[mlpack] Profiling for parallelization

Fri Mar 16 04:09:34 EDT 2018

Hello

Thank you for your help! I had a few more questions
Sequential algorithms like logistic regression are very hard to
parallelize. While researching for this project, the only way I could find
was by  computing the gradient in parallel of a batch. But from what I
could see in mlpack, the batch is provided as a matrix. Matrices operations
are already parallelized in mlpack as openBLAS is parallelized. So I
needn't worry about such algorithms?
Yes, you're right that we can use environment variables but wouldn't it be
cleaner and better looking to provide users with an option like 'cores'
with default value as max number of cores available (Or 1, whichever is
chosen by you) in algorithms that have been parallelized?
Also is bagging emsembling implemented in mlpack? It's a pretty popular
algorithm and I couldn't find it in mlpack. I was wondering if it's needed
in mlpack?

Thanks

On Mon, Mar 12, 2018 at 7:57 PM, Ryan Curtin <ryan at ratml.org> wrote:

> On Mon, Mar 12, 2018 at 06:51:20PM +0530, Nikhil Goel wrote:
> > Hello
> >
> > I am Nikhil Goel (github:nikhilgoel1997), a pre-final year student from
> > Birla Institute of Technology and Science, Pilani (BITS, Pilani). I've
> been
> > contributing to mlpack for the past month and have become familiar with
> the
> > codebase. In the past I've done projects on Sentiment analysis, Image
> > classification and Financial signal processing using machine learning.
> > I wanted to do a project which would help me improve my understanding of
> > multiple algorithms and Profiling for parallelization is ideal for that!
> In
> > that direction I've studied and grown familiar with the openMP library.
> > While I want to tackle every algorithm that is implemented in mlpack and
> > find a way to parallelize it or have a good explanation as to why it is
> not
> > parallelizable, doing it properly by 27th (Last day to submit the
> proposal)
> > might be a little difficult. Since the project description is vague, what
> > would be a good number of algorithms for which proper description on how
> to
> > parallelize is given in the proposal for a strong proposal. (I believe
> > there are 5 algorithms that have been parallelized in mlpack and till
> now,
> > I've found how to parallelize other algorithms like knn, logistic
> > regression, naive bayes, pca)
> > As for the API, I think having an additional option in the algorithm for
> > using multi-core can be given to the user. Is this a good idea?
> >
> > I would love to hear suggestions from the mentors to understand if they
> > feel that I'm approaching this project the correct way.
>
> Hi Nikhil,
>
> Thanks for getting in touch.  It's tough to say what a good number of
> algorithms to parallelize would be reasonable, because some algorithms
> will be harder to parallelize than others.  What I would suggest is that
> you take a look at some algorithms that are interesting to you, estimate
> how long it might take to OpenMP-ize them, and then use this to
> structure your proposal.  Don't worry if the timeline isn't exactly
> accurate; we know that sometimes it is hard to estimate, and your mentor
> (which in this case I guess will be me) will work with you to
> restructure the timeline and scope of work as needed.  But you should
> still aim to try and get it as close to reality as you think you can.
>
> For the API, with OpenMP I think no changes are necessary.  The user can
> set their desired number of cores with environment variables like
> OMP_NUM_THREADS and other variables.
>
> I hope this helps; let me know if I can clarify anything.
>
> --
> Ryan Curtin    | "Why is it that the landscape is moving... but the boat is
> ryan at ratml.org | still?"  - Train Driver
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://knife.lugatgt.org/pipermail/mlpack/attachments/20180316/d930f56e/attachment-0001.html>