[mlpack] Profiling for parallelization

Tue Mar 20 14:50:05 EDT 2018

Hi Ryan

Thank you for your help. I've submitted the draft of my proposal and it
would be really helpful if you could review it and tell me the changes I
should make.
My main concerns regarding my proposal are -
1) The number of algorithms/functions I've chosen. I'm trying to research
more but if you can tell your thoughts on the number of algorithms I've
chosen, it would be really helpful.
2) I looked into logisitic regression, and it is using SGD and L-BFGS.
Parallel-SGD has been implemented in mlpack but I'm unsure if that will
actually provide a significant speedup as the parallelization is already
there at low levels. Do you think it will be worst investing my time into?
Should I mention it in my GSoC proposal?
3) Similar kind of problem for naive bayes. I've figured out the for loops
that should be parallelized but the papers I followed showed no significant
performance improvement in parallel naive bayes. Should I mention this in
my proposal?
4) How much change is permitted before I should make another file for
parallel implementation of the algorithm?
5) I've dropped the idea of providing API since you're right, it will be
better for the user to learn openMP as it's pretty famous.
6)I've added bagging in my proposal. So I'll implement and parallelize it.
I hope that's fine.

Thanks

On Fri, Mar 16, 2018 at 8:22 PM, Ryan Curtin <ryan at ratml.org> wrote:

> On Fri, Mar 16, 2018 at 01:39:34PM +0530, Nikhil Goel wrote:
> > Hello
> >
> > Thank you for your help! I had a few more questions
> > Sequential algorithms like logistic regression are very hard to
> > parallelize. While researching for this project, the only way I could
> find
> > was by  computing the gradient in parallel of a batch. But from what I
> > could see in mlpack, the batch is provided as a matrix. Matrices
> operations
> > are already parallelized in mlpack as openBLAS is parallelized. So I
> > needn't worry about such algorithms?
>
> Hi there Nikhil,
>
> You are right, there are some algorithms for which specific
> parallelization is not useful and it is better to depend on a parallel
> BLAS.  For logistic regression in particular, there are a few parallel
> optimizers that are implemented; you might consider taking a look at
> those also.
>
> > Yes, you're right that we can use environment variables but wouldn't it
> be
> > cleaner and better looking to provide users with an option like 'cores'
> > with default value as max number of cores available (Or 1, whichever is
> > chosen by you) in algorithms that have been parallelized?
>
> No, in my view this would be an unnecessary addition of an extra API
> that users have to learn.  If a user learns about OpenMP environment
> variables it is useful anywhere OpenMP is used, but if a user instead
> learns about some mlpack-specific parallelization API, it is not useful
> anywhere except mlpack.
>
> > Also is bagging emsembling implemented in mlpack? It's a pretty popular
> > algorithm and I couldn't find it in mlpack. I was wondering if it's
> needed
> > in mlpack?
>
> The only ensembling algorithm we have at the minute is AdaBoost.  It may
> be useful to add another algorithm.
>
> Thanks,
>
> Ryan
>
> --
> Ryan Curtin    | "I can't believe you like money too.  We should
> ryan at ratml.org | hang out."  - Frito
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://knife.lugatgt.org/pipermail/mlpack/attachments/20180321/63d7677d/attachment.html>