[mlpack] (no subject)

Sat Mar 28 12:00:24 EDT 2020

Hi Aman,

I've got five emails from you in the past two days... please be patient,
right before the GSoC deadline is the busiest time.  Many of us mentors
also have other jobs and work to handle, so quick responses aren't
always feasible---just look at the huge list of open PRs for proof...
On the other hand, we do already have the application guide and other
resources so I would suggest looking at those.

As for what you wrote, random forests already parallelize trees during
training with OpenMP.  mlpack is not a distributed library so I would
advise against MapReduce; it doesn't fit in any of the abstractions that
we have---so, I agree, OpenMP is the way to go.  We don't have the
resources to make effectively mlpack into a distributed machine learning
library, and even if we did, it's doubtful that we could compete with
the already established frameworks in that area.

If you mean that you are intending to implementing gradient boosting
algorithms, please do be clear in your proposal about what you are
hoping to implement, what its API will be, how users will use it, and so
forth.  Note that mlpack already has AdaBoost, which might be of
interest to you as you work on your proposal.

Thanks,

Ryan

On Sat, Mar 28, 2020 at 07:46:39PM +0530, Aman Pandey wrote:
> Waiting for some feedback.
> 
> 
> Have done some more groundwork....
> 
> 
> I'll drop a draft proposal in a day or two.
> 
> 
> Thanks & Regards
> Aman Pandey
> 
> On Fri, 27 Mar 2020, 8:34 pm Aman Pandey, <aman0902pandey at gmail.com> wrote:
> 
> >
> > Hi Ryan/Marcus,
> > Just saw 2017s work on parallelisation by Shikhar Bhardwaj.
> >
> > https://www.mlpack.org/gsocblog/profiling-for-parallelization-and-parallel-stochastic-optimization-methods-summary.html
> >
> > Impressive work he has done. Excellent documentation I must say.
> > ---------------------
> >
> > In this email, I'll be telling about
> >
> >    - What Algos I'll be implementing?
> >    - What thoughts I had over parallelisation?
> >    - My rough plan to complete the proposed work in MLPACK
> >    - Just thought of adding something like *Federated Learning* in
> >    Mlpack(this could be very complex though!)
> >
> >
> > I want to have a little more clear understanding of what I am going to do,
> > please check if I am planning correctly. Also if my waythrough is feasible.
> >
> > I will be focusing on following algos, during my GSOC period:
> > 1) Random Forest
> > 2) KNN
> > 3) A few Gradient Boosting Algorithms
> >
> > Either I can "Parallelize" the algorithms according to their computation
> > tasks(e.g. in Random Forest, I can try training its N trees in parallel) or
> > by Distributing tasks by MapReduce or other distributed computation
> > algorithms (https://stackoverflow.com/a/8727235/9982106 lists a few
> > pretty well). MapReduce only works well if very small data moves across the
> > machine very few times. *This could be a reason why should we try looking
> > at a few better alternatives*.
> > For. e.g. after each iteration, derivative-based methods have to calculate
> > gradient over complete training data, which, in general, requires moving
> > complete data to a single machine and compute the gradient. As the number
> > of iterations increases, this could result in a bottleneck.
> >
> > I am in favour of working with OpenMP, before trying any such thing.
> >
> > Similar can occur with Tree-based algorithms when splitting has to be
> > calculated on complete data repetitively,
> >
> > I would follow the given "rough" timeline:
> > *(haven't kept it complex and unrealistic)*
> >
> > 1) Profiling algorithms to find their bottlenecks, training on a variety
> > of example datasets(small to big, which brings in a heavy difference in
> > calculations) -
> > *Week 1-2*2) Working on at least one GBA, to check it my approach is
> > cool, and that in complete in accordance with MLPACK. Parallelly working on
> > profiling and designing parallelism for Random Forest  - *Week* *2-3*
> > 3)  Working on Random forest and KNN - *Week 4 - 8*
> > 4)* Building on Different distributed computing alternatives for
> > MapReduce. *This one if works well, could transform MLPACK into an actual *distributed
> > killer*. However, working randomly on different algos with different
> > Distributed computation technique may lead to randomness in MLPACK
> > Development. (*I still have to be sure on this stuff.*)
> >
> > ----------------- *An additional idea ---------*
> > I don't know if this has been discussed before, as I have been away from
> > MLPACK for almost a year.
> > Have you ever thought of adding FEDERATED LEARNING support for MLPACK?
> > Like the *PYSYFT Support(*https://github.com/OpenMined/PySyft*), *can
> > bring tremendous improvement in MLPACK. This would really help people
> > working on Big Deep Learning and for the researchers?
> >
> >
> > Please let me know if we can discuss this idea!
> >
> > --------------------------------------
> > The reason for me choosing MLPACK is that, I have knowledge of its
> > codebase, as I tried in mlpack last year, and ofc, the team is awesome, I
> > have always found good support from everyone here.
> >
> > And, amid this COVID-19 thing, I *will* *not* be able to complete my
> > earned internship at *NUS-Singapore,* so I need something of that level
> > to work upon and utilising these summers.
> > I am very comfortable with any kind of code, as an example, I have worked
> > on completely unknown HASKELL code while working as an Undergrad Researcher
> > at IITK(one of the finest CSE depts in INDIA).
> > Plus, having knowledge of Advanced C++ can help me be quick and efficient.
> >
> > I have started drafting a proposal. Please, let me know your thoughts.
> >
> > Will update you soon within the next 2 days.
> >
> > ----
> > Please be safe!
> > Looking forward to a wonderful experience with MLPACK. :)
> >
> >
> >
> > *Aman Pandeyamanpandey.codes <http://amanpandey.codes>*
> >
> > On Mon, Mar 16, 2020 at 7:52 PM Aman Pandey <aman0902pandey at gmail.com>
> > wrote:
> >
> >> Hi Ryan,
> >> I think that is enough information.
> >> Thanks a lot.
> >> I tried MLPACK, the last year, on QGMM, unfortunately, couldn't make it.
> >>
> >> Will try once again, with a possibly better proposal. ;)
> >> In parallelisation this time.
> >>
> >> Thanks.
> >> Aman Pandey
> >> GitHub Username: johnsoncarl
> >>
> >> On Mon, Mar 16, 2020 at 7:33 PM Ryan Curtin <ryan at ratml.org> wrote:
> >>
> >>> On Sun, Mar 15, 2020 at 12:38:09PM +0530, Aman Pandey wrote:
> >>> > Hey Ryan/Marcus,
> >>> > Are there any current coordinates to start with, in "Profiling for
> >>> > Parallelization"?
> >>> > I want to know if any, to avoid any redundant work.
> >>>
> >>> Hey Aman,
> >>>
> >>> I don't think that there are any particular directions.  You could
> >>> consider looking at previous messages from previous years in the mailing
> >>> list archives (this project has been proposed in the past and there has
> >>> been some discussion).  My suggestion would be to find some algorithms
> >>> that you think could be useful to parallelize, and spend some time
> >>> thinking about the right way to do that with OpenMP.  The "profiling"
> >>> part may come in useful here, as when you put your proposal together it
> >>> could be useful to find algorithms that have bottlenecks that could be
> >>> easily resolved with parallelism.  (Note that not all algorithms have
> >>> bottlenecks that can be solved with parallelism, and algorithms heavy on
> >>> linear algebra may already be effectively parallelized via the use of
> >>> OpenBLAS at a lower level.)
> >>>
> >>> Thanks,
> >>>
> >>> Ryan
> >>>
> >>> --
> >>> Ryan Curtin    | "I was misinformed."
> >>> ryan at ratml.org |   - Rick Blaine
> >>>
> >>
> >>
> >> --
> >>
> >> Aman Pandey
> >> Junior Undergraduate, Bachelors of Technology
> >> Sardar Vallabhbhai National Institute of Technology,
> >>
> >> Surat, Gujarat, India. 395007
> >> Webpage: https://johnsoncarl.github.io/aboutme/
> >> LinkedIn: https://www.linkedin.com/in/amnpandey/
> >>
> >
> >
> > --
> >
> > Aman Pandey
> > Junior Undergraduate, Bachelors of Technology
> > Sardar Vallabhbhai National Institute of Technology,
> >
> > Surat, Gujarat, India. 395007
> > Webpage: https://johnsoncarl.github.io/aboutme/
> > LinkedIn: https://www.linkedin.com/in/amnpandey/
> >
> >
> > --
> >
> > Aman Pandey
> > Junior Undergraduate, Bachelors of Technology
> > Sardar Vallabhbhai National Institute of Technology,
> >
> > Surat, Gujarat, India. 395007
> > Webpage: https://johnsoncarl.github.io/aboutme/
> > LinkedIn: https://www.linkedin.com/in/amnpandey/
> >

> _______________________________________________
> mlpack mailing list
> mlpack at lists.mlpack.org
> http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack

-- 
Ryan Curtin    | "Hold still."
ryan at ratml.org |   - Mr. Blonde