[mlpack] (no subject)

Sat Mar 28 10:16:39 EDT 2020

Waiting for some feedback.

Have done some more groundwork....

I'll drop a draft proposal in a day or two.

Thanks & Regards
Aman Pandey

On Fri, 27 Mar 2020, 8:34 pm Aman Pandey, <aman0902pandey at gmail.com> wrote:

>
> Hi Ryan/Marcus,
> Just saw 2017s work on parallelisation by Shikhar Bhardwaj.
>
> https://www.mlpack.org/gsocblog/profiling-for-parallelization-and-parallel-stochastic-optimization-methods-summary.html
>
> Impressive work he has done. Excellent documentation I must say.
> ---------------------
>
> In this email, I'll be telling about
>
>    - What Algos I'll be implementing?
>    - What thoughts I had over parallelisation?
>    - My rough plan to complete the proposed work in MLPACK
>    - Just thought of adding something like *Federated Learning* in
>    Mlpack(this could be very complex though!)
>
>
> I want to have a little more clear understanding of what I am going to do,
> please check if I am planning correctly. Also if my waythrough is feasible.
>
> I will be focusing on following algos, during my GSOC period:
> 1) Random Forest
> 2) KNN
> 3) A few Gradient Boosting Algorithms
>
> Either I can "Parallelize" the algorithms according to their computation
> tasks(e.g. in Random Forest, I can try training its N trees in parallel) or
> by Distributing tasks by MapReduce or other distributed computation
> algorithms (https://stackoverflow.com/a/8727235/9982106 lists a few
> pretty well). MapReduce only works well if very small data moves across the
> machine very few times. *This could be a reason why should we try looking
> at a few better alternatives*.
> For. e.g. after each iteration, derivative-based methods have to calculate
> gradient over complete training data, which, in general, requires moving
> complete data to a single machine and compute the gradient. As the number
> of iterations increases, this could result in a bottleneck.
>
> I am in favour of working with OpenMP, before trying any such thing.
>
> Similar can occur with Tree-based algorithms when splitting has to be
> calculated on complete data repetitively,
>
> I would follow the given "rough" timeline:
> *(haven't kept it complex and unrealistic)*
>
> 1) Profiling algorithms to find their bottlenecks, training on a variety
> of example datasets(small to big, which brings in a heavy difference in
> calculations) -
> *Week 1-2*2) Working on at least one GBA, to check it my approach is
> cool, and that in complete in accordance with MLPACK. Parallelly working on
> profiling and designing parallelism for Random Forest  - *Week* *2-3*
> 3)  Working on Random forest and KNN - *Week 4 - 8*
> 4)* Building on Different distributed computing alternatives for
> MapReduce. *This one if works well, could transform MLPACK into an actual *distributed
> killer*. However, working randomly on different algos with different
> Distributed computation technique may lead to randomness in MLPACK
> Development. (*I still have to be sure on this stuff.*)
>
> ----------------- *An additional idea ---------*
> I don't know if this has been discussed before, as I have been away from
> MLPACK for almost a year.
> Have you ever thought of adding FEDERATED LEARNING support for MLPACK?
> Like the *PYSYFT Support(*https://github.com/OpenMined/PySyft*), *can
> bring tremendous improvement in MLPACK. This would really help people
> working on Big Deep Learning and for the researchers?
>
>
> Please let me know if we can discuss this idea!
>
> --------------------------------------
> The reason for me choosing MLPACK is that, I have knowledge of its
> codebase, as I tried in mlpack last year, and ofc, the team is awesome, I
> have always found good support from everyone here.
>
> And, amid this COVID-19 thing, I *will* *not* be able to complete my
> earned internship at *NUS-Singapore,* so I need something of that level
> to work upon and utilising these summers.
> I am very comfortable with any kind of code, as an example, I have worked
> on completely unknown HASKELL code while working as an Undergrad Researcher
> at IITK(one of the finest CSE depts in INDIA).
> Plus, having knowledge of Advanced C++ can help me be quick and efficient.
>
> I have started drafting a proposal. Please, let me know your thoughts.
>
> Will update you soon within the next 2 days.
>
> ----
> Please be safe!
> Looking forward to a wonderful experience with MLPACK. :)
>
>
>
> *Aman Pandeyamanpandey.codes <http://amanpandey.codes>*
>
> On Mon, Mar 16, 2020 at 7:52 PM Aman Pandey <aman0902pandey at gmail.com>
> wrote:
>
>> Hi Ryan,
>> I think that is enough information.
>> Thanks a lot.
>> I tried MLPACK, the last year, on QGMM, unfortunately, couldn't make it.
>>
>> Will try once again, with a possibly better proposal. ;)
>> In parallelisation this time.
>>
>> Thanks.
>> Aman Pandey
>> GitHub Username: johnsoncarl
>>
>> On Mon, Mar 16, 2020 at 7:33 PM Ryan Curtin <ryan at ratml.org> wrote:
>>
>>> On Sun, Mar 15, 2020 at 12:38:09PM +0530, Aman Pandey wrote:
>>> > Hey Ryan/Marcus,
>>> > Are there any current coordinates to start with, in "Profiling for
>>> > Parallelization"?
>>> > I want to know if any, to avoid any redundant work.
>>>
>>> Hey Aman,
>>>
>>> I don't think that there are any particular directions.  You could
>>> consider looking at previous messages from previous years in the mailing
>>> list archives (this project has been proposed in the past and there has
>>> been some discussion).  My suggestion would be to find some algorithms
>>> that you think could be useful to parallelize, and spend some time
>>> thinking about the right way to do that with OpenMP.  The "profiling"
>>> part may come in useful here, as when you put your proposal together it
>>> could be useful to find algorithms that have bottlenecks that could be
>>> easily resolved with parallelism.  (Note that not all algorithms have
>>> bottlenecks that can be solved with parallelism, and algorithms heavy on
>>> linear algebra may already be effectively parallelized via the use of
>>> OpenBLAS at a lower level.)
>>>
>>> Thanks,
>>>
>>> Ryan
>>>
>>> --
>>> Ryan Curtin    | "I was misinformed."
>>> ryan at ratml.org |   - Rick Blaine
>>>
>>
>>
>> --
>>
>> Aman Pandey
>> Junior Undergraduate, Bachelors of Technology
>> Sardar Vallabhbhai National Institute of Technology,
>>
>> Surat, Gujarat, India. 395007
>> Webpage: https://johnsoncarl.github.io/aboutme/
>> LinkedIn: https://www.linkedin.com/in/amnpandey/
>>
>
>
> --
>
> Aman Pandey
> Junior Undergraduate, Bachelors of Technology
> Sardar Vallabhbhai National Institute of Technology,
>
> Surat, Gujarat, India. 395007
> Webpage: https://johnsoncarl.github.io/aboutme/
> LinkedIn: https://www.linkedin.com/in/amnpandey/
>
>
> --
>
> Aman Pandey
> Junior Undergraduate, Bachelors of Technology
> Sardar Vallabhbhai National Institute of Technology,
>
> Surat, Gujarat, India. 395007
> Webpage: https://johnsoncarl.github.io/aboutme/
> LinkedIn: https://www.linkedin.com/in/amnpandey/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://knife.lugatgt.org/pipermail/mlpack/attachments/20200328/47eb9df5/attachment-0001.htm>