[mlpack] Fwd: (no subject)

Aman Pandey aman0902pandey at gmail.com
Fri Mar 27 11:04:15 EDT 2020


Hi Ryan/Marcus,
Just saw 2017s work on parallelisation by Shikhar Bhardwaj.
https://www.mlpack.org/gsocblog/profiling-for-parallelization-and-parallel-stochastic-optimization-methods-summary.html

Impressive work he has done. Excellent documentation I must say.
---------------------

In this email, I'll be telling about

   - What Algos I'll be implementing?
   - What thoughts I had over parallelisation?
   - My rough plan to complete the proposed work in MLPACK
   - Just thought of adding something like *Federated Learning* in
   Mlpack(this could be very complex though!)


I want to have a little more clear understanding of what I am going to do,
please check if I am planning correctly. Also if my waythrough is feasible.

I will be focusing on following algos, during my GSOC period:
1) Random Forest
2) KNN
3) A few Gradient Boosting Algorithms

Either I can "Parallelize" the algorithms according to their computation
tasks(e.g. in Random Forest, I can try training its N trees in parallel) or
by Distributing tasks by MapReduce or other distributed computation
algorithms (https://stackoverflow.com/a/8727235/9982106 lists a few pretty
well). MapReduce only works well if very small data moves across the
machine very few times. *This could be a reason why should we try looking
at a few better alternatives*.
For. e.g. after each iteration, derivative-based methods have to calculate
gradient over complete training data, which, in general, requires moving
complete data to a single machine and compute the gradient. As the number
of iterations increases, this could result in a bottleneck.

I am in favour of working with OpenMP, before trying any such thing.

Similar can occur with Tree-based algorithms when splitting has to be
calculated on complete data repetitively,

I would follow the given "rough" timeline:
*(haven't kept it complex and unrealistic)*

1) Profiling algorithms to find their bottlenecks, training on a variety of
example datasets(small to big, which brings in a heavy difference in
calculations) -
*Week 1-2*2) Working on at least one GBA, to check it my approach is cool,
and that in complete in accordance with MLPACK. Parallelly working on
profiling and designing parallelism for Random Forest  - *Week* *2-3*
3)  Working on Random forest and KNN - *Week 4 - 8*
4)* Building on Different distributed computing alternatives for MapReduce.
*This one if works well, could transform MLPACK into an actual *distributed
killer*. However, working randomly on different algos with different
Distributed computation technique may lead to randomness in MLPACK
Development. (*I still have to be sure on this stuff.*)

----------------- *An additional idea ---------*
I don't know if this has been discussed before, as I have been away from
MLPACK for almost a year.
Have you ever thought of adding FEDERATED LEARNING support for MLPACK? Like
the *PYSYFT Support(*https://github.com/OpenMined/PySyft*), *can bring
tremendous improvement in MLPACK. This would really help people working on
Big Deep Learning and for the researchers?


Please let me know if we can discuss this idea!

--------------------------------------
The reason for me choosing MLPACK is that, I have knowledge of its
codebase, as I tried in mlpack last year, and ofc, the team is awesome, I
have always found good support from everyone here.

And, amid this COVID-19 thing, I *will* *not* be able to complete my earned
internship at *NUS-Singapore,* so I need something of that level to work
upon and utilising these summers.
I am very comfortable with any kind of code, as an example, I have worked
on completely unknown HASKELL code while working as an Undergrad Researcher
at IITK(one of the finest CSE depts in INDIA).
Plus, having knowledge of Advanced C++ can help me be quick and efficient.

I have started drafting a proposal. Please, let me know your thoughts.

Will update you soon within the next 2 days.

----
Please be safe!
Looking forward to a wonderful experience with MLPACK. :)



*Aman Pandeyamanpandey.codes <http://amanpandey.codes>*

On Mon, Mar 16, 2020 at 7:52 PM Aman Pandey <aman0902pandey at gmail.com>
wrote:

> Hi Ryan,
> I think that is enough information.
> Thanks a lot.
> I tried MLPACK, the last year, on QGMM, unfortunately, couldn't make it.
>
> Will try once again, with a possibly better proposal. ;)
> In parallelisation this time.
>
> Thanks.
> Aman Pandey
> GitHub Username: johnsoncarl
>
> On Mon, Mar 16, 2020 at 7:33 PM Ryan Curtin <ryan at ratml.org> wrote:
>
>> On Sun, Mar 15, 2020 at 12:38:09PM +0530, Aman Pandey wrote:
>> > Hey Ryan/Marcus,
>> > Are there any current coordinates to start with, in "Profiling for
>> > Parallelization"?
>> > I want to know if any, to avoid any redundant work.
>>
>> Hey Aman,
>>
>> I don't think that there are any particular directions.  You could
>> consider looking at previous messages from previous years in the mailing
>> list archives (this project has been proposed in the past and there has
>> been some discussion).  My suggestion would be to find some algorithms
>> that you think could be useful to parallelize, and spend some time
>> thinking about the right way to do that with OpenMP.  The "profiling"
>> part may come in useful here, as when you put your proposal together it
>> could be useful to find algorithms that have bottlenecks that could be
>> easily resolved with parallelism.  (Note that not all algorithms have
>> bottlenecks that can be solved with parallelism, and algorithms heavy on
>> linear algebra may already be effectively parallelized via the use of
>> OpenBLAS at a lower level.)
>>
>> Thanks,
>>
>> Ryan
>>
>> --
>> Ryan Curtin    | "I was misinformed."
>> ryan at ratml.org |   - Rick Blaine
>>
>
>
> --
>
> Aman Pandey
> Junior Undergraduate, Bachelors of Technology
> Sardar Vallabhbhai National Institute of Technology,
>
> Surat, Gujarat, India. 395007
> Webpage: https://johnsoncarl.github.io/aboutme/
> LinkedIn: https://www.linkedin.com/in/amnpandey/
>


-- 

Aman Pandey
Junior Undergraduate, Bachelors of Technology
Sardar Vallabhbhai National Institute of Technology,

Surat, Gujarat, India. 395007
Webpage: https://johnsoncarl.github.io/aboutme/
LinkedIn: https://www.linkedin.com/in/amnpandey/


-- 

Aman Pandey
Junior Undergraduate, Bachelors of Technology
Sardar Vallabhbhai National Institute of Technology,

Surat, Gujarat, India. 395007
Webpage: https://johnsoncarl.github.io/aboutme/
LinkedIn: https://www.linkedin.com/in/amnpandey/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://knife.lugatgt.org/pipermail/mlpack/attachments/20200327/2691bb87/attachment-0001.htm>


More information about the mlpack mailing list