[mlpack] GSOC 2016 Aspirant - Parallel Stochastic Optimisation Methods

Fri Mar 4 20:46:14 EST 2016

On Thu, Mar 03, 2016 at 11:11:05PM +0530, Aditya Sharma wrote:
> Hi Ryan,
> 
> I read the Hogwild! paper, which to my understanding, gives theoretical
> convergence guarantees for just parallelizing SGD without worrying about
> locking, etc in a shared memory model, if the data is large enough and
> updates happen atomically.
> 
> I also went through your implementations of SGD and mini-batch SGD. I think
> it would be fairly easy to OpenMP-ize the current implementations along the
> lines of Hogwild.
> 
> But, in my opinion, if we just use multi-threading, ml-pack might not be
> very attractive for researchers working with truly large-scale data.
> 
> I think it would be a good idea if we could add support for GPU processing
> to the existing optimizers. I have prior experience working with CUDA and I
> think I would be able to add a CUDA version of Hogwild! built on the
> existing SGD implementations in ml-pack, over the summer. Such that
> researchers with little knowledge of CUDA can directly use ml-pack to speed
> their code, without worrying about what's under the hood (much like what
> Theano does for python).
> 
> Another direction could be to add support for distributed computing, by
> linking ml-pack to either the Parameter Server by CMU (
> http://parameterserver.org) or integrating the MPI based Parameter Server
> that I've built, and parallelizing the existing SGD and mini-batch code in
> ml-pack along the lines of Downpour SGD (similar to Tensor FLow and
> DistBelief systems developed by Googole).
> 
> The distributed implementation would be a bit more complicated, but I think
> I should be able to do it over the summer, as that's exactly what the focus
> of my research is currently.
> 
> I would love to know your thoughts and suggestions.

Hi Aditya,

We could definitely use OpenMP on the current SGD implementations, but
we would have to be careful to ensure that this wouldn't modify the
result.  Hogwild! is almost certainly easiest to implement in OpenMP.
(Actually it's sufficiently simple that just a Hogwild! implementation
would be too little work for a GSoC project I think, but it could
definitely be a component of a larger project.)

The problem with CUDA is that you will have to be shipping the data back
and forth from the GPU every iteration, because the optimizer is
separate from the function it is optimizing.  The optimizer only makes
calls to function.Evaluate() and function.Gradient(), and it's not
reasonable to expect that every Evaluate() and Gradient() call will be
written for GPUs.  This means that the only step that you could put on a
GPU would realistically be the update step, and given the huge overhead
of the communication cost, I'm doubtful that we'd see any speedup.

It's a very hard challenge to support GPUs while still keeping the
algorithms simple enough to be maintained.

I think the same thing is true for MPI; the code written for MPI can end
up being very complex and hard to maintain.  Here we have another
problem: mlpack has no support for distributed matrices or distributed
problems of any form (and in general isn't aimed at that use case; there
are maybe better tools, like Spark for instance).

I don't mean to say these ideas are impossible: what you've suggested is
a set of really great improvements and ideas.  But we would need to do a
lot of thinking to figure out how they would fit into the core
abstractions of mlpack, how we can preserve the basic interface we have
now, and (maybe most importantly) how we can keep the code simple.

Thanks,

Ryan

-- 
Ryan Curtin    | "Reprogram him!"
ryan at ratml.org |   - Master Control Program