[mlpack] Implementation of HOGWILD! in mlpack

Sat Jun 24 08:00:57 EDT 2017

Thanks a lot for the insights, Ryan. I noticed I sent this as a reply
directly to you, instead of the mailing list. Sorry for the extra mail.

The update in the description of the algorithm requires only those
components of the decision variable to be updated which are related
(affected by a part of the loss function). Updating the entire decision
variable (in the vectorized fashion, as is done currently in sgd update
policies) will not be lock-free, which is required by the algorithm.

I thought of two approaches to this :

1. Update each component of the decision variable from the sparse gradient,
not by a vectorized implementation, but by iterating over the entire
gradient. The update will be done only for non-zero parts of the gradient.
This will be atomic(a float subtraction).
2. Change the requirements of the function interface, to provide a
*Components* function, which takes in a datapoint and returns a list of
components for which the decision variable needs to be updated. The
Optimizer only updates the components returned by the function.

I am trying to think if there are cases where either of these would be
better. (The first one is clearly better from an API perspective because
then the requirements on the function interface remain unchanged). Will
there be cases where the 2nd approach will do less work?

On Fri, Jun 23, 2017 at 7:11 PM, Ryan Curtin <ryan at ratml.org> wrote:

> On Wed, Jun 21, 2017 at 02:20:31AM +0530, Shikhar Bhardwaj wrote:
> > Hello everyone!
> >
> > Yannis and I have been discussing over the past few days about the
> > implementation of Parallel SGD using the ideas presented in the HOGWILD!
> > paper(https://arxiv.org/pdf/1106.5730.pdf). We also have at hand an
> > implementation of the algorithm by the authors of the paper(
> > http://i.stanford.edu/hazy/victor/Hogwild/) and a talk given by one of
> the
> > authors about the algorithm (https://www.youtube.com/watch?v=l5JqUvTdZts
> )
> > (the part relevant to the parallel algorithm starts ~26 min).
> >
> > Overall, we've been able to identify the fact that the algorithm (with
> its
> > free for all update model and simple work sharing description) does not
> > contain a lot of complexity. The paper describes a hypergraph structure
> on
> > the loss function to be minimised. However, a sizeable amount of
> complexity
> > goes into the implementation of the description of this loss function
> > (which depends on the particular problem being solved by the Optimizer).
> > The main concern we have right now is to separate this complexity (of
> > defining the loss function) from the Implementation of the Optimizer,
> while
> > maintaining reasonable ease of use. The iterative procedure of the
> > algorithm (which is shared across the processors) uses parts of this
> > Hypergraph structure.
> >
> > Do we need to introduce another function interface (SparseFunctionType),
> > which allows us to maintain and ask for this structure in the Optimizer?
> > The author's implementation is quite difficult to understand. Any
> insights
> > would be helpful.
>
> Hey Shikhar,
>
> It's been a little while since I've thought about Hogwild, so I had to
> take a little while to refresh my memory a bit.
>
> I agree that the algorithm is simple, but the paper has a lot more
> descriptions and proofs and exposition than I think is completely
> necessary (but that's just my opinion).  I think that the hypergraph
> induced by the different cost functions is just one way to see things,
> and I don't think Hogwild is restricted to only those situations for
> which those hypergraphs are known.  Unless I've misunderstood, the
> actual hypergraph itself doesn't need to be known.
>
> So under the assumption of some separable cost function
>
> f(x) = sum_{i = 0}^{k} f_i(x),
>
> I think that all we need to hope is that each f_i(x) that we might
> sample has sparse support in x (that is, if we take a gradient step
> based only on f_i(x), only a few components of x will be updated).
>
> We already have the DecomposableFunctionType abstraction, which I think
> will work for Hogwild.  But I think that users should be aware that
> Hogwild isn't useful for every problem---only those that are both
> decomposable and sparse.  You certainly could use Hogwild in a dense
> update situation, it just may not work so well...
>
> One example in mlpack where things should work well is high-dimensional
> sparse logistic regression, since the gradient induced by each point in
> the dataset should be pretty sparse.
>
> But, it's possible there's something I overlooked, or some reason that
> the existing DecomposableFunctionType abstraction wouldn't work... :)
>
> By the way, there is some ongoing discussion about potentially changing
> the DecomposableFunctionType abstraction to better support batch
> updates: https://github.com/mlpack/mlpack/pull/1034 , but I don't expect
> any changes that come out of that discussion to significantly impact
> however you choose to implement Hogwild.
>
> --
> Ryan Curtin    | "I like this game."
> ryan at ratml.org |   - Coach
>

-- 
Shikhar Bhardwaj
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://knife.lugatgt.org/pipermail/mlpack/attachments/20170624/6e02375d/attachment.html>