[mlpack] Student looking to contribute (possibly through GSoC 2016)

Shaoxiang Chen forwchen at gmail.com
Tue Mar 1 10:53:45 EST 2016


Hi Ryan,

Thanks for clarifying things for me. It's good to know there are techniques
like FastMKS.

The main idea of that paper I linked is to reduce the cost of computing
similarity. It does not build any kind of index structures. I think it
might be used with other algorithms that utilize index structures.

Anyway, I will get started with mlpack first and then run the benchmarks.
I'll also read through the paper on FastMKS.

Since I'll not be able to work full time during May~July, I should
contribute on github. Hope to discuss with you when I encounter problems :)

Regards

On Tue, Mar 1, 2016 at 10:30 PM, Ryan Curtin <ryan at ratml.org> wrote:

> On Tue, Mar 01, 2016 at 03:12:07PM +0800, Shaoxiang Chen wrote:
> > Hi everyone,
> >
> > I am Shaoxiang Chen, a CS student(junior) of Fudan University, China. I'm
> > currently working with a professor, studying in the areas of multimedia
> > content analysis and machine(deep) learning.
> >
> > I've read a research paper about accelerating approximate nearest
> neighbour
> > search when the distance metric of cosine similarity is used. It achieved
> > decent speed up by quantizing vectors into binary form, making the
> > computation of distance orders of magnitude faster.
> >
> > I see that cosine similarity is not included in neither flann nor
> mlpack's
> > distance metrics. For one thing, it is not directly applicable to the
> space
> > partitioning tree structures(am I right?). And second, if all the vectors
> > are l2-normed, cosine similarity directly corresponds to l2 distance.
> >
> > I've coded the proposed method myself and got decent speed up. The higher
> > dimension, the larger speed up. While the method itself limits its use
> > cases to high dimensional data(otherwise precision drops), I think it
> might
> > serve as a supplement for tree structures when dealing with high
> > dimensional data.
> >
> > I haven't run a benchmark of this my code against other ann search
> > algorithms, but the paper includes some benchmarks.
> >
> > If after you've read the paper you think it's ok to include this
> algorithm
> > in mlpack's neighbour-search, I'd like to contribute and discuss :)
> >
> > link to the paper:
> >
> http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40572.pdf
>
> Hi Shaoxiang,
>
> Technically the cosine similarity is not a metric, so it can't be used
> out of the box.  I think there are ways you can modify the cosine
> similarity so that it can be used with trees, like for instance the
> angular cosine similarity.
>
> Another technique that can be used for fast cosine similarity search is
> "fast max-kernel search", or FastMKS, a technique that mlpack provides
> (though its implementation gives exact results instead of approximate
> results).
>
> I think it might be interesting to compare your implementation with the
> FastMKS implementation in mlpack, and maybe some other implementations,
> to see where it gives speedup.  And if it does give speedup for a good
> variety of datasets and situations, then we should include it if you are
> willing to take the time to incorporate it (once it's tested,
> documented, optimized, etc.). :)
>
> Another thought is that the Hamming distance is a metric, so that's an
> easier comparison: you could implement the Hamming distance in
> src/mlpack/core/metrics/, then use it with the existing nearest neighbor
> search techniques, and compare.  (Again the existing techniques are
> exact, so that may make the comparison a little bit difficult, but it
> would still be interesting, I think.)
>
> If you are interested, here's a paper on FastMKS:
>
> http://www.ratml.org/pub/pdf/2013fastmks.pdf
>
> When I have a chance, I'll try and read through the paper you
> linked---it looks to be interesting.
>
> This could be a good project for GSoC: implement the algorithm, test it,
> run benchmarks, and so forth.  So please do feel free to consider this
> work as a summer-long project.
>
> Let me know if I can clarify anything.
>
> Thanks!
>
> Ryan
>
> --
> Ryan Curtin    | "Hi Doggy."
> ryan at ratml.org |   - Johnny
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cc.gatech.edu/pipermail/mlpack/attachments/20160301/56f70f0e/attachment-0002.html>


More information about the mlpack mailing list