[mlpack] Student looking to contribute (possibly through GSoC 2016)

Tue Mar 1 09:30:04 EST 2016

On Tue, Mar 01, 2016 at 03:12:07PM +0800, Shaoxiang Chen wrote:
> Hi everyone,
> 
> I am Shaoxiang Chen, a CS student(junior) of Fudan University, China. I'm
> currently working with a professor, studying in the areas of multimedia
> content analysis and machine(deep) learning.
> 
> I've read a research paper about accelerating approximate nearest neighbour
> search when the distance metric of cosine similarity is used. It achieved
> decent speed up by quantizing vectors into binary form, making the
> computation of distance orders of magnitude faster.
> 
> I see that cosine similarity is not included in neither flann nor mlpack's
> distance metrics. For one thing, it is not directly applicable to the space
> partitioning tree structures(am I right?). And second, if all the vectors
> are l2-normed, cosine similarity directly corresponds to l2 distance.
> 
> I've coded the proposed method myself and got decent speed up. The higher
> dimension, the larger speed up. While the method itself limits its use
> cases to high dimensional data(otherwise precision drops), I think it might
> serve as a supplement for tree structures when dealing with high
> dimensional data.
> 
> I haven't run a benchmark of this my code against other ann search
> algorithms, but the paper includes some benchmarks.
> 
> If after you've read the paper you think it's ok to include this algorithm
> in mlpack's neighbour-search, I'd like to contribute and discuss :)
> 
> link to the paper:
> http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40572.pdf

Hi Shaoxiang,

Technically the cosine similarity is not a metric, so it can't be used
out of the box.  I think there are ways you can modify the cosine
similarity so that it can be used with trees, like for instance the
angular cosine similarity.

Another technique that can be used for fast cosine similarity search is
"fast max-kernel search", or FastMKS, a technique that mlpack provides
(though its implementation gives exact results instead of approximate
results).

I think it might be interesting to compare your implementation with the
FastMKS implementation in mlpack, and maybe some other implementations,
to see where it gives speedup.  And if it does give speedup for a good
variety of datasets and situations, then we should include it if you are
willing to take the time to incorporate it (once it's tested,
documented, optimized, etc.). :)

Another thought is that the Hamming distance is a metric, so that's an
easier comparison: you could implement the Hamming distance in
src/mlpack/core/metrics/, then use it with the existing nearest neighbor
search techniques, and compare.  (Again the existing techniques are
exact, so that may make the comparison a little bit difficult, but it
would still be interesting, I think.)

If you are interested, here's a paper on FastMKS:

http://www.ratml.org/pub/pdf/2013fastmks.pdf

When I have a chance, I'll try and read through the paper you
linked---it looks to be interesting.

This could be a good project for GSoC: implement the algorithm, test it,
run benchmarks, and so forth.  So please do feel free to consider this
work as a summer-long project.

Let me know if I can clarify anything.

Thanks!

Ryan

-- 
Ryan Curtin    | "Hi Doggy."
ryan at ratml.org |   - Johnny