[mlpack] GSoC 2016 - Project Questions

Mon Mar 7 08:03:15 EST 2016

Hello again,

sorry for the late reply

Yes, definitely other forms of LSH or improvement to the existing LSH
> implementation would be a good way to go.  Dong's paper is where LSHKIT
> came from, right?  If so it might be interesting to implement that, but
> only if we are pretty sure we can outperform LSHKIT (i.e. if we can make
> a faster implementation).
>

>From what I can recall, LSHKIT is quite well optimized, so it might be a
hard task to outperform it. I'm not sure if it has been parallelized,
though, so that might be a way to achieve that (if you don't consider that
cheating :P).

>
> One thing I found with other LSH implementations was that most of them
> are a little difficult to work with... a user is most likely going to
> want to say "I want k approximate nearest neighbors with a desired
> maximum of e% relative error", so they specify k and e and then the
> algorithm returns results with the desired error bound.  But some LSH
> packages do not offer this kind of interface and can be very hard to
> work with (E2LSH is one example).
>

Yes, that is true, most have a big number of parameters that people not
familiar with the algorithm might not understand at first. That's why I
proposed implementing something similar to Dong's paper, since they propose
a way to identify sets of parameters that will yield results near some
specified recall/selectivity.
Of course this is probably going to be slower than the naive kNN
implementation, but it is reasonable to run the tuning program once for a
given dataset, produce favorable sets of parameters, and store these for
further runs of LSH.
An alternative would be developing some heuristic that maps neighbor error
rates to algorithm parameters. This might be tricky though, since error
depends highly on the dataset. It would be an interesting endeavour though.

>
> Here are some interesting runtime comparisons to LSHKIT that I did last
> year (on page 9 and 10); I compared some tree-based algorithms and then
> also multiprobe LSH:
>
> http://ratml.org/pub/pdf/2015faster.pdf
>
> I think that probably LSH would be faster in higher dimensions, but that
> wasn't the region of interest for the study I linked to.  Also that
> paper might be a decent place to start reading about dual-tree nearest
> neighbor search (and the papers that paper references might be
> interesting as well).
>

I've put this paper on my reading list,  as well as "An Optimal Algorithm
for Approximate Nearest Neighbor Searching in Fixed Dimensions" which you
linked to in another mail. I believe I will be able to read them by the
time I start writing my proposal.

Anyway, maybe you would be the right person for this task?
>
> https://github.com/mlpack/mlpack/issues/261
>
> I think that if you have spent a lot of time studying and thinking about
> LSH, then this one may not be so difficult.
>

** I'll respond to this in the end of my email.

>
> > *Another question* is inspired from an old ticket I found in GitHub,
> where
> > you mention interest in parallelization. I would also be interested in,
> as
> > a side goal, implementing a parallel version of a few of the approximate
> > algorithms. Is that something interesting to you guys?
>
> Yes, parallelization is interesting, but should probably be done via
> OpenMP.  It can be really difficult to read parallel code, so if we make
> the code too complex, then suddenly developers need to have good
> knowledge of C++, parallel programming, and also machine learning.  It's
> already hard enough to find people at the intersection of C++ and
> machine learning. :)
>

Yes, OpenMP is probably the best solution, each much easier to read than
Pthreads code (not to mention much easier to write than Pthreads) and
doesn't require a special compiler like Intel Cilk and TBB do.

You're more than welcome to submit two proposals, but often it's better
> to focus on just one, since we can only accept one proposal per student.
> :)
>
> Then I will probably stick with one proposal and try to make it as good as
possible :)

** Regarding LSH Tests:
The "impossibly hard" label makes this look scary, admittedly :P
I'll look at the test and current LSH code and see what I can think of, in
the meantime could you provide some help regarding where I can find more
documentation for the test function? I haven't figured out how to run only
specific tests instead of all 616 of them.

At the moment, an idea I have is this:
A typical way to test how "good" approximate algorithms are is against some
base case, so for example I've often used the SIFT dataset (Evaluation of
Approximate nearest neighbors: large datasets
<http://corpus-texmex.irisa.fr/>) to calculate metrics like selectivity and
recall.
Of course this doesn't check if the implementation is correct, but it
produces results we can evaluate as reasonable or unreasonable.

So maybe an approach would be, have our LSH implementation run a few times
with a few different parameters we've selected, and see how close these
results get to some benchmark we know to be correct? Problems with this
idea:
a) That requires packing a benchmarks file and SIFT dataset along with the
rest of the software.
b) We need benchmarks, which we could get by either using some
implementation we know to be good - for example, LSHKit - or by running the
LSH implementation now and then re-run it every time we change the code.

Is this what you had in mind when you mentioned probabilistic tests?

Thanks for the advice, I'll keep in touch!

Yannis
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cc.gatech.edu/pipermail/mlpack/attachments/20160307/20bf928b/attachment-0002.html>