[mlpack] GSoC 2016 - Project Questions

Tue Mar 8 09:41:06 EST 2016

On Mon, Mar 07, 2016 at 01:03:15PM +0000, Yannis Mentekidis wrote:
> Hello again,
> 
> sorry for the late reply
> 
> Yes, definitely other forms of LSH or improvement to the existing LSH
> > implementation would be a good way to go.  Dong's paper is where LSHKIT
> > came from, right?  If so it might be interesting to implement that, but
> > only if we are pretty sure we can outperform LSHKIT (i.e. if we can make
> > a faster implementation).
> 
> From what I can recall, LSHKIT is quite well optimized, so it might be a
> hard task to outperform it. I'm not sure if it has been parallelized,
> though, so that might be a way to achieve that (if you don't consider that
> cheating :P).

Parallelization is fine, I don't think LSHKIT is parallel.  OpenMP could
be a good way to do that, since there will probably be a lot of sections
that could be embarrassingly parallel.  I see that LSHKIT uses floats
internally, so if we test the two implementations against each other
we'll probably want to use floats for mlpack's test (instead of the
default double type that's used for most matrices).

> > One thing I found with other LSH implementations was that most of them
> > are a little difficult to work with... a user is most likely going to
> > want to say "I want k approximate nearest neighbors with a desired
> > maximum of e% relative error", so they specify k and e and then the
> > algorithm returns results with the desired error bound.  But some LSH
> > packages do not offer this kind of interface and can be very hard to
> > work with (E2LSH is one example).
> 
> Yes, that is true, most have a big number of parameters that people not
> familiar with the algorithm might not understand at first. That's why I
> proposed implementing something similar to Dong's paper, since they propose
> a way to identify sets of parameters that will yield results near some
> specified recall/selectivity.
> Of course this is probably going to be slower than the naive kNN
> implementation, but it is reasonable to run the tuning program once for a
> given dataset, produce favorable sets of parameters, and store these for
> further runs of LSH.
> An alternative would be developing some heuristic that maps neighbor error
> rates to algorithm parameters. This might be tricky though, since error
> depends highly on the dataset. It would be an interesting endeavour though.

Do you know if anyone has tried to map neighbor error rates to algorithm
parameters?  It would interesting to do.

I think that Dong's paper would definitely be nice to implement and add
to mlpack, specifically because of the big number of parameters and the
auto-tuning.

> > Here are some interesting runtime comparisons to LSHKIT that I did last
> > year (on page 9 and 10); I compared some tree-based algorithms and then
> > also multiprobe LSH:
> >
> > http://ratml.org/pub/pdf/2015faster.pdf
> >
> > I think that probably LSH would be faster in higher dimensions, but that
> > wasn't the region of interest for the study I linked to.  Also that
> > paper might be a decent place to start reading about dual-tree nearest
> > neighbor search (and the papers that paper references might be
> > interesting as well).
> 
> I've put this paper on my reading list,  as well as "An Optimal Algorithm
> for Approximate Nearest Neighbor Searching in Fixed Dimensions" which you
> linked to in another mail. I believe I will be able to read them by the
> time I start writing my proposal.

Great!  I think the Arya paper is really nice; I enjoyed reading it.
Let me know if there's anything in the "Faster dual-tree traversal for
nearest neighbor search" paper that I can clarify.

> > You're more than welcome to submit two proposals, but often it's better
> > to focus on just one, since we can only accept one proposal per student.
> > :)
> >
> Then I will probably stick with one proposal and try to make it as good as
> possible :)
> 
> ** Regarding LSH Tests:
> The "impossibly hard" label makes this look scary, admittedly :P
> I'll look at the test and current LSH code and see what I can think of, in
> the meantime could you provide some help regarding where I can find more
> documentation for the test function? I haven't figured out how to run only
> specific tests instead of all 616 of them.

Ah, yeah, it's the Boost Unit Test Framework, so their command-line
options are documented here:
http://www.boost.org/doc/libs/1_46_1/libs/test/doc/html/utf/user-guide/runtime-config/reference.html

You could run just the LSHTest suite like this:

$ bin/mlpack_test -t LSHTest

or you could run just one single test case from the LSHTest suite like
this:

$ bin/mlpack_test -t LSHTest/LSHSearchTest

> At the moment, an idea I have is this:
> A typical way to test how "good" approximate algorithms are is against some
> base case, so for example I've often used the SIFT dataset (Evaluation of
> Approximate nearest neighbors: large datasets
> <http://corpus-texmex.irisa.fr/>) to calculate metrics like selectivity and
> recall.
> Of course this doesn't check if the implementation is correct, but it
> produces results we can evaluate as reasonable or unreasonable.
> 
> So maybe an approach would be, have our LSH implementation run a few times
> with a few different parameters we've selected, and see how close these
> results get to some benchmark we know to be correct? Problems with this
> idea:
> a) That requires packing a benchmarks file and SIFT dataset along with the
> rest of the software.
> b) We need benchmarks, which we could get by either using some
> implementation we know to be good - for example, LSHKit - or by running the
> LSH implementation now and then re-run it every time we change the code.
> 
> Is this what you had in mind when you mentioned probabilistic tests?

Yeah, I think approach (a) is probably the right way to do this.  I
don't think it's realistic to do an exact correctness check like the
test that Pari originally wrote, because that depends on the random seed
and the specific implementation, and we can't guarantee (and shouldn't
guarantee) that that will never change.  So maybe the better approach
here is to find a dataset, find some LSH parameters, calculate the
theoretical approximation guarantee for how correct the results should
be, and then ensure that the implementation satisfies that guarantee.

I haven't though too much about that, but I think maybe the theoretical
guarantee would be something like the parameters to Theorem 1 in Datar's
paper:

http://www.cs.princeton.edu/courses/archive/spring05/cos598E/bib/p253-datar.pdf

The weird thing is that the mlpack implementation performs actual
approximate nearest neighbor search, but their theory is for solving
(R, c)-NN, which I think (correct me if I'm wrong) is just the task of
repoting a point within the distance cR from the query point.  So with
properly set c, R, p_1, and p_2, I think maybe we can just run LSH and
then check to see that the nearest neighbors returned are withing
distance cR.

We probably don't want to package a very large dataset though; maybe
there is already a dataset in src/mlpack/tests/data/ that is suitable?

https://github.com/mlpack/mlpack/tree/master/src/mlpack/tests

For approach (b), I think actually a way to do that would be with the
automatic benchmarking system found at

https://github.com/zoq/benchmarks/

which displays some results you can look through at

http://www.mlpack.org/benchmark.html

The system can display metrics like recall, precision, etc.; try
selecting "Metric plots for an algorithm/parameter/dataset combination",
then "PERCEPTRON" for the method, "-i 1000" for the parameters, and the
"iris" dataset to see what I mean.  There is a lot more information
there too, but it can be hard to find a compelling visualization in all
of the noise.  I think the perceptron iris graph looks good. :)

Anyway, it would be really great to have this issue solved; I've never
had the time to look into it enough.

Thanks!

Ryan

-- 
Ryan Curtin    | "Somebody dropped a bag on the sidewalk."
ryan at ratml.org |   - Kit