[mlpack] GSoC 2016 - Project Questions

Tue Mar 8 13:43:13 EST 2016

Hello,

Thank you for the clarifications, I managed to run specific tests after
that.

I'll look at Datar's paper and your comment, and hopefully I'll be able to
have a draft idea about a testing procedure soon. Should I get back at you
with it to discuss it further, or should I just start coding away and see
how it goes?

I believe the dataset used for this shouldn't matter that much, so we
probably will not need to include a different one. I'll probably toy around
with the iris data for now.

Thank you very much for the input!

Yannis

On Tue, Mar 8, 2016 at 2:41 PM, Ryan Curtin <ryan at ratml.org> wrote:

> On Mon, Mar 07, 2016 at 01:03:15PM +0000, Yannis Mentekidis wrote:
> > Hello again,
> >
> > sorry for the late reply
> >
> > Yes, definitely other forms of LSH or improvement to the existing LSH
> > > implementation would be a good way to go.  Dong's paper is where LSHKIT
> > > came from, right?  If so it might be interesting to implement that, but
> > > only if we are pretty sure we can outperform LSHKIT (i.e. if we can
> make
> > > a faster implementation).
> >
> > From what I can recall, LSHKIT is quite well optimized, so it might be a
> > hard task to outperform it. I'm not sure if it has been parallelized,
> > though, so that might be a way to achieve that (if you don't consider
> that
> > cheating :P).
>
> Parallelization is fine, I don't think LSHKIT is parallel.  OpenMP could
> be a good way to do that, since there will probably be a lot of sections
> that could be embarrassingly parallel.  I see that LSHKIT uses floats
> internally, so if we test the two implementations against each other
> we'll probably want to use floats for mlpack's test (instead of the
> default double type that's used for most matrices).
>
> > > One thing I found with other LSH implementations was that most of them
> > > are a little difficult to work with... a user is most likely going to
> > > want to say "I want k approximate nearest neighbors with a desired
> > > maximum of e% relative error", so they specify k and e and then the
> > > algorithm returns results with the desired error bound.  But some LSH
> > > packages do not offer this kind of interface and can be very hard to
> > > work with (E2LSH is one example).
> >
> > Yes, that is true, most have a big number of parameters that people not
> > familiar with the algorithm might not understand at first. That's why I
> > proposed implementing something similar to Dong's paper, since they
> propose
> > a way to identify sets of parameters that will yield results near some
> > specified recall/selectivity.
> > Of course this is probably going to be slower than the naive kNN
> > implementation, but it is reasonable to run the tuning program once for a
> > given dataset, produce favorable sets of parameters, and store these for
> > further runs of LSH.
> > An alternative would be developing some heuristic that maps neighbor
> error
> > rates to algorithm parameters. This might be tricky though, since error
> > depends highly on the dataset. It would be an interesting endeavour
> though.
>
> Do you know if anyone has tried to map neighbor error rates to algorithm
> parameters?  It would interesting to do.
>
> I think that Dong's paper would definitely be nice to implement and add
> to mlpack, specifically because of the big number of parameters and the
> auto-tuning.
>
> > > Here are some interesting runtime comparisons to LSHKIT that I did last
> > > year (on page 9 and 10); I compared some tree-based algorithms and then
> > > also multiprobe LSH:
> > >
> > > http://ratml.org/pub/pdf/2015faster.pdf
> > >
> > > I think that probably LSH would be faster in higher dimensions, but
> that
> > > wasn't the region of interest for the study I linked to.  Also that
> > > paper might be a decent place to start reading about dual-tree nearest
> > > neighbor search (and the papers that paper references might be
> > > interesting as well).
> >
> > I've put this paper on my reading list,  as well as "An Optimal Algorithm
> > for Approximate Nearest Neighbor Searching in Fixed Dimensions" which you
> > linked to in another mail. I believe I will be able to read them by the
> > time I start writing my proposal.
>
> Great!  I think the Arya paper is really nice; I enjoyed reading it.
> Let me know if there's anything in the "Faster dual-tree traversal for
> nearest neighbor search" paper that I can clarify.
>
> > > You're more than welcome to submit two proposals, but often it's better
> > > to focus on just one, since we can only accept one proposal per
> student.
> > > :)
> > >
> > Then I will probably stick with one proposal and try to make it as good
> as
> > possible :)
> >
> > ** Regarding LSH Tests:
> > The "impossibly hard" label makes this look scary, admittedly :P
> > I'll look at the test and current LSH code and see what I can think of,
> in
> > the meantime could you provide some help regarding where I can find more
> > documentation for the test function? I haven't figured out how to run
> only
> > specific tests instead of all 616 of them.
>
> Ah, yeah, it's the Boost Unit Test Framework, so their command-line
> options are documented here:
>
> http://www.boost.org/doc/libs/1_46_1/libs/test/doc/html/utf/user-guide/runtime-config/reference.html
>
> You could run just the LSHTest suite like this:
>
> $ bin/mlpack_test -t LSHTest
>
> or you could run just one single test case from the LSHTest suite like
> this:
>
> $ bin/mlpack_test -t LSHTest/LSHSearchTest
>
> > At the moment, an idea I have is this:
> > A typical way to test how "good" approximate algorithms are is against
> some
> > base case, so for example I've often used the SIFT dataset (Evaluation of
> > Approximate nearest neighbors: large datasets
> > <http://corpus-texmex.irisa.fr/>) to calculate metrics like selectivity
> and
> > recall.
> > Of course this doesn't check if the implementation is correct, but it
> > produces results we can evaluate as reasonable or unreasonable.
> >
> > So maybe an approach would be, have our LSH implementation run a few
> times
> > with a few different parameters we've selected, and see how close these
> > results get to some benchmark we know to be correct? Problems with this
> > idea:
> > a) That requires packing a benchmarks file and SIFT dataset along with
> the
> > rest of the software.
> > b) We need benchmarks, which we could get by either using some
> > implementation we know to be good - for example, LSHKit - or by running
> the
> > LSH implementation now and then re-run it every time we change the code.
> >
> > Is this what you had in mind when you mentioned probabilistic tests?
>
> Yeah, I think approach (a) is probably the right way to do this.  I
> don't think it's realistic to do an exact correctness check like the
> test that Pari originally wrote, because that depends on the random seed
> and the specific implementation, and we can't guarantee (and shouldn't
> guarantee) that that will never change.  So maybe the better approach
> here is to find a dataset, find some LSH parameters, calculate the
> theoretical approximation guarantee for how correct the results should
> be, and then ensure that the implementation satisfies that guarantee.
>
> I haven't though too much about that, but I think maybe the theoretical
> guarantee would be something like the parameters to Theorem 1 in Datar's
> paper:
>
>
> http://www.cs.princeton.edu/courses/archive/spring05/cos598E/bib/p253-datar.pdf
>
> The weird thing is that the mlpack implementation performs actual
> approximate nearest neighbor search, but their theory is for solving
> (R, c)-NN, which I think (correct me if I'm wrong) is just the task of
> repoting a point within the distance cR from the query point.  So with
> properly set c, R, p_1, and p_2, I think maybe we can just run LSH and
> then check to see that the nearest neighbors returned are withing
> distance cR.
>
> We probably don't want to package a very large dataset though; maybe
> there is already a dataset in src/mlpack/tests/data/ that is suitable?
>
> https://github.com/mlpack/mlpack/tree/master/src/mlpack/tests
>
> For approach (b), I think actually a way to do that would be with the
> automatic benchmarking system found at
>
> https://github.com/zoq/benchmarks/
>
> which displays some results you can look through at
>
> http://www.mlpack.org/benchmark.html
>
> The system can display metrics like recall, precision, etc.; try
> selecting "Metric plots for an algorithm/parameter/dataset combination",
> then "PERCEPTRON" for the method, "-i 1000" for the parameters, and the
> "iris" dataset to see what I mean.  There is a lot more information
> there too, but it can be hard to find a compelling visualization in all
> of the noise.  I think the perceptron iris graph looks good. :)
>
> Anyway, it would be really great to have this issue solved; I've never
> had the time to look into it enough.
>
> Thanks!
>
> Ryan
>
> --
> Ryan Curtin    | "Somebody dropped a bag on the sidewalk."
> ryan at ratml.org |   - Kit
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cc.gatech.edu/pipermail/mlpack/attachments/20160308/6aa235fd/attachment-0002.html>