[mlpack] GSoC MLPack - Approximate Nearest Neighbor Search

Mon Mar 21 10:33:28 EDT 2016

On Mon, Mar 21, 2016 at 08:21:27AM -0300, Marcos Pividori wrote:
> Dear all,
> 
> I am Marcos Pividori, a Computer Science student from the National
> University of Rosario, Argentina ([1]). I am currently finishing my degree.
> (5 year long degree)
> I am writing because I am really interested in working this GSoC for mlpack.

Hi Marcos,

Thanks for getting in touch.

> I have been reading the papers recommended for this topic. I think I have a
> general idea of the main methods for the Approximate Nearest Neighbor
> problem.
> Also, I have installed the mlpack library and looked at the main code
> implementation.
> So, I should split the project with these main goals:
>  + Implementation of approximate nearest neighbor search.
>  + Implementation of a command line tool (simple task).
>  + Comparisons between mlpack implementation and similar libraries, and
> writing of proper documentation.
> 
> About the implemenation of ann: do you have a preference on which method to
> use? Should we implement many of these different methods and compare their
> results?
> I have found many different options in the literature:
>   + "An optimal algorithm for ANN Searching in Fixed dimensions" proposes
> to use BBD-Trees for the computation, and it is implemented in the ANN
> library. But I couldn't find more information on BBD trees. ANN library
> also uses KD-Trees.
>   + "An Investigation of Practical Approximate Nearest Neighbor Algorithms"
> proposes to use Spill-Trees (maybe Hybrid) with overlapping in points
> separation.
>   + "Fast approximate nearest neighbors with automatic configuration"
> reviews and compares the state of the art in ann. It mentioned the
> randomized kd-trees and k-means clustering trees as the most promising
> methods according to their results. FLANN library implements many of these
> methods.
>   + scikit-learn library implements Locality-Sensitive Hashing for ann.

mlpack has an LSH implementation already (see src/mlpack/methods/lsh/).
The easiest (and best, in my opinion) route to a good approximate
nearest neighbor search algorithm is to extend the existing dual-tree
algorithm to be an approximate algorithm instead of an exact algorithm.

If you read the paper "Tree-independent dual-tree algorithms" you will
see that the pruning rule for dual-tree nearest neighbor search is of
the form

"prune if d_min(N_q, N_r) > B(N_q)"

but this can easily be made approximate by a change to something like
"if d_min(N_q, N_r) > B(N_q) + epsilon".  This would yield a pretty
simple change (or generalization) of the algorithm that's implemented.
But I don't think that's enough work for the whole summer, so one of the
papers you referenced might be worth implementing.

I might consider implementing spill trees and then the defeatist search
used by that algorithm.

Then you could spend the rest of the time comparing and benchmarking
implementations.

I hope this is helpful.

Thanks,

Ryan

-- 
Ryan Curtin    | "He takes such wonderful pictures with
ryan at ratml.org | his paws."  - Radio man