mlpack blog

Approximate Nearest Neighbor Search - Summary

Marcos Pividori, 23 August 2016

I have been working this summer on the GSoC Project: "Approximate Nearest Neighbor Search" (Project site: [1]).

The main contribution can be summarized in these pull requests:

[pull/768] (Api details)
[pull/766] (Fix furthest neighbor sort policy)
[pull/765] (Api improvement - move semantics)
[pull/764] (Fix tests)
[pull/762] (Greedy Traverser)
[pull/747] (Spill trees Implementation)
[pull/732] (Heaps for mlpack)
[pull/731] (Fix memory leak)
[pull/727] (Fix TreeTrait for BallTree)
[pull/693] (Modify NSModel to use boost variant)
[pull/684] (Approx KNN for Dual tree algorithms)
[pull/682] (Properly resetting auxBound)
[pull/668] (Add B_aux according to issue #642)
[pull/655] (Fix erroneous ball tree definition)
[pull/646] (Remove duplicated code for traversal info)
[pull/645] (Properly use Enum type)
[pull/637] (Fix a mistake in metric policy)
[pull/19] (Improve programs for ANN and FLANN)
[pull/17] (Approx KNN for the benchmarking system)
[pull/14] (Add Num of BaseCases as a metric)

(List of commits: [link])

Approximate nearest neighbor search:

I modified mlpack's code to include an epsilon value, which represents the maximum relative error. It is considered by the prune rules when deciding if a given node combination should be analyzed, (as suggested at the end of the paper [2]), with a general implementation that works for both KFN and KNN.

The command line tools: mlpack_knn and mlpack_kfn, were updated to include an extra option "-e", to specify the maximum relative error (default value: 0).

The main code developed was included in the [pull/684].

Take into into account that epsilon represents the maximum relative error. So, the average relative error (effective error) will be much smaller than the epsilon value provided.

As expected, higher values of the epsilon parameter implies that more nodes are pruned and, therefore, we have a faster algorithm, as can be seen in the next graphic for the dataset *isolet:

[See graphic]

Spill Trees:

I have implemented Spill Trees, as defined in: "An Investigation of Practial Approximate Nearest Neighbor Algorithms" [3] ([pull/747]). It is a variant of binary space trees in which the children of a node can "spill over" each other, and contain shared datapoints.

One problem with Spill Trees is that their depth varies considerably depending on the overlapping size tau.

For that reason, I have implemented Hybrid Spill Trees [3], which provide better guarantee in the logarithmic depth of the tree.

The extension was incorporated to existing mlpack_knn. You can specify "-t spill" to consider spill trees, and the command line paramenters "--tau" to set different values for the overlapping size (default value is tau=0), and "--rho" to set different values for the balance threshold (default value is rho=0.7).

Spill Trees's decision boundary:

We have considered many different approaches for the implementation of Spill Trees, see discussions in: [issues/728] and [spilltrees]. Finally, we decided to implement a similar approach to the one mentioned in Ting Liu's paper.

Splitting Hyperplanes

Actual implementation of Spill Trees can be configured to choose between different kind of splitting Hyperplanes:

+) Providing the template parameter HyperplaneType between two candidate classes: AxisOrthogonalHyperplane and Hyperplane (not necessarily Axis-Orthogonal). (In both cases the hyperplane with the widest range of projection values will be chosen).

+) Providing the template parameteter SplitType, between two candidate classes: MidpointSpaceSplit, MeanSpaceSplit, which determines the splitting value to be considered.

By default, mlpack_knn considers Axis-Orthogonal Hyperplanes and Midpoint Splits, because it is the most efficient option (we can project any vector in O(1) time).

Defeatist Traversers

I have implemented Hybrid SP-Tree Search as defined in [3]. We can control the hybrid by varying tau. If tau is zero, we have a pure spill tree with defeatist search, very efficient but not accurate enough. If tau is a very large number, then every node is a non-overlapping node and we get back to the traditional metric tree, with prunning rules, perfectly accurate but not very efficient. By setting different values for tau, we have a trade-off between efficiency and accuracy.

Also, I implemented a Defeatist Dual Tree Traverser, where the query tree is built without overlapping.

The DefeatistDualTreeTraverser is faster than the DefeatistSingleTreeTraverser, specially when the value of tau increases.

Some results can be seen in the next graphic for the dataset 1000000-10-randu.

[See graphic]

General Greedy Search Traverser:

Also, I implemented a general greedy single tree traverser to perform greedy search, that always chooses the child with the best score when traversing the tree, and doesn't consider backtracking: GreedySingleTreeTraverser.

It is a tree independent implementation (based on the general TreeType API). As expected, the greedy traverser performs considerably faster than other approachs, at the cost of some relative error in the results. (PR: [pull/762]). Further disccusion in: [issues/761].

We can simply reduce the relative error by increasing the leaf size of the tree, as is shown in the next graphic for the dataset isolet.

[See graphic]

Other improvements

Also, I have been improving some parts of the existing code:

Bound Issues:

I found some issues in the definition of the B2 bound. We were discussing about it in [issues/642] and, after thinking about the different special cases, we found some examples where actual definition could fail. I fixed existing code to consider slighty different bounds.

Improve NSModel implementation:

I have been improving existing code for NSModel, as was suggested in [issues/674], using boost variant to manage different options for tree types. (PR: [pull/693]).

Heaps for the list of candidates:

I realized in many parts of the code, we were keeping track of the best k candidates visited through a sorted list. Instead of maintaining a sorted list, a better approach was to use a priority queue. I implemented this in [pull/732].

Benchmarking system:

I have been updating the benchmarking system to include approximate neighbor search not only with mlpack, but also with other libraries like [ANN] and [FLANN] (PR: [pull/14] and [pull/19] ).

Also, I created a new view to plot the progress of a specific metric for different values of a method parameter. For example, for knn, it is possible to analyze the number of base cases and runtime for different values of approximation error (epsilon), with different libraries/configurations. (PR: [pull/17]).

Acknowledgement:

I want to thank the mlpack community for giving me the opportunity to work with them this summer, it has been a fascinating experience! They were very responsive, I always found someone to talk in the IRC channel, willing to offer their time to discuss ideas or help me with my project! :)

For further information see my previous [blog posts].