## Approximate Nearest Neighbor Search - Conclusion

I have been working this summer on the GSoC Project: "Approximate Nearest Neighbor Search" (Project site: 1).

The main contribution can be summarized in these pull requests:

- pull/768 (Api details)
- pull/766 (Fix furthest neighbor sort policy)
- pull/765 (Api improvement - move semantics)
- pull/764 (Fix tests)
- pull/762 (Greedy Traverser)
- pull/747 (Spill trees Implementation)
- pull/732 (Heaps for mlpack)
- pull/731 (Fix memory leak)
- pull/727 (Fix TreeTrait for BallTree)
- pull/693 (Modify NSModel to use boost variant)
- pull/684 (Approx KNN for Dual tree algorithms)
- pull/682 (Properly resetting auxBound)
- pull/668 (Add B_aux according to issue #642)
- pull/655 (Fix erroneous ball tree definition)
- pull/646 (Remove duplicated code for traversal info)
- pull/645 (Properly use Enum type)
- pull/637 (Fix a mistake in metric policy)
- pull/19 (Improve programs for ANN and FLANN)
- pull/17 (Approx KNN for the benchmarking system)
- pull/14 (Add Num of BaseCases as a metric)

(List of commits: [link])

## Approximate nearest neighbor search:

I modified mlpack's code to include an *epsilon* value, which represents the
maximum relative error. It is considered by the prune rules when deciding if a
given node combination should be analyzed, (as suggested at the end of the paper
2), with a general implementation that works for both *KFN* and *KNN*.

The command line tools: *mlpack_knn* and *mlpack_kfn*, were updated to include
an extra option *"-e"*, to specify the maximum relative error
(default value: 0).

The main code developed was included in the pull/684.

Take into into account that epsilon represents the maximum relative error. So, the average relative error (effective error) will be much smaller than the epsilon value provided.

As expected, higher values of the epsilon parameter implies that more nodes are
pruned and, therefore, we have a faster algorithm, as can be seen in the next graphic for the dataset *isolet*:

## Spill Trees:

I have implemented Spill Trees, as defined in: "An Investigation of Practial Approximate Nearest Neighbor Algorithms" 3 (pull/747). It is a variant of binary space trees in which the children of a node can "spill over" each other, and contain shared datapoints.

One problem with Spill Trees is that their depth varies considerably depending
on the overlapping size *tau*.

For that reason, I have implemented Hybrid Spill Trees 3, which provide better guarantee in the logarithmic depth of the tree.

The extension was incorporated to existing *mlpack_knn*.
You can specify *"-t spill"* to consider spill trees, and the command line
paramenters *"--tau"* to set different values for the overlapping size (default
value is tau=0), and *"--rho"* to set different values for the balance threshold
(default value is rho=0.7).

### Spill Trees's decision boundary:

We have considered many different approaches for the implementation of Spill Trees, see discussions in: issues/728 and [spilltrees]. Finally, we decided to implement a similar approach to the one mentioned in Ting Liu's paper.

### Splitting Hyperplanes

Actual implementation of Spill Trees can be configured to choose between different kind of splitting Hyperplanes:

+) Providing the template parameter `HyperplaneType`

between two candidate classes:
`AxisOrthogonalHyperplane`

and `Hyperplane`

(not necessarily Axis-Orthogonal).
(In both cases the hyperplane with the widest range of projection values will be
chosen).

+) Providing the template parameteter `SplitType`

, between two candidate classes:
`MidpointSpaceSplit`

, `MeanSpaceSplit`

, which determines the splitting value
to be considered.

By default, *mlpack_knn* considers *Axis-Orthogonal Hyperplanes* and *Midpoint
Splits*, because it is the most efficient option (we can project any vector in
O(1) time).

### Defeatist Traversers

I have implemented *Hybrid SP-Tree Search* as defined in 3.
We can control the hybrid by varying *tau*. If *tau* is zero, we have a pure
spill tree with defeatist search, very efficient but not accurate enough.
If *tau* is a very large number, then every node is a non-overlapping node and
we get back to the traditional metric tree, with prunning rules, perfectly
accurate but not very efficient.
By setting different values for *tau*, we have a trade-off between efficiency
and accuracy.

Also, I implemented a Defeatist Dual Tree Traverser, where the *query tree* is
built without overlapping.

The `DefeatistDualTreeTraverser`

is faster than the
`DefeatistSingleTreeTraverser`

, specially when the value of *tau* increases.

Some results can be seen in the next graphic for the dataset *1000000-10-randu*.

## General Greedy Search Traverser:

Also, I implemented a *general greedy single tree traverser* to perform greedy
search, that always chooses the child with the best score when traversing the
tree, and doesn't consider backtracking: `GreedySingleTreeTraverser`

.

It is a tree independent implementation (based on the general TreeType API). As expected, the greedy traverser performs considerably faster than other approachs, at the cost of some relative error in the results. (PR: pull/762). Further disccusion in: issues/761.

We can simply reduce the relative error by increasing the leaf size of the tree,
as is shown in the next graphic for the dataset *isolet*.

## Other improvements

Also, I have been improving some parts of the existing code:

### Bound Issues:

I found some issues in the definition of the B2 bound. We were discussing about it in issues/642 and, after thinking about the different special cases, we found some examples where actual definition could fail. I fixed existing code to consider slighty different bounds.

### Improve NSModel implementation:

I have been improving existing code for `NSModel`

, as was suggested
in issues/674, using boost variant to manage different
options for tree types. (PR: pull/693).

### Heaps for the list of candidates:

I realized in many parts of the code, we were keeping track of the best k candidates visited through a sorted list. Instead of maintaining a sorted list, a better approach was to use a priority queue. I implemented this in pull/732.

### Benchmarking system:

I have been updating the benchmarking system to include approximate neighbor search not only with mlpack, but also with other libraries like ANN and FLANN (PR: pull/14 and pull/19 ).

Also, I created a new view to plot the progress of a specific metric for different values of a method parameter. For example, for knn, it is possible to analyze the number of base cases and runtime for different values of approximation error (epsilon), with different libraries/configurations. (PR: pull/17).

## Acknowledgement:

I want to thank the mlpack community for giving me the opportunity to work with them this summer, it has been a fascinating experience! They were very responsive, I always found someone to talk in the IRC channel, willing to offer their time to discuss ideas or help me with my project! :)

For further information see my previous [blog posts].

more...