mlpack
blog
|
Approximate Nearest Neighbor Search - Summary
I have been working this summer on the GSoC Project: "Approximate Nearest Neighbor Search"
(Project site: [1]).
The main contribution can be summarized in these pull requests:
- [pull/768] (Api details)
- [pull/766] (Fix furthest neighbor sort policy)
- [pull/765] (Api improvement - move semantics)
- [pull/764] (Fix tests)
- [pull/762] (Greedy Traverser)
- [pull/747] (Spill trees Implementation)
- [pull/732] (Heaps for mlpack)
- [pull/731] (Fix memory leak)
- [pull/727] (Fix TreeTrait for BallTree)
- [pull/693] (Modify NSModel to use boost variant)
- [pull/684] (Approx KNN for Dual tree algorithms)
- [pull/682] (Properly resetting auxBound)
- [pull/668] (Add B_aux according to issue #642)
- [pull/655] (Fix erroneous ball tree definition)
- [pull/646] (Remove duplicated code for traversal info)
- [pull/645] (Properly use Enum type)
- [pull/637] (Fix a mistake in metric policy)
- [pull/19] (Improve programs for ANN and FLANN)
- [pull/17] (Approx KNN for the benchmarking system)
- [pull/14] (Add Num of BaseCases as a metric)
(List of commits: [link])
Approximate nearest neighbor search:
I modified mlpack's
code to include an epsilon
value, which represents the maximum relative error. It is considered by the prune rules when deciding if a given node combination should be analyzed, (as suggested at the end of the paper [2]), with a general implementation that works for both KFN
and KNN
.
The command line tools: mlpack_knn
and mlpack_kfn
, were updated to include an extra option "-e"
, to specify the maximum relative error (default value: 0).
The main code developed was included in the [pull/684].
Take into into account that epsilon represents the maximum relative error. So, the average relative error (effective error) will be much smaller than the epsilon value provided.
As expected, higher values of the epsilon parameter implies that more nodes are pruned and, therefore, we have a faster algorithm, as can be seen in the next graphic for the dataset *isolet
:
Spill Trees:
I have implemented Spill Trees, as defined in: "An Investigation of Practial
Approximate Nearest Neighbor Algorithms"
[3] ([pull/747]). It is a variant of binary space trees in which the children of a node can "spill over" each other, and contain shared datapoints.
One problem with Spill Trees is that their depth varies considerably depending on the overlapping size tau
.
For that reason, I have implemented Hybrid Spill Trees [3], which provide better guarantee in the logarithmic depth of the tree.
The extension was incorporated to existing mlpack_knn
. You can specify "-t spill"
to consider spill trees, and the command line paramenters "--tau"
to set different values for the overlapping size (default value is tau=0), and "--rho"
to set different values for the balance threshold (default value is rho=0.7).
Spill Trees's decision boundary:
We have considered many different approaches for the implementation of Spill Trees, see discussions in: [issues/728] and [spilltrees]. Finally, we decided to implement a similar approach to the one mentioned in Ting Liu's paper.
Splitting Hyperplanes
Actual implementation of Spill Trees can be configured to choose between different kind of splitting Hyperplanes:
+) Providing the template parameter HyperplaneType
between two candidate classes: AxisOrthogonalHyperplane
and Hyperplane
(not necessarily Axis-Orthogonal). (In both cases the hyperplane with the widest range of projection values will be chosen).
+) Providing the template parameteter SplitType
, between two candidate classes: MidpointSpaceSplit
, MeanSpaceSplit
, which determines the splitting value to be considered.
By default, mlpack_knn
considers Axis-Orthogonal Hyperplanes
and Midpoint Splits
, because it is the most efficient option (we can project any vector in O(1) time).
Defeatist Traversers
I have implemented Hybrid SP-Tree Search
as defined in [3]. We can control the hybrid by varying tau
. If tau
is zero, we have a pure spill tree with defeatist search, very efficient but not accurate enough. If tau
is a very large number, then every node is a non-overlapping node and we get back to the traditional metric tree, with prunning rules, perfectly accurate but not very efficient. By setting different values for tau
, we have a trade-off between efficiency and accuracy.
Also, I implemented a Defeatist Dual Tree Traverser, where the query tree
is built without overlapping.
The DefeatistDualTreeTraverser
is faster than the DefeatistSingleTreeTraverser
, specially when the value of tau
increases.
Some results can be seen in the next graphic for the dataset 1000000-10-randu
.
General Greedy Search Traverser:
Also, I implemented a general greedy single tree traverser
to perform greedy search, that always chooses the child with the best score when traversing the tree, and doesn't consider backtracking: GreedySingleTreeTraverser
.
It is a tree independent implementation (based on the general TreeType API). As expected, the greedy traverser performs considerably faster than other approachs, at the cost of some relative error in the results. (PR: [pull/762]). Further disccusion in: [issues/761].
We can simply reduce the relative error by increasing the leaf size of the tree, as is shown in the next graphic for the dataset isolet
.
Other improvements
Also, I have been improving some parts of the existing code:
Bound Issues:
I found some issues in the definition of the B2 bound. We were discussing about it in [issues/642] and, after thinking about the different special cases, we found some examples where actual definition could fail. I fixed existing code to consider slighty different bounds.
Improve NSModel implementation:
I have been improving existing code for NSModel
, as was suggested in [issues/674], using boost variant to manage different options for tree types. (PR: [pull/693]).
Heaps for the list of candidates:
I realized in many parts of the code, we were keeping track of the best k candidates visited through a sorted list. Instead of maintaining a sorted list, a better approach was to use a priority queue. I implemented this in [pull/732].
Benchmarking system:
I have been updating the benchmarking system to include approximate neighbor search not only with mlpack
, but also with other libraries like [ANN] and [FLANN] (PR: [pull/14] and [pull/19] ).
Also, I created a new view to plot the progress of a specific metric for different values of a method parameter. For example, for knn, it is possible to analyze the number of base cases and runtime for different values of approximation error (epsilon), with different libraries/configurations. (PR: [pull/17]).
Acknowledgement:
I want to thank the mlpack
community for giving me the opportunity to work with them this summer, it has been a fascinating experience! They were very responsive, I always found someone to talk in the IRC channel, willing to offer their time to discuss ideas or help me with my project! :)
For further information see my previous [blog posts].
Generated by 1.8.13