I have been working this summer on the GSoC Project: "Approximate Nearest Neighbor Search" (Project site: 1).

The main contribution can be summarized in these pull requests:

Approximate nearest neighbor search:

I modified mlpack's code to include an epsilon value, which represents the maximum relative error. It is considered by the prune rules when deciding if a given node combination should be analyzed, (as suggested at the end of the paper 2), with a general implementation that works for both KFN and KNN.

The command line tools: mlpack_knn and mlpack_kfn, were updated to include an extra option "-e", to specify the maximum relative error (default value: 0).

The main code developed was included in the pull/684.

Take into into account that epsilon represents the maximum relative error. So, the average relative error (effective error) will be much smaller than the epsilon value provided.

As expected, higher values of the epsilon parameter implies that more nodes are pruned and, therefore, we have a faster algorithm, as can be seen in the next graphic for the dataset isolet:

[See graphic]

Spill Trees:

I have implemented Spill Trees, as defined in: "An Investigation of Practial Approximate Nearest Neighbor Algorithms" 3 (pull/747). It is a variant of binary space trees in which the children of a node can "spill over" each other, and contain shared datapoints.

One problem with Spill Trees is that their depth varies considerably depending on the overlapping size tau.

For that reason, I have implemented Hybrid Spill Trees 3, which provide better guarantee in the logarithmic depth of the tree.

The extension was incorporated to existing mlpack_knn. You can specify "-t spill" to consider spill trees, and the command line paramenters "--tau" to set different values for the overlapping size (default value is tau=0), and "--rho" to set different values for the balance threshold (default value is rho=0.7).

Spill Trees's decision boundary:

We have considered many different approaches for the implementation of Spill Trees, see discussions in: issues/728 and [spilltrees]. Finally, we decided to implement a similar approach to the one mentioned in Ting Liu's paper.

Splitting Hyperplanes

Actual implementation of Spill Trees can be configured to choose between different kind of splitting Hyperplanes:

+) Providing the template parameter HyperplaneType between two candidate classes: AxisOrthogonalHyperplane and Hyperplane (not necessarily Axis-Orthogonal). (In both cases the hyperplane with the widest range of projection values will be chosen).

+) Providing the template parameteter SplitType, between two candidate classes: MidpointSpaceSplit, MeanSpaceSplit, which determines the splitting value to be considered.

By default, mlpack_knn considers Axis-Orthogonal Hyperplanes and Midpoint Splits, because it is the most efficient option (we can project any vector in O(1) time).

Defeatist Traversers

I have implemented Hybrid SP-Tree Search as defined in 3. We can control the hybrid by varying tau. If tau is zero, we have a pure spill tree with defeatist search, very efficient but not accurate enough. If tau is a very large number, then every node is a non-overlapping node and we get back to the traditional metric tree, with prunning rules, perfectly accurate but not very efficient. By setting different values for tau, we have a trade-off between efficiency and accuracy.

Also, I implemented a Defeatist Dual Tree Traverser, where the query tree is built without overlapping.

The DefeatistDualTreeTraverser is faster than the DefeatistSingleTreeTraverser, specially when the value of tau increases.

Some results can be seen in the next graphic for the dataset 1000000-10-randu.

[See graphic]

General Greedy Search Traverser:

Also, I implemented a general greedy single tree traverser to perform greedy search, that always chooses the child with the best score when traversing the tree, and doesn't consider backtracking: GreedySingleTreeTraverser.

It is a tree independent implementation (based on the general TreeType API). As expected, the greedy traverser performs considerably faster than other approachs, at the cost of some relative error in the results. (PR: pull/762). Further disccusion in: issues/761.

We can simply reduce the relative error by increasing the leaf size of the tree, as is shown in the next graphic for the dataset isolet.

[See graphic]

Other improvements

Also, I have been improving some parts of the existing code:

Bound Issues:

I found some issues in the definition of the B2 bound. We were discussing about it in issues/642 and, after thinking about the different special cases, we found some examples where actual definition could fail. I fixed existing code to consider slighty different bounds.

Improve NSModel implementation:

I have been improving existing code for NSModel, as was suggested in issues/674, using boost variant to manage different options for tree types. (PR: pull/693).

Heaps for the list of candidates:

I realized in many parts of the code, we were keeping track of the best k candidates visited through a sorted list. Instead of maintaining a sorted list, a better approach was to use a priority queue. I implemented this in pull/732.

Benchmarking system:

I have been updating the benchmarking system to include approximate neighbor search not only with mlpack, but also with other libraries like ANN and FLANN (PR: pull/14 and pull/19 ).

Also, I created a new view to plot the progress of a specific metric for different values of a method parameter. For example, for knn, it is possible to analyze the number of base cases and runtime for different values of approximation error (epsilon), with different libraries/configurations. (PR: pull/17).

Acknowledgement:

I want to thank the mlpack community for giving me the opportunity to work with them this summer, it has been a fascinating experience! They were very responsive, I always found someone to talk in the IRC channel, willing to offer their time to discuss ideas or help me with my project! :)

For further information see my previous [blog posts].

Last week, I have been improving NeighborSearchRules and considering differents approaches for the implementation of Hybrid Spill Trees [1].

Spill Trees's decision boundary:

The initial implementation of Spill Trees, was based on BinarySpaceTrees's code [2].

In mlpack, KDTrees are built using the midpoint split but, when calculating the score and deciding how to traverse the tree, instead of the distance to the hyperplane determined by the midpoint split, the distance to the hrect bound is considered because it provides a tighter bound.

So, I decided to take the same approach for spill trees: build the tree splitting the points with a midpoint split, and consider the bounds when calculating the score and traversing the tree.

But this implementation is different to the approach mentioned in Ting Liu’s paper [1], where it is supposed that you consider the same hyperplane when splitting the list of points and when deciding which node to visit in a dfs search.

So, the difference is that, we consider a midpoint split hyperplane when building the tree, but we consider a different decision boundary when traversing the tree (the decision boundary is defined by the set of points with the same distance to the left child’s bound and the right child’s bound).

After analysing this difference, I realized that it could make a difference in the overlapping of nodes. We have been discussing about the advantages and disadvantages of these approaches in [issues/728] and [spilltrees.pdf].

Finally, we decided to implement a similar approach to the one mentioned in Ting Liu's paper, considering midpoint cutting hyperplanes when calculating the score, the same hyperplanes that are considered when building the tree.

Heaps for the list of candidates:

I realized in many parts of the code, we keep track of the best k candidates visited. In this situation, it is not necessary to keep a sorted list of the k elements. We only need to know the value of the worst candidate, so we can compare it with new elements and know if we must insert them to this list.

So, instead of maintaining a sorted list, a better approach is to use a priority queue.

I have been working on this improvement in: [pull/732]

Next week, I will work in the implementation of Dual Tree Search for Spill Trees. Follow the progress in: [3].

Last week, I have considered different modifications to the implementation of Hybrid Spill Trees [1].

I have summarized what I have implemented in a pdf file: [spilltrees.pdf]

Tau parameter:

I was wondering why don't consider a percentage instead of a fixed value for tau. I mean, considering a percentage of the range of values in the dimension considered, so the overlapping buffer could change according to the width of the range of values in each split.

I looked for other implementations of spill trees and I found that both approaches are considered.

So, I decided to read more documentation on spill trees. I have read Ting Liu's thesis [2]. She doesn't provide a strong justification but, when analysing the appropiate value for tau parameter, it is mentioned:

"Eq. (5.9) also implies that tau only depends on R_s and d, so it should be a fixed value for a given data set, the idea of changing tau for each partition does not work very well for this reason."

So, according to what was proposed there, I decided to continue considering a fixed value for tau.

Knn search:

As we don't know how many nodes will be pruned by the defeatist search, we can not be sure that we will have considered at least k points after traversing the tree. Maybe, we have visited less than k points by the end of the search, so we don't have enough points. This happens because with defeatist search we don't do backtracking. We can be sure we will visit at least 1 point, because all leaves contain at least one point.

So, I decided to modify actual implementation to guarantee that we always return k neighbors, no matter which tau parameter is used. This is achieved by doing backtracking, even on overlapping nodes, until we have k candidates. Because we do backtracking in some overlapping nodes, I had to modify the implementation to avoid repeated candidates when analysing the same point twice.

However, we have not decided yet if we will include this modification.

Also, I added many test cases for the structure of spill trees and knn search with them.

Next week, I will continue considering different options to improve the implementation of Spill Trees. Follow the progress in: [3].

Last week, I have completed the implementation of Hybrid Spill Trees [1].

It is possible to configure both parameters: the overlapping size ($\tau$) and the balance threshold ($\rho$), as I mentioned in last blog post.

Also, I have included a Single Tree Traverser to perform Hybrid SP-Tree Search [1].

Generally, neighbor search algorithms spend a large fraction of time backtracking to prove a candidate point is the true nearest neighbor. A different approach is to descend the metric tree without backtracking, and then output the point in the first leaf node visited as the nearest neighbor of the query point. This is called "defeatist seach" on a metric-tree. The problem with this approach is very low accuracy, specially when the query point is close to the decision boundary. However, Spill trees are especially appropriate for defeatist search. By including a overlapping buffer, we can increase the accuracy.

As mentioned in last blog post, Hybrid Spill trees have both overlapping and non-overlapping nodes. If we combine defeatist search with the previous approach, we have a Hybrid SP-Tree Search: "we only do defeatist search on overlapping nodes, for non-overlapping nodes, we still do backtracking using Neighbor Search Rules to decide when pruning".

We can control the hybrid by varying $\tau$. If $\tau$ is zero, we have a pure spill tree with defeatist search, very efficient but not accurate enough. If $\tau$ is a very large number, then every node is a non-overlapping node and we get back to the traditional metric tree, with prunning rules, perfectly accurate but not very efficient. By setting different values for $\tau$, we have a trade-off between efficiency and accuracy.

Next week, I will continue improving the implementation of Spill Trees and the Defeatist Seach and write some tests. Also, I have to continue thinking about possible alternatives to implement Hybrid SP-Tree Search as a dual tree algorithm. Follow the progress in: [2].

Last week, I have been working on the implementation of Spill Trees, as defined in: "An Investigation of Practial Approximate Nearest Neighbor Algorithms" [1].

Spill tree is a variant of binary space trees in which the children of a node can "spill over" each other, and contain shared datapoints.

Two new separating planes LP and RP are defined, both of which are parallel to the original decision boundary, at a distance $\tau$ from it. The region between LP and RP is called "overlapping buffer", and determines the points that are shared by the children nodes.

One problem with Spill Trees is that their depth varies considerably depending on the overlapping size $\tau$. For that reason, we are implementing Hybrid Spill Trees [1], which provide better guarantees.

For each node, we first split the points considering the overlapping buffer. If either of its children contains more than $\rho$ fraction of the total points we undo the overlapping splitting. Instead a conventional partition is used. In this way, we can ensure that each split reduces the number of points of a node by at least a constant factor, resulting in logarithmic depth of the tree.

The implementation of spill trees is similar to the implementation of Binary Space Trees. However, we need to manage the list of points differently. We are going to have overlapping nodes, so we can not use range of indexes of the main dataset's matrix as we do with binary trees. Therefore, actual implementation includes a general dataset instance (as we do with binary trees), and leaf nodes hold a vector of indexes pointing to columns of that matrix. Follow the progress in: [2].

Next week, I will continue working in the implementation of Spill Trees and the Defeatist Seach for approximate neighbor search.

Last week, I have been working on the benchmarking system [1]. After considering different options, I created a new view to plot the progress of a specific metric for different values of a method parameter. For example, for knn, it is possible to analyze the number of base cases and runtime for different values of approximation error (epsilon), with different libraries/configurations.

I executed some tests for: mlpack_knn with cover trees and kdtrees, and other libraries like ANN and FLANN. Results are available on: [2] ("Metric analysis with multiple parameters for an algorithm/dataset combination.")

We plan to benchmark approximate neighbor search with bigger datasets, using Jenkins's servers.

Next week, I will continue working in the implementation of Spill Trees [3]. I will consider different approaches to include this extension and analyse the possibility of implementing Hybrid SP-Tree Search as a dual tree algorithm.

Last week, I completed the improvement of existing code for NSModel [1], using boost variant to manage different options for tree types.

The general functionalities were implemented through visitor classes, taking advantage of template specialization to differentiate those tree types that accept leafSize as a parameter, where we need a slightly different procedure.

AppVeyor has shown some problems with the MSVC compiler, in the resolution of template parameters of alias templates, similar to an old issue: [2] We finally managed to fix this including some extra definitions.

All this work was merged in the PR: [3].

Also, I have been updating the benchmarking system to include an epsilon parameter for neighbor search, not only with mlpack, but also with other libraries like ANN and FLANN.

Now, I am working in adding a new view to the benchmarking system, to plot the progress of a specific metric for different values of a method parameter. In particular, we are interesting in analyzing the number of base cases and runtime for different values of approximation error (epsilon). A trade-off between doing neighbor search accurately and doing it quickly.

Next week, I plan to finish this comparison and start working in the implementation of some alternative approaches like Spill Trees [4].

Last week, I finished the extension for approximate neighbor search [1].

Mainly, I modified the code to include an epsilon value, which represents the accepted relative error. It is considered by the prune rules when deciding if a given node combination should be analyzed.

When doing dual tree search, the best between the modified $B_1$ bound (with epsilon) and the original $B_2$ bound is chosen. As was discussed in [2], with actual implementation, $B_2$ can not be relaxed to include the epsilon value.

Then, I added many test cases with different values of epsilon and different kind of trees (KDTree/BallTree/CoverTree), checking the relative error between the approximated results and the true best candidates.

The command line tools: mlpack_knn and mlpack_kfn, were updated to include an extra option "-e", to specify the relative error (default value: 0). We have been discussing about which approximation parameters use for KFN in [3].

Also, I have been working improving existing code for NSModel, as was suggested by Sumedh in [4]. After considering many options, boost variant seems to be the most appropiate.

Next week, I plan to continue working improving existing code and checking some details. Also, I will do some comparisons in performance of exact vs approximate neighbor search.

Last week we finished the discussion about neighbor search's bounds [1]. I updated existing code to consider slighty different bounds. Then, I compared the performance, using the benchmarking system with a modification to measure the number of base cases for ALLKNN/ALLKFN. Results shown exactly the same num of base cases for many datasets, so we decided to go ahead and merge it.

Then, I continued working in existing neighbor search code, updating the dual tree algorithm to do approximate nearest neighbor search. I modified the prune rule to include a relative error, as mentioned at the end of the paper [2], with a general implementation that works for both kfn and knn [3].

As we are doing approximate search, we can prune more than when an exact solution is required. For example, for knn, we consider the prune rule:

Prune if $d_{min}(N_q, N_r) > ( 1 / (1 + \epsilon ) ) * B_1(N_q)$.

Next week, I plan to continue working in the implementation of aprox knn, adding many test cases. Once everything is ready, I will merge it into the main repository.

I am working this summer on the implementation of approximate nearest neighbor search.

First, we want to update the existing dual tree algorithm to do approximate nearest neighbor search, modifying the prune rule to include an epsilon, as mentioned at the end of the paper [1].

After reading the paper: "Tree-Independent Dual-Tree Algorithms" [2] in detail, I found some issues in the definition of the B2 bound. We were discussing about it in [3] and, after thinking about the different special cases, we found some examples where actual definition could fail. We seem to have a solution. New week, I will work in updating existing code to fix this and compare the performance after this modification.

When reviewing the implementation of tree types, I found a subtle mistake! Ball trees were using hyper rectangle bounds instead of ball bounds! [4] It was fixed and is waiting to be merged.

Next week, I plan to continue improving the neighbor search code, to fix the issue related to the bounds. Once this is working properly, I could add the modification to do approximate search and do some tests.