In this blog post I'll try to describe consisely the work that I have done for the mlpack library as part of the GSoC project R+ trees, Hilbert R trees, vantage point trees, random projection trees, UB trees implementation this summer.

### Summary

Here's a list of my Pull Requests. All these PRs are merged except Pull Request 746 (Universal B tree implementation, the code is under review now. The tree works correctly, maybe it requires a number of small fixes, I believe the PR will be merged soon).

• Hilbert R tree: 664
• R+ and R++ trees implementation: 699
• Vantage point tree: 708
• Removed RectangleTree::Points(): 709
• Hilbert R tree fixes: 710
• RectangleTree::NumDescendants() optimization and some RectangleTree fixes: 711
• Replaced SortStruct by std::pair: 721
• Fixed a segfault in RPlusTreeSplit. Fixed traits for the R+/R++ tree: 724
• Random projection trees: 726
• Fixed an error in MidpointSplit that may lead to a segfault: 741
• Universal B tree implementation: 746
• Vantage point tree and HRectBound improvements: 760
• Fixed copy constructor of RectangleTree and added move constructor: 767

(List of commits: list)

### Hilbert R tree

I began my work with the Hilbert R tree. The tree is implemented according to the Hilbert R-tree: An Improved R-tree Using Fractals paper on the basis of the RectangleTree class.

The Hilbert R tree differs from original Guttman's R trees by the split and the insertion methods. All points are being inserted into the Hilbert R tree according to their position on the Hilbert curve (the Hilbert value).

The Hilbert curve is a variant of space-filling Peano curves. The algorithm that translates a point to the Hilbert value is based on the following paper Programming the Hilbert curve. The paper sugests an algorithm that allows to calculate the Hilbert value by means of bit operations whereas previous attempts construct the curve using recursion. The method is implemented in DiscreteHilbertValue class. I tried a different approach that compares the Hilbert order of two points by means of recursion but the algorithm was much slower than the method suggested in the paper.

The changes are contained in Pull Request #664 and Pull Request #710.

### R+/R++ tree

I continued my project with the R+ tree which is based on the paper The R+-tree: A dynamic index for multi-dimensional objects. The main advantage of the tree is the property that children do not overlap each other.

When I had implemented the tree I found an error in the insertion algorithm. The paper does not discuss the method in detail. It is not difficult to think out a case when we can not insert a point and preserve the property of non-overlapping children. Another problem is that the tree does not guarantee that the number of child nodes after the split algorithm is less than the maximum allowed number. The issues appear to be well known problems. I looked through a series of articles and came across the paper R++ tree: an efficient spatial access method for highly redundant point data that describes a variant of the R+ tree that takes into account the maximum bounding rectangle and avoids the issues. So, I've added a workaround that solves the R+ tree problems by increasing the maximum size of the node and I've implemented the R++ tree.

The implementation of the R+ tree and the R++ tree can be found in Pull Request 699 and Pull Request 724.

### Vantage point tree

The vantage point tree is implemented on the basis of the paper Data Structures and Algorithms for Nearest Neighbor Search in General Metric Spaces. The implementation appears to be much slower than the implementation of the k-d tree. The main disadvantage is that the bound of a node is too complex and attempts to calculate the distance between a point and a vp-tree node precisely require a lot of computations. The first version of the vantage point tree contains a point in each intermediate node. Then Ryan suggested to move vantage points to separate nodes in order to simplify tree traversal rules. Then I tried to implement a variant of the vantage point tree that contains points only in leaf nodes. That variant appears to be faster then the previous variants.

Moreover, I tried a number of different bounds. Right now, the tree uses the bound which consists of two non-concentric balls, the outer ball is the minimum bounding ball and the inner one is centered at the vantage point.

The changes can be found in Pull Request 708 and Pull Request 760. Moreover, I implemented a variant of the vantage point tree that uses the hollow-rect bound which consists of the minimum bounding rectangle and a number of rectangular hollows. I didn't issue a PR since the implementation is slower than the implementation of the vantage point tree (I pushed the code here).

### Random projection tree

The implementation of random projection trees are based on the paper Random projection trees and low dimensional manifolds.

I implemented two versions of the split method: the mean split and the max split. Both methods are described in the paper. Unfortunately, the tree performs much slower than the k-d tree since the bound of the random projection tree is a polytope. There are a number of methods that calculates the distance between a point and a polytope but they are based on the gradient descent method and require a lot of computations. Right now, the tree uses an ordinary rectangular bound.

The tree is implemented on the basis of the BinarySpaceTree class. The changes can be found in Pull Request 726.

### Universal B tree

I think the universal B tree is the most complicated algorithm that I have implemented this summer. The implementation is based on the paper The Universal B-Tree for Multidimensional indexing: General Concepts.

The universal B tree is a variant of the B tree. Like the Hilbert R tree, the UB tree maps multidimensional space to a linear ordered space. All points are being inserted into the tree according to that ordering. Notably, this mapping preserves the multidimensional clustering.

The bound of the UB tree is a stairway-shaped figure in multidimensional space. Thus, the bound consists of a number of non-overlapping hyper-rectangles. Of course, the tree allows to use non-overlapping bounds but that leads to a huge number of computations.

The changes can be found in Pull Request 746. The PR is not merged yet, but I believe it will be merged soon.

### Other changes

I made a number of changes that are not connected directly with my project, I'll briefly describe them here. I've removed extensions of the TreeType API in the RectangleTree class. The changes can be found in Pull Request 709. Pull Request 711 contains optimizations that allow to find descendant points by a logarithmic time. Pull Request 721 solves Issue 712. Pull Request 741 solves a problem which has appeared after refactoring. And Pull Request 767 adds the copy-constructor and the move constructor to the RectangleTree class.

### TODOs

Right now, some tree types like the vantage point tree and random projection trees use looser-bounds. That leads to a huge number of base cases in dual-tree algorithms. I think these bounds may be even more optimized but I don't know how.

Another problem is concerned with Rank-Approximate Nearest Neighbor Search. The algorithm often fails with some tree types (e.g. the Hilbert R tree). It is an interesting problem to understand why that happens.

### Acknowledgement

That was an amazing summer, thanks to the mlpack team and especially to Ryan Curtin who looked through the code and has suggested a lot of advices on the optimization of the algorithms and to Marcos Pividori who has proposed a series of ideas on the optimization of the RectangleTree class (such as Pull Request 711 and Pull Request 767).

For further information see my blog posts.

Last week I have been working on fixing various tree types.

### Random projection trees

I made some minor fixes of random projection trees and wrote a simple test that checks if there is a hyperplane that splits a node. I find the hyperplane as a solution of a linear system of inequalities. Curiously, but it seems the easiest iterative method solves the system fine.

### Universal B-tree

I fixed a problem in the address-to-point conversion method. The method had led to wrong results if some addresses correspond to negative numbers. Moreover, since floating point data types have some excess information, some addresses correspond to infinite points. Thus, I have to add some restrictions in the conversion algorithm.

### Vantage point tree

I implemented a method that tries to get a tighter bound by investigating the parent bound. The method calculates the distance to the "corner" (the intersection of two (n-1)-dimensional spheres that is an (n-2)-dimensional sphere) using properties of orthogonal transforms. The results appear to be worse since this calculation requires too many arithmetic operations and the number of base cases has decreased slightly.

### Hollow hyperrectangle vantage point tree

I implemented a bound that consists of an outer rectangle and a number of rectangular hollows. I tried to test the vantage point tree with this bound. Right now, the original VP-tree bound outperforms this one, but I think the hollow-hrect bound should work faster. So, I'll continue working on the optimization of the bound.

Last week I have been working on the universal B tree. I finished the implementation of the bound and wrote a series of tests.

In order to explain what the bound is I have to introduce the notions $\textbf{area}$ and $\textbf{address}$.

An $\textbf{address}\ \alpha$ for an $\textbf{area}$ in an $n$-dimensional cube is a sequence $i_1.i_2.\ldots.i_l$, where $i_j\in {1,2,\ldots,2^n}$.

I'll define the notion $\textbf{area}$ recursively. Let $\alpha$ be an $\textbf{address}$ for an $\textbf{area}$ in an $n$-dimensional cube C ($\textbf{area}(C)$). Partition C into $2^n$ subcubes $sc(i),\ i=1,2,\ldots,2^n$ by dividing C in the middle of each dimension. If $\alpha=k$ then $\textbf{area}(C)[k]=\bigcup_{i=1}^{k}sc(i),\ k=0,1,\ldots,2^n-1$. If $\alpha=k.\beta$ then $\textbf{area}(C)[\alpha]=\textbf{area}(C)[k]\cup \textbf{area}(sc(k+1))[\beta]$, where $\textbf{area}(sc(k+1))[\beta]$ is an area in $sc(k+1)$.

Each child node of the universal B tree is bounded by two addresses i.e. the bound of a node is the difference of two areas. Since I deal with floating point numbers I use fixed-sized addresses in the whole space. The calculation of the distance between two nodes requires a lot of arithmetic operations. So, I decided to restrict the number of steps in the bound. Thus, children may overlap each other.

Last week I have been working on the vantage point tree and the universal B-tree.

### Vantage point tree

I tried various fixes of the vantage point tree. Namely, I moved vantage points to separate nodes. That should considerably simplify pruning rules (we needn't implement the score algorithm for a query node and a reference point). Thus, each intermediate node of the vantage point tree contains three children. The first child contains only the vantage point. In such a way, the first point of the first sibling of a node is the centroid of the node.

This property allows to use some optimizations in dual-tree algorithms and I added this check to NeighborSearchRules and RangeSearchRules. There are still some errors, some base cases are duplicated.

### Universal B-tree

The UB-tree is a variant of the B-tree. It introduces an ordering in multidimensional space i.e. the space is mapped to a linear ordered space. Notably, this mapping preserves the multidimensional clustering. Thus, all points are inserted into the UB-tree according to this ordering. I implemented the split algorithm. Right now, I am working on the implementation of the UB-tree bound. This bound is slightly more complicated than HRectBound since it consists of a number of hyperrectangles. In contrast to VantagaPointTree, this bound preserves a crutial property of BinarySpaceTree. Namely, UB-tree's children do not overlap each other.

Last week I continued working on the vantage point tree. Since the first point of each left subtree of the vantage point tree is the centroid of its bound and the right subtree does not contain the centroid at all, I added a new method (IsFirstPointCentroid()) to the TreeType API. This check should allow dual and single tree algorithms to use some optimizations.

Moreover, I implemented two types of the random projection tree: RPTreeMax and RPTreeMean (these tree types are based on BinarySpaceTree). Right now, these tree types use HRectBound, I implemented only the split algorithm.

The RPTreeMax split divides a node by a random hyperplane. The position of the hyperplane may differ from the median by a random deviation. The RPTreeMean split divides a node by a random hyperplane if the diameter is less or equal to the average distance between points multiplied by a constant.

My implementation is sligthly different from the original algorithms. I use a fixed number of random dataset points in order to calculate the median and the average distance between points. And since the algorithm that the paper introduces in order to find the deviation from the median is weird (one of two resulting nodes may be empty), I shrank the bounds of this deviation.

Last week, I have been working on the implementation of HollowBallBound. This class represents the area included between two concentric ball bounds. The radius of the inner ball may be equal to 0.

The HollowBallBound class is intended to replace the original piece-wise ball bound of the vantage point tree. This bound allows to calculate the minimum distance to a point much faster than the original bound. One of the disadvantages is the loss the crucial property of the binary space tree: the use of HollowBallBound may lead to overlapping children.

Since the vantage point tree is slightly different from the binary space tree, namely each intermediate node contains a point, I had to move the vantage point tree to a separate class. Unfortunately, that led to the duplication of the code.

Moreover, I added some tests for the vantage point tree.

Last week I had been working on some fixes of the RectangleTree class, including the removal of RectangleTree::Points(), the removal of RectangleTree::Children() and the optimization of RectangleTree::NumDescendants().

The removal of the Children() method requires to make SplitType, AuxiliaryInformationType and DescentType friends of the RectangleTree class. The optimization of NumDescendants() method allows Descendant() to take $O(\log N)$ operations instead of $O(N)$.

Moreover, I finished the R+/R++ tree implementation. Namely, I replaced SortStruct by std::pair in MinimalCoverageSweep and MinimalSplitsNumberSweep, I simplified the code of MinimalCoverageSweep and I fixed the error with addition of a huge amount of equal points.

Unfortunately, I failed in the implementation of the PiecewiseBallBund class since I am not able to think out any cost-efficient method of finding the distance between a point and a piece-wise ball bound. The problem is not easy even in 2 dimensions.

As I had planned in the previous post I did the refactoring of the R+/R++ tree. I added a template parameter SweepType to the RPlusTreeSplit class and implemented a sweep method that tries to partition an intermediate node with the minimum number of splits. All sweep techniques are designed in a tree-independent manner. In order to do that I introduced a template parameter SplitPolicy. This parameter helps to determine the subtree in which we should insert any particular child of an intermediate node that is being split. Also I fixed some errors.

Moreover, I began the vantage point tree implementation. Right now the split method (VantagePointSplit class) is implemented. I have to implement a piecewise ball boundary type since the BallBound is not suitable for the vantage point tree. Also I did not think about vantage point tree-specific tests.

Last week I had been working on the R+ tree implementation. I came across some curious issues. Namely, the R+ tree appears to be buggy.

The R+ tree requires that the nodes do not overlap each other. This leads to some problems in the insertion algorithm and in the split method. Firstly, the paper does not explain the insertion algorithm in detail. The paper states that we should insert a rectangle in each node that overlaps the rectangle. But the paper does not discuss the case when there are no such nodes. I thought that I can always enlarge a node in order to insert a point but when I tried to implement that a simple test showed that there are some cases when it is impossible. Later I thought out such case in 2D. Another problem of the R+ tree is the split algorithm. Sometimes the split should be propogated downward (because of the disjointness property). Hence the split algorithm does not guarantee that nodes will not be overflowed after the split.

I looked trough many papers and found a curious modification of the R+ tree: R++ tree. The R++ tree suggests to store the maximum bounding rect of a node. And the maximum bounding rects of two nodes do not overlap. This property helps to avoid the R+ tree problems. But sometimes this leads to empty leaf nodes.

So, I implemented the R+ tree and the R++ tree. I solved the problem with the insertion algorithm by addition of a node. But I am not sure that my implementation of the R+ tree has not problems with the split algorithm. The R+ tree definitely has problems on rectangle data (suppose all data rectangles overlap each other). But I do not know an example in case of point data. Right now the R++ tree uses the sweep method of the R+ tree. I'd like to implement some different sweep techniques.

As I had planned in the previous post I modified base cases and pruning rules for the neighbor search method, the range search algorithm and the k rank-approximate-nearest-neighbors search method.

In order to do that I had to modify the RectangleTree traversal algorithms. Since the methods described above represent results as a vector of the indices of points in the dataset I had to implement an approach for calculating the index for any particular point in the tree.

Also I started working on the implementation of the R+ tree.

P.S. Unfortunately, we decided that these modifications make base cases and pruning rules too complicated. So, these changes have to be removed. But as for me it was interesting to explore this as a possibility.