mlpack
blog

Implementation of tree types  Summary
In this blog post I'll try to describe consisely the work that I have done for the mlpack
library as part of the GSoC project R+ trees, Hilbert R trees, vantage point trees, random projection trees, UB trees implementation this summer.
Summary
Here's a list of my Pull Requests. All these PRs are merged except Pull Request 746 (Universal B tree implementation, the code is under review now. The tree works correctly, maybe it requires a number of small fixes, I believe the PR will be merged soon).
 Hilbert R tree: 664
 R+ and R++ trees implementation: 699
 Vantage point tree: 708
 Removed RectangleTree::Points(): 709
 Hilbert R tree fixes: 710
 RectangleTree::NumDescendants() optimization and some RectangleTree fixes: 711
 Replaced SortStruct by std::pair: 721
 Fixed a segfault in RPlusTreeSplit. Fixed traits for the R+/R++ tree: 724
 Random projection trees: 726
 Fixed an error in MidpointSplit that may lead to a segfault: 741
 Universal B tree implementation: 746
 Vantage point tree and HRectBound improvements: 760
 Fixed copy constructor of RectangleTree and added move constructor: 767
(List of commits: list)
Hilbert R tree
I began my work with the Hilbert R tree. The tree is implemented according to the Hilbert Rtree: An Improved Rtree Using Fractals paper on the basis of the RectangleTree
class.
The Hilbert R tree differs from original Guttman's R trees by the split and the insertion methods. All points are being inserted into the Hilbert R tree according to their position on the Hilbert curve (the Hilbert value).
The Hilbert curve is a variant of spacefilling Peano curves. The algorithm that translates a point to the Hilbert value is based on the following paper Programming the Hilbert curve. The paper sugests an algorithm that allows to calculate the Hilbert value by means of bit operations whereas previous attempts construct the curve using recursion. The method is implemented in DiscreteHilbertValue
class. I tried a different approach that compares the Hilbert order of two points by means of recursion but the algorithm was much slower than the method suggested in the paper.
The changes are contained in Pull Request #664 and Pull Request #710.
R+/R++ tree
I continued my project with the R+ tree which is based on the paper The R+tree: A dynamic index for multidimensional objects. The main advantage of the tree is the property that children do not overlap each other.
When I had implemented the tree I found an error in the insertion algorithm. The paper does not discuss the method in detail. It is not difficult to think out a case when we can not insert a point and preserve the property of nonoverlapping children. Another problem is that the tree does not guarantee that the number of child nodes after the split algorithm is less than the maximum allowed number. The issues appear to be well known problems. I looked through a series of articles and came across the paper R++ tree: an efficient spatial access method for highly redundant point data that describes a variant of the R+ tree that takes into account the maximum bounding rectangle and avoids the issues. So, I've added a workaround that solves the R+ tree problems by increasing the maximum size of the node and I've implemented the R++ tree.
The implementation of the R+ tree and the R++ tree can be found in Pull Request 699 and Pull Request 724.
Vantage point tree
The vantage point tree is implemented on the basis of the paper Data Structures and Algorithms for Nearest Neighbor Search in General Metric Spaces. The implementation appears to be much slower than the implementation of the kd tree. The main disadvantage is that the bound of a node is too complex and attempts to calculate the distance between a point and a vptree node precisely require a lot of computations. The first version of the vantage point tree contains a point in each intermediate node. Then Ryan suggested to move vantage points to separate nodes in order to simplify tree traversal rules. Then I tried to implement a variant of the vantage point tree that contains points only in leaf nodes. That variant appears to be faster then the previous variants.
Moreover, I tried a number of different bounds. Right now, the tree uses the bound which consists of two nonconcentric balls, the outer ball is the minimum bounding ball and the inner one is centered at the vantage point.
The changes can be found in Pull Request 708 and Pull Request 760. Moreover, I implemented a variant of the vantage point tree that uses the hollowrect bound which consists of the minimum bounding rectangle and a number of rectangular hollows. I didn't issue a PR since the implementation is slower than the implementation of the vantage point tree (I pushed the code here).
Random projection tree
The implementation of random projection trees are based on the paper Random projection trees and low dimensional manifolds.
I implemented two versions of the split method: the mean split and the max split. Both methods are described in the paper. Unfortunately, the tree performs much slower than the kd tree since the bound of the random projection tree is a polytope. There are a number of methods that calculates the distance between a point and a polytope but they are based on the gradient descent method and require a lot of computations. Right now, the tree uses an ordinary rectangular bound.
The tree is implemented on the basis of the BinarySpaceTree
class. The changes can be found in Pull Request 726.
Universal B tree
I think the universal B tree is the most complicated algorithm that I have implemented this summer. The implementation is based on the paper The Universal BTree for Multidimensional indexing: General Concepts.
The universal B tree is a variant of the B tree. Like the Hilbert R tree, the UB tree maps multidimensional space to a linear ordered space. All points are being inserted into the tree according to that ordering. Notably, this mapping preserves the multidimensional clustering.
The bound of the UB tree is a stairwayshaped figure in multidimensional space. Thus, the bound consists of a number of nonoverlapping hyperrectangles. Of course, the tree allows to use nonoverlapping bounds but that leads to a huge number of computations.
The changes can be found in Pull Request 746. The PR is not merged yet, but I believe it will be merged soon.
Other changes
I made a number of changes that are not connected directly with my project, I'll briefly describe them here. I've removed extensions of the TreeType API in the RectangleTree
class. The changes can be found in Pull Request 709. Pull Request 711 contains optimizations that allow to find descendant points by a logarithmic time. Pull Request 721 solves Issue 712. Pull Request 741 solves a problem which has appeared after refactoring. And Pull Request 767 adds the copyconstructor and the move constructor to the RectangleTree
class.
TODOs
Right now, some tree types like the vantage point tree and random projection trees use looserbounds. That leads to a huge number of base cases in dualtree algorithms. I think these bounds may be even more optimized but I don't know how.
Another problem is concerned with RankApproximate Nearest Neighbor Search. The algorithm often fails with some tree types (e.g. the Hilbert R tree). It is an interesting problem to understand why that happens.
Acknowledgement
That was an amazing summer, thanks to the mlpack
team and especially to Ryan Curtin who looked through the code and has suggested a lot of advices on the optimization of the algorithms and to Marcos Pividori who has proposed a series of ideas on the optimization of the RectangleTree
class (such as Pull Request 711 and Pull Request 767).
For further information see my blog posts.
Generated by 1.8.13