Dataset and Experimentation Tools : Summary

In this blog post I'll try to describe my contributions I've made to mlpack this summer.


Here is the link for all my pull requests pull requests. Below is the list of the major pull requests with self-explanatory descriptions.

  • Descriptive Statistics command-line program : 742
  • DatasetMapper & Imputer 694
  • delete unused string_util : 692
  • fix default output problem and some styles : 680
  • Binarize Function + Test : 666
  • add Split() without label: 654
  • add cli executable for data_split : 650

Descriptive Statistics

I originally built a class that calculates descriptive statistics. But after a few discussion, I ended up shrinking all of the functions down to minimum to provide maximum performance and maintainability. I also merged all commits to one to discard unnecessary commits.

Sample output on "iris.csv" would be:

[INFO ] dim     var     mean    std     median  min     max     range   skew    kurt    SE      
[INFO ] 0       0.6857  5.8433  0.8281  5.8000  4.3000  7.9000  3.6000  0.3149  -0.5521 0.0676  
[INFO ] 1       0.1880  3.0540  0.4336  3.0000  2.0000  4.4000  2.4000  0.3341  0.2908  0.0354  
[INFO ] 2       3.1132  3.7587  1.7644  4.3500  1.0000  6.9000  5.9000  -0.2745 -1.4019 0.1441  
[INFO ] 3       0.5824  1.1987  0.7632  1.3000  0.1000  2.5000  2.4000  -0.1050 -1.3398 0.0623  

Users can control the width and precision using -w and -p flag. I tested the output using excel and they match perfectly.

DatasetMapper & Imputer

I renamed DatasetInfo to DatasetMapper, which accepts template parameter of MapPolicy. ( can be used to store different kinds of maps.) DatasetMapper, however, still provides backward compatibility with typedef: using DatasetInfo = DatasetMapper<IncrementPolicy>. The IncrementPolicy denotes the original mapping policy used, which increments numbers for different categories, starting from 0.

Imputer class is also added in this pull request. Imputer also accepts template parameter called ImputationStrategy, so that different strategies can be applied.

Lastly, a command line program called "mlpack_preprocess_imputer.cpp" was added to the mlpack.


This is a simple implementation of binarize function which transforms values in matrix to 0 or 1 according to the threshold. You can use umat A = (B > C) but this function has a overload that applies binarize to only one dimension. Plus, it can produce any type of matrix, not umat.


I added TrainTestSplit() and renamed old ones to LabelTrainTestSplit() as discussed in #651 . This is just a naive implementation mostly copied from Tham's work. I believe LabelTrainTestSplit can just reuse the code in TrainTestSplit twice for both data and labels.

I also implemented "mlpack_preprocess_split.cpp".

Other changes

I also made minor contributions in debugging and fixing styles, especially related to data IO.


I wish to keep contributing to mlpack. I will try to polish the works a little bit more, and especially, I would LOVE to contribute to the deep learning modules. I've been personally reading papers about sequence-to-sequence models, which are used widely for natural language processing and timeseries data analytics.


I thank the mlpack mentors and especially to Tham who gave me a lot of advises through code reviews.


We need to go deeper, Googlenet : Project Summary

This blogpost discusses the project summary and my contributions over the summer, as GSoC 2016, approaches its conclusion. First we'll discuss how you can find most of the work that I did, this will be a list of commits from mlpack or from various branches in my fork. Then we'll discuss what were the goals of the project, and how much we accomplished them, and finally what I learnt over the summer and why working on this project with my mentors was great! :)

Project Summary

The goal of this project was to develop googlenet such that it integrates in mlpack's existing ANN API and the modules developed are reusable to other related applications.

We selected the edge_boxes algorithm for object localization. We performed the feature extraction 703 for a sample BSDS500 Dataset. Marcus helped with reusing the Tree implementation to train the structured random forest which detects edges in the image. Next, we started implementing the Neural Net part. We added functionality to the pooling layer, convolutional layer and implemented the inception layer (which is replicated throughout googlenet), concatenation layer, subnetwork layer and connect layer as additional layers and wrote the [tests][tests] for them. This will give mlpack users a significant flexibility in training more complicated and deep neural nets. We made the GoogleNet using these layers. Tests of our implementation on standard datasets still need to be finished. Here is the list of commits and pull requests (from recent to old) in different branches, with their description, you can see to track the work done over summer:

To see the week by week progress of the work done you can look at the blog.

Feature Extraction

For feature extraction the objective was that given images, segmentations and boundaries extract over 7000 features for different color spaces, gradient channels to capture the local edge structure in the images. Fast edge detection using structured forests and Sketch tokens are the papers that describe this task.

We began this process by first writing useful Image Processing algorithms to convert between color spaces (RGB to LUV), interpolating & padding images, performing convolutions, computing HOG features and calculating distance transforms. The distance transform algorithm is implemented in this paper. Then we calculated Regular and Self Similarity features on a 16x16 image patches, for 1000 edge locations and 1000 non-edge locations. For this we also shrunk channels to reduce the dimensionality of our data, and then discretized our features into classes by performing PCA, so the data could be represented by a normal decision tree or random forest. The implementations of these algorithms can be seen in 703. Then I wrote the tests for the functions implemented in the StructuredForests class and compared values against the reference implementations to verify their correctness. Marcus helped by providing implementation of the structured tree using the feature extraction code.

Inception Layer

Next, we proceeded to implement the Inception Layer. Before doing this, I needed to read some papers alexnet, visualizing CNNs to understand some CNN architectures and some ideas like Network in Network, that Googlenet paper uses by replicating the inception network inside it 9 times. It took time to understand the mlpack CNN class implementation as it uses interesting techniques of generating code using compile time recursion on templates which I was previously oblivious of. Then we made an inception layer as a collection of layers as mentioned in googlenet paper and wrote the tests for it to verify correctness. The implementation of inception layer can be seen in 757.

Adding functionality to Pooling Layer and Convolution Layer

While writing tests for inception layer, it was noticed that some functionalities of the existing classes need to be modified. For the pooling layer I added the functionality to pool with a given stride value. Then we improved the convolution layer to support Forward, Backward and Gradient updates when padding is specified. Padding is very important for deep networks, as we are able to preserve the width of our data by specifying padding, otherwise the data will become smaller as we continue to perform pooling and convolution operations on it, and we will not be able to get a neural net "deep enough". Then I wrote the tests for the pooling layer and convolution layer, and now the test for inception layer passed correctly too!

Concatenation Layer and Subnetwork Layer

On Examining the structure of the googlenet network, we felt that we need a concatenation layer. This layer will give us the functionality to concatenate the outputs of two or more layers in the forward pass, and then distribute the errors among the constituent layers for the backward pass. So I wrote a concat_layer that does exactly this and the corresponding tests.

The goal of this project was to create the components of googlenet so they are also reusable to other applications. So to make duplicating any collection of layers in a deep network easier, we decided to implement a subnet layer. The tests for the subnet layer is still under construction which will implement the inception_layer using the subnet_layer and check for correctness.

Connect Layer

With the googlenet network we faced one more interesting problem - auxillary classifiers. From one layer, there could be 2 layers diverging, and both of these layers would end up at separate output layers. Auxillary classifiers are added to googlenet to combat vanishing gradient problem while providing regularization. In mlpack implementation, the layers are stacked sequentially in the form of a tuple. To support this architectural variant, where 2 layers emerge from one layer, we added a connect layer, which contains the 2 separate nets that emerge from it, and has responsibility for passing input to and collect errors from these nets. Tests still need to be written to for the connect layer.


After all the basic components have completed, creating googlenet is as simple as stacking up all of the layers, put the desired values from the paper and calling the Train() and Predict() functions of CNN class to evaluate outputs. When we are able to complete all refinements we need to make to, all the components developed in this project, training deep neural nets with mlpack will become effortless. There is also one variant of googlenet which uses batch normalization, that I plan to contribute to mlpack with the guidance of Marcus after GSoC 2016.


The following things still need to be completed in order to achieve all the goals mentioned in our proposal: 1. Complete the edge boxes implementation. 2. Write rigorous tests for googlenet. 3. Minor improvements suggested by my mentors in the current Pull requests.


I want to thank the mlpack community for giving me this awesome opportunity to work with them on this amazing project over the summer. I was welcomed right from the first day I joined the irc channel in the beginning of the student application period, when I wasn't even sure what project I wanted to apply to for GSoC 2016. Special Thanks to my mentors Marcus Edel and Tham Ngap Wei, for clearing all my doubts (sometimes even unrelated to the project :) ) with so much patience and simple explainations, and helping me with design and debugging of the project. I feel I have learnt a lot from them, and I really enjoy being part of the mlpack community. This was a great experience, Thank you very much!


Approximate Nearest Neighbor Search - Conclusion

I have been working this summer on the GSoC Project: "Approximate Nearest Neighbor Search" (Project site: 1).

The main contribution can be summarized in these pull requests:

(List of commits: [link])

Approximate nearest neighbor search:

I modified mlpack's code to include an epsilon value, which represents the maximum relative error. It is considered by the prune rules when deciding if a given node combination should be analyzed, (as suggested at the end of the paper 2), with a general implementation that works for both KFN and KNN.

The command line tools: mlpack_knn and mlpack_kfn, were updated to include an extra option "-e", to specify the maximum relative error (default value: 0).

The main code developed was included in the pull/684.

Take into into account that epsilon represents the maximum relative error. So, the average relative error (effective error) will be much smaller than the epsilon value provided.

As expected, higher values of the epsilon parameter implies that more nodes are pruned and, therefore, we have a faster algorithm, as can be seen in the next graphic for the dataset isolet:

[See graphic]

Spill Trees:

I have implemented Spill Trees, as defined in: "An Investigation of Practial Approximate Nearest Neighbor Algorithms" 3 (pull/747). It is a variant of binary space trees in which the children of a node can "spill over" each other, and contain shared datapoints.

One problem with Spill Trees is that their depth varies considerably depending on the overlapping size tau.

For that reason, I have implemented Hybrid Spill Trees 3, which provide better guarantee in the logarithmic depth of the tree.

The extension was incorporated to existing mlpack_knn. You can specify "-t spill" to consider spill trees, and the command line paramenters "--tau" to set different values for the overlapping size (default value is tau=0), and "--rho" to set different values for the balance threshold (default value is rho=0.7).

Spill Trees's decision boundary:

We have considered many different approaches for the implementation of Spill Trees, see discussions in: issues/728 and [spilltrees]. Finally, we decided to implement a similar approach to the one mentioned in Ting Liu's paper.

Splitting Hyperplanes

Actual implementation of Spill Trees can be configured to choose between different kind of splitting Hyperplanes:

+) Providing the template parameter HyperplaneType between two candidate classes: AxisOrthogonalHyperplane and Hyperplane (not necessarily Axis-Orthogonal). (In both cases the hyperplane with the widest range of projection values will be chosen).

+) Providing the template parameteter SplitType, between two candidate classes: MidpointSpaceSplit, MeanSpaceSplit, which determines the splitting value to be considered.

By default, mlpack_knn considers Axis-Orthogonal Hyperplanes and Midpoint Splits, because it is the most efficient option (we can project any vector in O(1) time).

Defeatist Traversers

I have implemented Hybrid SP-Tree Search as defined in 3. We can control the hybrid by varying tau. If tau is zero, we have a pure spill tree with defeatist search, very efficient but not accurate enough. If tau is a very large number, then every node is a non-overlapping node and we get back to the traditional metric tree, with prunning rules, perfectly accurate but not very efficient. By setting different values for tau, we have a trade-off between efficiency and accuracy.

Also, I implemented a Defeatist Dual Tree Traverser, where the query tree is built without overlapping.

The DefeatistDualTreeTraverser is faster than the DefeatistSingleTreeTraverser, specially when the value of tau increases.

Some results can be seen in the next graphic for the dataset 1000000-10-randu.

[See graphic]

General Greedy Search Traverser:

Also, I implemented a general greedy single tree traverser to perform greedy search, that always chooses the child with the best score when traversing the tree, and doesn't consider backtracking: GreedySingleTreeTraverser.

It is a tree independent implementation (based on the general TreeType API). As expected, the greedy traverser performs considerably faster than other approachs, at the cost of some relative error in the results. (PR: pull/762). Further disccusion in: issues/761.

We can simply reduce the relative error by increasing the leaf size of the tree, as is shown in the next graphic for the dataset isolet.

[See graphic]

Other improvements

Also, I have been improving some parts of the existing code:

Bound Issues:

I found some issues in the definition of the B2 bound. We were discussing about it in issues/642 and, after thinking about the different special cases, we found some examples where actual definition could fail. I fixed existing code to consider slighty different bounds.

Improve NSModel implementation:

I have been improving existing code for NSModel, as was suggested in issues/674, using boost variant to manage different options for tree types. (PR: pull/693).

Heaps for the list of candidates:

I realized in many parts of the code, we were keeping track of the best k candidates visited through a sorted list. Instead of maintaining a sorted list, a better approach was to use a priority queue. I implemented this in pull/732.

Benchmarking system:

I have been updating the benchmarking system to include approximate neighbor search not only with mlpack, but also with other libraries like ANN and FLANN (PR: pull/14 and pull/19 ).

Also, I created a new view to plot the progress of a specific metric for different values of a method parameter. For example, for knn, it is possible to analyze the number of base cases and runtime for different values of approximation error (epsilon), with different libraries/configurations. (PR: pull/17).


I want to thank the mlpack community for giving me the opportunity to work with them this summer, it has been a fascinating experience! They were very responsive, I always found someone to talk in the IRC channel, willing to offer their time to discuss ideas or help me with my project! :)

For further information see my previous [blog posts].


Implementation of tree types : Summary

In this blog post I'll try to describe consisely the work that I have done for the mlpack library as part of the GSoC project R+ trees, Hilbert R trees, vantage point trees, random projection trees, UB trees implementation this summer.


Here's a list of my Pull Requests. All these PRs are merged except Pull Request 746 (Universal B tree implementation, the code is under review now. The tree works correctly, maybe it requires a number of small fixes, I believe the PR will be merged soon).

  • Hilbert R tree: 664
  • R+ and R++ trees implementation: 699
  • Vantage point tree: 708
  • Removed RectangleTree::Points(): 709
  • Hilbert R tree fixes: 710
  • RectangleTree::NumDescendants() optimization and some RectangleTree fixes: 711
  • Replaced SortStruct by std::pair: 721
  • Fixed a segfault in RPlusTreeSplit. Fixed traits for the R+/R++ tree: 724
  • Random projection trees: 726
  • Fixed an error in MidpointSplit that may lead to a segfault: 741
  • Universal B tree implementation: 746
  • Vantage point tree and HRectBound improvements: 760
  • Fixed copy constructor of RectangleTree and added move constructor: 767

(List of commits: list)

Hilbert R tree

I began my work with the Hilbert R tree. The tree is implemented according to the Hilbert R-tree: An Improved R-tree Using Fractals paper on the basis of the RectangleTree class.

The Hilbert R tree differs from original Guttman's R trees by the split and the insertion methods. All points are being inserted into the Hilbert R tree according to their position on the Hilbert curve (the Hilbert value).

The Hilbert curve is a variant of space-filling Peano curves. The algorithm that translates a point to the Hilbert value is based on the following paper Programming the Hilbert curve. The paper sugests an algorithm that allows to calculate the Hilbert value by means of bit operations whereas previous attempts construct the curve using recursion. The method is implemented in DiscreteHilbertValue class. I tried a different approach that compares the Hilbert order of two points by means of recursion but the algorithm was much slower than the method suggested in the paper.

The changes are contained in Pull Request #664 and Pull Request #710.

R+/R++ tree

I continued my project with the R+ tree which is based on the paper The R+-tree: A dynamic index for multi-dimensional objects. The main advantage of the tree is the property that children do not overlap each other.

When I had implemented the tree I found an error in the insertion algorithm. The paper does not discuss the method in detail. It is not difficult to think out a case when we can not insert a point and preserve the property of non-overlapping children. Another problem is that the tree does not guarantee that the number of child nodes after the split algorithm is less than the maximum allowed number. The issues appear to be well known problems. I looked through a series of articles and came across the paper R++ tree: an efficient spatial access method for highly redundant point data that describes a variant of the R+ tree that takes into account the maximum bounding rectangle and avoids the issues. So, I've added a workaround that solves the R+ tree problems by increasing the maximum size of the node and I've implemented the R++ tree.

The implementation of the R+ tree and the R++ tree can be found in Pull Request 699 and Pull Request 724.

Vantage point tree

The vantage point tree is implemented on the basis of the paper Data Structures and Algorithms for Nearest Neighbor Search in General Metric Spaces. The implementation appears to be much slower than the implementation of the k-d tree. The main disadvantage is that the bound of a node is too complex and attempts to calculate the distance between a point and a vp-tree node precisely require a lot of computations. The first version of the vantage point tree contains a point in each intermediate node. Then Ryan suggested to move vantage points to separate nodes in order to simplify tree traversal rules. Then I tried to implement a variant of the vantage point tree that contains points only in leaf nodes. That variant appears to be faster then the previous variants.

Moreover, I tried a number of different bounds. Right now, the tree uses the bound which consists of two non-concentric balls, the outer ball is the minimum bounding ball and the inner one is centered at the vantage point.

The changes can be found in Pull Request 708 and Pull Request 760. Moreover, I implemented a variant of the vantage point tree that uses the hollow-rect bound which consists of the minimum bounding rectangle and a number of rectangular hollows. I didn't issue a PR since the implementation is slower than the implementation of the vantage point tree (I pushed the code here).

Random projection tree

The implementation of random projection trees are based on the paper Random projection trees and low dimensional manifolds.

I implemented two versions of the split method: the mean split and the max split. Both methods are described in the paper. Unfortunately, the tree performs much slower than the k-d tree since the bound of the random projection tree is a polytope. There are a number of methods that calculates the distance between a point and a polytope but they are based on the gradient descent method and require a lot of computations. Right now, the tree uses an ordinary rectangular bound.

The tree is implemented on the basis of the BinarySpaceTree class. The changes can be found in Pull Request 726.

Universal B tree

I think the universal B tree is the most complicated algorithm that I have implemented this summer. The implementation is based on the paper The Universal B-Tree for Multidimensional indexing: General Concepts.

The universal B tree is a variant of the B tree. Like the Hilbert R tree, the UB tree maps multidimensional space to a linear ordered space. All points are being inserted into the tree according to that ordering. Notably, this mapping preserves the multidimensional clustering.

The bound of the UB tree is a stairway-shaped figure in multidimensional space. Thus, the bound consists of a number of non-overlapping hyper-rectangles. Of course, the tree allows to use non-overlapping bounds but that leads to a huge number of computations.

The changes can be found in Pull Request 746. The PR is not merged yet, but I believe it will be merged soon.

Other changes

I made a number of changes that are not connected directly with my project, I'll briefly describe them here. I've removed extensions of the TreeType API in the RectangleTree class. The changes can be found in Pull Request 709. Pull Request 711 contains optimizations that allow to find descendant points by a logarithmic time. Pull Request 721 solves Issue 712. Pull Request 741 solves a problem which has appeared after refactoring. And Pull Request 767 adds the copy-constructor and the move constructor to the RectangleTree class.


Right now, some tree types like the vantage point tree and random projection trees use looser-bounds. That leads to a huge number of base cases in dual-tree algorithms. I think these bounds may be even more optimized but I don't know how.

Another problem is concerned with Rank-Approximate Nearest Neighbor Search. The algorithm often fails with some tree types (e.g. the Hilbert R tree). It is an interesting problem to understand why that happens.


That was an amazing summer, thanks to the mlpack team and especially to Ryan Curtin who looked through the code and has suggested a lot of advices on the optimization of the algorithms and to Marcos Pividori who has proposed a series of ideas on the optimization of the RectangleTree class (such as Pull Request 711 and Pull Request 767).

For further information see my blog posts.


Summary of Neuroevolution Algorithms Implementation

In this blog, we summarize our project and our current contributions to mlpack project. I wish this will be helpful to anyone who is interested in what we have done, how we did, how to use it, and how to contribute to it if interested.

Brief Summary

Our project is aiming to implement multiple Neuralevolution algorithms, including CNE, NEAT and HyperNEAT (perhaps implementing more in the future).

Currently, I have created the following Pull Requests:

  • My working PR: PR 686
  • The PR for merging CNE algorithm: PR 753
  • The PR for merging NEAT algorithm: PR 752
  • The PR for merging HyperNEAT algorithm: PR 754

Currently, the CNE algorithm can be merged after we adjust some coding styles. The NEAT algorithm is finished and tested with a bunch of testings, will will be state in detail in the following. The HyperNEAT algorithm is finished a version and in the progress of debugging by tests.

Before my works being merged to the mlpack repository, you can check the most updated implementations under my github account:

[1] Neural Evolution module code:NE source code

[2] Neural Evolution tests code:NE test code

After they are merged, they can be found in the same directory under mlpack repository.

CNE Algorithm Implementation

The first algorithm is Conventional Neural Evolution (CNE) algorithm. The main reference papers and code for the implementation of CNE includes:

Generally, different neural evolution algorithms are evolving a number of neural networks iteratively to find out a siutable neural network for solving specific tasks. So we first define some classes to represent key concepts in neural evolution algorithms. Including:

  1. LinkGene: this class defines a link. Basically, a link is defined by the two neurons' id it connected, its own id, and its weight. Detailed implementation is in mlpack/src/mlpack/methods/ne/link_gene.hpp.
  2. NeuronGene: this class defines a neuron. Basically, a neuron is defined by its id, neuron type (INPUT, HIDDEN, OUTPUT, BIAS), activation function type (SIGMOID, LINEAR, RELU, etc.) Detailed implementation is in mlpack/src/mlpack/methods/ne/neuron_gene.hpp.
  3. Genome: this is a critical class. A genome is the encoding format of a neural network. A neural network contains multiple links and neurons. Thus, a genome contains a vector of link genes and neuron genes. Detailed implementation can be found in mlpack/src/mlpack/methods/ne/genome.hpp. A novel idea we made is how we calculate a genome's output given an input vector, which is the Activate function in the Genome class. Briefly speaking, as neural networks in NE algorithms are not in well-defined layered structure, we assign each neuron a height attribute. Input neurons are of height 0. Output neurons are of height 1. Heights of hidden neurons are between 0 and 1. Different neurons with same height value cannot be connected (but a neuron can connect to itself to form a recurrent link). In this way, we can have at least three benefits: first, it makes the activation calculation be quite fast (we just need to loop through all links for once); second, calculation logic of any complex neurl network structure is quite clear: neurons are activated in sequence according to its height: from small (0) to big (1); third, different kind of links can be defined by compare the heights of the two neurons it connected. A FORWARD link is connect a small height neuron to a big height neuron. A BACKWARD link is connect a big height neuron to a small height neuron. And a RECURRENT link is connect a neuron to itself.
  4. Species: Basically a species contains a vector of genomes which will be evolved by NE algorithms, such as CNE algorithm. Detailed implementation can be found in mlpack/src/mlpack/methods/ne/species.hpp .
  5. Population: Basically a population contains a vector of species. As in algorithms such as NEAT, a number of genomes is not just an array of genomes, but be speciated into different species. Thus, we define the class Population to organize a vector of species. Detailed implementation can be found in mlpack/src/mlpack/methods/ne/population.hpp .

If each algorithm is a house, then the above classes are the bricks of different style houses. For different algorithms, including CNE, NEAT, and HyperNEAT, the above classes are their basis.

Specially, for CNE algorithm, we define a class CNE, where details are inside mlpack/src/mlpack/methods/ne/cne.hpp . For each algorithm, including CNE, NEAT and HyperNEAT, there are same key functions (with different implementation), so that different algorithms can be called in similar style. Here we list some key functions:

  1. Reproduce(): This is the key function, where how the algorithm evolves its genome groups to get the next generation genomes is defined.
  2. Evolve(): This is the main function of each algorithm. The whole neural evolution progress is depicted in this function.
  3. Others: Other functions are mainly operator functions such as mutate weight, link or neutron.

Besides all above, we have two more helpful classes defined and being utilized by all neural eolution algorithm classes.

First, considering a problem: how to evaluate a genome given different tasks to solve? For example, given a same genome, its fitness to XOR task, or to Cart Pole Balancing task, are obviously different. To solve this, we propose to define a task class for each different problem. We test our algorithms with a bunch of tests. Each test task, we defined a corresponding task class. They are inside the file mlpack/src/mlpack/methods/ne/tasks.hpp. For example, for the XOR task, we defined class TaskXor; for the Cart Pole Balancing task, we defined task TaskCartPole, etc. Every task class must implement a double EvalFitness(Genome& genome) function, so that different neural evolution algorithms can use this function to evaluate a genome's fitness.

Second, different algorithms will have many parameters. For example, the probability to mutate a weight, the mutate size, the probability to add a new link or neuron, etc. When we create an algorithm instance, we need to specify all these algorithm parameters. If we put all of them as the paramrters of constructor function, the parameter list will be too long. To solve this problem, we define a class Parameters which contains all algorithm parameters. Details are in mlpack/src/mlpack/methods/ne/parameters.hpp. This way gives at least two benefits: first, we just need to pass a Parameter instance to constructor functions of different algorithms, rather than all parameters; second, we can choose the parameters we need to assign values. Different algorithms will share the same parameter name, if they are of the same meaning in different algorithm (for example, all algorithms have the operation to mutate a link's weight. Thus, aMutateWeightProb represents the probability to mutate a genome's weight for all algorithms).

NEAT Algorithm Implementation

The NEAT algorithm is the most critical algorithm we implemented in our project. The main reference paper for NEAT algorithm is: Evolving Neural Networks through Augmenting Topologies.

Our detailed class NEAT implementation can be found in mlpack/src/mlpack/methods/ne/neat.hpp. Compared with CNE algorithm, NEAT contains much more mutate operators. The evolution mechanism is also more complex. To adapt to the increasing complexity, we defined much more functions to model different operators, such as add new link or new neuron. One thing to notice is that the NEAT algorithm also contains a Reproduce() function. How to reproduce genomes to get next generation, i.e., how to organize different mutate or crossover operators in each evolution process, is kind of flexible, and also critical to the performance of algorithm. Similarly, Mutate, BreedChild functions in NEAT class are also kind of flexible and can have different implementations.

During the implementation of NEAT, we mainly referred to the existing implementation: NEAT reference.

An important contribution in our project is that, we implement a bunch of testing tasks to test our algorithms. Currently, we have implemented XOR, Mountain Car, Cart Pole Balancing, Markov/Non-Markov Double Pole Balancing, and Playing Super Mario Game. Here we list the reference materials for different tasks.

Cart Pole Balancing Problem and Mountain Car Problem: The materials I referred to for Cart Pole Balancing and Mountain Car problem includes:

Double Pole Balancing Problem: The materials I reffered to for double pole balancing are:

Play Super Mario Game: Marcus helps to implemented the testing of playing Super Mario game, and our algorithm successfully passed the first level until now! Detailed implementation and how to repeat the experiment can be found under Marcus's github account: Playing Super Mario.

Last but not the least, how we call the NEAT algorithm to solve different tasks? First, given a new problem, a user should define his/her own task class, which offers a EvalFitness function inside the class. Second, calling NEAT to run evolution is organized in the following steps:

  1. Set task parameters and construct a task instance;
  2. Set algorithm parameters and construct a Parameters instance;
  3. Set a seed genome, i.e., the genome used to initialize a population of genomes to evolve;
  4. Construct an algorithm instance using the above three input parameters: task, parameters, and seed genome;
  5. Call algorithm instances' Evolve() function.

For detailed examples, please check mlpack/src/mlpack/tests/ne_test.cpp.

Here we list a few features that we haven't implement but is going to add soon:

  • Save and load genome. So that the learned best genome for tasks can be saved and to solve new inputs.
  • Save and load model. So that an algorithm instance can be created by loading a config file, or save to a config file.
  • Visualize genome. Currently we visualize genome by printing its links and neurons' information. I wish we can implement a more intuitive graphical method.

HyperNEAT Algorithm Implementation

The HyperNEAT algorithm is similar with NEAT. The key difference is that the evolved genomes are not used directly, but being applied to a substrate genome to generate the links. The generated genome will be applied to user's task. The main reference for HyperNEAT is: HyperNEAT.

As HyperNEAT needs to query a substrate, i.e., a set of neuron nodes with coordinates, we defined class Substrate in the mlpack/src/mlpack/methods/ne/substrate.hpp. The algorithm class HyperNEAT is defined in mlpack/src/mlpack/methods/ne/hyperneat.hpp. Currently we have implemented the HyperNEAT algorithm and it is being tested by various tasks we defined before.

I would like to describe more details about HyperNEAT after it has passed all tests, as we may revise the design a lot during debugging.


We have almost achieved all our goals in our proposal. Currently the remain works including:

  1. Debugging HyperNEAT to let it pass all tests;
  2. Check potential bugs and optimize as much as we can;
  3. Clean code and adjust style for merge into mlpack.

In the future, maybe we can implement more interesting neural evolution algorithms.


I have enjoied a great summer working on this interesting project! Thanks a lot to my menter Marcus Edel, who has helped me a lot by keep giving me valuable advices during coding, help with debugging, help with implementing Super Mario task, answer all my questions timely with patience, and so on. He is an excellent mentor and I learned a lot from him. Thanks Marcus! Besides, I am also grateful to the mlpack team that gives me this chance to participate into such an excited project! Thank you very much!



Neuroevolution Algorithms Implementation: week 14

During the last week, we have changed the design of HyperNEAT to avoid class inheritance, and implemented it and its first test case: XOR. We are now debugging this algorithm as it hasn't pass the test. Until now, we have figured out multiple critical bugs through the debugging of XOR. One is the Genome classs's Activate() function. Previously it may skip some neuron's activation due to the disabled links. Another is NEAT's MutateAddNeuron(). Previously it will generate duplicated links. After we figured out why, the problem is now solved. We have re-run previous NEAT's test cases, and they passed.

Now the CNE algorithm is ready to merge once we adjusted its style. The NEAT algorithm may be merged after debugging HyperNEAT. As we may figure out more potential bugs during the debuggingof HyperNEAT, and thus we can optimize it.


Summary of LSH changes for GSoC 2016

As we are approaching the pencils down date, I think it is a good time to create a short summary of my contributions to mlpack this summer. Seeing all the students being very active, and so much code being committed, I believe summing up what I've done in the last months is going to help anyone wanting to come up-to-speed with the part of the changes I'm responsible for.

Executive Summary

TL;DR: A summary of my commits can be found here.

Here's a list of my Pull Requests, each with a short description.

  • LSH Non-deterministic Testing: 605
  • LSH Projection Table Access: 663
  • LSH Deterministic Testing: 676
  • LSH Optimization - find() vs unique(): 623
  • LSH Optimization - secondHashTable: 675
  • Multiprobe LSH: 691
  • LSH Parallelization with OpenMP: 700
  • Gamma Distribution, boost backporting: 729
  • Gamma Distribution, more functionality: 751
  • LSH Tuning (under construction): 749

LSH Testing

This contribution actually started before my GSoC application, and it was based on Issue 261. The question was: How can we create tests that will verify that the LSH module works correctly, given that LSH is a randomized approximate algorithm and there is no "correct" solution to check for?

The accepted solution was twofold.

First, I discussed with Ryan and we came up with some reasonable assumptions that LSH must fulfill: Increasing the number of tables must increase recall, increasing the number of projections per table must decrease recall. A very "expensive" run should examine nearly 100% of the points and have nearly 100% recall. A very "cheap" run should examine almost no points, and have recall near 0. These tests were added in several commits that are mostly summarized by Pull Request 605.

The second part of the solution needed us to have write access to the (otherwise random) projection tables used by the LSHSearch class. I modified the code slightly to be able to do that in Pull Request 663. That PR also changes the way projection tables are used, going from an std::vector of arma::mats to arma::cube. Then, in Pull Request 676, I added deterministic tests for LSH, basically exploiting the fact that, if the identity matrix is used as a projection table, the resulting hash buckets are predictable. An intuition of this idea is given in a comment I made in a different PR.

These three Pull Requests increased LSH testing coverage significantly.

LSH Optimizations

Before moving to the main course (I'm looking at you, Multiprobe LSH), I want to focus on two Pull Requests that I think are important optimizations to the LSH module.

The first one, Pull Request 623, tries to reduce the memory footprint of LSHSearch::ReturnIndicesFromTable(). The original implementation would allocate N spaces for the N points, set them all to 0, then mark the points that were in the same hash bucket as a query and only keep those indices to create the candidate set. The complexity of that was O(N). Instead, we could simply take note of the indices of all the points we find, and only keep the unique ones. Unique runs in O(MlogM), but if M is significantly smaller than N, we reduce both the memory used and the time needed.

To find the sweet spot between M and N, we did extensive tuning with a number of datasets, and allowed our implementation to pick either the O(N) or O(MlogM) method for each point individually.

The second optimization, summarized in Pull Request 675 was made mainly by Ryan, based on our observation that the second-level hash table implementation was too heavy. What the previous implementation did was allocate a secondHashSize x bucketSize armadillo matrix, with each row corresponding to the key for hash value i. The first bucketSize points hashed to each value were kept, and the rest were discarded. Then, the hash table was condensed.

For the default parameters (secondHashTable = 99901, bucketSize = 500), this required almost 50 million objects of type size_t to be allocated. size_t is usually 8 bytes long, resulting in an allocation of about 400Mb when the program launched. This is bad, but it's even worse if a user sets bucketSize to some significantly larger size, like 3000 or 4000 (not unreasonable for larger datasets).

The new version of the code refrains from such excessive allocations, using a std::vec of arma::Col`s instead of a 2-dimensional matrix.

Multiprobe LSH

Now to the interesting part: Implementation of a state-of-the-art algorithm that (promises to) significantly improve the approximation results of naive LSH. The implementation was based on this paper, and it was introduced to mlpack through Pull Request 691.

The implementation was mostly straight-forward, since the paper is quite clear and even provides pseudocode for the most tricky part.

Of course, many mistakes that were made were much easier to test for, now that we could write (semi-)deterministic tests for LSH.

The LSHSearch class of release 2.0.3 does not yet include Multiprobe LSH, so if you want to try it before release 2.0.4 which will (presumably) include all GSoC changes, you should download and install mlpack from the source.

Parallelization with OpenMP

This was another part I was looking forward to, and which I believe is an important improvement over the old implementation. Using OpenMP directives and minimal extra code, we were able to have the LSHSearch class process more than one query in different threads.

Pull Request 700 is merged, so if you're using a multi-core machine (you probably are) running mlpack, you can now process your LSH queries faster (or you will be, from mlpack 2.0.4, and if you're not using Visual Studio to compile).

Implementation of LSH Tuning

Fair warning: LSH Tuning is still under construction.

For the last part of my GSoC contributions, I decided to implement the LSH Tuning algorithm. The algorithm helps identify parameter sets for which Multi-probe LSH will perform well for a specific dataset. Without this algorithm, tuning LSH by hand quickly becomes tedious and annoying.

Among other things, LSH Tuning models pairwise distances by fitting a Gamma Distribution to a sample of them. In order to do that fitting, I implemented an algorithm proposed by Thomas Minka that converges faster than the method proposed in the original paper. The implementation of the Gamma Distribution is included in Pull Request 729, which includes backporting of several features from Boost 1.58 needed by Minka's algorithm.

The Gamma Distribution implementation was incomplete, as I mention in Issue 733. I worked towards closing that issue later, and implemented most of the missing functionality in Pull Request 751. Some work remains to be done, and I will come back to it once everything else is ready.

The rest of the code for the LSHTuning module, a new class and executable that will be added to mlpack, is still in progress. The paper describing the algorithm has been convoluted in a few parts, but I think most of my confusion has been solved (with immeasurable help from Ryan), so I'm confident the code will be ready to ship relatively soon.

I am almost done implementing the core algorithm, but for it to be usable I need to write the code for the corresponding mlpack executable and write useful tests. My progress can be seen in Pull Request 749.


It has been an amazing summer, and although I didn't have the time to complete any of my blue-sky ideas that I discussed in my proposal, I think the experience has made me significantly more aware of the mlpack codebase. I am now much more capable to continue contributing to it, so hopefully I will be implementing many more interesting features soon!

If you made it this far, congratulations! Have an ice cream on me.



We need to go deeper, Googlenet : Week-10 & 11 Highlights

In week 10 & 11, I have, incorporated the fixes suggested in #703 and #696 for the feature extraction part. I have also applied fixes suggested in #737 for the convLayer. Completed the subnet_layer (right now we duplicate some code from cnn.hpp for the subnet layer, this maybe changed later). Completed the basic structure of the googlenet network. What still needs to be discussed is how the error from auxillary classifiers is being propagated into the main network, which I will do forthwith. Regularization in the network and writing tests for its correct working are the other tasks that still need to be done. This is what I will do in the next days. Besides I am also looking at the fixes suggested in #757 and these changes will be made as soon as some things are clear. Once these changes are done we will create the inception layer using the subnet_layer and concat_layer which will fulfill one of the objectives of the project that users can duplicate any complex collection layers in a deep network without having to explicitly write a class for that collection of layers. I will also be writing a full blog post which covers point by point everything done in the project from start to finish in the next week. Thanks for reading.


Approximate Nearest Neighbor Search - Week 11

Last week, I have implemented generalized Spill Trees, to consider general splitting hyperplanes, not necessarily axis-orthogonal.

I added a new template parameter for Spill Trees: HyperplaneType, and I implemented two candidate classes: AxisOrthogonalHyperplane and Hyperplane.

(+) AxisOrthogonalHyperplane: consider an axis-parallel projection vector. So, we can project any vector in O(1) time, considering the appropiate dimension.

(+) Hyperplane consider a general projection vector. So we can project any vector in O(d) time, through a dot product.

Inside Spill Tree, I consider BallBound for non-axis-orthogonal hyperplanes, and HrectBound for axis-orthogonal hyperplanes.

Considering this abstraction, I managed to significantly simplify the Split methods: MeanSplit and MidpointSplit. Now, they share most of the code.

By default, SpillSearch considers AxisOrthogonalHyperplanes because they seem to be faster in many cases, but this is not always true. We have to benchmark both methods and see which is faster and which is more accurate, I mean, which has the best relation between running time and relative approximation error.

All of this changes were included in the pull request: [1].

I plan to work next days in these topics:

+) Add many test cases for all the code developed: SpillTree class, SpillSearch class, Hyperplane class, etc.

+) AppVeyor has shown some problems with the MSVC compiler, in the resolution of template parameters of alias templates, similar to an old issue: [2]. I have to fix it, probably including some extra definitions.

+) Check details in the Spill Tree's implementation.

Follow the progress in: [3].


Implementation of tree types : Week 11

Last week I have been working on fixing various tree types.

Random projection trees

I made some minor fixes of random projection trees and wrote a simple test that checks if there is a hyperplane that splits a node. I find the hyperplane as a solution of a linear system of inequalities. Curiously, but it seems the easiest iterative method solves the system fine.

Universal B-tree

I fixed a problem in the address-to-point conversion method. The method had led to wrong results if some addresses correspond to negative numbers. Moreover, since floating point data types have some excess information, some addresses correspond to infinite points. Thus, I have to add some restrictions in the conversion algorithm.

Vantage point tree

I implemented a method that tries to get a tighter bound by investigating the parent bound. The method calculates the distance to the "corner" (the intersection of two (n-1)-dimensional spheres that is an (n-2)-dimensional sphere) using properties of orthogonal transforms. The results appear to be worse since this calculation requires too many arithmetic operations and the number of base cases has decreased slightly.

Hollow hyperrectangle vantage point tree

I implemented a bound that consists of an outer rectangle and a number of rectangular hollows. I tried to test the vantage point tree with this bound. Right now, the original VP-tree bound outperforms this one, but I think the hollow-hrect bound should work faster. So, I'll continue working on the optimization of the bound.