Summary of LSH changes for GSoC 2016

As we are approaching the pencils down date, I think it is a good time to create a short summary of my contributions to mlpack this summer. Seeing all the students being very active, and so much code being committed, I believe summing up what I've done in the last months is going to help anyone wanting to come up-to-speed with the part of the changes I'm responsible for.

Executive Summary

TL;DR: A summary of my commits can be found here.

Here's a list of my Pull Requests, each with a short description.

  • LSH Non-deterministic Testing: 605
  • LSH Projection Table Access: 663
  • LSH Deterministic Testing: 676
  • LSH Optimization - find() vs unique(): 623
  • LSH Optimization - secondHashTable: 675
  • Multiprobe LSH: 691
  • LSH Parallelization with OpenMP: 700
  • Gamma Distribution, boost backporting: 729
  • Gamma Distribution, more functionality: 751
  • LSH Tuning (under construction): 749

LSH Testing

This contribution actually started before my GSoC application, and it was based on Issue 261. The question was: How can we create tests that will verify that the LSH module works correctly, given that LSH is a randomized approximate algorithm and there is no "correct" solution to check for?

The accepted solution was twofold.

First, I discussed with Ryan and we came up with some reasonable assumptions that LSH must fulfill: Increasing the number of tables must increase recall, increasing the number of projections per table must decrease recall. A very "expensive" run should examine nearly 100% of the points and have nearly 100% recall. A very "cheap" run should examine almost no points, and have recall near 0. These tests were added in several commits that are mostly summarized by Pull Request 605.

The second part of the solution needed us to have write access to the (otherwise random) projection tables used by the LSHSearch class. I modified the code slightly to be able to do that in Pull Request 663. That PR also changes the way projection tables are used, going from an std::vector of arma::mats to arma::cube. Then, in Pull Request 676, I added deterministic tests for LSH, basically exploiting the fact that, if the identity matrix is used as a projection table, the resulting hash buckets are predictable. An intuition of this idea is given in a comment I made in a different PR.

These three Pull Requests increased LSH testing coverage significantly.

LSH Optimizations

Before moving to the main course (I'm looking at you, Multiprobe LSH), I want to focus on two Pull Requests that I think are important optimizations to the LSH module.

The first one, Pull Request 623, tries to reduce the memory footprint of LSHSearch::ReturnIndicesFromTable(). The original implementation would allocate N spaces for the N points, set them all to 0, then mark the points that were in the same hash bucket as a query and only keep those indices to create the candidate set. The complexity of that was O(N). Instead, we could simply take note of the indices of all the points we find, and only keep the unique ones. Unique runs in O(MlogM), but if M is significantly smaller than N, we reduce both the memory used and the time needed.

To find the sweet spot between M and N, we did extensive tuning with a number of datasets, and allowed our implementation to pick either the O(N) or O(MlogM) method for each point individually.

The second optimization, summarized in Pull Request 675 was made mainly by Ryan, based on our observation that the second-level hash table implementation was too heavy. What the previous implementation did was allocate a secondHashSize x bucketSize armadillo matrix, with each row corresponding to the key for hash value i. The first bucketSize points hashed to each value were kept, and the rest were discarded. Then, the hash table was condensed.

For the default parameters (secondHashTable = 99901, bucketSize = 500), this required almost 50 million objects of type size_t to be allocated. size_t is usually 8 bytes long, resulting in an allocation of about 400Mb when the program launched. This is bad, but it's even worse if a user sets bucketSize to some significantly larger size, like 3000 or 4000 (not unreasonable for larger datasets).

The new version of the code refrains from such excessive allocations, using a std::vec of arma::Col`s instead of a 2-dimensional matrix.

Multiprobe LSH

Now to the interesting part: Implementation of a state-of-the-art algorithm that (promises to) significantly improve the approximation results of naive LSH. The implementation was based on this paper, and it was introduced to mlpack through Pull Request 691.

The implementation was mostly straight-forward, since the paper is quite clear and even provides pseudocode for the most tricky part.

Of course, many mistakes that were made were much easier to test for, now that we could write (semi-)deterministic tests for LSH.

The LSHSearch class of release 2.0.3 does not yet include Multiprobe LSH, so if you want to try it before release 2.0.4 which will (presumably) include all GSoC changes, you should download and install mlpack from the source.

Parallelization with OpenMP

This was another part I was looking forward to, and which I believe is an important improvement over the old implementation. Using OpenMP directives and minimal extra code, we were able to have the LSHSearch class process more than one query in different threads.

Pull Request 700 is merged, so if you're using a multi-core machine (you probably are) running mlpack, you can now process your LSH queries faster (or you will be, from mlpack 2.0.4, and if you're not using Visual Studio to compile).

Implementation of LSH Tuning

Fair warning: LSH Tuning is still under construction.

For the last part of my GSoC contributions, I decided to implement the LSH Tuning algorithm. The algorithm helps identify parameter sets for which Multi-probe LSH will perform well for a specific dataset. Without this algorithm, tuning LSH by hand quickly becomes tedious and annoying.

Among other things, LSH Tuning models pairwise distances by fitting a Gamma Distribution to a sample of them. In order to do that fitting, I implemented an algorithm proposed by Thomas Minka that converges faster than the method proposed in the original paper. The implementation of the Gamma Distribution is included in Pull Request 729, which includes backporting of several features from Boost 1.58 needed by Minka's algorithm.

The Gamma Distribution implementation was incomplete, as I mention in Issue 733. I worked towards closing that issue later, and implemented most of the missing functionality in Pull Request 751. Some work remains to be done, and I will come back to it once everything else is ready.

The rest of the code for the LSHTuning module, a new class and executable that will be added to mlpack, is still in progress. The paper describing the algorithm has been convoluted in a few parts, but I think most of my confusion has been solved (with immeasurable help from Ryan), so I'm confident the code will be ready to ship relatively soon.

I am almost done implementing the core algorithm, but for it to be usable I need to write the code for the corresponding mlpack executable and write useful tests. My progress can be seen in Pull Request 749.


It has been an amazing summer, and although I didn't have the time to complete any of my blue-sky ideas that I discussed in my proposal, I think the experience has made me significantly more aware of the mlpack codebase. I am now much more capable to continue contributing to it, so hopefully I will be implementing many more interesting features soon!

If you made it this far, congratulations! Have an ice cream on me.



Multiprobe LSH Finalized, LSH Parallelization

I started this week by making the final changes to Multiprobe LSH - I fixed the style issues, corrected some bugs, and implemented or re-implemented some tests.

As I've discussed before, LSH, and by extension Multiprobe LSH, is quite a hard algorithm to test, because it is randomized and furthermore doesn't give any practically usable guarantees on the rate of its error.

So while fixing the documentation and style, I ran into a few bugs in the code. What was worse, I found bugs in my testing code that made it too tolerant of mistakes. So, I re-implemented all the deterministic tests for Multiprobe LSH.

Before the end of the week, Ryan merged my Pull Request to the master branch, so now Multiprobe LSH is available for users of the github build.

I continued my week by trying to implement parallel versions of arma::find() and arma::unique(). I failed miserably, for two reasons: (a) because these functions are aggressively optimized, and (b) because there's a big overhead on lock contention and shared memory coherency, so big that the parallel version ends up being slower than the sequential one.

In the end, we are probably going to go only with the first and simpler parallelization of the querying process - where an n-core machine can process n queries in parallel, which is the only thing I have attempted that actually showed any improvement.

In related news, I merged the Multiprobe code into the OpenMP code branch, so AppVeyor could build the Pull Request and tell us if it fails in other platforms. Things look good, which is good news :)


LSH Parallelization - Week 2

This week I focused on implementing the parallelization changes I discussed in my previous blog post. I also spent some time on bug fixes and style changes of Multiprobe LSH, but I will not bore you with that.

Processing Queries in Parallel

First, I parallelized the query processing code. This was pretty straightforward since OpenMP lets us define private and shared variables and takes care of memory access by itself.

Here's a nice example of why memory access becomes a concern when dealing with parallel code.

Let's say you're on vacation in a nice Greek beach, and you run out of money. You call your mom to deposit some on your bank account but she doesn't answer, so you leave a message and then call your dad. Your dad answers and runs out to send you money through the ATM. While he's out, your mom listens to the message and goes online to also send you money...

So, your dad at the ATM and your mom through e-banking both perform these actions:

// balance = how much money you have. In the beginning this is 0.
balance = read_account_balance(); 
balance += 100; // the new deposit 
write_account_balance(balance); // update balance

But, because the bank you use doesn't employ great programmers, so they didn't care about memory access by multithreaded applications. So these commands are executed like this:

balance_mom = read_account_balance(); // mom knows you have 0
balance_mom += 100;
balance_dad = read_account_balance(); // dad knows you have 0
write_account_balance_mom(balance_mom); // mom just set your balance to 100
balance_dad += 200; // dad is more generous
write_account_balance_dad(balance_dad); // dad just set your balance to 200

Well, that's a problem - both mom and dad gave you money, but you only received 200 out of the 300 total. The reason is that the code is what is called a "critical segment" - a dangerous point where no two threads are safe to operate at the same time.

Does this mean we don't get to use parallelism in fear of ruining our results? Of course not! We just need to verify that what we are doing is correct - essentially forbidding parallel execution of the (usually very small) parts of the code that are critical segments.

OpenMP can force single-core execution of parts of the code with #pragma omp atomic (for one critical expression) or #pragma omp critical (for longer pieces of code).

Thread book-keeping

Once I was done with that and had it tested, I started implementing code to keep track of how many threads we use. That code started becoming very complicated, so I considered implementing a thread-control class and managing threads through it, but that has the potential of complicating the code too much. Many of the parts we're intervening with already have an increased level of code complexity (long functions with a lot of branching) and so adding more complexity might make the next person to look at the code want to learn where I live and come murder me - something I would very much like to avoid.

So, I ended up going with a simpler solution. OpenMP by default does not allow for nested parallelism - that is, it doesn't allow a spawned thread to spawn threads of its own. This is very good: it's easy to accidentally get yourself in recursive situations where spawning more threads that spawn more threads (and so on) leads to either the program to crash or it being very slow.

Since the code I want to parallelize further is nested inside parallel regions, it will, by default, be executed by only one thread. If, however, a user wants to enable that parallelism as well (for example in the live-query processing scheme I described in the previous post), then they can enable OpenMP's nested parallelism by setting the environment variable OMP_NESTED to TRUE.

Further parallelism

For now, I have only parallelized the for loops inside ReturnIndicesFromTables(), but they are not the bottleneck - find() and unique() are. So this week I plan to implement parallel versions of these functions and add tests for the second level of parallelism. Once that is done, the chapter of LSH parallelization will be over :)


LSH Parallelization

I have done a few things this week on the LSH code, but I want to talk mainly about parallelization of the LSHSearch object, a topic which I find very interesting.

First off, a few things to consider:

a) Parallelization should be as transparent as possible. Not everyone is familiar and/or comfortable with parallel frameworks and concepts, and we shouldn't make life hard for future code maintainers/developers.

b) Parallelization, if done the wrong way, can lead to all sorts of pain: very slow code (mainly because of thread contention), deadlocks, correctness errors, to name a few.

c) At the same time, parallelization is very important for a scalable library that deals with large datasets, such as mlpack. If done correctly, it can give an important bump in performance, especially since virtually every machine is multicore now.

So we need to keep (a) and (c) in mind while being careful to not fall in any of the traps mentioned in (b). (a) is easy to do with OpenMP, and we should also try to use only the very basic constructs - parallel for, let's say.

Now let's talk specifically about the LSH code. The main time waster there is the search we do for each query, after the object has been trained with a reference set.

The way mlpack does this is there is a for loop, inside which, codes are computed for each query, and then other points with the same code are assembled into a candidate set where we then perform naive kNN. There's two possible paths to parallelization here:

  1. We can focus on having LSHSearch::Search process queries in parallel. This way, whenever this function is called with an array of queries, OpenMP will let us process different queries on different cores.
  2. We can instead try to reduce the cost of each individual query, by parallelizing the ReturnIndicesFromTable and BaseCase functions.

Option (1) has several benefits:

  • When ran with a large query set, the processing cores will be utilized very well - there's plenty of work to go around, so we simply spawn threads and divide work among them.
  • It is trivial to implement and embarrassingly parallel, meaning the more threads we spawn (up to the number of queries) the faster we will be.
  • It should cause an immediate and noticeable bump in performance.

But, this option also has a problem: If the user is running single-query searches, then this will not benefit them at all. That is, we have increased the number of queries we process per second but have not decreased the time required for each query.

Option 2 serves exactly that purpose: By parallelizing the query processing code itself, each query will complete faster. In use cases where users have a model of LSH already computed, and receive and process queries in real time, this makes much more sense than a parallelization of the query-processing mechanism.

The downside to this is, the part of the process we actually can parallelize is smaller, and isn't embarrassingly parallel. Still, it can benefit quite a lot.

But wait, the upsides and downsides are flipped in these two cases. Couldn't we somehow reap the benefits of both methods?

Yes, we could - by utilizing a hybrid parallelization model. The way to do this is for LSHSearch to keep track of the number of threads it currently uses as well as the maximum number of threads it is allowed to use (by default, equal to the number of machine cores) and then if it realizes that it is not taking advantage of the full machine capabilities, also use the per-query parallelization method.

Fortunately, OpenMP makes life very easy for us, with the if clause in its parallelization directive. The if clause is evaluated, and threads are spawned only if it evaluates to true. So, we can do something like

#pragma omp parallel for \
    if (numThreadsUsed <= maxThreads) \
    num_threads (maxThreads - numThreadsUsed)

I have already begun implementing this. I've changed the code in LSHSearch to provide for the thread book-keeping, and have completed the parallel query processing implementation.

Of course, at the same time, we should take care that what we are doing is correct (point c on my list). To do that, I've added yet another unit test to the LSHSearch suite, which performs two searches with the same query set - one that uses only one thread and one that uses the maximum number. These two processes should (and indeed do) return the same thing.


LSH Utility Features, Multiprobe LSH Benchmarking

This is week 3 of GSoC and, according to plan, I am still working on multiprobe LSH. I thought this would have progressed faster - since a first working version of the algorithm was actually completed by the end of Week 1, but I learned that debugging and testing machine learning algorithms is actually harder than I expected :)

I started week 3 by implementing a function in LSHSearch that computes recall given a reference set of neighbors and the neighbors found by LSH. This functionality is now available through the mlpack_lsh binary using the -t switch and specifying a true neighbor indices file. Computing recall is a very useful step in LSH parameter tuning, so I thought it would be nice to provide this simple computation.

I also implemented a few deterministic tests for LSH. Now we can be relatively sure that what we do doesn't break the algorithm but only improves it. The deterministic tests essentially use given projection tables, so that disjoint clusters of points merge or don't merge controllably.

Finally I started running tests on single- and multiprobe LSH. I've found that different datasets behave differently here, some benefiting from multiprobe while some being better off without it. Profiling with gprof showed that the computation of additional probing bin codes doesn't affect performance significantly (usually accounts for 2-4% of total time), so now I'm trying to understand what the bottleneck is, precisely.

The two ideas I have regarding this is that either the resulting candidate set is very large in some cases, and so the resulting final linear search is what affects performance, or that the process of exploring so many additional bins is what causes these unpleasant results. In any case, I will investigate more.

By this week's end, I will hopefully be finished with all optimizations that occur from profiling and tests of multiprobe LSH, and will have started parallelization with OpenMP.


LSH optimizations, modifications, and benchmarking

I began this week by debugging my multiprobe implementation which I discussed in my previous post. The algorithm is quite complicated and so I wanted to make sure it runs correctly before moving on to benchmarking and optimization.

Sure enough, I found a lot of minor bugs lying here and there which the tests I had written didn't catch. That worried me, so I decided to write better tests - my idea was to add some simple deterministic test cases.

To do that, I needed to improve access to LSHSearch object's projection tables, which are randomly generated - to have deterministic tests, you need to be able to specify tables instead of allowing the object to generate random ones for you.

In the process of modifying the LSHSearch code to do that, Ryan and I also decided to make a few other modifications, namely

1) Change the data structure that stores the projection tables from an std::vector to an arma::cube. Each slice of the cube is a projection table. This conserves memory and simplifies the code.

2) Change the implementation of the second level hashing. In the current version, an arma::Mat table is created where each row corresponds to a hash bucket and stores indices to points hashed to that bucket. This is inefficient, both because the default secondHashSize is pretty large and because the number of points in each bucket might be uneven - so the resulting table is quite sparse. After some demo codes and discussion, we decided on a solution to these two problems.

So, with LSHSearch transparent, more easily testable and more efficient, we are now ready to perform benchmarks of single- and multiprobe LSH, see what we can optimize in the multiprobe code, and then move on to parallelization. All this will start today, so stay tuned :D


Implementation of Multiprobe LSH

This summer my goal is to improve various features of the current Locality Sensitive Hashing implementation of mlpack, making it faster, smarter, and easier to use. LSH is an approximate nearest neighbors algorithm that uses hashing to greatly reduce the amount of points needed to be examined

I was looking forward to GSoC week 1 for a while - I began with the implementation of Multiprobe LSH, an algorithm that improves on the classic LSH by identifying more hash buckets in each table where a query's neighboring points might be. The algorithm better utilizes the tables created by LSH, meaning fewer ones need to be created, which makes the search take less time and memory.

The implementation required the modification of the LSH code and corresponding mlpack test cases.

The new parameter, number of additional probing bins, is now accessible to users both as a command line argument (-T) and via a new parameter in LSHSearch.Search().

Another mini-feature I implemented, LSHSearch.ComputeRecall() takes two armadillo matrices and computes the recall (% of neighbors found correctly by LSH). This is also accessible from the command line program by using the -t switch to specify a "truth file" - a file of real neighbors.

Using these two features, a user should be able to reduce the number of tables used by LSH and get as good (or better!) recall by increasing the number of additional probing bins.

I am making documentation, testing and style changes and will be opening a pull request in the next few days.