Summary of LSH changes for GSoC 2016

As we are approaching the pencils down date, I think it is a good time to create a short summary of my contributions to mlpack this summer. Seeing all the students being very active, and so much code being committed, I believe summing up what I've done in the last months is going to help anyone wanting to come up-to-speed with the part of the changes I'm responsible for.

Executive Summary

TL;DR: A summary of my commits can be found here.

Here's a list of my Pull Requests, each with a short description.

  • LSH Non-deterministic Testing: 605
  • LSH Projection Table Access: 663
  • LSH Deterministic Testing: 676
  • LSH Optimization - find() vs unique(): 623
  • LSH Optimization - secondHashTable: 675
  • Multiprobe LSH: 691
  • LSH Parallelization with OpenMP: 700
  • Gamma Distribution, boost backporting: 729
  • Gamma Distribution, more functionality: 751
  • LSH Tuning (under construction): 749

LSH Testing

This contribution actually started before my GSoC application, and it was based on Issue 261. The question was: How can we create tests that will verify that the LSH module works correctly, given that LSH is a randomized approximate algorithm and there is no "correct" solution to check for?

The accepted solution was twofold.

First, I discussed with Ryan and we came up with some reasonable assumptions that LSH must fulfill: Increasing the number of tables must increase recall, increasing the number of projections per table must decrease recall. A very "expensive" run should examine nearly 100% of the points and have nearly 100% recall. A very "cheap" run should examine almost no points, and have recall near 0. These tests were added in several commits that are mostly summarized by Pull Request 605.

The second part of the solution needed us to have write access to the (otherwise random) projection tables used by the LSHSearch class. I modified the code slightly to be able to do that in Pull Request 663. That PR also changes the way projection tables are used, going from an std::vector of arma::mats to arma::cube. Then, in Pull Request 676, I added deterministic tests for LSH, basically exploiting the fact that, if the identity matrix is used as a projection table, the resulting hash buckets are predictable. An intuition of this idea is given in a comment I made in a different PR.

These three Pull Requests increased LSH testing coverage significantly.

LSH Optimizations

Before moving to the main course (I'm looking at you, Multiprobe LSH), I want to focus on two Pull Requests that I think are important optimizations to the LSH module.

The first one, Pull Request 623, tries to reduce the memory footprint of LSHSearch::ReturnIndicesFromTable(). The original implementation would allocate N spaces for the N points, set them all to 0, then mark the points that were in the same hash bucket as a query and only keep those indices to create the candidate set. The complexity of that was O(N). Instead, we could simply take note of the indices of all the points we find, and only keep the unique ones. Unique runs in O(MlogM), but if M is significantly smaller than N, we reduce both the memory used and the time needed.

To find the sweet spot between M and N, we did extensive tuning with a number of datasets, and allowed our implementation to pick either the O(N) or O(MlogM) method for each point individually.

The second optimization, summarized in Pull Request 675 was made mainly by Ryan, based on our observation that the second-level hash table implementation was too heavy. What the previous implementation did was allocate a secondHashSize x bucketSize armadillo matrix, with each row corresponding to the key for hash value i. The first bucketSize points hashed to each value were kept, and the rest were discarded. Then, the hash table was condensed.

For the default parameters (secondHashTable = 99901, bucketSize = 500), this required almost 50 million objects of type size_t to be allocated. size_t is usually 8 bytes long, resulting in an allocation of about 400Mb when the program launched. This is bad, but it's even worse if a user sets bucketSize to some significantly larger size, like 3000 or 4000 (not unreasonable for larger datasets).

The new version of the code refrains from such excessive allocations, using a std::vec of arma::Col`s instead of a 2-dimensional matrix.

Multiprobe LSH

Now to the interesting part: Implementation of a state-of-the-art algorithm that (promises to) significantly improve the approximation results of naive LSH. The implementation was based on this paper, and it was introduced to mlpack through Pull Request 691.

The implementation was mostly straight-forward, since the paper is quite clear and even provides pseudocode for the most tricky part.

Of course, many mistakes that were made were much easier to test for, now that we could write (semi-)deterministic tests for LSH.

The LSHSearch class of release 2.0.3 does not yet include Multiprobe LSH, so if you want to try it before release 2.0.4 which will (presumably) include all GSoC changes, you should download and install mlpack from the source.

Parallelization with OpenMP

This was another part I was looking forward to, and which I believe is an important improvement over the old implementation. Using OpenMP directives and minimal extra code, we were able to have the LSHSearch class process more than one query in different threads.

Pull Request 700 is merged, so if you're using a multi-core machine (you probably are) running mlpack, you can now process your LSH queries faster (or you will be, from mlpack 2.0.4, and if you're not using Visual Studio to compile).

Implementation of LSH Tuning

Fair warning: LSH Tuning is still under construction.

For the last part of my GSoC contributions, I decided to implement the LSH Tuning algorithm. The algorithm helps identify parameter sets for which Multi-probe LSH will perform well for a specific dataset. Without this algorithm, tuning LSH by hand quickly becomes tedious and annoying.

Among other things, LSH Tuning models pairwise distances by fitting a Gamma Distribution to a sample of them. In order to do that fitting, I implemented an algorithm proposed by Thomas Minka that converges faster than the method proposed in the original paper. The implementation of the Gamma Distribution is included in Pull Request 729, which includes backporting of several features from Boost 1.58 needed by Minka's algorithm.

The Gamma Distribution implementation was incomplete, as I mention in Issue 733. I worked towards closing that issue later, and implemented most of the missing functionality in Pull Request 751. Some work remains to be done, and I will come back to it once everything else is ready.

The rest of the code for the LSHTuning module, a new class and executable that will be added to mlpack, is still in progress. The paper describing the algorithm has been convoluted in a few parts, but I think most of my confusion has been solved (with immeasurable help from Ryan), so I'm confident the code will be ready to ship relatively soon.

I am almost done implementing the core algorithm, but for it to be usable I need to write the code for the corresponding mlpack executable and write useful tests. My progress can be seen in Pull Request 749.

Conclusions

It has been an amazing summer, and although I didn't have the time to complete any of my blue-sky ideas that I discussed in my proposal, I think the experience has made me significantly more aware of the mlpack codebase. I am now much more capable to continue contributing to it, so hopefully I will be implementing many more interesting features soon!

If you made it this far, congratulations! Have an ice cream on me.

icecream

more...

Finalized Parallelization, Week 1 of Modeling LSH

This week's work can be organized in two sections - Finalizing LSH Parallelization and Beginning the LSH Model.

Parallelization

On the parallelization front, I did some experiments regarding the static or dynamic scheduling of threads, and applied an OpenMP reduction pattern to remove all thread locks in the parallel loop.

Scheduling of threads is an important part of parallelization - load balancing is a very interesting topic and OpenMP has a few nice predetermined modes you can ask for. The basic three are static, dynamic, and guided.

Static scheduling is, well, the static assignment of constant portions of work ("blocks") to threads, for example saying thread 1 will execute loops 1:4, thread 2 loops 5:8 and so on. This is the simplest and lowest-costing work scheduling scheme.

Dynamic scheduling does the same thing with the slight difference that, after the portions of work are divided, if a thread finishes its job it is assigned a new block to process. For iterations of unequal length, dynamic scheduling is usually good because it doesn't allow threads to slack after they finish their allocated block.

Guided scheduling reduces the block size as time progresses, assigning smaller and smaller blocks. This is useful for situations where, as the loop progresses, iterations last longer, so it is better to schedule work as it comes.

In our case, because different query points might have varying candidate set sizes, going with Dynamic was the best choice even though the scheduling overhead is nontrivial. Running a few tests confirmed that theory - for most datasets, Dynamic scheduling was faster than Static.

Modeling LSH

Now that Multiprobe LSH is implemented and working, I've started examining my next planned contribution - attempting to predict values for each parameter (W, M, L, T - Hash width, Number of Projections, Number of Hash tables, Number of Additional Probes) that will have satisfactory recall (= level of approximation) with the lowest possible cost (interpreted as time or selectivity).

The paper Modeling LSH For Performance Tuning proposes a model of LSH and the data distribution that is able to produce estimations of recall and selectivity based on a set of parameters and a small sample of the dataset, to which the parameters of a specific data distribution (the Gamma Distribution) are fit.

The authors' approach is very theoretical, and although they do produce usable mathematical results, they do not explain how to implement them in code - they are more interested in the what the model does, not how it does it.

So, my task is to decrypt their mathematical formulas and bring them to computable form. This might prove to not be a trivial task, but I have already started diving into the specifics and I hope I will be able to produce at least a prototype by this week's end.

more...

Multiprobe LSH Finalized, LSH Parallelization

I started this week by making the final changes to Multiprobe LSH - I fixed the style issues, corrected some bugs, and implemented or re-implemented some tests.

As I've discussed before, LSH, and by extension Multiprobe LSH, is quite a hard algorithm to test, because it is randomized and furthermore doesn't give any practically usable guarantees on the rate of its error.

So while fixing the documentation and style, I ran into a few bugs in the code. What was worse, I found bugs in my testing code that made it too tolerant of mistakes. So, I re-implemented all the deterministic tests for Multiprobe LSH.

Before the end of the week, Ryan merged my Pull Request to the master branch, so now Multiprobe LSH is available for users of the github build.

I continued my week by trying to implement parallel versions of arma::find() and arma::unique(). I failed miserably, for two reasons: (a) because these functions are aggressively optimized, and (b) because there's a big overhead on lock contention and shared memory coherency, so big that the parallel version ends up being slower than the sequential one.

In the end, we are probably going to go only with the first and simpler parallelization of the querying process - where an n-core machine can process n queries in parallel, which is the only thing I have attempted that actually showed any improvement.

In related news, I merged the Multiprobe code into the OpenMP code branch, so AppVeyor could build the Pull Request and tell us if it fails in other platforms. Things look good, which is good news :)

more...

LSH Parallelization - Week 2

This week I focused on implementing the parallelization changes I discussed in my previous blog post. I also spent some time on bug fixes and style changes of Multiprobe LSH, but I will not bore you with that.

Processing Queries in Parallel

First, I parallelized the query processing code. This was pretty straightforward since OpenMP lets us define private and shared variables and takes care of memory access by itself.

Here's a nice example of why memory access becomes a concern when dealing with parallel code.

Let's say you're on vacation in a nice Greek beach, and you run out of money. You call your mom to deposit some on your bank account but she doesn't answer, so you leave a message and then call your dad. Your dad answers and runs out to send you money through the ATM. While he's out, your mom listens to the message and goes online to also send you money...

So, your dad at the ATM and your mom through e-banking both perform these actions:

// balance = how much money you have. In the beginning this is 0.
balance = read_account_balance(); 
balance += 100; // the new deposit 
write_account_balance(balance); // update balance

But, because the bank you use doesn't employ great programmers, so they didn't care about memory access by multithreaded applications. So these commands are executed like this:

balance_mom = read_account_balance(); // mom knows you have 0
balance_mom += 100;
balance_dad = read_account_balance(); // dad knows you have 0
write_account_balance_mom(balance_mom); // mom just set your balance to 100
balance_dad += 200; // dad is more generous
write_account_balance_dad(balance_dad); // dad just set your balance to 200

Well, that's a problem - both mom and dad gave you money, but you only received 200 out of the 300 total. The reason is that the code is what is called a "critical segment" - a dangerous point where no two threads are safe to operate at the same time.

Does this mean we don't get to use parallelism in fear of ruining our results? Of course not! We just need to verify that what we are doing is correct - essentially forbidding parallel execution of the (usually very small) parts of the code that are critical segments.

OpenMP can force single-core execution of parts of the code with #pragma omp atomic (for one critical expression) or #pragma omp critical (for longer pieces of code).

Thread book-keeping

Once I was done with that and had it tested, I started implementing code to keep track of how many threads we use. That code started becoming very complicated, so I considered implementing a thread-control class and managing threads through it, but that has the potential of complicating the code too much. Many of the parts we're intervening with already have an increased level of code complexity (long functions with a lot of branching) and so adding more complexity might make the next person to look at the code want to learn where I live and come murder me - something I would very much like to avoid.

So, I ended up going with a simpler solution. OpenMP by default does not allow for nested parallelism - that is, it doesn't allow a spawned thread to spawn threads of its own. This is very good: it's easy to accidentally get yourself in recursive situations where spawning more threads that spawn more threads (and so on) leads to either the program to crash or it being very slow.

Since the code I want to parallelize further is nested inside parallel regions, it will, by default, be executed by only one thread. If, however, a user wants to enable that parallelism as well (for example in the live-query processing scheme I described in the previous post), then they can enable OpenMP's nested parallelism by setting the environment variable OMP_NESTED to TRUE.

Further parallelism

For now, I have only parallelized the for loops inside ReturnIndicesFromTables(), but they are not the bottleneck - find() and unique() are. So this week I plan to implement parallel versions of these functions and add tests for the second level of parallelism. Once that is done, the chapter of LSH parallelization will be over :)

more...

LSH Parallelization

I have done a few things this week on the LSH code, but I want to talk mainly about parallelization of the LSHSearch object, a topic which I find very interesting.

First off, a few things to consider:

a) Parallelization should be as transparent as possible. Not everyone is familiar and/or comfortable with parallel frameworks and concepts, and we shouldn't make life hard for future code maintainers/developers.

b) Parallelization, if done the wrong way, can lead to all sorts of pain: very slow code (mainly because of thread contention), deadlocks, correctness errors, to name a few.

c) At the same time, parallelization is very important for a scalable library that deals with large datasets, such as mlpack. If done correctly, it can give an important bump in performance, especially since virtually every machine is multicore now.

So we need to keep (a) and (c) in mind while being careful to not fall in any of the traps mentioned in (b). (a) is easy to do with OpenMP, and we should also try to use only the very basic constructs - parallel for, let's say.

Now let's talk specifically about the LSH code. The main time waster there is the search we do for each query, after the object has been trained with a reference set.

The way mlpack does this is there is a for loop, inside which, codes are computed for each query, and then other points with the same code are assembled into a candidate set where we then perform naive kNN. There's two possible paths to parallelization here:

  1. We can focus on having LSHSearch::Search process queries in parallel. This way, whenever this function is called with an array of queries, OpenMP will let us process different queries on different cores.
  2. We can instead try to reduce the cost of each individual query, by parallelizing the ReturnIndicesFromTable and BaseCase functions.

Option (1) has several benefits:

  • When ran with a large query set, the processing cores will be utilized very well - there's plenty of work to go around, so we simply spawn threads and divide work among them.
  • It is trivial to implement and embarrassingly parallel, meaning the more threads we spawn (up to the number of queries) the faster we will be.
  • It should cause an immediate and noticeable bump in performance.

But, this option also has a problem: If the user is running single-query searches, then this will not benefit them at all. That is, we have increased the number of queries we process per second but have not decreased the time required for each query.

Option 2 serves exactly that purpose: By parallelizing the query processing code itself, each query will complete faster. In use cases where users have a model of LSH already computed, and receive and process queries in real time, this makes much more sense than a parallelization of the query-processing mechanism.

The downside to this is, the part of the process we actually can parallelize is smaller, and isn't embarrassingly parallel. Still, it can benefit quite a lot.

But wait, the upsides and downsides are flipped in these two cases. Couldn't we somehow reap the benefits of both methods?

Yes, we could - by utilizing a hybrid parallelization model. The way to do this is for LSHSearch to keep track of the number of threads it currently uses as well as the maximum number of threads it is allowed to use (by default, equal to the number of machine cores) and then if it realizes that it is not taking advantage of the full machine capabilities, also use the per-query parallelization method.

Fortunately, OpenMP makes life very easy for us, with the if clause in its parallelization directive. The if clause is evaluated, and threads are spawned only if it evaluates to true. So, we can do something like

#pragma omp parallel for \
    if (numThreadsUsed <= maxThreads) \
    num_threads (maxThreads - numThreadsUsed)
    ...

I have already begun implementing this. I've changed the code in LSHSearch to provide for the thread book-keeping, and have completed the parallel query processing implementation.

Of course, at the same time, we should take care that what we are doing is correct (point c on my list). To do that, I've added yet another unit test to the LSHSearch suite, which performs two searches with the same query set - one that uses only one thread and one that uses the maximum number. These two processes should (and indeed do) return the same thing.

more...