Profiling for parallelization and parallel stochastic optimization methods - Week 7 & 8

These past two weeks were spent on two things : running HOGWILD! on larger datasets to find performance issues and fixing them and working on an implementation of SCD.

Running parallel SGD on the Netflix datasets revealed a few issues with the code. Using the generalized form of the parallel SGD Optimize function was proving to be very slow with a dataset of this size (with numFunctions = 100198806). Also, shuffling indices was taking more resources than it should.

The first issue was fixed by introducing an overload for ParallelSGD in the regularized SVD implementation, as is done for StandardSGD. To resolve the second issue, arma::shuffle was switched for std::shuffle.

After the latest commit, the runs on the datasets are here.

Comparing these to the runs mentioned in the paper, we see that the time taken on the RCV1 dataset is nearly the same, with lower error. Netflix performance is worse, and has probably to do something with a better initiliazation strategy or using a "delta" model (where ratings are calculated as a difference from the mean rating for the movie, instead of directly multiplying the user and movie matrices).

One interesting thing to note is that the runs in the paper were done on 10 core machines, whereas the runs for mlpack are on a dual-core, quad-thread CPU.

Regarding the implementation of SCD, I am thinking about adding a ResolvableFunctionType requirement on the function iterface. SCD requires the exact computation of gradient in a single dimension, and none of the existing optimizers seem to utilize this type of information.

more...

Profiling for parallelization and parallel stochastic optimization methods - Week 6

This week was spent in further research into testing methods for the implementations in the PR(#1037). Several runs and experiments are in the Gists linked below :

Also, I have started with the initial implementation of another optimizer method, Stochastic Co Ordinate descent, and would open up a PR soon with the additions.

more...

Profiling for parallelization and parallel stochastic optimization methods - Week 5

As mentioned in the earlier blog post, this week was primarily focussed on adding more tests for the implementation of Parallel SGD, along with progress on the matrix completion example from the paper. I have made some improvements to the code based on initial reviews, while adding new parts.

Both the implemented examples from the paper can be added as separate methods to mlpack. Once enough tests are added, I'll run the methods on the datasets from the paper. While waiting for an in depth review on the PR, I'll work on adding more tests for SparseSVM and SparseMC.

more...

Profiling for parallelization and parallel stochastic optimization methods - Week 4

This week I completed the initial implementation of parallel SGD, along with a couple of tests for the new code. The PR (#1037) is nearly ready for an initial review. With further discussion, I would add more sophisticated and detailed methods of testing. Some SFINAE code needs to be added to handle the current interface of DecomposableFunctionType.

I would like to test this implementation against the original runs mentioned in the paper, to see the difference in results obtained from an OpenMP based implementation. I have already implemented the SparseSVM example.

The next few days would be focussed primarily on implementing more tests, running ParallelSGD on the datasets mentioned in the paper and possibly profiling mlpack's implementation of GMM for potential parallelization with OpenMP.

more...

Profiling for parallelization and parallel stochastic optimization methods - Week 3

During week 3, I finished up work on the parallel implementation of KMeans, making the mlpack implementation nearly 4x faster with 4 threads available. The next algorithms for profiling would be GMMs and dual tree traversers. For now, I have started to focus on the other aspect of my project, the implementation of parallel stochastic optimization methods.

I have discussed about the implementation of parallel SGD with the HOGWILD! scheme of update with Yannis and the changes to the DecomposableFunctionType interface to carry the necessary details for the HOGWILD! update on separate processors. Some of these details are problem specific and the primary challenge in designing the interface is to preserve genericity in the Optimizer to allow users to write their own loss functions with minimal effort.

Next, I'll open up a PR with the initial implementation of this interface for further discussion and testing.

more...

Profiling for parallelization and parallel stochastic optimization methods - Week 2

Week 2 continued with finishing up work on the CTest PR and profiling algorithms in mlpack for parallelization. The first algorithm under testing was the Gaussian Naive Bayes Classifier. The algorithm is simple enough to understand. For training, it takes a single pass through the data, calculating the mean and variance of each label for each feature, assuming that the feature values are sampled from a Gaussian PDF.

Since the algorithm is linear in the size of the dataset, any time spent in IO easily overshadows any time spent in computing the mean and variance of the data(which is very small to begin with). Also, the original form of NBC (in a multi label setting) is not trivially parallelizable with OpenMP. A change in the algorithm was required, which would increase the time complexity, while giving trivial gains from parallelization. Coupled with the fact that NBC cannot model complex datasets, we decided it was probably not worth the added complexity to the codebase to add a parallel implementation of this algorithm. A few measurements of the proposed implementation exist in this gist.

The next algorithm to be measured was KMeans. The implementation of NaiveKMeans was trivially parallelizable and profiling showed that most of the time was spent in computing the Metric(which is good). The mlpack KMeans class uses NaiveKMeans as a subroutine and the parallel implementation can be switched for the sequential one because of the same interface. The speedup in the initial implementation is quite noticiable. Here are the measurements. I have opened up #1024 with this implementation for discussion and further refinement.

Next, I will be going ahead with the implementation of HOGWILD! and possibly parallel dual-tree traversal algorithms.

more...

Profiling for parallelization and parallel stochastic optimization methods - Week 1

This summer, I'll be working on the parallel implementations of the sequential algorithms already implemented in mlpack to improve their performance on multicore systems.

I began week 1 by starting with the integration of CTest in the mlpack testing interface. It makes testing faster by making multiple jobs run in parallel, making use of the available extra threads, reducing development time. The intial results look quite promising, with around 2x reduction in test times while running 4 jobs in parallel. The PR opened for introducing this feature in the mlpack build system is here. Currently, I am working on automating the process of adding tests to CTest by getting the test suite names and procedurally adding them to the list of tests.

In the meantime, I have been writing benchmarks and profiling the algorithms which are to be implemented in parallel to find the existing bottlenecks and the expected performance improvements from the parallel implementations.

In the coming weeks, my work will mainly be focussed on implemeneting the algorithms with significant scope for improvement of performance in their parallel versions and finding out more ways of improving and scaling mlpack's performance on multicore systems.

more...

LSH Parallelization - Week 2

This week I focused on implementing the parallelization changes I discussed in my previous blog post. I also spent some time on bug fixes and style changes of Multiprobe LSH, but I will not bore you with that.

Processing Queries in Parallel

First, I parallelized the query processing code. This was pretty straightforward since OpenMP lets us define private and shared variables and takes care of memory access by itself.

Here's a nice example of why memory access becomes a concern when dealing with parallel code.

Let's say you're on vacation in a nice Greek beach, and you run out of money. You call your mom to deposit some on your bank account but she doesn't answer, so you leave a message and then call your dad. Your dad answers and runs out to send you money through the ATM. While he's out, your mom listens to the message and goes online to also send you money...

So, your dad at the ATM and your mom through e-banking both perform these actions:

// balance = how much money you have. In the beginning this is 0.
balance = read_account_balance(); 
balance += 100; // the new deposit 
write_account_balance(balance); // update balance

But, because the bank you use doesn't employ great programmers, so they didn't care about memory access by multithreaded applications. So these commands are executed like this:

balance_mom = read_account_balance(); // mom knows you have 0
balance_mom += 100;
balance_dad = read_account_balance(); // dad knows you have 0
write_account_balance_mom(balance_mom); // mom just set your balance to 100
balance_dad += 200; // dad is more generous
write_account_balance_dad(balance_dad); // dad just set your balance to 200

Well, that's a problem - both mom and dad gave you money, but you only received 200 out of the 300 total. The reason is that the code is what is called a "critical segment" - a dangerous point where no two threads are safe to operate at the same time.

Does this mean we don't get to use parallelism in fear of ruining our results? Of course not! We just need to verify that what we are doing is correct - essentially forbidding parallel execution of the (usually very small) parts of the code that are critical segments.

OpenMP can force single-core execution of parts of the code with #pragma omp atomic (for one critical expression) or #pragma omp critical (for longer pieces of code).

Thread book-keeping

Once I was done with that and had it tested, I started implementing code to keep track of how many threads we use. That code started becoming very complicated, so I considered implementing a thread-control class and managing threads through it, but that has the potential of complicating the code too much. Many of the parts we're intervening with already have an increased level of code complexity (long functions with a lot of branching) and so adding more complexity might make the next person to look at the code want to learn where I live and come murder me - something I would very much like to avoid.

So, I ended up going with a simpler solution. OpenMP by default does not allow for nested parallelism - that is, it doesn't allow a spawned thread to spawn threads of its own. This is very good: it's easy to accidentally get yourself in recursive situations where spawning more threads that spawn more threads (and so on) leads to either the program to crash or it being very slow.

Since the code I want to parallelize further is nested inside parallel regions, it will, by default, be executed by only one thread. If, however, a user wants to enable that parallelism as well (for example in the live-query processing scheme I described in the previous post), then they can enable OpenMP's nested parallelism by setting the environment variable OMP_NESTED to TRUE.

Further parallelism

For now, I have only parallelized the for loops inside ReturnIndicesFromTables(), but they are not the bottleneck - find() and unique() are. So this week I plan to implement parallel versions of these functions and add tests for the second level of parallelism. Once that is done, the chapter of LSH parallelization will be over :)

more...

LSH Parallelization

I have done a few things this week on the LSH code, but I want to talk mainly about parallelization of the LSHSearch object, a topic which I find very interesting.

First off, a few things to consider:

a) Parallelization should be as transparent as possible. Not everyone is familiar and/or comfortable with parallel frameworks and concepts, and we shouldn't make life hard for future code maintainers/developers.

b) Parallelization, if done the wrong way, can lead to all sorts of pain: very slow code (mainly because of thread contention), deadlocks, correctness errors, to name a few.

c) At the same time, parallelization is very important for a scalable library that deals with large datasets, such as mlpack. If done correctly, it can give an important bump in performance, especially since virtually every machine is multicore now.

So we need to keep (a) and (c) in mind while being careful to not fall in any of the traps mentioned in (b). (a) is easy to do with OpenMP, and we should also try to use only the very basic constructs - parallel for, let's say.

Now let's talk specifically about the LSH code. The main time waster there is the search we do for each query, after the object has been trained with a reference set.

The way mlpack does this is there is a for loop, inside which, codes are computed for each query, and then other points with the same code are assembled into a candidate set where we then perform naive kNN. There's two possible paths to parallelization here:

  1. We can focus on having LSHSearch::Search process queries in parallel. This way, whenever this function is called with an array of queries, OpenMP will let us process different queries on different cores.
  2. We can instead try to reduce the cost of each individual query, by parallelizing the ReturnIndicesFromTable and BaseCase functions.

Option (1) has several benefits:

  • When ran with a large query set, the processing cores will be utilized very well - there's plenty of work to go around, so we simply spawn threads and divide work among them.
  • It is trivial to implement and embarrassingly parallel, meaning the more threads we spawn (up to the number of queries) the faster we will be.
  • It should cause an immediate and noticeable bump in performance.

But, this option also has a problem: If the user is running single-query searches, then this will not benefit them at all. That is, we have increased the number of queries we process per second but have not decreased the time required for each query.

Option 2 serves exactly that purpose: By parallelizing the query processing code itself, each query will complete faster. In use cases where users have a model of LSH already computed, and receive and process queries in real time, this makes much more sense than a parallelization of the query-processing mechanism.

The downside to this is, the part of the process we actually can parallelize is smaller, and isn't embarrassingly parallel. Still, it can benefit quite a lot.

But wait, the upsides and downsides are flipped in these two cases. Couldn't we somehow reap the benefits of both methods?

Yes, we could - by utilizing a hybrid parallelization model. The way to do this is for LSHSearch to keep track of the number of threads it currently uses as well as the maximum number of threads it is allowed to use (by default, equal to the number of machine cores) and then if it realizes that it is not taking advantage of the full machine capabilities, also use the per-query parallelization method.

Fortunately, OpenMP makes life very easy for us, with the if clause in its parallelization directive. The if clause is evaluated, and threads are spawned only if it evaluates to true. So, we can do something like

#pragma omp parallel for \
    if (numThreadsUsed <= maxThreads) \
    num_threads (maxThreads - numThreadsUsed)
    ...

I have already begun implementing this. I've changed the code in LSHSearch to provide for the thread book-keeping, and have completed the parallel query processing implementation.

Of course, at the same time, we should take care that what we are doing is correct (point c on my list). To do that, I've added yet another unit test to the LSHSearch suite, which performs two searches with the same query set - one that uses only one thread and one that uses the maximum number. These two processes should (and indeed do) return the same thing.

more...