Profiling for parallelization and parallel stochastic optimization methods - Week 7 & 8

These past two weeks were spent on two things : running HOGWILD! on larger datasets to find performance issues and fixing them and working on an implementation of SCD.

Running parallel SGD on the Netflix datasets revealed a few issues with the code. Using the generalized form of the parallel SGD Optimize function was proving to be very slow with a dataset of this size (with numFunctions = 100198806). Also, shuffling indices was taking more resources than it should.

The first issue was fixed by introducing an overload for ParallelSGD in the regularized SVD implementation, as is done for StandardSGD. To resolve the second issue, arma::shuffle was switched for std::shuffle.

After the latest commit, the runs on the datasets are here.

Comparing these to the runs mentioned in the paper, we see that the time taken on the RCV1 dataset is nearly the same, with lower error. Netflix performance is worse, and has probably to do something with a better initiliazation strategy or using a "delta" model (where ratings are calculated as a difference from the mean rating for the movie, instead of directly multiplying the user and movie matrices).

One interesting thing to note is that the runs in the paper were done on 10 core machines, whereas the runs for mlpack are on a dual-core, quad-thread CPU.

Regarding the implementation of SCD, I am thinking about adding a ResolvableFunctionType requirement on the function iterface. SCD requires the exact computation of gradient in a single dimension, and none of the existing optimizers seem to utilize this type of information.


Profiling for parallelization and parallel stochastic optimization methods - Week 6

This week was spent in further research into testing methods for the implementations in the PR(#1037). Several runs and experiments are in the Gists linked below :

Also, I have started with the initial implementation of another optimizer method, Stochastic Co Ordinate descent, and would open up a PR soon with the additions.


Profiling for parallelization and parallel stochastic optimization methods - Week 5

As mentioned in the earlier blog post, this week was primarily focussed on adding more tests for the implementation of Parallel SGD, along with progress on the matrix completion example from the paper. I have made some improvements to the code based on initial reviews, while adding new parts.

Both the implemented examples from the paper can be added as separate methods to mlpack. Once enough tests are added, I'll run the methods on the datasets from the paper. While waiting for an in depth review on the PR, I'll work on adding more tests for SparseSVM and SparseMC.


Profiling for parallelization and parallel stochastic optimization methods - Week 4

This week I completed the initial implementation of parallel SGD, along with a couple of tests for the new code. The PR (#1037) is nearly ready for an initial review. With further discussion, I would add more sophisticated and detailed methods of testing. Some SFINAE code needs to be added to handle the current interface of DecomposableFunctionType.

I would like to test this implementation against the original runs mentioned in the paper, to see the difference in results obtained from an OpenMP based implementation. I have already implemented the SparseSVM example.

The next few days would be focussed primarily on implementing more tests, running ParallelSGD on the datasets mentioned in the paper and possibly profiling mlpack's implementation of GMM for potential parallelization with OpenMP.


Profiling for parallelization and parallel stochastic optimization methods - Week 3

During week 3, I finished up work on the parallel implementation of KMeans, making the mlpack implementation nearly 4x faster with 4 threads available. The next algorithms for profiling would be GMMs and dual tree traversers. For now, I have started to focus on the other aspect of my project, the implementation of parallel stochastic optimization methods.

I have discussed about the implementation of parallel SGD with the HOGWILD! scheme of update with Yannis and the changes to the DecomposableFunctionType interface to carry the necessary details for the HOGWILD! update on separate processors. Some of these details are problem specific and the primary challenge in designing the interface is to preserve genericity in the Optimizer to allow users to write their own loss functions with minimal effort.

Next, I'll open up a PR with the initial implementation of this interface for further discussion and testing.


Profiling for parallelization and parallel stochastic optimization methods - Week 2

Week 2 continued with finishing up work on the CTest PR and profiling algorithms in mlpack for parallelization. The first algorithm under testing was the Gaussian Naive Bayes Classifier. The algorithm is simple enough to understand. For training, it takes a single pass through the data, calculating the mean and variance of each label for each feature, assuming that the feature values are sampled from a Gaussian PDF.

Since the algorithm is linear in the size of the dataset, any time spent in IO easily overshadows any time spent in computing the mean and variance of the data(which is very small to begin with). Also, the original form of NBC (in a multi label setting) is not trivially parallelizable with OpenMP. A change in the algorithm was required, which would increase the time complexity, while giving trivial gains from parallelization. Coupled with the fact that NBC cannot model complex datasets, we decided it was probably not worth the added complexity to the codebase to add a parallel implementation of this algorithm. A few measurements of the proposed implementation exist in this gist.

The next algorithm to be measured was KMeans. The implementation of NaiveKMeans was trivially parallelizable and profiling showed that most of the time was spent in computing the Metric(which is good). The mlpack KMeans class uses NaiveKMeans as a subroutine and the parallel implementation can be switched for the sequential one because of the same interface. The speedup in the initial implementation is quite noticiable. Here are the measurements. I have opened up #1024 with this implementation for discussion and further refinement.

Next, I will be going ahead with the implementation of HOGWILD! and possibly parallel dual-tree traversal algorithms.


Profiling for parallelization and parallel stochastic optimization methods - Week 1

This summer, I'll be working on the parallel implementations of the sequential algorithms already implemented in mlpack to improve their performance on multicore systems.

I began week 1 by starting with the integration of CTest in the mlpack testing interface. It makes testing faster by making multiple jobs run in parallel, making use of the available extra threads, reducing development time. The intial results look quite promising, with around 2x reduction in test times while running 4 jobs in parallel. The PR opened for introducing this feature in the mlpack build system is here. Currently, I am working on automating the process of adding tests to CTest by getting the test suite names and procedurally adding them to the list of tests.

In the meantime, I have been writing benchmarks and profiling the algorithms which are to be implemented in parallel to find the existing bottlenecks and the expected performance improvements from the parallel implementations.

In the coming weeks, my work will mainly be focussed on implemeneting the algorithms with significant scope for improvement of performance in their parallel versions and finding out more ways of improving and scaling mlpack's performance on multicore systems.