mlpack  blog
LSH optimizations, modifications, and benchmarking - Week 2

LSH optimizations, modifications, and benchmarking - Week 2

Yannis Mentekidis, 19 August 2014

I began this week by debugging my multiprobe implementation which I discussed in my previous post. The algorithm is quite complicated and so I wanted to make sure it runs correctly before moving on to benchmarking and optimization.

Sure enough, I found a lot of minor bugs lying here and there which the tests I had written didn't catch. That worried me, so I decided to write better tests - my idea was to add some simple deterministic test cases.

To do that, I needed to improve access to LSHSearch object's projection tables, which are randomly generated - to have deterministic tests, you need to be able to specify tables instead of allowing the object to generate random ones for you.

In the process of modifying the LSHSearch code to do that, Ryan and I also decided to make a few other modifications, namely

1) Change the data structure that stores the projection tables from an std::vector to an arma::cube. Each slice of the cube is a projection table. This conserves memory and simplifies the code.

2) Change the implementation of the second level hashing. In the current version, an arma::Mat<size_t> table is created where each row corresponds to a hash bucket and stores indices to points hashed to that bucket. This is inefficient, both because the default secondHashSize is pretty large and because the number of points in each bucket might be uneven - so the resulting table is quite sparse. After some demo codes and discussion, we decided on a solution to these two problems.

So, with LSHSearch transparent, more easily testable and more efficient, we are now ready to perform benchmarks of single- and multiprobe LSH, see what we can optimize in the multiprobe code, and then move on to parallelization. All this will start today, so stay tuned :D

LSH optimizations, modifications, and benchmarking - Week 2

This is week 3 of GSoC and, according to plan, I am still working on multiprobe LSH. I thought this would have progressed faster - since a first working version of the algorithm was actually completed by the end of Week 1, but I learned that debugging and testing machine learning algorithms is actually harder than I expected :)

I started week 3 by implementing a function in LSHSearch that computes recall given a reference set of neighbors and the neighbors found by LSH. This functionality is now available through the mlpack_lsh binary using the -t switch and specifying a true neighbor indices file. Computing recall is a very useful step in LSH parameter tuning, so I thought it would be nice to provide this simple computation.

I also implemented a few deterministic tests for LSH. Now we can be relatively sure that what we do doesn't break the algorithm but only improves it. The deterministic tests essentially use given projection tables, so that disjoint clusters of points merge or don't merge controllably.

Finally I started running tests on single- and multiprobe LSH. I've found that different datasets behave differently here, some benefiting from multiprobe while some being better off without it. Profiling with gprof showed that the computation of additional probing bin codes doesn't affect performance significantly (usually accounts for 2-4% of total time), so now I'm trying to understand what the bottleneck is, precisely.

The two ideas I have regarding this is that either the resulting candidate set is very large in some cases, and so the resulting final linear search is what affects performance, or that the process of exploring so many additional bins is what causes these unpleasant results. In any case, I will investigate more.

By this week's end, I will hopefully be finished with all optimizations that occur from profiling and tests of multiprobe LSH, and will have started parallelization with OpenMP. AUTHORSTART Yannis Mentekidis AUTHORSTOP DATESTART 13 June 2016 DATESTOP PAGESTART YannisMentekidisPage PAGESTOP

LSH optimizations, modifications, and benchmarking - Week 2

The past week, I continued the implementation of the LSH model, the skeleton of which I outlined in my previous blog post.

The LSHModel::Train() method, that uses the dataset to learn some useful parameters is complete, so the LSHModel::Predict() remains for the proposed algorithm to be complete.

While implementing LSHModel::Predict(), I realized some of the GammaDistribution class features which I mentioned are missing will actually be needed by the LSHModel. As a result of this realization, I went back and filled in the missing parts as well as the corresponding correctness tests. There is one more correctness test to implement, and then the GammaDistribution class will be complete and issuue 733 will be closed.

This week, I plan to fill out the remaining parts of the LSHModel and GammaDistribution classes, and leave the last week for testing, optimization, and final changes. AUTHORSTART Yannis Mentekidis AUTHORSTOP DATESTART 19 August 2014 DATESTOP PAGESTART YannisMentekidisPage PAGESTOP