[mlpack] Looking for heavyweight usecases of NB Classifier

Mon Jun 5 15:37:01 EDT 2017

On Mon, Jun 05, 2017 at 05:50:08PM +0100, Yannis Mentekidis wrote:
> Hi guys,
> 
> Shikhar is working on his project to profile different mlpack algorithms
> and identify potential bottlenecks he could then parallelize. He's found a
> paper (
> https://papers.nips.cc/paper/3150-map-reduce-for-machine-learning-on-multicore.pdf)
> which
> adapts the MapReduce paradigm for certain algorithms, including Naive
> Bayes, so he started with profiling that algorithm.
> 
> However, he and I have been struggling to actually find a dataset that
> makes the algorithm take a significant amount of time. The time it takes
> for the mlpack::data::Load() functions is 2-3 orders of magnitude larger
> than the Train() and Classify() functions.
> 
> We were wondering:
> 
>    - Has anybody come across any usecases where NBC is slow enough to be
>    worth parallelizing?
>    - Does anyone have any tips on profiling the algorithm so that data
>    loading is ignored, so we can focus on the things we can actually improve?

NBC only takes basically one pass over the data, so either you have to
find a gigantic dataset that takes a long time to pass over, or maybe a
dataset with a very large number of classes (so that the model itself
takes up a large number of space).

The UCI higgs dataset might be useful:

https://archive.ics.uci.edu/ml/datasets/HIGGS

but it's only two-class.

You can save some time with loading by loading a binary matrix file.
CSV takes a long time to parse.  You could write some simple code to
convert:

  arma::mat dataset;
  data::Load("big_file.csv", dataset);
  data::Save("big_file.bin", dataset);

That should reduce the amount of runtime devoting to building the
matrix.

Another option would be to just generate a very very large random
matrix.

-- 
Ryan Curtin    | 
ryan at ratml.org | "Death is the road to awe."