[mlpack] Looking for heavyweight usecases of NB Classifier
Ryan Curtin
ryan at ratml.org
Mon Jun 5 15:37:01 EDT 2017
On Mon, Jun 05, 2017 at 05:50:08PM +0100, Yannis Mentekidis wrote:
> Hi guys,
>
> Shikhar is working on his project to profile different mlpack algorithms
> and identify potential bottlenecks he could then parallelize. He's found a
> paper (
> https://papers.nips.cc/paper/3150-map-reduce-for-machine-learning-on-multicore.pdf)
> which
> adapts the MapReduce paradigm for certain algorithms, including Naive
> Bayes, so he started with profiling that algorithm.
>
> However, he and I have been struggling to actually find a dataset that
> makes the algorithm take a significant amount of time. The time it takes
> for the mlpack::data::Load() functions is 2-3 orders of magnitude larger
> than the Train() and Classify() functions.
>
> We were wondering:
>
> - Has anybody come across any usecases where NBC is slow enough to be
> worth parallelizing?
> - Does anyone have any tips on profiling the algorithm so that data
> loading is ignored, so we can focus on the things we can actually improve?
NBC only takes basically one pass over the data, so either you have to
find a gigantic dataset that takes a long time to pass over, or maybe a
dataset with a very large number of classes (so that the model itself
takes up a large number of space).
The UCI higgs dataset might be useful:
https://archive.ics.uci.edu/ml/datasets/HIGGS
but it's only two-class.
You can save some time with loading by loading a binary matrix file.
CSV takes a long time to parse. You could write some simple code to
convert:
arma::mat dataset;
data::Load("big_file.csv", dataset);
data::Save("big_file.bin", dataset);
That should reduce the amount of runtime devoting to building the
matrix.
Another option would be to just generate a very very large random
matrix.
--
Ryan Curtin |
ryan at ratml.org | "Death is the road to awe."
More information about the mlpack
mailing list