[mlpack] Google Summer of Code 2013 - Introduction + Thoughts

Wed Apr 17 12:34:48 EDT 2013

On Tue, Apr 16, 2013 at 10:49:04PM +0200, Marcus Edel wrote:
> Hello,
> 
> I'm interested in working for mlpack this summer in a GSoC project and
> would be happy to work on the automatic benchmarking of the mlpack
> methods.

Hello Marcus,

I know I've already written about the automatic benchmarking project but
I can't seem to find the thread to link to, so it looks like I'm going
to be writing it again... :)

That particular project is very exciting to me because it allows fast
generation of benchmarks for new versions.  The benchmarks in the mlpack
paper (http://arxiv.org/abs/1210.6293) took a long time to generate and
really I'd rather be spending my time doing something other than running
timing trials on a bunch of datasets to generate a table.

So if Jenkins does it automatically, this is a huge burden lifted off
developers' shoulders, and, it also means that we can have nice tables
on mlpack.org with the latest updated benchmarks.

> I've looked at the code and noticed that a lot of the mlpack methods
> have a test which is supposed to be run with some set of parameters,
> but a lot of them generate random datasets. With regard to informing
> the developers which of their changesets have caused speedups or
> slowdowns particularly with regard to compare the results with the
> competing libraries, I consider it advisable to take existing datasets
> from something like mldata.org (mldata.org provides data and task
> downloads in a standardized format), so it would be good to expand the
> task to add read support for the mldata.org datasets.

This is exactly right; random datasets are not great for benchmarking.
Actually, I think that uniformly random datasets cause worst-case
behavior for kd-trees (though I'm not certain on that and I don't feel
like thinking it through right now).  So they are not good to use as
benchmarks for that reason, and also because they are synthetic, and
machine learning practitioners are more interested in using real
datasets and probably would be more interested in seeing mlpack runtime
comparisons on datasets more applicable to them.

mldata.org is a good place to find datasets, as is the UCI machine
learning repository:

http://archive.ics.uci.edu/ml/datasets.html

I imagine there's a significant amount of overlap between the two...

> Currently I'm looking into the coding practices used in mlpack and
> play with some features of mlpack. I would like to request the mentors
> and the community to please provide any details to resources which
> could be helpful for the project.

The tutorials may be helpful:

http://www.mlpack.org/tutorial.html

Also, the coding standards are laid out here:

http://trac.research.cc.gatech.edu/fastlab/wiki/NewStyleGuidelines

Let me know if you have any problems or questions.

Thanks,

Ryan

-- 
Ryan Curtin       | "Indeed!"
ryan at igglybob.com |   - David Lo Pan