[mlpack] Patch for Ticket #251 and GSoC Intro

Mon Apr 15 15:03:06 EDT 2013

On Sat, Apr 13, 2013 at 03:41:06PM +0530, Pararth Shah wrote:
> Hi,
> 
> I am interested in working with MLPACK this summer as part of GSoC 2013.
> Since this list has already been flooded with introduction emails, I went
> ahead and created a patch for a bug, in order to (i) get acquainted with
> the code, and (ii) help towards reaching the 1.0.5 milestone.
> 
> *Patch for #251: kmeans -f has no test and does not work*
> 
> The ticket is here <http://trac.research.cc.gatech.edu/fastlab/ticket/251>.
> I figured that the KMeans::FastCluster() function is giving an error due to
> issues in construction of the MRKDStatistic object. I modified the
> constructor implementation (diff attached) which seems to have solved the
> issue.
> 
> I will go ahead and add a test for FastCluster(), but please confirm that I
> am on the right track. If yes, I'll assign the ticket to myself and attach
> the diff to the ticket (already created an account on trac).

Hello Pararth,

Thank you for taking the time to look into ticket #251.  Unfortunately,
the solution is going to be somewhat more complex than your patch,
because the way trees work in mlpack has changed slightly, and the
Pelleg-Moore k-means algorithm needs to be re-derived for the
tree-independent setting.  The kd-trees, for which the algorithm was
originally devised, conveniently store the number of points they
contain, but in general we can't assume this about any tree.

I am planning to re-derive the Pelleg-Moore algorithm at some point, but
I haven't had the time yet.  Then, it can be rewritten in a
tree-independent manner.  The MRKDStatistic class will also require some
thought, though.

> The "Automated Benchmarking of MLPACK Methods" project sounds interesting
> to me, as I did similar work during my previous GSoC experience. I am work
> on sketching out a proposal for the same, while simultaneously getting a
> better understanding of the MLPACK codebase. However, I am also on the
> lookout for other project ideas that may interest me, and will email again
> once I have something substantial to discuss (two such ideas are
> parallelization using OpenMP, and support for graph min-cut based
> optimizations).

The completion of this project would be very useful for mlpack, and I
would be really happy to see it done.  Right now, producing benchmarks
for mlpack (such as those in the very-recently-published JMLR paper
found at
https://lists.csail.mit.edu/pipermail/jmlr-announce/2013-April/000837.html )
is rather tedious and time-consuming.  An automated benchmarking system
would be set up once and then be able to produce benchmarks like this at
any time.  Then, updates to the benchmarks in the paper are trivial!
Very cool.  :)

Let me know if I can answer any questions about it.  It has some similar
concerns to the profile-guided optimization project, which I answered a
few questions about at
https://mailman.cc.gatech.edu/pipermail/mlpack/2013-April/000052.html .

> P.S. @Ryan and other mentors, what would be a good time of the day to catch
> one of you on IRC?

I'm usually awake and periodically checking #mlpack between 10am and 5pm
EST, and after dinner I'm sometimes around from about 7pm to 2am (no
guarantees).  Feel free to drop in anytime.

-- 
Ryan Curtin       | "If it's something that can be stopped, then just try to stop it!"
ryan at igglybob.com |   - Skull Kid