[mlpack] GSoC: Fast k-centers algorithm & implementation

Ryan Curtin gth671b at mail.gatech.edu
Mon Mar 10 23:20:40 EDT 2014


On Mon, Mar 10, 2014 at 09:33:02PM -0400, Wenlin Hu wrote:
> Dear all,
> 
> I am a PhD candidate of Stony Brook University, USA. My major is Applied
> Mathematics. I am very excited about the opportunity to contribute to
> MLPACK community from now on.
> 
> I have more than five-year research experience on mathematical modeling and
> numerical simulation, concentrated on interface and computational fluid
> dynamic related problems. I am proficient in C/C++, Matlab, algorithms,
> data structures, and parallel computing.
> 
> I have strong interest in the field of Machine Learning and Big Data. I am
> doing online course on Machine Learning by Andrew NG on Stanford Online and
> reading classical textbooks on statistical learning now.
> 
> Among all the exciting ideas and projects, the project that caught my eyes
> is:
> 
>    - Fast k-venters algorithm and implementation
> 
> I have done the course on Computational Geometry last year. I am
> comfortable with CG algorithms, data structures, and research-oriented
> project. I will be fully engaged in completing this project during summer
> 2014 if selected.
> 
> My resume or any other information can be furnished upon request. I am
> looking forward to your advice on how should I proceed for GSoC 2014.
> Thanks.

Hi Wenlin,

Thank you for getting in touch.  I actually had not seen the k-centers
idea until today but it is not only related to Bill's research but my
own too.  mlpack implements a nice framework to implement dual-tree
algorithms, so after the derivation of an algorithm, it should be
straightforward to implement (testing will be difficult, of course).

I am not an expert on k-centers and I have yet to understand the problem
fully (given that I just read the description ten minutes ago) but it
appears to be an NP-hard problem.  This means that approximation is the
only way to do it with trees.

It seems to me to be related to the b-matching problem, which I once
spent a long time thinking about.  Basically, in the b-matching problem,
you are attempting to connect each point in the set to exactly b
neighbors, and minimize the sum of the distances of each connection.

My conclusion, in the end, was that it was not possible to do b-matching
well with trees, or, at least, it wasn't possible to do it easily.  To
do anything well with a tree-based algorithm, you need to be able to
make a prune based on geometric reasoning, and with b-matching, you
can't ever perfectly rule out a connection because it is sometimes
possible that you must make a very long-distance connection to find the
best solution; but with trees, the goal is to prune those long-distance
connections.

It may still be possible to apply a dual-tree algorithm to b-matching,
but I have not reconsidered it since probably mid-2012.

With k-centers, I think it is more reasonable to design a dual-tree
algorithm, because I think the pruning decisions are a bit easier to
make.

However, I must also add, the Google Summer of Code stipulations require
that there is some code artifact at the end of the summer, so although
the idea is kind of an exploratory research idea, there must be some
code implementation to be turned into Google at the end of the summer.
So basically, I think a student will have to go into the project with a
basic idea of a pruning rule or algorithm to implement, and then even if
the results are negative and it does not turn into the most important
paper NIPS has ever seen, Google will still have the code and the
requirements of the GSoC program will be fulfilled.

For this project, a good first place to start is to download mlpack,
compile and install it, then spend some time reviewing the tree
abstractions in mlpack (found in src/mlpack/core/tree/) and the
dual-tree algorithms mlpack implements (for instance, the
DualTreeBoruvka class in src/mlpack/methods/emst/).  I can give links to
lots and lots of papers on dual-tree algorithms, if you like.

(Bill, I've CC'ed you in case you have any comments or further thoughts
on the project)

Thanks,

Ryan

-- 
Ryan Curtin    | "Wha' happened?"
ryan at ratml.org |   - Mike LaFontaine



More information about the mlpack mailing list