[mlpack] GSoC: Fast k-centers algorithm & implementation

Wenlin Hu hiroshima596 at gmail.com
Tue Mar 11 00:00:59 EDT 2014


Hi Ryan,

Thanks for sharing me with high level ideas of this project. Basically I
get your points and understand the difficulty with the project. I will
first play with the related parts in mlpack and we could have more
discussions later. Thank you.

Thanks,
Wenlin Hu


On Mon, Mar 10, 2014 at 11:20 PM, Ryan Curtin <gth671b at mail.gatech.edu>wrote:

> On Mon, Mar 10, 2014 at 09:33:02PM -0400, Wenlin Hu wrote:
> > Dear all,
> >
> > I am a PhD candidate of Stony Brook University, USA. My major is Applied
> > Mathematics. I am very excited about the opportunity to contribute to
> > MLPACK community from now on.
> >
> > I have more than five-year research experience on mathematical modeling
> and
> > numerical simulation, concentrated on interface and computational fluid
> > dynamic related problems. I am proficient in C/C++, Matlab, algorithms,
> > data structures, and parallel computing.
> >
> > I have strong interest in the field of Machine Learning and Big Data. I
> am
> > doing online course on Machine Learning by Andrew NG on Stanford Online
> and
> > reading classical textbooks on statistical learning now.
> >
> > Among all the exciting ideas and projects, the project that caught my
> eyes
> > is:
> >
> >    - Fast k-venters algorithm and implementation
> >
> > I have done the course on Computational Geometry last year. I am
> > comfortable with CG algorithms, data structures, and research-oriented
> > project. I will be fully engaged in completing this project during summer
> > 2014 if selected.
> >
> > My resume or any other information can be furnished upon request. I am
> > looking forward to your advice on how should I proceed for GSoC 2014.
> > Thanks.
>
> Hi Wenlin,
>
> Thank you for getting in touch.  I actually had not seen the k-centers
> idea until today but it is not only related to Bill's research but my
> own too.  mlpack implements a nice framework to implement dual-tree
> algorithms, so after the derivation of an algorithm, it should be
> straightforward to implement (testing will be difficult, of course).
>
> I am not an expert on k-centers and I have yet to understand the problem
> fully (given that I just read the description ten minutes ago) but it
> appears to be an NP-hard problem.  This means that approximation is the
> only way to do it with trees.
>
> It seems to me to be related to the b-matching problem, which I once
> spent a long time thinking about.  Basically, in the b-matching problem,
> you are attempting to connect each point in the set to exactly b
> neighbors, and minimize the sum of the distances of each connection.
>
> My conclusion, in the end, was that it was not possible to do b-matching
> well with trees, or, at least, it wasn't possible to do it easily.  To
> do anything well with a tree-based algorithm, you need to be able to
> make a prune based on geometric reasoning, and with b-matching, you
> can't ever perfectly rule out a connection because it is sometimes
> possible that you must make a very long-distance connection to find the
> best solution; but with trees, the goal is to prune those long-distance
> connections.
>
> It may still be possible to apply a dual-tree algorithm to b-matching,
> but I have not reconsidered it since probably mid-2012.
>
> With k-centers, I think it is more reasonable to design a dual-tree
> algorithm, because I think the pruning decisions are a bit easier to
> make.
>
> However, I must also add, the Google Summer of Code stipulations require
> that there is some code artifact at the end of the summer, so although
> the idea is kind of an exploratory research idea, there must be some
> code implementation to be turned into Google at the end of the summer.
> So basically, I think a student will have to go into the project with a
> basic idea of a pruning rule or algorithm to implement, and then even if
> the results are negative and it does not turn into the most important
> paper NIPS has ever seen, Google will still have the code and the
> requirements of the GSoC program will be fulfilled.
>
> For this project, a good first place to start is to download mlpack,
> compile and install it, then spend some time reviewing the tree
> abstractions in mlpack (found in src/mlpack/core/tree/) and the
> dual-tree algorithms mlpack implements (for instance, the
> DualTreeBoruvka class in src/mlpack/methods/emst/).  I can give links to
> lots and lots of papers on dual-tree algorithms, if you like.
>
> (Bill, I've CC'ed you in case you have any comments or further thoughts
> on the project)
>
> Thanks,
>
> Ryan
>
> --
> Ryan Curtin    | "Wha' happened?"
> ryan at ratml.org |   - Mike LaFontaine
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cc.gatech.edu/pipermail/mlpack/attachments/20140311/1ae9ec2c/attachment-0003.html>


More information about the mlpack mailing list