[mlpack] GSoC 2016 - Fast k-centers Algorithm & Implementation

Mon Mar 14 10:24:19 EDT 2016

On Sun, Mar 13, 2016 at 10:26:41PM +0100, Borja Menéndez Moreno wrote:
> Hello,
> 
> I am Borja Menéndez, a PhD student in Computer Science starting his third
> year of studies from Madrid, Spain. In particular, my field of study is
> optimization problems. In my thesis I'm working on combinatorial problems
> related to retrieving orders from warehouses. You can see my first JCR
> paper in [1] for more info about this.
> 
> As you can imagine, I'm really interested in the project titled "Fast
> k-centers Algorithm & Implementation". I find this optimization problem
> truly challenging and funny to work with. In the last few weeks I'm being
> attracted to Machine Learning and I think this project is a good start for
> me since it is a good mixture between optimization problems and machine
> learning. During my master I worked with an optimization problem relatively
> similar to this one: MaxMin Diversity Problem. I know it is not the same
> problem but, probably, the ideas I used for that problem could work here.
> 
> Currently, and from the last two years, I'm working with Java, but I also
> have programmed my whole master thesis in C++ (more than 12.000 lines of
> code) in an open source project called JdeRobot, a robotics framework. You
> can see my contribution to the project in [2] (my github username is
> bmenendez [3]).
> 
> Are you open to try other algorithms (different from the Dual-Tree proposed
> in the ideas page) such as Variable Neighborhood Search / GRASP / Tabu
> Search to address the problem? Are you planning to write a paper if the
> obtained solutions are good enough to do it? Do you have a set of problem
> instances to start working on this problem and trying some naïve ideas?
> 
> I also wanted to congratulate for your work in mlpack since it is an
> extremely easy to install library and your documentation is remarkably
> complete.

Hi Borja,

Thanks for getting in touch.

The reason we've suggested using dual-tree algorithms is because it's
what Bill and I are most familiar with; we both spent most of graduate
school writing dual-tree algorithms for various tasks.  So I think that
both of us could come up with an algorithm to approximate a solution to
the k-centers algorithm pretty easily.

I'm open to mentoring other ideas, but we should have the same sort of
aim for the project:

 * devise an approximate strategy for k-centroids that can handle large
   datasets (i.e. hundreds of thousands to millions of points)

 * implement that strategy and also the Gonzales algorithm for
   comparison

 * if the results are good, prepare a paper for submission to a
   conference of journal

What do you think?  If you are confident that one of the approaches you
suggested will work, please go ahead and propose it.  But since Google
Summer of Code is oriented towards code and not research, we at least
have to have a good plan for what you will implement during the proposal
phase.

If you want to read more about dual-tree algorithms, a good place to
start might be here:

http://www.ratml.org/pub/pdf/2013tree.pdf

I think a tree-based approach has a high likelihood of being successful.
If you're interested in that, I can elaborate on some ideas.  (And
please feel free to elaborate on the ideas that you mentioned.  I am
interested to hear more.)

Thanks,

Ryan

-- 
Ryan Curtin    | "That rug really tied the room together."
ryan at ratml.org |   - The Dude