[mlpack] GSoC 2016 - Fast k-centers Algorithm & Implementation

Mon Mar 14 14:03:58 EDT 2016

Hi, Ryan,

Yes, the aim for the project is ok for me, I've asked for using other
metaheuristics instead of approximation algorithms just because I'm used to
them. Maybe VNS, GRASP or TS could do a good job, but I cannot guarantee
that fact since I didn't try to address the problem :-) Anyway, if you
(Bill and yourself) are familiar with dual-tree algorithms (and I'm sure
you are, just looking at the title of your PhD thesis), I think the best
way to tackle the problem and achieve the goals is following your
guidelines.

Considering the points you wrote down: Gonzalez algorithm is this one [1],
right? Do you have a set of problem instances (or datasets, whatever you
want to name it) just to see how the problem looks and, possibly, to try a
naïve greedy algorithm?

To be honest, this is my first time with dual-tree algorithms, and I find
your paper very descriptive. However, I don't see what is a query dataset
and a reference dataset, specially applied to this problem. Maybe I have
not dedicated too much time to that, but this is a busy week for me, I'm
sorry.

As I've said before, I'm not used to dual-tree algorithms, so my ideas
(thinking in a greedy algorithm to have an initial solution, thinking about
how to move from one part of the solution space to another in order to
avoid basins of attraction, thinking about moves to obtain a local optimum
with a local search, etc.) will not work very well here, but I think I can
learn a lot with your knowledge.

Kind regards,

Borja.

[1] - http://www.sciencedirect.com/science/article/pii/0304397585902245

2016-03-14 15:24 GMT+01:00 Ryan Curtin <ryan at ratml.org>:

> On Sun, Mar 13, 2016 at 10:26:41PM +0100, Borja Menéndez Moreno wrote:
> > Hello,
> >
> > I am Borja Menéndez, a PhD student in Computer Science starting his third
> > year of studies from Madrid, Spain. In particular, my field of study is
> > optimization problems. In my thesis I'm working on combinatorial problems
> > related to retrieving orders from warehouses. You can see my first JCR
> > paper in [1] for more info about this.
> >
> > As you can imagine, I'm really interested in the project titled "Fast
> > k-centers Algorithm & Implementation". I find this optimization problem
> > truly challenging and funny to work with. In the last few weeks I'm being
> > attracted to Machine Learning and I think this project is a good start
> for
> > me since it is a good mixture between optimization problems and machine
> > learning. During my master I worked with an optimization problem
> relatively
> > similar to this one: MaxMin Diversity Problem. I know it is not the same
> > problem but, probably, the ideas I used for that problem could work here.
> >
> > Currently, and from the last two years, I'm working with Java, but I also
> > have programmed my whole master thesis in C++ (more than 12.000 lines of
> > code) in an open source project called JdeRobot, a robotics framework.
> You
> > can see my contribution to the project in [2] (my github username is
> > bmenendez [3]).
> >
> > Are you open to try other algorithms (different from the Dual-Tree
> proposed
> > in the ideas page) such as Variable Neighborhood Search / GRASP / Tabu
> > Search to address the problem? Are you planning to write a paper if the
> > obtained solutions are good enough to do it? Do you have a set of problem
> > instances to start working on this problem and trying some naïve ideas?
> >
> > I also wanted to congratulate for your work in mlpack since it is an
> > extremely easy to install library and your documentation is remarkably
> > complete.
>
> Hi Borja,
>
> Thanks for getting in touch.
>
> The reason we've suggested using dual-tree algorithms is because it's
> what Bill and I are most familiar with; we both spent most of graduate
> school writing dual-tree algorithms for various tasks.  So I think that
> both of us could come up with an algorithm to approximate a solution to
> the k-centers algorithm pretty easily.
>
> I'm open to mentoring other ideas, but we should have the same sort of
> aim for the project:
>
>  * devise an approximate strategy for k-centroids that can handle large
>    datasets (i.e. hundreds of thousands to millions of points)
>
>  * implement that strategy and also the Gonzales algorithm for
>    comparison
>
>  * if the results are good, prepare a paper for submission to a
>    conference of journal
>
> What do you think?  If you are confident that one of the approaches you
> suggested will work, please go ahead and propose it.  But since Google
> Summer of Code is oriented towards code and not research, we at least
> have to have a good plan for what you will implement during the proposal
> phase.
>
> If you want to read more about dual-tree algorithms, a good place to
> start might be here:
>
> http://www.ratml.org/pub/pdf/2013tree.pdf
>
> I think a tree-based approach has a high likelihood of being successful.
> If you're interested in that, I can elaborate on some ideas.  (And
> please feel free to elaborate on the ideas that you mentioned.  I am
> interested to hear more.)
>
> Thanks,
>
> Ryan
>
> --
> Ryan Curtin    | "That rug really tied the room together."
> ryan at ratml.org |   - The Dude
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cc.gatech.edu/pipermail/mlpack/attachments/20160314/8c69590a/attachment-0002.html>