[mlpack] GSoC 2015: Fast k-centers algorithm & implementation

Lukas Zorich lukas.zorich at gmail.com
Wed Mar 4 01:23:55 EST 2015


Hi Ryan,

First of all, it's too bad that mlpack hasn't been accepted into GSoC'15 :(
I spent a lot of time reading about dual-tree algorithms and diving into
mlpack code. The thing is that I'm still interested in dual-tree
algorithms, and I also want to sharpen my C++ skills, so I still want to
contribute to mlpack :)

I thought that I could implement KDE (
https://github.com/mlpack/mlpack/issues/150) to start. I will use the
various mlpack dual-tree algorithms already implemented as a guide. Do you
think this is achivable for me?

I will let you know if I have questions :)

Cheers!

Lukas.

2015-02-24 13:45 GMT-03:00 Ryan Curtin <ryan at ratml.org>:

> On Mon, Feb 23, 2015 at 03:26:01AM -0300, Lukas Zorich wrote:
> > Reading through the GSoC 2015 ideas, the project that interests me
> > most is *Fast
> > k-centers algorithm & implementation*. This seems that a very interesting
> > and fun project to work on, but it also looks difficult and challenging.
> I
> > wanted to know if someone that didn't know about Dual-Tree Boruvka MST
> > algorithm till today and with little Machine Learning experience, but
> very
> > motivated and willing to learn all the necessary stuff to complete
> > successfully the project will be able to handle it (the project
> > descriptions says that some understanding of geometry and spatial data
> > structures is necessary: I know about kd-trees and also I have worked
> with
> > octrees in a college project before). Don't get me wrong, the fact that
> > this is a challenging project (for me) motivates me a lot, but I also
> want
> > to be realistic.
>
> Hi Lukas,
>
> I think the k-centers project will be a lot of fun.  Since you're
> familiar with kd-trees and octrees, that helps a lot.  There are still
> a few months to learn more about dual-tree algorithms, and mlpack's
> infrastructure for dual-tree algorithms is pretty nice, so you should
> not have to spend too much time thinking about the intricacies of how to
> traverse trees and implement tree types.
>
> If I'm not mistaken, Bill already has most of the idea for the algorithm
> ready; it just needs an implementation, some testing, and a nice API.
> Then we can see if the algorithm is useful and faster than alternatives
> (so, surveying other algorithms for k-centers and implementing those may
> be a part of the project too), and if so, then a paper can be written
> and submitted.
>
> > I already downloaded and built mlpack from source code. I also started to
> > read some papers about dual-tree algorithms [1] and [2], to understand
> how
> > this algorithms work. And, as previous mailing list post [3] says, I also
> > going to look at the tree abstractions in mlpack (src/mlpack/core/tree/)
> > and the dual-tree algorithms mlpack implements, in particular
> DualTreeBoruvka
> > class in src/mlpack/methods/emst/. Are there any other important
> resources
> > I should look at? Maybe some dual-tree algorithm papers or other
> important
> > resources to learn more about this topics would be great!
>
> There are many dual-tree algorithm papers... let me link several below.
> The two you've started with are good places to start.  It's worth
> pointing out that many of the papers I link to below have very different
> notation and may be somewhat hard to understand...
>
> "n-body problems in statistical learning"
>
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.28.7138&rep=rep1&type=pdf
> This is the first paper of dual-tree algorithms, which introduces the
> idea of building a tree on query points instead of traversing one tree
> for each query point.
>
> "Cover trees for nearest neighbor"
>
> http://machinelearning.wustl.edu/mlpapers/paper_files/icml2006_BeygelzimerKL06.pdf
> This is a theory paper describing the cover tree, and contains a
> dual-tree algorithm for nearest neighbor search.  If the theory doesn't
> make sense, don't worry about it, but maybe you might find it
> interesting.
>
> "Dual-tree fast exact max-kernel search"
> http://www.ratml.org/pub/pdf/2013fastmks.pdf
> A dual-tree algorithm for max-kernel search, given in the context of
> tree-independent dual-tree algorithms.  The theory might not be
> interesting or useful.
>
> "Approximate Matrix Multiplication and Space Partitioning Trees: An
> Exploration"
> http://www.npslagle.info/articles/matrixMultiplication.pdf
> An interesting exploration into using trees for approximate matrix
> multiplication.
>
> "Fast Particle Smoothing: If I Had a Million Particles"
>
> http://machinelearning.wustl.edu/mlpapers/paper_files/icml2006_KlaasBFDML06.pdf
> Another dual-tree algorithm for a different problem.
>
> "Rank-approximate nearest neighbor search: Retaining meaning and speed
> in high dimensions"
>
> http://papers.nips.cc/paper/3864-rank-approximate-nearest-neighbor-search-retaining-meaning-and-speed-in-high-dimensions.pdf
> An interesting new notion of approximation for nearest neighbor search,
> and a dual-tree algorithm to perform it with.
>
> You can follow the references of each of those articles to find lots
> more, if you like, but that's probably a good place to start.  If I can
> answer any questions, please feel free to ask.
>
> It's probably also worth your time to look through the implementation of
> various mlpack dual-tree algorithms to understand how they are written
> and how you might write another one:
>
>   src/mlpack/methods/neighbor_search/
>   src/mlpack/methods/range_search/
>   src/mlpack/methods/emst/
>   src/mlpack/methods/fastmks/
>   src/mlpack/methods/rann/
>
> Hope this helps. :)
>
> Ryan
>
> --
> Ryan Curtin    | "I am."
> ryan at ratml.org |   - Joe
>



-- 
Lukas Zorich
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cc.gatech.edu/pipermail/mlpack/attachments/20150304/0b6710ed/attachment-0002.html>


More information about the mlpack mailing list