[mlpack] GSoC 2015: Fast k-centers algorithm & implementation

Ryan Curtin ryan at ratml.org
Tue Feb 24 11:45:40 EST 2015


On Mon, Feb 23, 2015 at 03:26:01AM -0300, Lukas Zorich wrote:
> Reading through the GSoC 2015 ideas, the project that interests me
> most is *Fast
> k-centers algorithm & implementation*. This seems that a very interesting
> and fun project to work on, but it also looks difficult and challenging. I
> wanted to know if someone that didn't know about Dual-Tree Boruvka MST
> algorithm till today and with little Machine Learning experience, but very
> motivated and willing to learn all the necessary stuff to complete
> successfully the project will be able to handle it (the project
> descriptions says that some understanding of geometry and spatial data
> structures is necessary: I know about kd-trees and also I have worked with
> octrees in a college project before). Don't get me wrong, the fact that
> this is a challenging project (for me) motivates me a lot, but I also want
> to be realistic.

Hi Lukas,

I think the k-centers project will be a lot of fun.  Since you're
familiar with kd-trees and octrees, that helps a lot.  There are still
a few months to learn more about dual-tree algorithms, and mlpack's
infrastructure for dual-tree algorithms is pretty nice, so you should
not have to spend too much time thinking about the intricacies of how to
traverse trees and implement tree types.

If I'm not mistaken, Bill already has most of the idea for the algorithm
ready; it just needs an implementation, some testing, and a nice API.
Then we can see if the algorithm is useful and faster than alternatives
(so, surveying other algorithms for k-centers and implementing those may
be a part of the project too), and if so, then a paper can be written
and submitted.

> I already downloaded and built mlpack from source code. I also started to
> read some papers about dual-tree algorithms [1] and [2], to understand how
> this algorithms work. And, as previous mailing list post [3] says, I also
> going to look at the tree abstractions in mlpack (src/mlpack/core/tree/)
> and the dual-tree algorithms mlpack implements, in particular DualTreeBoruvka
> class in src/mlpack/methods/emst/. Are there any other important resources
> I should look at? Maybe some dual-tree algorithm papers or other important
> resources to learn more about this topics would be great!

There are many dual-tree algorithm papers... let me link several below.
The two you've started with are good places to start.  It's worth
pointing out that many of the papers I link to below have very different
notation and may be somewhat hard to understand...

"n-body problems in statistical learning"
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.28.7138&rep=rep1&type=pdf
This is the first paper of dual-tree algorithms, which introduces the
idea of building a tree on query points instead of traversing one tree
for each query point.

"Cover trees for nearest neighbor"
http://machinelearning.wustl.edu/mlpapers/paper_files/icml2006_BeygelzimerKL06.pdf
This is a theory paper describing the cover tree, and contains a
dual-tree algorithm for nearest neighbor search.  If the theory doesn't
make sense, don't worry about it, but maybe you might find it
interesting.

"Dual-tree fast exact max-kernel search"
http://www.ratml.org/pub/pdf/2013fastmks.pdf
A dual-tree algorithm for max-kernel search, given in the context of
tree-independent dual-tree algorithms.  The theory might not be
interesting or useful.

"Approximate Matrix Multiplication and Space Partitioning Trees: An
Exploration"
http://www.npslagle.info/articles/matrixMultiplication.pdf
An interesting exploration into using trees for approximate matrix
multiplication.

"Fast Particle Smoothing: If I Had a Million Particles"
http://machinelearning.wustl.edu/mlpapers/paper_files/icml2006_KlaasBFDML06.pdf
Another dual-tree algorithm for a different problem.

"Rank-approximate nearest neighbor search: Retaining meaning and speed
in high dimensions"
http://papers.nips.cc/paper/3864-rank-approximate-nearest-neighbor-search-retaining-meaning-and-speed-in-high-dimensions.pdf
An interesting new notion of approximation for nearest neighbor search,
and a dual-tree algorithm to perform it with.

You can follow the references of each of those articles to find lots
more, if you like, but that's probably a good place to start.  If I can
answer any questions, please feel free to ask.

It's probably also worth your time to look through the implementation of
various mlpack dual-tree algorithms to understand how they are written
and how you might write another one:

  src/mlpack/methods/neighbor_search/
  src/mlpack/methods/range_search/
  src/mlpack/methods/emst/
  src/mlpack/methods/fastmks/
  src/mlpack/methods/rann/

Hope this helps. :)

Ryan

-- 
Ryan Curtin    | "I am."
ryan at ratml.org |   - Joe



More information about the mlpack mailing list