[mlpack] Question about KMeans benchmark

Thu Jul 17 11:57:30 EDT 2014

On Thu, Jul 17, 2014 at 11:27:14AM -0400, Liu Liu wrote:
> On Thu, Jul 17, 2014 at 11:14 AM, Liu Liu <lliu at stern.nyu.edu> wrote:
> > Hi Ryan,
> >
> > I am interested in using KMeans in MLPACK for my research purpose. I have
> > several questions about the benchmark of Kmeans in your website.

Hello Liu,

I will do my best to answer.

> > 1) What are the datasets? How large (# of items, # of features)?
> 
> Ah, I should have read the benchmark results more carefully. I find the
> dataset names now.

Ok, good to hear you got that figured out.

> > 2) Is the result based on a single run or multiple run? Matlab has a
> > parameter to run Kmeans multiple times and choose the best one as final
> > result.
> > 3) Do you use Bradley-Fayyad "refined start" when test KMeans for
> > benchmark?

The result is based on a single run, but each run is supposed to start
from the same initial cluster centroids.  As a result, the algorithm is
deterministic and will proceed to the same results for every library.

I have successfully determined that the benchmarking system has support
for starting from the same initial centroids for each library, but I
have not yet confirmed if that support is in use.  I'll respond further
when I have figured it out (or alternately, maybe Marcus, the system's
author, can give some insight).

> > 4) How do you select other parameters for each dataset? The result only
> > showed # of clusters.

Once the initial cluster centroids are selected, the only other
parameter is the number of clusters.  These are selected based on
properties of the dataset.

> > Regarding how to select a good initial start, you mentioned in the website
> > that there are multiple strategies for choosing initial points effectively
> > and MLPACK implements some of these, notably the Bradley-Fayyad algorithm.
> > Have you tried other initialization methods, e.g., KMeans++
> > <http://en.wikipedia.org/wiki/K-means%2B%2B> or XMeans
> > <http://www.cs.cmu.edu/~dpelleg/download/xmeans.pdf>, or compared their
> > performance?

Both of those algorithms are good choices, but mlpack has neither
implemented.  Thanks to the niceness of C++ templates, it is easy to
implement any other initialization strategy.  See this section in the
tutorial for more information:

http://www.mlpack.org/doxygen.php?doc=kmtutorial.html#kmeans_initial_partition_kmtut

Once a new partitioning policy is implemented, one simply needs to make
a new KMeans object using the relevant partitioning policy:

KMeans<MetricType, YourNewPartitioningPolicy> kmeans;
kmeans.Cluster(...)

and it will use YourNewPartitioningPolicy to make the initial
assignments.

> > btw, I real like the project, the coding style and the nice documentation.
> > Thank you for making it available to us!!

Thanks!  Feel free to contribute to the project -- anyone is welcome to
join the community.

-- 
Ryan Curtin    | "Are you or are you not the Black Angel of Death?"
ryan at ratml.org |   - Steve