[mlpack] Question about KMeans benchmark

Thu Jul 17 11:14:19 EDT 2014

Hi Ryan,

I am interested in using KMeans in MLPACK for my research purpose. I have
several questions about the benchmark of Kmeans in your website.

1) What are the datasets? How large (# of items, # of features)?
2) Is the result based on a single run or multiple run? Matlab has a
parameter to run Kmeans multiple times and choose the best one as final
result.
3) Do you use Bradley-Fayyad "refined start" when test KMeans for benchmark?
4) How do you select other parameters for each dataset? The result only
showed # of clusters.

Regarding how to select a good initial start, you mentioned in the website
that there are multiple strategies for choosing initial points effectively
and MLPACK implements some of these, notably the Bradley-Fayyad algorithm.
Have you tried other initialization methods, e.g., KMeans++
<http://en.wikipedia.org/wiki/K-means%2B%2B> or XMeans
<http://www.cs.cmu.edu/~dpelleg/download/xmeans.pdf>, or compared their
performance?

Thank you!

btw, I real like the project, the coding style and the nice documentation.
Thank you for making it available to us!!

Best,
Liu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cc.gatech.edu/pipermail/mlpack/attachments/20140717/c381b207/attachment-0002.html>