mlpack_kmeans - k-means clustering


mlpack_kmeans [-h] [-v]


This program performs K-Means clustering on the given dataset, storing the learned cluster assignments either as a column of labels in the file containing the input dataset or in a separate file. Empty clusters are not allowed by default; when a cluster becomes empty, the point furthest from the centroid of the cluster with maximum variance is taken to fill that cluster.

Optionally, the Bradley and Fayyad approach ("Refining initial points for k-means clustering", 1998) can be used to select initial points by specifying the --refined_start (-r) option. This approach works by taking random samples of the dataset; to specify the number of samples, the --samples parameter is used, and to specify the percentage of the dataset to be used in each sample, the --percentage parameter is used (it should be a value between 0.0 and 1.0).

There are several options available for the algorithm used for each Lloyd iteration, specified with the --algorithm (-a) option. The standard O(kN) approach can be used (’naive’). Other options include the Pelleg-Moore tree-based algorithm (’pelleg-moore’), Elkan’s triangle-inequality based algorithm (’elkan’), Hamerly’s modification to Elkan’s algorithm (’hamerly’), the dual-tree k-means algorithm (’dualtree’), and the dual-tree k-means algorithm using the cover tree (’dualtree-covertree’).

The behavior for when an empty cluster is encountered can be modified with the --allow_empty_clusters (-e) option. When this option is specified and there is a cluster owning no points at the end of an iteration, that cluster’s centroid will simply remain in its position from the previous iteration. If the --kill_empty_clusters (-E) option is specified, then when a cluster owns no points at the end of an iteration, the cluster centroid is simply filled with DBL_MAX, killing it and effectively reducing k for the rest of the computation. Note that the default option when neither empty cluster option is specified can be time-consuming to calculate; therefore, specifying -e or -E will often accelerate runtime.

As of October 2014, the --overclustering option has been removed. If you want this support back, let us know---file a bug at or get in touch through another means.


--clusters (-c) [int]

Number of clusters to find (0 autodetects from initial centroids).

--input_file (-i) [string]

Input dataset to perform clustering on.


--algorithm (-a) [string]

Algorithm to use for the Lloyd iteration (’naive’, ’pelleg-moore’, ’elkan’, ’hamerly’, ’dualtree’, or ’dualtree-covertree’). Default value ’naive’.

--allow_empty_clusters (-e)

Allow empty clusters to be persist.

--help (-h)

Default help info.

--in_place (-P)

If specified, a column containing the learned cluster assignments will be added to the input dataset file. In this case, --outputFile is overridden.

--info [string]

Get help on a specific module or option. Default value ’’. --initial_centroids (-I) [string] Start with the specified initial centroids. Default value ’’.

--kill_empty_clusters (-E)

Remove empty clusters when they occur.

--labels_only (-l)

Only output labels into output file.

--max_iterations (-m) [int]

Maximum number of iterations before k-means terminates. Default value 1000.

--percentage (-p) [double]

Percentage of dataset to use for each refined start sampling (use when --refined_start is specified). Default value 0.02.

--refined_start (-r)

Use the refined initial point strategy by Bradley and Fayyad to choose initial points.

--samplings (-S) [int]

Number of samplings to perform for refined start (use when --refined_start is specified). Default value 100.

--seed (-s) [int]

Random seed. If 0, ’std::time(NULL)’ is used. Default value 0.

--verbose (-v)

Display informational messages and the full list of parameters and timers at the end of execution.

--version (-V)

Display the version of mlpack.


--centroid_file (-C) [string] If specified, the centroids of each cluster will be written to the given file. Default value ’’.
--output_file (-o) [string]

File to write output labels or labeled data to. Default value ’’.



For further information, including relevant papers, citations, and theory, For further information, including relevant papers, citations, and theory, consult the documentation found at or included with your consult the documentation found at or included with your DISTRIBUTION OF MLPACK. DISTRIBUTION OF MLPACK.