[mlpack] DBSCAN unusualy slow...or not?
Patrick Marais
patrickcmarais at gmail.com
Thu Jun 8 09:17:17 EDT 2017
Hello,
I've just started using mlpack, and wished to cluster 512,000 points, in a
63-dim feature space. Kmeans does this in < 15 sec on my laptop. I tried
the same with DBSCAN and killed it at about 700 sec while it was still (I
think) churning away (and consuming a lot of memory). I thought the use of
a Rangesearch kd-tree should have reduced this to something fairly quick? I
copied example code from the command line version of dbsacn, since there is
no tutorial. My understanding is that building a default RangeSearch should
give you a kd-tree? That seems to be what the code shows too.
I was wondering if my choice of epsilon/minPts could have such a dramatic
effect. I histogrammed the distance between all point pairs and set epsilon
based on this, minpts I set to 64, i.e.dim+1 as per advice I had seen
elsewhere.
So, I guess my questions are
1) Is the DBSCAN somehow using a naive/brute force search?
2) is DBSCAN intrinsically slow for lage numbers of high dimensional
points? I hear of people clustering 10's of millions if points, so I didn't
think my data set was unreasonably large.
I'd be very disappointed if (2) was indeed the case.
Any advice/comments appreciated. I've included he code snippet below.
Thanks,
Patrick
std::cerr << "Starting DBSCAN clustering for " << trainingData.n_cols <<
" samples...\n";
mlpack::range::RangeSearch<> RS; // default KDtree range search
acceleration
// if minPts == 0, then compute actual minPts used as Dim+1, where dim is
the dimensio of the feature vector
int minclusterpts = (minClustSize < 1 ? trainingData.n_rows+1 :
minClustSize);
mlpack::dbscan::DBSCAN<mlpack::range::RangeSearch<> > dbs(epsilon,
minclusterpts, RS);
// do actual clustering
int numCluster = (int)dbs.Cluster(trainingData, assignments);
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://knife.lugatgt.org/pipermail/mlpack/attachments/20170608/1eea93f6/attachment.html>
More information about the mlpack
mailing list