[mlpack] DBSCAN unusualy slow...or not?

Patrick Marais patrickcmarais at gmail.com
Thu Jun 8 09:17:17 EDT 2017


Hello,

I've just started using mlpack, and wished to cluster 512,000 points, in a
63-dim feature space. Kmeans does this in < 15 sec on my laptop. I tried
the same with DBSCAN and  killed it at about 700 sec while it was still (I
think) churning away (and consuming  a lot of memory). I thought the use of
a Rangesearch kd-tree should have reduced this to something fairly quick? I
copied example code from the command line version of dbsacn, since there is
no tutorial. My understanding is that building a default RangeSearch should
give you a kd-tree? That seems  to be what the code shows too.

I was wondering if my choice of epsilon/minPts could have such a dramatic
effect. I histogrammed the distance between all point pairs and set epsilon
based on this, minpts I set to 64, i.e.dim+1 as per advice I had seen
elsewhere.

So, I guess my questions are
1) Is the DBSCAN somehow using a naive/brute force search?
2) is DBSCAN intrinsically slow for lage numbers of high dimensional
points? I hear of people clustering 10's of millions if points, so I didn't
think my data set was unreasonably large.

I'd be very disappointed if (2) was indeed the case.

Any advice/comments appreciated. I've included he code snippet below.

Thanks,
Patrick

  std::cerr << "Starting DBSCAN clustering for " << trainingData.n_cols  <<
" samples...\n";
  mlpack::range::RangeSearch<> RS; // default KDtree range search
acceleration

  // if minPts == 0, then compute actual minPts used as Dim+1, where dim is
the dimensio of the feature vector
  int minclusterpts = (minClustSize < 1 ? trainingData.n_rows+1 :
minClustSize);

  mlpack::dbscan::DBSCAN<mlpack::range::RangeSearch<> > dbs(epsilon,
minclusterpts, RS);

  // do actual clustering

  int numCluster = (int)dbs.Cluster(trainingData, assignments);
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://knife.lugatgt.org/pipermail/mlpack/attachments/20170608/1eea93f6/attachment.html>


More information about the mlpack mailing list