[mlpack] DBSCAN unusualy slow...or not?

Thu Jun 8 09:52:14 EDT 2017

On Thu, Jun 08, 2017 at 03:17:17PM +0200, Patrick Marais wrote:
> Hello,
> 
> I've just started using mlpack, and wished to cluster 512,000 points, in a
> 63-dim feature space. Kmeans does this in < 15 sec on my laptop. I tried
> the same with DBSCAN and  killed it at about 700 sec while it was still (I
> think) churning away (and consuming  a lot of memory). I thought the use of
> a Rangesearch kd-tree should have reduced this to something fairly quick? I
> copied example code from the command line version of dbsacn, since there is
> no tutorial. My understanding is that building a default RangeSearch should
> give you a kd-tree? That seems  to be what the code shows too.
> 
> I was wondering if my choice of epsilon/minPts could have such a dramatic
> effect. I histogrammed the distance between all point pairs and set epsilon
> based on this, minpts I set to 64, i.e.dim+1 as per advice I had seen
> elsewhere.
> 
> So, I guess my questions are
> 1) Is the DBSCAN somehow using a naive/brute force search?
> 2) is DBSCAN intrinsically slow for lage numbers of high dimensional
> points? I hear of people clustering 10's of millions if points, so I didn't
> think my data set was unreasonably large.
> 
> I'd be very disappointed if (2) was indeed the case.
> 
> Any advice/comments appreciated. I've included he code snippet below.

Hi Patrick,

You are right that DBSCAN uses kd-tree-based range search as a default.
In fact you can simplify your code a little bit:

mlpack::dbscan::DBSCAN<> dbs(epsilon, minclusterpts);

The problem that you are encountering is intrinsic to DBSCAN but it is
solvable.  When I perform a range search with some radius, for DBSCAN I
must collect the indices of every point that lies within that range.
Thus for 512k points, supposing that I choose a large enough epsilon,
then the range search will return every single other point in the
dataset, and the RAM usage to do that will be non-negligible.

Therefore, reducing epsilon significantly can go a long way towards
reducing the computational cost of DBSCAN.  I would try reducing epsilon
by an order of magnitude or more and seeing if this improves your
runtime while still giving acceptable clustering results.

If epsilon can't be reduced to a small enough value that gives good
clustering results in reasonable time, then it's possible that DBSCAN is
not a good fit for the dataset you are clustering.

I hope this helps; let me know if I can clarify anything else.

Thanks,

Ryan

-- 
Ryan Curtin    | "...wildcat... ...wild... cat... ... ...pow...
ryan at ratml.org | ...wildcat... I'm going to go."  - Eli Cash