[mlpack] DBSCAN unusualy slow...or not?

Patrick Marais patrickcmarais at gmail.com
Thu Jun 8 11:58:07 EDT 2017


HI Ryan,

Thanks for the quick reply and suggestions.

I suspect you're right about the neighbor query getting too many points; I
tried reducing it massively, and now it runs for about 15 minutes
 and the seg faults. See below - The number of points is a bit higher than
I stated: 542326, but the input is an arma::mat with this number of columns
and rows=63. I'm fairly sure I have checked everything to remove NaN's and
so on. Could it be possible that the size of the data set is causing
something to fail? The memory usage was not maxed out at this point (only
about 2.5GB over 8GB).

I just ran it through gdb, which doesn't seem to have all the debug
information, so besides the place at which it crashed, I can't say much
else.

Not sure what to try next. If I remove the cal to dbscan and use Kmeans,
everything works (although the clusters are not what I'd really like).

Any ideas or suggestions would be very welcome.

Thanks,
Patrick


#0  0x0000000000438844 in
mlpack::dbscan::DBSCAN<mlpack::range::RangeSearch<mlpack::metric::LMetric<2,
true>, arma::Mat<double>, mlpack::tree::KDTree>,
mlpack::dbscan::RandomPointSelection>::ProcessPoint<arma::Mat<double> > (
    this=this at entry=0x7fffffffd580, data=..., unvisited=...,
    index=<optimized out>, assignments=...,
    currentCluster=currentCluster at entry=0,
    neighbors=std::vector of length 542329, capacity 542329 = {...},
    distances=std::vector of length 542329, capacity 542329 = {...},
    topLevel=false)
    at /usr/local/include/mlpack/methods/dbscan/dbscan_impl.hpp:174
174          ProcessPoint(data, unvisited, neighbors[index][j], assignments,

On Thu, Jun 8, 2017 at 3:52 PM, Ryan Curtin <ryan at ratml.org> wrote:

> On Thu, Jun 08, 2017 at 03:17:17PM +0200, Patrick Marais wrote:
> > Hello,
> >
> > I've just started using mlpack, and wished to cluster 512,000 points, in
> a
> > 63-dim feature space. Kmeans does this in < 15 sec on my laptop. I tried
> > the same with DBSCAN and  killed it at about 700 sec while it was still
> (I
> > think) churning away (and consuming  a lot of memory). I thought the use
> of
> > a Rangesearch kd-tree should have reduced this to something fairly
> quick? I
> > copied example code from the command line version of dbsacn, since there
> is
> > no tutorial. My understanding is that building a default RangeSearch
> should
> > give you a kd-tree? That seems  to be what the code shows too.
> >
> > I was wondering if my choice of epsilon/minPts could have such a dramatic
> > effect. I histogrammed the distance between all point pairs and set
> epsilon
> > based on this, minpts I set to 64, i.e.dim+1 as per advice I had seen
> > elsewhere.
> >
> > So, I guess my questions are
> > 1) Is the DBSCAN somehow using a naive/brute force search?
> > 2) is DBSCAN intrinsically slow for lage numbers of high dimensional
> > points? I hear of people clustering 10's of millions if points, so I
> didn't
> > think my data set was unreasonably large.
> >
> > I'd be very disappointed if (2) was indeed the case.
> >
> > Any advice/comments appreciated. I've included he code snippet below.
>
> Hi Patrick,
>
> You are right that DBSCAN uses kd-tree-based range search as a default.
> In fact you can simplify your code a little bit:
>
> mlpack::dbscan::DBSCAN<> dbs(epsilon, minclusterpts);
>
> The problem that you are encountering is intrinsic to DBSCAN but it is
> solvable.  When I perform a range search with some radius, for DBSCAN I
> must collect the indices of every point that lies within that range.
> Thus for 512k points, supposing that I choose a large enough epsilon,
> then the range search will return every single other point in the
> dataset, and the RAM usage to do that will be non-negligible.
>
> Therefore, reducing epsilon significantly can go a long way towards
> reducing the computational cost of DBSCAN.  I would try reducing epsilon
> by an order of magnitude or more and seeing if this improves your
> runtime while still giving acceptable clustering results.
>
> If epsilon can't be reduced to a small enough value that gives good
> clustering results in reasonable time, then it's possible that DBSCAN is
> not a good fit for the dataset you are clustering.
>
> I hope this helps; let me know if I can clarify anything else.
>
> Thanks,
>
> Ryan
>
> --
> Ryan Curtin    | "...wildcat... ...wild... cat... ... ...pow...
> ryan at ratml.org | ...wildcat... I'm going to go."  - Eli Cash
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://knife.lugatgt.org/pipermail/mlpack/attachments/20170608/e04b44bb/attachment.html>


More information about the mlpack mailing list