[mlpack] DBSCAN unusualy slow...or not?

Fri Jun 9 07:07:27 EDT 2017

Hi Ryan,

I have made the data set (273MB)  available (saved with
mlpack::data::Save() ) at

https://drive.google.com/open?id=0B_8-awG7IJzyT2xaU3RQZDhmYzQ

This should be 542K points of dim=63.

I ran DBSCAN with epsilon=1 and minPts=64.

I have also attached an image (and left iy in the dir) which is a histogram
of inter-point distances under the regular Euclidean metric, so epsilon
=1.0 seems like it should not have returned that many points?

Hopefully you can reproduce the error - please let me know if you need
anything more.

I checked for NaN and INF in the data max, and everything seems to be fine.

Cheers,
Patrick

On Thu, Jun 8, 2017 at 8:25 PM, Ryan Curtin <ryan at ratml.org> wrote:

> On Thu, Jun 08, 2017 at 05:58:07PM +0200, Patrick Marais wrote:
> > HI Ryan,
> >
> > Thanks for the quick reply and suggestions.
> >
> > I suspect you're right about the neighbor query getting too many points;
> I
> > tried reducing it massively, and now it runs for about 15 minutes
> >  and the seg faults. See below - The number of points is a bit higher
> than
> > I stated: 542326, but the input is an arma::mat with this number of
> columns
> > and rows=63. I'm fairly sure I have checked everything to remove NaN's
> and
> > so on. Could it be possible that the size of the data set is causing
> > something to fail? The memory usage was not maxed out at this point (only
> > about 2.5GB over 8GB).
> >
> > I just ran it through gdb, which doesn't seem to have all the debug
> > information, so besides the place at which it crashed, I can't say much
> > else.
> >
> > Not sure what to try next. If I remove the cal to dbscan and use Kmeans,
> > everything works (although the clusters are not what I'd really like).
>
> Hmm, not sure exactly what the issue is here.  You may have uncovered a
> bug.  Is there any chance I can get the dataset to try and reproduce the
> failure?
>
> Another sanity check would be to try an even smaller epsilon; if it's
> still taking 15 minutes, then it still may be finding very many points
> for each range search.
>
> Clustering is a hard problem, and there's definitely a big tradeoff
> between "fast and bad" (k-means lives here) and "slow but good" (maybe
> you could say this about DBSCAN, but even DBSCAN is faster than some
> things like spectral clustering methods).
>
> --
> Ryan Curtin    | "Bye-bye, goofy woman.  I enjoyed repeatedly
> ryan at ratml.org | throwing you to the ground." - Ben Jabituya
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://knife.lugatgt.org/pipermail/mlpack/attachments/20170609/340ad65c/attachment.html>