[mlpack] DBSCAN unusualy slow...or not?

Thu Jun 8 14:25:39 EDT 2017

On Thu, Jun 08, 2017 at 05:58:07PM +0200, Patrick Marais wrote:
> HI Ryan,
> 
> Thanks for the quick reply and suggestions.
> 
> I suspect you're right about the neighbor query getting too many points; I
> tried reducing it massively, and now it runs for about 15 minutes
>  and the seg faults. See below - The number of points is a bit higher than
> I stated: 542326, but the input is an arma::mat with this number of columns
> and rows=63. I'm fairly sure I have checked everything to remove NaN's and
> so on. Could it be possible that the size of the data set is causing
> something to fail? The memory usage was not maxed out at this point (only
> about 2.5GB over 8GB).
> 
> I just ran it through gdb, which doesn't seem to have all the debug
> information, so besides the place at which it crashed, I can't say much
> else.
> 
> Not sure what to try next. If I remove the cal to dbscan and use Kmeans,
> everything works (although the clusters are not what I'd really like).

Hmm, not sure exactly what the issue is here.  You may have uncovered a
bug.  Is there any chance I can get the dataset to try and reproduce the
failure?

Another sanity check would be to try an even smaller epsilon; if it's
still taking 15 minutes, then it still may be finding very many points
for each range search.

Clustering is a hard problem, and there's definitely a big tradeoff
between "fast and bad" (k-means lives here) and "slow but good" (maybe
you could say this about DBSCAN, but even DBSCAN is faster than some
things like spectral clustering methods).

-- 
Ryan Curtin    | "Bye-bye, goofy woman.  I enjoyed repeatedly
ryan at ratml.org | throwing you to the ground." - Ben Jabituya