[mlpack] DBSCAN unusualy slow...or not?

Wed Jun 14 17:40:39 EDT 2017

On Fri, Jun 09, 2017 at 01:07:27PM +0200, Patrick Marais wrote:
> Hi Ryan,
> 
> I have made the data set (273MB)  available (saved with
> mlpack::data::Save() ) at
> 
> https://drive.google.com/open?id=0B_8-awG7IJzyT2xaU3RQZDhmYzQ
> 
> This should be 542K points of dim=63.
> 
> I ran DBSCAN with epsilon=1 and minPts=64.
> 
> I have also attached an image (and left iy in the dir) which is a histogram
> of inter-point distances under the regular Euclidean metric, so epsilon
> =1.0 seems like it should not have returned that many points?
> 
> Hopefully you can reproduce the error - please let me know if you need
> anything more.
> 
> I checked for NaN and INF in the data max, and everything seems to be fine.

Hi Patrick,

Thanks for the dataset.  I found multiple issues, each of which I
solved.  First there was some code that needed to be updated, but that
part was easy.

Second, I found that the step that turned the range search results into
assignments for each point was not efficient and could be significantly
improved via the use of a union find structure.  Otherwise, there would
be a stack overflow, which would happen with your dataset for large
epsilon values.  Happily, mlpack already has one implemented, so I just
plugged that in and got significant speedup.  Now the vast majority of
runtime on your dataset is the range search itself.

Third, I found that with larger epsilon, the system does indeed run out
of RAM, because each point has too many neighbors in the range search
set.  i.e. if each point has 1k neighbors, and there are 500k points,
then we need an array of size 500M to hold them all...

I thought about it briefly and realized that with the union find
structure, we do not need to hold the range search results for every
single point at once, and instead we can perform individual range
searches for each point in the dataset.  That might be slower (depending
on the dataset and lots of other things), but it will definitely in all
cases use less RAM.

The --single_mode option can be used for that mode.

So, for instance, now I can run like this:

$ bin/mlpack_dbscan -i ~/Downloads/big-data-rows_63_cols_542K.bin -e 0.9 -v --single_mode -C c.csv
[INFO ] Loading '/home/ryan/Downloads/big-data-rows_63_cols_542K.bin' as Armadillo binary formatted data.  Size is 63 x 542329.
[INFO ] DBSCAN clustering on point 10000...
<...skipped...>
[INFO ] DBSCAN clustering on point 540000...
[INFO ] 1154 clusters found.
[INFO ] Saving CSV data to 'c.csv'.
[INFO ] 
[INFO ] Execution parameters:
[INFO ]   assignments_file: 
[INFO ]   centroids_file: c.csv
[INFO ]   epsilon: 0.9
[INFO ]   help: 0
[INFO ]   info: 
[INFO ]   input_file: /home/ryan/Downloads/big-data-rows_63_cols_542K.bin
[INFO ]   min_size: 5
[INFO ]   naive: 0
[INFO ]   single_mode: 1
[INFO ]   tree_type: kd
[INFO ]   verbose: 1
[INFO ]   version: 0
[INFO ] Program timers:
[INFO ]   loading_data: 0.324149s
[INFO ]   range_search/computing_neighbors: 511.964841s (8 mins, 31.9 secs)
[INFO ]   saving_data: 0.049956s
[INFO ]   total_time: 522.897931s (8 mins, 42.8 secs)

For your dataset it seems like --single_mode is faster, but that's not
necessarily true for all data (especially as the dimensionality gets
lower).

To use the new code, you can clone from the git master branch at
https://github.com/mlpack/mlpack.git and make the 'mlpack_dbscan'
target.  The updated code will either be released soon-ish as part of an
mlpack 2.2.4 backport release, or as part of mlpack 3.0.0 (though the
3.0.0 release may still be some months off, not sure).

Let me know if I can clarify anything.

Thanks for the report!  I think mlpack's DBSCAN implementation is
significantly improved as a result.

Ryan

-- 
Ryan Curtin    | "I am."
ryan at ratml.org |   - Joe