[mlpack] One-class Classification

Tue May 1 13:54:11 EDT 2018

On Tue, May 01, 2018 at 12:48:17AM +0000, Germán Lancioni wrote:
> Hi MLPACK team,
> 
> Congratulations for the fantastic work you are doing. After trying several options, I find MLPACK the most professional, well written and maintained framework for ML. I have been using Naive Bayes, Random Forest, K-foldCv and model saving. Now I'm wondering if MLPACK somehow supports one-class classification (e.g. one-class SVM), as I have an anomaly detection problem at hand. I tried going through the API doc but couldn't find anything in that regard.
> 
> I appreciate any input, and again cheers for the outstanding work.

Hi Germán,

Thanks for the nice words about mlpack.  I'm glad that you've found it
useful.

At the moment, we don't have any out-of-the-box one-class classification
techniques implemented.  However, at its core anomaly detection could
also be expressed as the following question: assuming my data came from
some distribution D, how likely is it that my point came from D or did
not?  More specifically, if we have some probability density function
estimate p(x | D) for some point x, we can then do something like saying
"if p(x | D) < threshold, then it is an anomaly".  So, building on that
idea, there may be a few things you can use out of mlpack:

 * You could use density estimation trees if your data is
   low-dimensional to compute densities of points.  That code is found
   in src/mlpack/methods/det/.

 * KDE (kernel density estimation) is being implemented now in
   https://github.com/mlpack/mlpack/pull/1301, and you could use that to
   do much the same thing.  I think in its current state it should work,
   but maybe easier to wait until it is done and merged.

 * You could build a feedforward network autoencoder and use, e.g., the
   MSE of the reconstructed image as a measure of anomaly.  Here's a
   similar example:
   https://shiring.github.io/machine_learning/2017/05/01/fraud

 * Use k-furthest-neighbors to compute the mean kFN distance for a
   point?  That *could* be a measure of outlier-ness.

 * This is a little different, but maybe you could use DBSCAN with a
   properly tuned radius, and points that get classified as "noise"
   (i.e. they are far away from any cluster) could be considered
   anomalies.

 * The last option (probably the most time consuming) would be to
   implement a technique and then we can merge it into mlpack so long as
   it's well tested and fast. :)

Maybe some of these things work for your situation, maybe not.  In any
case I hope that the ideas are helpful, and let me know if I can clarify
anything.

-- 
Ryan Curtin    | "This is how Number One works!"
ryan at ratml.org |   - Number One