[mlpack] Is scanning business transactions for fraud an appropriate use of MLpack?

Sun Nov 11 17:39:20 EST 2018

Thanks, Ryan.

I'm packaging this into a proposal to my colleagues, and I hope to get
buy-in.

           Regards, Rick

On Wed, Nov 7, 2018 at 10:27 PM Ryan Curtin <ryan at ratml.org> wrote:

> On Wed, Nov 07, 2018 at 01:27:59PM -0500, Rick Hedin wrote:
> > Hi, Ryan.
> >
> > Hmm.  Interesting.  Always numeric, eh?
>
> Yeah, mlpack is built on the Armadillo matrix library.  I'm not familiar
> with CRM114 but I would imagine it is doing something that amounts to
> one-hot encoding.
>
> > I could convert our data into numeric values, as you suggest, but I have
> > some misgivings.  Maybe I should say some more about the data records
> that
> > stream in.
> >
> > 1.  Some of our fields are encoded values.  So for example SERVICE_TYPE =
> > PSYCHIC_READING.  With other possible values being PALM_READING,
> > CASTING_STICKS, TAROT_CARDS.  (We don't really offer psychic readings.)
> >
> > We should be able to assign a number to each of the possible values
> without
> > any problem.
>
> Right.  The typical way to handle something like this in a machine
> learning library would be one-hot encoding, where instead of having one
> dimension for SERVICE_TYPE, you'd have a dimension
> SERVICE_TYPE_PSYCHIC_READING that takes a value 0/1,
> SERVICE_TYPE_PALM_READING that takes value 0/1, etc., etc. for each
> possibility.  This makes your data matrix pretty big though.
>
> There are a few mlpack algorithms like the decision tree
> (mlpack_decision_tree from the command line) that support categorical
> variables, which can be loaded with the .arff file format.
>
> > 2.  Some of our fields are numeric values.  So for example
> AMOUNT_CHARGED =
> > 9.95.
> >
> > I bet MLPack could handle these directly.
>
> Yep, no need to change these.
>
> > 3.  Some of our fields are free text fields.  For example COMMENT =
> > "Customer seemed agitated.  I couldn't get a clear reading."
> >
> > We could create a big dictionary, and map words to numbers.  Leaving out
> > stemming and phrases.  But that would truly be a very big dictionary.
> And
> > it's quite likely that information in the comment might be useful for
> > determining whether the transaction is fraudulent.
> >
> > My previous experience, CRM114, handles text swimmingly.  But it doesn't
> > handle numeric fields at all.  (Other than as a very peculiar looking
> > number.)  Perhaps neither engine is really appropriate.
> >
> > Does this information about our fields jog loose any additional ideas?
>
> Typically with text data, dictionary encoding is often how it's done.
> Alternately, it could be done at the character level so the dictionary
> is small, but then you need a powerful modeling technique to learn the
> different dependencies between letters.  However I am not an NLP expert,
> so I can't say what will be best for your problem, but something kind of
> like dictionary encoding could work.
>
> word2vec could also be another interesting preprocessing technique, but
> I don't think anyone's currently implemented this model in mlpack.
>
> I think, my overall suggestion would be, once you can get your data into
> a numeric format, mlpack could be used just fine for the actual machine
> learning part, but mlpack doesn't have the best facilities for text
> loading and preprocessing.
>
> Hope this helps!
>
> --
> Ryan Curtin    | "And do not attempt to grow a brain!"
> ryan at ratml.org |   - Sgt. Howard Payne
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://knife.lugatgt.org/pipermail/mlpack/attachments/20181111/657359d5/attachment.html>