[mlpack] Is scanning business transactions for fraud an appropriate use of MLpack?

Wed Nov 7 22:27:17 EST 2018

On Wed, Nov 07, 2018 at 01:27:59PM -0500, Rick Hedin wrote:
> Hi, Ryan.
> 
> Hmm.  Interesting.  Always numeric, eh?

Yeah, mlpack is built on the Armadillo matrix library.  I'm not familiar
with CRM114 but I would imagine it is doing something that amounts to
one-hot encoding.

> I could convert our data into numeric values, as you suggest, but I have
> some misgivings.  Maybe I should say some more about the data records that
> stream in.
> 
> 1.  Some of our fields are encoded values.  So for example SERVICE_TYPE =
> PSYCHIC_READING.  With other possible values being PALM_READING,
> CASTING_STICKS, TAROT_CARDS.  (We don't really offer psychic readings.)
> 
> We should be able to assign a number to each of the possible values without
> any problem.

Right.  The typical way to handle something like this in a machine
learning library would be one-hot encoding, where instead of having one
dimension for SERVICE_TYPE, you'd have a dimension
SERVICE_TYPE_PSYCHIC_READING that takes a value 0/1,
SERVICE_TYPE_PALM_READING that takes value 0/1, etc., etc. for each
possibility.  This makes your data matrix pretty big though.

There are a few mlpack algorithms like the decision tree
(mlpack_decision_tree from the command line) that support categorical
variables, which can be loaded with the .arff file format.

> 2.  Some of our fields are numeric values.  So for example AMOUNT_CHARGED =
> 9.95.
> 
> I bet MLPack could handle these directly.

Yep, no need to change these.

> 3.  Some of our fields are free text fields.  For example COMMENT =
> "Customer seemed agitated.  I couldn't get a clear reading."
> 
> We could create a big dictionary, and map words to numbers.  Leaving out
> stemming and phrases.  But that would truly be a very big dictionary.  And
> it's quite likely that information in the comment might be useful for
> determining whether the transaction is fraudulent.
> 
> My previous experience, CRM114, handles text swimmingly.  But it doesn't
> handle numeric fields at all.  (Other than as a very peculiar looking
> number.)  Perhaps neither engine is really appropriate.
> 
> Does this information about our fields jog loose any additional ideas?

Typically with text data, dictionary encoding is often how it's done.
Alternately, it could be done at the character level so the dictionary
is small, but then you need a powerful modeling technique to learn the
different dependencies between letters.  However I am not an NLP expert,
so I can't say what will be best for your problem, but something kind of
like dictionary encoding could work.

word2vec could also be another interesting preprocessing technique, but
I don't think anyone's currently implemented this model in mlpack.

I think, my overall suggestion would be, once you can get your data into
a numeric format, mlpack could be used just fine for the actual machine
learning part, but mlpack doesn't have the best facilities for text
loading and preprocessing.

Hope this helps!

-- 
Ryan Curtin    | "And do not attempt to grow a brain!"
ryan at ratml.org |   - Sgt. Howard Payne