[mlpack] Possible integration with MADlib?

Ryan Curtin ryan at ratml.org
Wed Feb 24 10:31:57 EST 2016


On Wed, Feb 17, 2016 at 10:25:24AM -0500, Babak Alipour wrote:
> Greetings everyone,
> 
> I'm a beginner to MLpack and was trying to use it for some large scale data
> processing.
> 
> While  mlpack is a great library and I like the modular standalone
> programs, it lacks support for SQL databases. I also came across another
> great analytical library, MADlib (http://madlib.incubator.apache.org/ ,
> code available on GitHub: https://github.com/apache/incubator-madlib ).
> While MADlib works beautifully on PostgreSQL, it lacks many popular machine
> learning algorithms, such as multilayer perceptrons and hidden markov
> models, both of which mlpack provides.
> 
> I was wondering if anyone has looked at possible integration of the two.
> The model for MADlib is complex and development of new algorithms for it
> are difficult (steep learning curve). But since the architecture is
> layered, I think it's possible to have the python drivers not only call
> RBDMS built-in functions and madlib-developed cpp code; but also call other
> libraries. Integration of a high performance library such as mlpack, could
> be very useful for people trying to do analytics on data residing in SQL
> databases.

Hi Babak,

I wanted to respond to this sooner but I had to take a look into MADlib
a bit first.

I think that some level of integration and interoperability would be
really nice, but probably the thing that is lacking is manpower. :)

It seems to me that for any integration to work, the key would be having
a matrix type that mlpack could use that used a database as storage.
Because mlpack uses Armadillo as a matrix library and is heavily
templated, there are some nice advantages here: if someone were to
implement a class 'db_mat' that had the same basic API as the Armadillo
matrices [1], then you could plug this class in anywhere arma::mat is
used in the mlpack code.  And because the mlpack code is mostly
templates, this means you don't have to modify mlpack code, you can just
do things like

  typedef Perceptron<SimpleWeightUpdate, ZeroInitialization, db_mat>
      DBPerceptron;

and then when you called the Perceptron constructor or the Train()
function, you could pass a db_mat.  In this way, all of the operations
that are done on your data are done in the database, using MADlib core
functionality (or something like this).

Building on top of that, you could then take this db_mat class and
mlpack and build it into MADlib in order to provide some more nice
machine learning functionality.

This is something I've always wanted to do, and a very long time ago, I
remember a Master's student taking the time to rewrite mlpack to work on
databases instead of contiguous dense memory blocks (back when it was
FASTLib/MLPACK), and I remember that it performed well.  However, I
doubt that code is of any use and I'm not even sure where to find it
anymore; that was 2009 or so, and the mlpack codebase has changed so
drastically since that it wouldn't be useful.  But I only mention the
anecdote to point out that it should be possible. :)

Realistically, the big problem with these ideas is finding someone to
implement it.  I'm happy to help provide some guidance and support, and
I'd love to see a db_mat class in mlpack (and... maybe we could pass it
back upstream to Armadillo, depending), but I certainly don't have time
to do the actual implementation myself.

I hope this is helpful!  If anything I've written isn't clear, please
let me know so I can clarify.

Thanks,

Ryan

[1] http://arma.sourceforge.net/docs.html

-- 
Ryan Curtin    | "So I got that going for me, which is nice."
ryan at ratml.org |   - Carl Spackler



More information about the mlpack mailing list