[mlpack] GSOC 19 automatic bindings and Rcppmlpack

Ryan Curtin ryan at ratml.org
Thu Mar 28 22:02:14 EDT 2019


On Thu, Mar 28, 2019 at 12:01:43PM -0500, Andrew Bernauer wrote:
> Hello,
> 
> I am a graduating senior with an interest in participating in the
> Rcppmlpack project. Finished building the source on windows with some help
> from the community this past Monday. Read over the documentation on the
> mlpack site and have a pretty good idea of what the project will entail.
> What steps would you recommend, for getting a prototype binding up for R?
> My remaining course work this semester includes a course in Machine
> Learning using R, so I have a solid grasp of what standard approaches are
> used on the R side of things.  Any suggestions would be greatly
> appreciated.

Hi there Andrew,

Thanks for getting in touch.  Glad to hear you had success building on
Windows---sometimes it can be a bit tricky.

For a binding project to another language (any language, including R)
the most important thing is that we don't copy any matrices whenever we
can avoid it when passing between two languages.  So for Python, this
means passing the Armadillo pointer through to numpy and vice versa.  In
R we luckily already have Rcpp, which should help with that quite a lot.
So that part should not be too hard.

Next we have to figure out the mappings of the other types.  The
automatic bindings can have all the types defined in
`src/mlpack/core/util/param.hpp`, which should be (in C++) `double`,
`int`, `std::string`, `std::vector<int>`, `std::vector<string>`,
`arma::mat`, `arma::vec`, `arma::rowvec`, `arma::Mat<size_t>`,
`arma::Col<size_t>`, `arma::Row<size_t>`, and then the weird one
`std::tuple<data::DatasetInfo, arma::mat>` (which is just a matrix with
markers specifying whether dimensions are categorical or numeric), and
`ModelType*` (the hard one).

Now that I write that, I guess the list is a bit longer than I thought.

In any case, most of those mappings are pretty easy, and it's okay if we
copy strings or vectors of ints or whatever because they won't be large.
We really need to avoid copying matrices though; however, I don't know
enough about R to know in what situations we will be able to avoid that.

The hard one is the model pointers---you can take a look at the bindings
and see that sometimes we have these --input_model_file and
--output_model_file parameters.  We wrapped this in Python by basically
just having Python hold onto this memory pointer from C++.  I am not
sure what the R solution will be, but that will probably be the most
tricky part of the process.

Once all those bits are figured out, you can handwrite a binding for one
algorithm (say, PCA or something), make sure it works, then set about
building the bits that go in the automatic binding generator.  There's a
tutorial on the doxygen documentation page that should detail that.

Anyway, this email got longer than I thought.  I hope it's useful both
to you and anyone else on the list who's looking to apply for this
project (whether it be for R or other languages---the information will
apply to both).  I may have also written more on the mailing list in the
past; you can take a look at the archives for more.

Hope this helps.  Let me know if I can clarify anything else.

Thanks,

Ryan

-- 
Ryan Curtin    | "Open the pig!"
ryan at ratml.org |   - Frank Moses


More information about the mlpack mailing list