[mlpack] GSoC-2021

Ryan Curtin ryan at ratml.org
Fri Mar 12 11:05:46 EST 2021


On Wed, Mar 10, 2021 at 01:25:04AM +0530, Gopi Manohar Tatiraju wrote:
> I am planning to contribute to mlapck under GSoC 2021. Currently, I am
> working on creating a pandas *dataframe-like class* that can be used to
> analyze the datasets in a better way.
> 
> Having a class like this would help in working with datasets as ml is not
> only about the model but about data as well.
> 
> I have a pr already open for this:
> https://github.com/mlpack/mlpack/pull/2727
> 
> I wanted to know if I can work on this in GSoC? As it was not listed on the
> idea page, but I think this would be a start to something useful and big.

Hey Gopi,

Thanks for working on that PR.  Personally I think this kind of support
would be really wonderful---right now, all non-numeric data in mlpack
has to be represented via both a `data::DatasetInfo` and an `arma::mat`
(since all the internal methods operate on an `arma::mat`).

The internal methods definitely can't change (they simply aren't
designed to work on other types of data), but we certainly could improve
the wrapper class.  You are right that it probably should look like a
Pandas dataframe or similar.

There is a lot of support and functionality already in the
`data::DatasetInfo` class (it's a typedef of the `data::DatasetMapper<>`
class), and we should definitely build on top of that.

If you wanted to work on this project, it would be best to start with
the top-level design: how will users use this dataframe?  How will the
dataframe integrate with mlpack's existing methods?  If I had, e.g., a
text dataset I wanted to use TF-IDF with, what would that look like?
What would loading a set of images look like?  How could we make sure
that whether I had a dataframe or just an `arma::mat` of numerical data,
it "felt the same" to work with either?

Those are just some of the questions that need to be answered.

Take a look at this too:

https://medium.com/@johan.mabille/xframe-towards-a-c-dataframe-26e1ccde211b

I wonder if it would be possible to convert an xframe representation
into an `arma::mat` without copying any memory at the point at which an
mlpack method is called.  (In fact, this even leads to another question:
could we seamlessly support Eigen matrices by converting an Eigen matrix
to an Armadillo matrix without any conversion?  Or other matrix
libraries?)

Given that GSoC is a shorter period of work this year, spending lots of
time implementing custom converters and things like this is probably too
much work---we should see how much we can wrap existing work and the
GSoC project should probably focus on making sure that the interfaces
all work right.

I hope this is helpful!

Thanks,

Ryan

-- 
Ryan Curtin    | "I am."
ryan at ratml.org |   - Joe


More information about the mlpack mailing list