[mlpack] GSoC-2021

Omar Shrit omar at shrit.me
Fri Mar 12 17:42:50 EST 2021


Hello Gopi,

The data frame class project is indeed a good idea, we have thought
about that, but as Ryan said, it can be a big project for GSoC given 
the limited period of time this year.

I have several ideas to add on what Ryan said. The objective is to make
the project lighter and more fit for a GSoC.

Knowing that, the data load/save part from mlpack core is the only part that
has implementation files (.cpp) while all methods of mlpack are header-only,
therefore:

1) It would be nicer to have mlpack as header-only by moving these
implementations into header files.

2) I would avoid re-implementing things that have already implemented,
especially that these parts of code (Loading, Saving, Matrix
manipulation, and conversion) need a lot of
optimization, which requires years of work to have something feasible.
However, looking at Xtensor library seems to be similar to Pandas providing
what is in need, in C++, with a good performance.

3) Xtensor integration can be realized by adding a mlpack wrapper 
(a small light wrapper) for Xtensor functionalities. This wrapper can be
integrated into mlpack source code, or can be kept separately (as ensmallen)
allowing to be added when needed, therefore only link with library that
we use (avoiding dependencies).

Knowing that the above steps will require more than one GSoC to
complete, but they can be done independently. You can choose what you
find the most suitable and build a proposal upon it allowing to have
the most possible decoupling between the tasks in order to maximize the possible 
feasibility of the project.

I hope you find this helpful !

Thanks,

Omar


On 03/13, Gopi Manohar Tatiraju wrote:
> Hey Ryan,
> 
> Thanks for the feedback.
> I agree that this can be a very big project considering the time span of
> GSoC this year, if we decide to go ahead with this project it will be very
> important to decide on some base features as you already pointed out.
> 
> how will users use this dataframe?
> 
> 
> We should do it in the same way as we do DatasetInfo, this will keep it
> separate from the dataset(arma::mat) so that we don't need to change how we
> pass data to the agent.
> We will create an object of class mlFrame and pass that to the load
> function. But we have to make sure that we don't end up making another copy
> of the dataset here as well, might use a bit of help here to create the
> skeleton of the class.
> 
>  How will the dataframe integrate with mlpack's existing methods?
> 
> 
> If we could follow the way I mentioned above we won't need to change any
> existing implementations to access or use the data.
> 
> Let's discuss point by point, let me know what you think about the
> above-mentioned way to implement it or if I need to clear anything more
> regarding this, I will address other questions soon as we get a basic idea
> of the project.
> 
> Regarding the image, there is an ImageInfo class we can extend its
> functionality to work on a directory of images, but I have not yet figured
> out if we need a way to display the methods, I mean the info regarding the
> images should be fine right?
> 
> Also, I was thinking of adding some stats to DatasetInfo class, methods to
> show the numerical summary of the dataset which can include mean, std, min,
> max, etc. These are the same methods that I suggested to implement in this
> PR <https://github.com/mlpack/mlpack/pull/2727>. As most of our datasets
> are numerical for now, I think we should first implement functionality that
> can be utilized by numeric datasets. Let me know what you think.
> 
> Thank you,
> Gopi
> 
> 
> On Fri, Mar 12, 2021 at 9:35 PM Ryan Curtin <ryan at ratml.org> wrote:
> 
> > On Wed, Mar 10, 2021 at 01:25:04AM +0530, Gopi Manohar Tatiraju wrote:
> > > I am planning to contribute to mlapck under GSoC 2021. Currently, I am
> > > working on creating a pandas *dataframe-like class* that can be used to
> > > analyze the datasets in a better way.
> > >
> > > Having a class like this would help in working with datasets as ml is not
> > > only about the model but about data as well.
> > >
> > > I have a pr already open for this:
> > > https://github.com/mlpack/mlpack/pull/2727
> > >
> > > I wanted to know if I can work on this in GSoC? As it was not listed on
> > the
> > > idea page, but I think this would be a start to something useful and big.
> >
> > Hey Gopi,
> >
> > Thanks for working on that PR.  Personally I think this kind of support
> > would be really wonderful---right now, all non-numeric data in mlpack
> > has to be represented via both a `data::DatasetInfo` and an `arma::mat`
> > (since all the internal methods operate on an `arma::mat`).
> >
> > The internal methods definitely can't change (they simply aren't
> > designed to work on other types of data), but we certainly could improve
> > the wrapper class.  You are right that it probably should look like a
> > Pandas dataframe or similar.
> >
> > There is a lot of support and functionality already in the
> > `data::DatasetInfo` class (it's a typedef of the `data::DatasetMapper<>`
> > class), and we should definitely build on top of that.
> >
> > If you wanted to work on this project, it would be best to start with
> > the top-level design: how will users use this dataframe?  How will the
> > dataframe integrate with mlpack's existing methods?  If I had, e.g., a
> > text dataset I wanted to use TF-IDF with, what would that look like?
> > What would loading a set of images look like?  How could we make sure
> > that whether I had a dataframe or just an `arma::mat` of numerical data,
> > it "felt the same" to work with either?
> >
> > Those are just some of the questions that need to be answered.
> >
> > Take a look at this too:
> >
> > https://medium.com/@johan.mabille/xframe-towards-a-c-dataframe-26e1ccde211b
> >
> > I wonder if it would be possible to convert an xframe representation
> > into an `arma::mat` without copying any memory at the point at which an
> > mlpack method is called.  (In fact, this even leads to another question:
> > could we seamlessly support Eigen matrices by converting an Eigen matrix
> > to an Armadillo matrix without any conversion?  Or other matrix
> > libraries?)
> >
> > Given that GSoC is a shorter period of work this year, spending lots of
> > time implementing custom converters and things like this is probably too
> > much work---we should see how much we can wrap existing work and the
> > GSoC project should probably focus on making sure that the interfaces
> > all work right.
> >
> > I hope this is helpful!
> >
> > Thanks,
> >
> > Ryan
> >
> > --
> > Ryan Curtin    | "I am."
> > ryan at ratml.org |   - Joe
> >

> _______________________________________________
> mlpack mailing list
> mlpack at lists.mlpack.org
> http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://knife.lugatgt.org/pipermail/mlpack/attachments/20210312/4edcae15/attachment-0001.sig>


More information about the mlpack mailing list