[mlpack] GSoC-2021

Gopi Manohar Tatiraju deathcoderx at gmail.com
Fri Mar 12 15:25:26 EST 2021


Hey Ryan,

Thanks for the feedback.
I agree that this can be a very big project considering the time span of
GSoC this year, if we decide to go ahead with this project it will be very
important to decide on some base features as you already pointed out.

how will users use this dataframe?


We should do it in the same way as we do DatasetInfo, this will keep it
separate from the dataset(arma::mat) so that we don't need to change how we
pass data to the agent.
We will create an object of class mlFrame and pass that to the load
function. But we have to make sure that we don't end up making another copy
of the dataset here as well, might use a bit of help here to create the
skeleton of the class.

 How will the dataframe integrate with mlpack's existing methods?


If we could follow the way I mentioned above we won't need to change any
existing implementations to access or use the data.

Let's discuss point by point, let me know what you think about the
above-mentioned way to implement it or if I need to clear anything more
regarding this, I will address other questions soon as we get a basic idea
of the project.

Regarding the image, there is an ImageInfo class we can extend its
functionality to work on a directory of images, but I have not yet figured
out if we need a way to display the methods, I mean the info regarding the
images should be fine right?

Also, I was thinking of adding some stats to DatasetInfo class, methods to
show the numerical summary of the dataset which can include mean, std, min,
max, etc. These are the same methods that I suggested to implement in this
PR <https://github.com/mlpack/mlpack/pull/2727>. As most of our datasets
are numerical for now, I think we should first implement functionality that
can be utilized by numeric datasets. Let me know what you think.

Thank you,
Gopi


On Fri, Mar 12, 2021 at 9:35 PM Ryan Curtin <ryan at ratml.org> wrote:

> On Wed, Mar 10, 2021 at 01:25:04AM +0530, Gopi Manohar Tatiraju wrote:
> > I am planning to contribute to mlapck under GSoC 2021. Currently, I am
> > working on creating a pandas *dataframe-like class* that can be used to
> > analyze the datasets in a better way.
> >
> > Having a class like this would help in working with datasets as ml is not
> > only about the model but about data as well.
> >
> > I have a pr already open for this:
> > https://github.com/mlpack/mlpack/pull/2727
> >
> > I wanted to know if I can work on this in GSoC? As it was not listed on
> the
> > idea page, but I think this would be a start to something useful and big.
>
> Hey Gopi,
>
> Thanks for working on that PR.  Personally I think this kind of support
> would be really wonderful---right now, all non-numeric data in mlpack
> has to be represented via both a `data::DatasetInfo` and an `arma::mat`
> (since all the internal methods operate on an `arma::mat`).
>
> The internal methods definitely can't change (they simply aren't
> designed to work on other types of data), but we certainly could improve
> the wrapper class.  You are right that it probably should look like a
> Pandas dataframe or similar.
>
> There is a lot of support and functionality already in the
> `data::DatasetInfo` class (it's a typedef of the `data::DatasetMapper<>`
> class), and we should definitely build on top of that.
>
> If you wanted to work on this project, it would be best to start with
> the top-level design: how will users use this dataframe?  How will the
> dataframe integrate with mlpack's existing methods?  If I had, e.g., a
> text dataset I wanted to use TF-IDF with, what would that look like?
> What would loading a set of images look like?  How could we make sure
> that whether I had a dataframe or just an `arma::mat` of numerical data,
> it "felt the same" to work with either?
>
> Those are just some of the questions that need to be answered.
>
> Take a look at this too:
>
> https://medium.com/@johan.mabille/xframe-towards-a-c-dataframe-26e1ccde211b
>
> I wonder if it would be possible to convert an xframe representation
> into an `arma::mat` without copying any memory at the point at which an
> mlpack method is called.  (In fact, this even leads to another question:
> could we seamlessly support Eigen matrices by converting an Eigen matrix
> to an Armadillo matrix without any conversion?  Or other matrix
> libraries?)
>
> Given that GSoC is a shorter period of work this year, spending lots of
> time implementing custom converters and things like this is probably too
> much work---we should see how much we can wrap existing work and the
> GSoC project should probably focus on making sure that the interfaces
> all work right.
>
> I hope this is helpful!
>
> Thanks,
>
> Ryan
>
> --
> Ryan Curtin    | "I am."
> ryan at ratml.org |   - Joe
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://knife.lugatgt.org/pipermail/mlpack/attachments/20210313/92fce8a7/attachment.htm>


More information about the mlpack mailing list