[mlpack] GSoC-2021

Omar Shrit omar at shrit.me
Thu Mar 18 15:12:34 EDT 2021


Hello Gopi,

Sorry, I am forgot to answer you earlier (A little bit busy), thanks for the ping.

On 03/18, Gopi Manohar Tatiraju wrote:
> Hey Omar,
> 
> Thank you soo much for such detailed replies and inputs.
> 
> I see that there are many things that we need to get in order before we
> start working on this idea.
> I think we should start from this point:
> 
> 1) It would be nicer to have mlpack as header-only by moving these
> > implementations into header files.
> 
> 
> I went through the mlpack/core/data/ directory and List of .cpp files in
> core/data
> 
>    1. detect_file_type.cpp
>    2. load.cpp
>    3. load_csv.cpp
>    4. load_image.cpp
>    5. save_image.cpp
> 
> The convention we follow in mlpack is having two .hpp files, one for
> declarations and one for implementations.

The convention is related to classes that are defined as
class-template. I do not think it is possible to convert an implementation
file to `_impl.hpp` if the class is not a class template.

Therefore, you can put the entire implementation in the header, where
the class is defined. There are several disadvantages of putting everything
in one place, for instance, increase compilation time and binary footprint.

It would be nice to have a small prototype and a benchmark to see how
much time the compilation time will increase, since we are trying to
reduce these numbers.

I did not look into the code yet, but I do not see anything that might
prevent from having everything in the headers.

Also, do not worry about details such as convention, these can be seen
later especially if a prototype comes back with good results.

> I just want to clarify some things so that I can start writing some code
> regarding this:
> 
>    - Let's consider detect_file_type, we have for now:
>       - detect_file_type.hpp
>       <https://github.com/mlpack/mlpack/blob/master/src/mlpack/core/data/detect_file_type.hpp>
>       - detect_file_type.cpp
>       <https://github.com/mlpack/mlpack/blob/master/src/mlpack/core/data/detect_file_type.cpp>
> 
> What we want in the future:
> 
> 
>    - detect_file_type.hpp
>    - detect_file_type_impl.hpp
> 
> Can you just give me one example or reference or just a list of points that
> we need to consider while restructuring these files? What will be the major
> differences after converting them to .hpp files?
> 
> 
> It feels great getting feedback on my idea and I get to learn soo many new
> things on the way.
> 
> Thank you,
> Gopi.
> 
> 
> On Sat, Mar 13, 2021 at 4:12 AM Omar Shrit <omar at shrit.me> wrote:
> 
> > Hello Gopi,
> >
> > The data frame class project is indeed a good idea, we have thought
> > about that, but as Ryan said, it can be a big project for GSoC given
> > the limited period of time this year.
> >
> > I have several ideas to add on what Ryan said. The objective is to make
> > the project lighter and more fit for a GSoC.
> >
> > Knowing that, the data load/save part from mlpack core is the only part
> > that
> > has implementation files (.cpp) while all methods of mlpack are
> > header-only,
> > therefore:
> >
> > 1) It would be nicer to have mlpack as header-only by moving these
> > implementations into header files.
> >
> > 2) I would avoid re-implementing things that have already implemented,
> > especially that these parts of code (Loading, Saving, Matrix
> > manipulation, and conversion) need a lot of
> > optimization, which requires years of work to have something feasible.
> > However, looking at Xtensor library seems to be similar to Pandas providing
> > what is in need, in C++, with a good performance.
> >
> > 3) Xtensor integration can be realized by adding a mlpack wrapper
> > (a small light wrapper) for Xtensor functionalities. This wrapper can be
> > integrated into mlpack source code, or can be kept separately (as
> > ensmallen)
> > allowing to be added when needed, therefore only link with library that
> > we use (avoiding dependencies).
> >
> > Knowing that the above steps will require more than one GSoC to
> > complete, but they can be done independently. You can choose what you
> > find the most suitable and build a proposal upon it allowing to have
> > the most possible decoupling between the tasks in order to maximize the
> > possible
> > feasibility of the project.
> >
> > I hope you find this helpful !
> >
> > Thanks,
> >
> > Omar
> >
> >
> > On 03/13, Gopi Manohar Tatiraju wrote:
> > > Hey Ryan,
> > >
> > > Thanks for the feedback.
> > > I agree that this can be a very big project considering the time span of
> > > GSoC this year, if we decide to go ahead with this project it will be
> > very
> > > important to decide on some base features as you already pointed out.
> > >
> > > how will users use this dataframe?
> > >
> > >
> > > We should do it in the same way as we do DatasetInfo, this will keep it
> > > separate from the dataset(arma::mat) so that we don't need to change how
> > we
> > > pass data to the agent.
> > > We will create an object of class mlFrame and pass that to the load
> > > function. But we have to make sure that we don't end up making another
> > copy
> > > of the dataset here as well, might use a bit of help here to create the
> > > skeleton of the class.
> > >
> > >  How will the dataframe integrate with mlpack's existing methods?
> > >
> > >
> > > If we could follow the way I mentioned above we won't need to change any
> > > existing implementations to access or use the data.
> > >
> > > Let's discuss point by point, let me know what you think about the
> > > above-mentioned way to implement it or if I need to clear anything more
> > > regarding this, I will address other questions soon as we get a basic
> > idea
> > > of the project.
> > >
> > > Regarding the image, there is an ImageInfo class we can extend its
> > > functionality to work on a directory of images, but I have not yet
> > figured
> > > out if we need a way to display the methods, I mean the info regarding
> > the
> > > images should be fine right?
> > >
> > > Also, I was thinking of adding some stats to DatasetInfo class, methods
> > to
> > > show the numerical summary of the dataset which can include mean, std,
> > min,
> > > max, etc. These are the same methods that I suggested to implement in
> > this
> > > PR <https://github.com/mlpack/mlpack/pull/2727>. As most of our datasets
> > > are numerical for now, I think we should first implement functionality
> > that
> > > can be utilized by numeric datasets. Let me know what you think.
> > >
> > > Thank you,
> > > Gopi
> > >
> > >
> > > On Fri, Mar 12, 2021 at 9:35 PM Ryan Curtin <ryan at ratml.org> wrote:
> > >
> > > > On Wed, Mar 10, 2021 at 01:25:04AM +0530, Gopi Manohar Tatiraju wrote:
> > > > > I am planning to contribute to mlapck under GSoC 2021. Currently, I
> > am
> > > > > working on creating a pandas *dataframe-like class* that can be used
> > to
> > > > > analyze the datasets in a better way.
> > > > >
> > > > > Having a class like this would help in working with datasets as ml
> > is not
> > > > > only about the model but about data as well.
> > > > >
> > > > > I have a pr already open for this:
> > > > > https://github.com/mlpack/mlpack/pull/2727
> > > > >
> > > > > I wanted to know if I can work on this in GSoC? As it was not listed
> > on
> > > > the
> > > > > idea page, but I think this would be a start to something useful and
> > big.
> > > >
> > > > Hey Gopi,
> > > >
> > > > Thanks for working on that PR.  Personally I think this kind of support
> > > > would be really wonderful---right now, all non-numeric data in mlpack
> > > > has to be represented via both a `data::DatasetInfo` and an `arma::mat`
> > > > (since all the internal methods operate on an `arma::mat`).
> > > >
> > > > The internal methods definitely can't change (they simply aren't
> > > > designed to work on other types of data), but we certainly could
> > improve
> > > > the wrapper class.  You are right that it probably should look like a
> > > > Pandas dataframe or similar.
> > > >
> > > > There is a lot of support and functionality already in the
> > > > `data::DatasetInfo` class (it's a typedef of the
> > `data::DatasetMapper<>`
> > > > class), and we should definitely build on top of that.
> > > >
> > > > If you wanted to work on this project, it would be best to start with
> > > > the top-level design: how will users use this dataframe?  How will the
> > > > dataframe integrate with mlpack's existing methods?  If I had, e.g., a
> > > > text dataset I wanted to use TF-IDF with, what would that look like?
> > > > What would loading a set of images look like?  How could we make sure
> > > > that whether I had a dataframe or just an `arma::mat` of numerical
> > data,
> > > > it "felt the same" to work with either?
> > > >
> > > > Those are just some of the questions that need to be answered.
> > > >
> > > > Take a look at this too:
> > > >
> > > >
> > https://medium.com/@johan.mabille/xframe-towards-a-c-dataframe-26e1ccde211b
> > > >
> > > > I wonder if it would be possible to convert an xframe representation
> > > > into an `arma::mat` without copying any memory at the point at which an
> > > > mlpack method is called.  (In fact, this even leads to another
> > question:
> > > > could we seamlessly support Eigen matrices by converting an Eigen
> > matrix
> > > > to an Armadillo matrix without any conversion?  Or other matrix
> > > > libraries?)
> > > >
> > > > Given that GSoC is a shorter period of work this year, spending lots of
> > > > time implementing custom converters and things like this is probably
> > too
> > > > much work---we should see how much we can wrap existing work and the
> > > > GSoC project should probably focus on making sure that the interfaces
> > > > all work right.
> > > >
> > > > I hope this is helpful!
> > > >
> > > > Thanks,
> > > >
> > > > Ryan
> > > >
> > > > --
> > > > Ryan Curtin    | "I am."
> > > > ryan at ratml.org |   - Joe
> > > >
> >
> > > _______________________________________________
> > > mlpack mailing list
> > > mlpack at lists.mlpack.org
> > > http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack
> >
> >

> _______________________________________________
> mlpack mailing list
> mlpack at lists.mlpack.org
> http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://knife.lugatgt.org/pipermail/mlpack/attachments/20210318/b77cbb04/attachment-0001.sig>


More information about the mlpack mailing list