[mlpack] GSoC-2021

Gopi Manohar Tatiraju deathcoderx at gmail.com
Mon Mar 22 11:40:47 EDT 2021


Hey,

Thank you for your quick response.

So I can see that we have boost::spirit in load_csv.cpp

Currently we have 2 files for load_csv:

   1. load_csv.hpp
   2. load_csv.cpp

I will move the function implementations for .cpp file to .hpp file and
will declare them as inline. So we will only have one .hpp file for
load_csv.

I will also check the compilation time before making these changes.

Sounds good?

Thanks,
Gopi

On Mon, Mar 22, 2021 at 8:30 PM Omar Shrit <omar at shrit.me> wrote:

> On 03/22, Gopi Manohar Tatiraju wrote:
> > Hey Omar,
> >
> > Thank you for the reply...
> > So what I understood is to make files _impl.hpp we need to declare class
> > templates.
> >
> > But taking the example of *detect_file *we can see that we only have
> > functions declared here. One file for declaring the functions and one for
> > the implementation.
> >
> > So maybe, I need to structure these functions into classes, right? Coz if
> > we want to put functions directly into the header files we need to
> declare
> > them as inline to avoid the violation of C++'s one definition rule.
>
> inline should be fine for now.
>
> > Also, how should I compare the compilation time?
>
> Use the time command with make in order to get the compilation time
>
> > Also about the prototype and benchmark you mentioned, can you explain a
> bit
> > more about that. I really want to make this project work and I am willing
> > to put my time in learning new things.
>
> You can start by moving class implementation (those using boost spirit) to
> header files
> and recompile mlpack. You need to measure the compilation time before
> and after changing implementation into header, that will gives us an
> idea about the increase related to boost spirit
>
> > Thank you for the guidance.
> >
> > PS. Sorry, I was busy with my college project so it took some time to
> reply.
> >
> > Regards,
> > Gopi
> >
> > On Fri, Mar 19, 2021 at 12:42 AM Omar Shrit <omar at shrit.me> wrote:
> >
> > > Hello Gopi,
> > >
> > > Sorry, I am forgot to answer you earlier (A little bit busy), thanks
> for
> > > the ping.
> > >
> > > On 03/18, Gopi Manohar Tatiraju wrote:
> > > > Hey Omar,
> > > >
> > > > Thank you soo much for such detailed replies and inputs.
> > > >
> > > > I see that there are many things that we need to get in order before
> we
> > > > start working on this idea.
> > > > I think we should start from this point:
> > > >
> > > > 1) It would be nicer to have mlpack as header-only by moving these
> > > > > implementations into header files.
> > > >
> > > >
> > > > I went through the mlpack/core/data/ directory and List of .cpp
> files in
> > > > core/data
> > > >
> > > >    1. detect_file_type.cpp
> > > >    2. load.cpp
> > > >    3. load_csv.cpp
> > > >    4. load_image.cpp
> > > >    5. save_image.cpp
> > > >
> > > > The convention we follow in mlpack is having two .hpp files, one for
> > > > declarations and one for implementations.
> > >
> > > The convention is related to classes that are defined as
> > > class-template. I do not think it is possible to convert an
> implementation
> > > file to `_impl.hpp` if the class is not a class template.
> > >
> > > Therefore, you can put the entire implementation in the header, where
> > > the class is defined. There are several disadvantages of putting
> everything
> > > in one place, for instance, increase compilation time and binary
> footprint.
> > >
> > > It would be nice to have a small prototype and a benchmark to see how
> > > much time the compilation time will increase, since we are trying to
> > > reduce these numbers.
> > >
> > > I did not look into the code yet, but I do not see anything that might
> > > prevent from having everything in the headers.
> > >
> > > Also, do not worry about details such as convention, these can be seen
> > > later especially if a prototype comes back with good results.
> > >
> > > > I just want to clarify some things so that I can start writing some
> code
> > > > regarding this:
> > > >
> > > >    - Let's consider detect_file_type, we have for now:
> > > >       - detect_file_type.hpp
> > > >       <
> > >
> https://github.com/mlpack/mlpack/blob/master/src/mlpack/core/data/detect_file_type.hpp
> > > >
> > > >       - detect_file_type.cpp
> > > >       <
> > >
> https://github.com/mlpack/mlpack/blob/master/src/mlpack/core/data/detect_file_type.cpp
> > > >
> > > >
> > > > What we want in the future:
> > > >
> > > >
> > > >    - detect_file_type.hpp
> > > >    - detect_file_type_impl.hpp
> > > >
> > > > Can you just give me one example or reference or just a list of
> points
> > > that
> > > > we need to consider while restructuring these files? What will be the
> > > major
> > > > differences after converting them to .hpp files?
> > > >
> > > >
> > > > It feels great getting feedback on my idea and I get to learn soo
> many
> > > new
> > > > things on the way.
> > > >
> > > > Thank you,
> > > > Gopi.
> > > >
> > > >
> > > > On Sat, Mar 13, 2021 at 4:12 AM Omar Shrit <omar at shrit.me> wrote:
> > > >
> > > > > Hello Gopi,
> > > > >
> > > > > The data frame class project is indeed a good idea, we have thought
> > > > > about that, but as Ryan said, it can be a big project for GSoC
> given
> > > > > the limited period of time this year.
> > > > >
> > > > > I have several ideas to add on what Ryan said. The objective is to
> make
> > > > > the project lighter and more fit for a GSoC.
> > > > >
> > > > > Knowing that, the data load/save part from mlpack core is the only
> part
> > > > > that
> > > > > has implementation files (.cpp) while all methods of mlpack are
> > > > > header-only,
> > > > > therefore:
> > > > >
> > > > > 1) It would be nicer to have mlpack as header-only by moving these
> > > > > implementations into header files.
> > > > >
> > > > > 2) I would avoid re-implementing things that have already
> implemented,
> > > > > especially that these parts of code (Loading, Saving, Matrix
> > > > > manipulation, and conversion) need a lot of
> > > > > optimization, which requires years of work to have something
> feasible.
> > > > > However, looking at Xtensor library seems to be similar to Pandas
> > > providing
> > > > > what is in need, in C++, with a good performance.
> > > > >
> > > > > 3) Xtensor integration can be realized by adding a mlpack wrapper
> > > > > (a small light wrapper) for Xtensor functionalities. This wrapper
> can
> > > be
> > > > > integrated into mlpack source code, or can be kept separately (as
> > > > > ensmallen)
> > > > > allowing to be added when needed, therefore only link with library
> that
> > > > > we use (avoiding dependencies).
> > > > >
> > > > > Knowing that the above steps will require more than one GSoC to
> > > > > complete, but they can be done independently. You can choose what
> you
> > > > > find the most suitable and build a proposal upon it allowing to
> have
> > > > > the most possible decoupling between the tasks in order to
> maximize the
> > > > > possible
> > > > > feasibility of the project.
> > > > >
> > > > > I hope you find this helpful !
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Omar
> > > > >
> > > > >
> > > > > On 03/13, Gopi Manohar Tatiraju wrote:
> > > > > > Hey Ryan,
> > > > > >
> > > > > > Thanks for the feedback.
> > > > > > I agree that this can be a very big project considering the time
> > > span of
> > > > > > GSoC this year, if we decide to go ahead with this project it
> will be
> > > > > very
> > > > > > important to decide on some base features as you already pointed
> out.
> > > > > >
> > > > > > how will users use this dataframe?
> > > > > >
> > > > > >
> > > > > > We should do it in the same way as we do DatasetInfo, this will
> keep
> > > it
> > > > > > separate from the dataset(arma::mat) so that we don't need to
> change
> > > how
> > > > > we
> > > > > > pass data to the agent.
> > > > > > We will create an object of class mlFrame and pass that to the
> load
> > > > > > function. But we have to make sure that we don't end up making
> > > another
> > > > > copy
> > > > > > of the dataset here as well, might use a bit of help here to
> create
> > > the
> > > > > > skeleton of the class.
> > > > > >
> > > > > >  How will the dataframe integrate with mlpack's existing methods?
> > > > > >
> > > > > >
> > > > > > If we could follow the way I mentioned above we won't need to
> change
> > > any
> > > > > > existing implementations to access or use the data.
> > > > > >
> > > > > > Let's discuss point by point, let me know what you think about
> the
> > > > > > above-mentioned way to implement it or if I need to clear
> anything
> > > more
> > > > > > regarding this, I will address other questions soon as we get a
> basic
> > > > > idea
> > > > > > of the project.
> > > > > >
> > > > > > Regarding the image, there is an ImageInfo class we can extend
> its
> > > > > > functionality to work on a directory of images, but I have not
> yet
> > > > > figured
> > > > > > out if we need a way to display the methods, I mean the info
> > > regarding
> > > > > the
> > > > > > images should be fine right?
> > > > > >
> > > > > > Also, I was thinking of adding some stats to DatasetInfo class,
> > > methods
> > > > > to
> > > > > > show the numerical summary of the dataset which can include mean,
> > > std,
> > > > > min,
> > > > > > max, etc. These are the same methods that I suggested to
> implement in
> > > > > this
> > > > > > PR <https://github.com/mlpack/mlpack/pull/2727>. As most of our
> > > datasets
> > > > > > are numerical for now, I think we should first implement
> > > functionality
> > > > > that
> > > > > > can be utilized by numeric datasets. Let me know what you think.
> > > > > >
> > > > > > Thank you,
> > > > > > Gopi
> > > > > >
> > > > > >
> > > > > > On Fri, Mar 12, 2021 at 9:35 PM Ryan Curtin <ryan at ratml.org>
> wrote:
> > > > > >
> > > > > > > On Wed, Mar 10, 2021 at 01:25:04AM +0530, Gopi Manohar Tatiraju
> > > wrote:
> > > > > > > > I am planning to contribute to mlapck under GSoC 2021.
> > > Currently, I
> > > > > am
> > > > > > > > working on creating a pandas *dataframe-like class* that can
> be
> > > used
> > > > > to
> > > > > > > > analyze the datasets in a better way.
> > > > > > > >
> > > > > > > > Having a class like this would help in working with datasets
> as
> > > ml
> > > > > is not
> > > > > > > > only about the model but about data as well.
> > > > > > > >
> > > > > > > > I have a pr already open for this:
> > > > > > > > https://github.com/mlpack/mlpack/pull/2727
> > > > > > > >
> > > > > > > > I wanted to know if I can work on this in GSoC? As it was not
> > > listed
> > > > > on
> > > > > > > the
> > > > > > > > idea page, but I think this would be a start to something
> useful
> > > and
> > > > > big.
> > > > > > >
> > > > > > > Hey Gopi,
> > > > > > >
> > > > > > > Thanks for working on that PR.  Personally I think this kind of
> > > support
> > > > > > > would be really wonderful---right now, all non-numeric data in
> > > mlpack
> > > > > > > has to be represented via both a `data::DatasetInfo` and an
> > > `arma::mat`
> > > > > > > (since all the internal methods operate on an `arma::mat`).
> > > > > > >
> > > > > > > The internal methods definitely can't change (they simply
> aren't
> > > > > > > designed to work on other types of data), but we certainly
> could
> > > > > improve
> > > > > > > the wrapper class.  You are right that it probably should look
> > > like a
> > > > > > > Pandas dataframe or similar.
> > > > > > >
> > > > > > > There is a lot of support and functionality already in the
> > > > > > > `data::DatasetInfo` class (it's a typedef of the
> > > > > `data::DatasetMapper<>`
> > > > > > > class), and we should definitely build on top of that.
> > > > > > >
> > > > > > > If you wanted to work on this project, it would be best to
> start
> > > with
> > > > > > > the top-level design: how will users use this dataframe?  How
> will
> > > the
> > > > > > > dataframe integrate with mlpack's existing methods?  If I had,
> > > e.g., a
> > > > > > > text dataset I wanted to use TF-IDF with, what would that look
> > > like?
> > > > > > > What would loading a set of images look like?  How could we
> make
> > > sure
> > > > > > > that whether I had a dataframe or just an `arma::mat` of
> numerical
> > > > > data,
> > > > > > > it "felt the same" to work with either?
> > > > > > >
> > > > > > > Those are just some of the questions that need to be answered.
> > > > > > >
> > > > > > > Take a look at this too:
> > > > > > >
> > > > > > >
> > > > >
> > >
> https://medium.com/@johan.mabille/xframe-towards-a-c-dataframe-26e1ccde211b
> > > > > > >
> > > > > > > I wonder if it would be possible to convert an xframe
> > > representation
> > > > > > > into an `arma::mat` without copying any memory at the point at
> > > which an
> > > > > > > mlpack method is called.  (In fact, this even leads to another
> > > > > question:
> > > > > > > could we seamlessly support Eigen matrices by converting an
> Eigen
> > > > > matrix
> > > > > > > to an Armadillo matrix without any conversion?  Or other matrix
> > > > > > > libraries?)
> > > > > > >
> > > > > > > Given that GSoC is a shorter period of work this year, spending
> > > lots of
> > > > > > > time implementing custom converters and things like this is
> > > probably
> > > > > too
> > > > > > > much work---we should see how much we can wrap existing work
> and
> > > the
> > > > > > > GSoC project should probably focus on making sure that the
> > > interfaces
> > > > > > > all work right.
> > > > > > >
> > > > > > > I hope this is helpful!
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Ryan
> > > > > > >
> > > > > > > --
> > > > > > > Ryan Curtin    | "I am."
> > > > > > > ryan at ratml.org |   - Joe
> > > > > > >
> > > > >
> > > > > > _______________________________________________
> > > > > > mlpack mailing list
> > > > > > mlpack at lists.mlpack.org
> > > > > > http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack
> > > > >
> > > > >
> > >
> > > > _______________________________________________
> > > > mlpack mailing list
> > > > mlpack at lists.mlpack.org
> > > > http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack
> > >
> > >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://knife.lugatgt.org/pipermail/mlpack/attachments/20210322/090fa994/attachment-0001.htm>


More information about the mlpack mailing list