[mlpack] GSoC-2021

Gopi Manohar Tatiraju deathcoderx at gmail.com
Mon Mar 22 15:38:06 EDT 2021


Hey Ryan,

Thanks for the feedback, I too looked at that issue you mentioned but
somehow I overlooked it and didn't mention it here.

If that's the case then do you have any suggestions regarding how to make
mlpack header only?
As pointed out by Omar the data load/save part from mlpack core is the only
part that has implementation files (.cpp)
while all methods of mlpack are header-only. So we were planning to remove
these .cpp files. Currently, there are:

   1. detect_file_type.cpp
   2. load.cpp
   3. load_csv.cpp
   4. load_image.cpp
   5. save_image.cpp

Currently, these are .cpp files we have and we want to change them to .hpp
to make mlpack header only.
All this is kinda pre-work for implementing data frame like class for
mlpack, which I've explained above.

Thanks for the suggestions
Gopi

On Mon, Mar 22, 2021 at 9:58 PM Ryan Curtin <ryan at ratml.org> wrote:

> (I have not read the rest of the thread.)
>
> No need to check the compilation time---I can already tell you it will
> be disastrously large.  We can't put load_csv.cpp into load_csv.hpp, as
> we would then be including the boost::spirit headers in everything that
> included mlpack, which we really need to avoid (because boost::spirit is
> huge).  The idea is to force templates to be instantiated in
> load_csv.cpp, so that they get compiled into libmlpack.so.
>
> Really we need to replace our custom CSV parser with something else so
> that we no longer have a boost::spirit dependency.  See here:
>
> https://github.com/mlpack/mlpack/issues/2646
>
> Thanks,
>
> Ryan
>
> On Mon, Mar 22, 2021 at 09:10:47PM +0530, Gopi Manohar Tatiraju wrote:
> > Hey,
> >
> > Thank you for your quick response.
> >
> > So I can see that we have boost::spirit in load_csv.cpp
> >
> > Currently we have 2 files for load_csv:
> >
> >    1. load_csv.hpp
> >    2. load_csv.cpp
> >
> > I will move the function implementations for .cpp file to .hpp file and
> > will declare them as inline. So we will only have one .hpp file for
> > load_csv.
> >
> > I will also check the compilation time before making these changes.
> >
> > Sounds good?
> >
> > Thanks,
> > Gopi
> >
> > On Mon, Mar 22, 2021 at 8:30 PM Omar Shrit <omar at shrit.me> wrote:
> >
> > > On 03/22, Gopi Manohar Tatiraju wrote:
> > > > Hey Omar,
> > > >
> > > > Thank you for the reply...
> > > > So what I understood is to make files _impl.hpp we need to declare
> class
> > > > templates.
> > > >
> > > > But taking the example of *detect_file *we can see that we only have
> > > > functions declared here. One file for declaring the functions and
> one for
> > > > the implementation.
> > > >
> > > > So maybe, I need to structure these functions into classes, right?
> Coz if
> > > > we want to put functions directly into the header files we need to
> > > declare
> > > > them as inline to avoid the violation of C++'s one definition rule.
> > >
> > > inline should be fine for now.
> > >
> > > > Also, how should I compare the compilation time?
> > >
> > > Use the time command with make in order to get the compilation time
> > >
> > > > Also about the prototype and benchmark you mentioned, can you
> explain a
> > > bit
> > > > more about that. I really want to make this project work and I am
> willing
> > > > to put my time in learning new things.
> > >
> > > You can start by moving class implementation (those using boost
> spirit) to
> > > header files
> > > and recompile mlpack. You need to measure the compilation time before
> > > and after changing implementation into header, that will gives us an
> > > idea about the increase related to boost spirit
> > >
> > > > Thank you for the guidance.
> > > >
> > > > PS. Sorry, I was busy with my college project so it took some time to
> > > reply.
> > > >
> > > > Regards,
> > > > Gopi
> > > >
> > > > On Fri, Mar 19, 2021 at 12:42 AM Omar Shrit <omar at shrit.me> wrote:
> > > >
> > > > > Hello Gopi,
> > > > >
> > > > > Sorry, I am forgot to answer you earlier (A little bit busy),
> thanks
> > > for
> > > > > the ping.
> > > > >
> > > > > On 03/18, Gopi Manohar Tatiraju wrote:
> > > > > > Hey Omar,
> > > > > >
> > > > > > Thank you soo much for such detailed replies and inputs.
> > > > > >
> > > > > > I see that there are many things that we need to get in order
> before
> > > we
> > > > > > start working on this idea.
> > > > > > I think we should start from this point:
> > > > > >
> > > > > > 1) It would be nicer to have mlpack as header-only by moving
> these
> > > > > > > implementations into header files.
> > > > > >
> > > > > >
> > > > > > I went through the mlpack/core/data/ directory and List of .cpp
> > > files in
> > > > > > core/data
> > > > > >
> > > > > >    1. detect_file_type.cpp
> > > > > >    2. load.cpp
> > > > > >    3. load_csv.cpp
> > > > > >    4. load_image.cpp
> > > > > >    5. save_image.cpp
> > > > > >
> > > > > > The convention we follow in mlpack is having two .hpp files, one
> for
> > > > > > declarations and one for implementations.
> > > > >
> > > > > The convention is related to classes that are defined as
> > > > > class-template. I do not think it is possible to convert an
> > > implementation
> > > > > file to `_impl.hpp` if the class is not a class template.
> > > > >
> > > > > Therefore, you can put the entire implementation in the header,
> where
> > > > > the class is defined. There are several disadvantages of putting
> > > everything
> > > > > in one place, for instance, increase compilation time and binary
> > > footprint.
> > > > >
> > > > > It would be nice to have a small prototype and a benchmark to see
> how
> > > > > much time the compilation time will increase, since we are trying
> to
> > > > > reduce these numbers.
> > > > >
> > > > > I did not look into the code yet, but I do not see anything that
> might
> > > > > prevent from having everything in the headers.
> > > > >
> > > > > Also, do not worry about details such as convention, these can be
> seen
> > > > > later especially if a prototype comes back with good results.
> > > > >
> > > > > > I just want to clarify some things so that I can start writing
> some
> > > code
> > > > > > regarding this:
> > > > > >
> > > > > >    - Let's consider detect_file_type, we have for now:
> > > > > >       - detect_file_type.hpp
> > > > > >       <
> > > > >
> > >
> https://github.com/mlpack/mlpack/blob/master/src/mlpack/core/data/detect_file_type.hpp
> > > > > >
> > > > > >       - detect_file_type.cpp
> > > > > >       <
> > > > >
> > >
> https://github.com/mlpack/mlpack/blob/master/src/mlpack/core/data/detect_file_type.cpp
> > > > > >
> > > > > >
> > > > > > What we want in the future:
> > > > > >
> > > > > >
> > > > > >    - detect_file_type.hpp
> > > > > >    - detect_file_type_impl.hpp
> > > > > >
> > > > > > Can you just give me one example or reference or just a list of
> > > points
> > > > > that
> > > > > > we need to consider while restructuring these files? What will
> be the
> > > > > major
> > > > > > differences after converting them to .hpp files?
> > > > > >
> > > > > >
> > > > > > It feels great getting feedback on my idea and I get to learn soo
> > > many
> > > > > new
> > > > > > things on the way.
> > > > > >
> > > > > > Thank you,
> > > > > > Gopi.
> > > > > >
> > > > > >
> > > > > > On Sat, Mar 13, 2021 at 4:12 AM Omar Shrit <omar at shrit.me>
> wrote:
> > > > > >
> > > > > > > Hello Gopi,
> > > > > > >
> > > > > > > The data frame class project is indeed a good idea, we have
> thought
> > > > > > > about that, but as Ryan said, it can be a big project for GSoC
> > > given
> > > > > > > the limited period of time this year.
> > > > > > >
> > > > > > > I have several ideas to add on what Ryan said. The objective
> is to
> > > make
> > > > > > > the project lighter and more fit for a GSoC.
> > > > > > >
> > > > > > > Knowing that, the data load/save part from mlpack core is the
> only
> > > part
> > > > > > > that
> > > > > > > has implementation files (.cpp) while all methods of mlpack are
> > > > > > > header-only,
> > > > > > > therefore:
> > > > > > >
> > > > > > > 1) It would be nicer to have mlpack as header-only by moving
> these
> > > > > > > implementations into header files.
> > > > > > >
> > > > > > > 2) I would avoid re-implementing things that have already
> > > implemented,
> > > > > > > especially that these parts of code (Loading, Saving, Matrix
> > > > > > > manipulation, and conversion) need a lot of
> > > > > > > optimization, which requires years of work to have something
> > > feasible.
> > > > > > > However, looking at Xtensor library seems to be similar to
> Pandas
> > > > > providing
> > > > > > > what is in need, in C++, with a good performance.
> > > > > > >
> > > > > > > 3) Xtensor integration can be realized by adding a mlpack
> wrapper
> > > > > > > (a small light wrapper) for Xtensor functionalities. This
> wrapper
> > > can
> > > > > be
> > > > > > > integrated into mlpack source code, or can be kept separately
> (as
> > > > > > > ensmallen)
> > > > > > > allowing to be added when needed, therefore only link with
> library
> > > that
> > > > > > > we use (avoiding dependencies).
> > > > > > >
> > > > > > > Knowing that the above steps will require more than one GSoC to
> > > > > > > complete, but they can be done independently. You can choose
> what
> > > you
> > > > > > > find the most suitable and build a proposal upon it allowing to
> > > have
> > > > > > > the most possible decoupling between the tasks in order to
> > > maximize the
> > > > > > > possible
> > > > > > > feasibility of the project.
> > > > > > >
> > > > > > > I hope you find this helpful !
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Omar
> > > > > > >
> > > > > > >
> > > > > > > On 03/13, Gopi Manohar Tatiraju wrote:
> > > > > > > > Hey Ryan,
> > > > > > > >
> > > > > > > > Thanks for the feedback.
> > > > > > > > I agree that this can be a very big project considering the
> time
> > > > > span of
> > > > > > > > GSoC this year, if we decide to go ahead with this project it
> > > will be
> > > > > > > very
> > > > > > > > important to decide on some base features as you already
> pointed
> > > out.
> > > > > > > >
> > > > > > > > how will users use this dataframe?
> > > > > > > >
> > > > > > > >
> > > > > > > > We should do it in the same way as we do DatasetInfo, this
> will
> > > keep
> > > > > it
> > > > > > > > separate from the dataset(arma::mat) so that we don't need to
> > > change
> > > > > how
> > > > > > > we
> > > > > > > > pass data to the agent.
> > > > > > > > We will create an object of class mlFrame and pass that to
> the
> > > load
> > > > > > > > function. But we have to make sure that we don't end up
> making
> > > > > another
> > > > > > > copy
> > > > > > > > of the dataset here as well, might use a bit of help here to
> > > create
> > > > > the
> > > > > > > > skeleton of the class.
> > > > > > > >
> > > > > > > >  How will the dataframe integrate with mlpack's existing
> methods?
> > > > > > > >
> > > > > > > >
> > > > > > > > If we could follow the way I mentioned above we won't need to
> > > change
> > > > > any
> > > > > > > > existing implementations to access or use the data.
> > > > > > > >
> > > > > > > > Let's discuss point by point, let me know what you think
> about
> > > the
> > > > > > > > above-mentioned way to implement it or if I need to clear
> > > anything
> > > > > more
> > > > > > > > regarding this, I will address other questions soon as we
> get a
> > > basic
> > > > > > > idea
> > > > > > > > of the project.
> > > > > > > >
> > > > > > > > Regarding the image, there is an ImageInfo class we can
> extend
> > > its
> > > > > > > > functionality to work on a directory of images, but I have
> not
> > > yet
> > > > > > > figured
> > > > > > > > out if we need a way to display the methods, I mean the info
> > > > > regarding
> > > > > > > the
> > > > > > > > images should be fine right?
> > > > > > > >
> > > > > > > > Also, I was thinking of adding some stats to DatasetInfo
> class,
> > > > > methods
> > > > > > > to
> > > > > > > > show the numerical summary of the dataset which can include
> mean,
> > > > > std,
> > > > > > > min,
> > > > > > > > max, etc. These are the same methods that I suggested to
> > > implement in
> > > > > > > this
> > > > > > > > PR <https://github.com/mlpack/mlpack/pull/2727>. As most of
> our
> > > > > datasets
> > > > > > > > are numerical for now, I think we should first implement
> > > > > functionality
> > > > > > > that
> > > > > > > > can be utilized by numeric datasets. Let me know what you
> think.
> > > > > > > >
> > > > > > > > Thank you,
> > > > > > > > Gopi
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Mar 12, 2021 at 9:35 PM Ryan Curtin <ryan at ratml.org>
> > > wrote:
> > > > > > > >
> > > > > > > > > On Wed, Mar 10, 2021 at 01:25:04AM +0530, Gopi Manohar
> Tatiraju
> > > > > wrote:
> > > > > > > > > > I am planning to contribute to mlapck under GSoC 2021.
> > > > > Currently, I
> > > > > > > am
> > > > > > > > > > working on creating a pandas *dataframe-like class* that
> can
> > > be
> > > > > used
> > > > > > > to
> > > > > > > > > > analyze the datasets in a better way.
> > > > > > > > > >
> > > > > > > > > > Having a class like this would help in working with
> datasets
> > > as
> > > > > ml
> > > > > > > is not
> > > > > > > > > > only about the model but about data as well.
> > > > > > > > > >
> > > > > > > > > > I have a pr already open for this:
> > > > > > > > > > https://github.com/mlpack/mlpack/pull/2727
> > > > > > > > > >
> > > > > > > > > > I wanted to know if I can work on this in GSoC? As it
> was not
> > > > > listed
> > > > > > > on
> > > > > > > > > the
> > > > > > > > > > idea page, but I think this would be a start to something
> > > useful
> > > > > and
> > > > > > > big.
> > > > > > > > >
> > > > > > > > > Hey Gopi,
> > > > > > > > >
> > > > > > > > > Thanks for working on that PR.  Personally I think this
> kind of
> > > > > support
> > > > > > > > > would be really wonderful---right now, all non-numeric
> data in
> > > > > mlpack
> > > > > > > > > has to be represented via both a `data::DatasetInfo` and an
> > > > > `arma::mat`
> > > > > > > > > (since all the internal methods operate on an `arma::mat`).
> > > > > > > > >
> > > > > > > > > The internal methods definitely can't change (they simply
> > > aren't
> > > > > > > > > designed to work on other types of data), but we certainly
> > > could
> > > > > > > improve
> > > > > > > > > the wrapper class.  You are right that it probably should
> look
> > > > > like a
> > > > > > > > > Pandas dataframe or similar.
> > > > > > > > >
> > > > > > > > > There is a lot of support and functionality already in the
> > > > > > > > > `data::DatasetInfo` class (it's a typedef of the
> > > > > > > `data::DatasetMapper<>`
> > > > > > > > > class), and we should definitely build on top of that.
> > > > > > > > >
> > > > > > > > > If you wanted to work on this project, it would be best to
> > > start
> > > > > with
> > > > > > > > > the top-level design: how will users use this dataframe?
> How
> > > will
> > > > > the
> > > > > > > > > dataframe integrate with mlpack's existing methods?  If I
> had,
> > > > > e.g., a
> > > > > > > > > text dataset I wanted to use TF-IDF with, what would that
> look
> > > > > like?
> > > > > > > > > What would loading a set of images look like?  How could we
> > > make
> > > > > sure
> > > > > > > > > that whether I had a dataframe or just an `arma::mat` of
> > > numerical
> > > > > > > data,
> > > > > > > > > it "felt the same" to work with either?
> > > > > > > > >
> > > > > > > > > Those are just some of the questions that need to be
> answered.
> > > > > > > > >
> > > > > > > > > Take a look at this too:
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> https://medium.com/@johan.mabille/xframe-towards-a-c-dataframe-26e1ccde211b
> > > > > > > > >
> > > > > > > > > I wonder if it would be possible to convert an xframe
> > > > > representation
> > > > > > > > > into an `arma::mat` without copying any memory at the
> point at
> > > > > which an
> > > > > > > > > mlpack method is called.  (In fact, this even leads to
> another
> > > > > > > question:
> > > > > > > > > could we seamlessly support Eigen matrices by converting an
> > > Eigen
> > > > > > > matrix
> > > > > > > > > to an Armadillo matrix without any conversion?  Or other
> matrix
> > > > > > > > > libraries?)
> > > > > > > > >
> > > > > > > > > Given that GSoC is a shorter period of work this year,
> spending
> > > > > lots of
> > > > > > > > > time implementing custom converters and things like this is
> > > > > probably
> > > > > > > too
> > > > > > > > > much work---we should see how much we can wrap existing
> work
> > > and
> > > > > the
> > > > > > > > > GSoC project should probably focus on making sure that the
> > > > > interfaces
> > > > > > > > > all work right.
> > > > > > > > >
> > > > > > > > > I hope this is helpful!
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > >
> > > > > > > > > Ryan
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Ryan Curtin    | "I am."
> > > > > > > > > ryan at ratml.org |   - Joe
> > > > > > > > >
> > > > > > >
> > > > > > > > _______________________________________________
> > > > > > > > mlpack mailing list
> > > > > > > > mlpack at lists.mlpack.org
> > > > > > > > http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack
> > > > > > >
> > > > > > >
> > > > >
> > > > > > _______________________________________________
> > > > > > mlpack mailing list
> > > > > > mlpack at lists.mlpack.org
> > > > > > http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack
> > > > >
> > > > >
> > >
>
> > _______________________________________________
> > mlpack mailing list
> > mlpack at lists.mlpack.org
> > http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack
>
>
> --
> Ryan Curtin    | "Closets bring order to a chaotic world!"
> ryan at ratml.org |  - Jay
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://knife.lugatgt.org/pipermail/mlpack/attachments/20210323/13b70dab/attachment-0001.htm>


More information about the mlpack mailing list