[mlpack] GSoC-2021

Ryan Curtin ryan at ratml.org
Mon Mar 22 12:28:02 EDT 2021


(I have not read the rest of the thread.)

No need to check the compilation time---I can already tell you it will
be disastrously large.  We can't put load_csv.cpp into load_csv.hpp, as
we would then be including the boost::spirit headers in everything that
included mlpack, which we really need to avoid (because boost::spirit is
huge).  The idea is to force templates to be instantiated in
load_csv.cpp, so that they get compiled into libmlpack.so.

Really we need to replace our custom CSV parser with something else so
that we no longer have a boost::spirit dependency.  See here:

https://github.com/mlpack/mlpack/issues/2646

Thanks,

Ryan

On Mon, Mar 22, 2021 at 09:10:47PM +0530, Gopi Manohar Tatiraju wrote:
> Hey,
> 
> Thank you for your quick response.
> 
> So I can see that we have boost::spirit in load_csv.cpp
> 
> Currently we have 2 files for load_csv:
> 
>    1. load_csv.hpp
>    2. load_csv.cpp
> 
> I will move the function implementations for .cpp file to .hpp file and
> will declare them as inline. So we will only have one .hpp file for
> load_csv.
> 
> I will also check the compilation time before making these changes.
> 
> Sounds good?
> 
> Thanks,
> Gopi
> 
> On Mon, Mar 22, 2021 at 8:30 PM Omar Shrit <omar at shrit.me> wrote:
> 
> > On 03/22, Gopi Manohar Tatiraju wrote:
> > > Hey Omar,
> > >
> > > Thank you for the reply...
> > > So what I understood is to make files _impl.hpp we need to declare class
> > > templates.
> > >
> > > But taking the example of *detect_file *we can see that we only have
> > > functions declared here. One file for declaring the functions and one for
> > > the implementation.
> > >
> > > So maybe, I need to structure these functions into classes, right? Coz if
> > > we want to put functions directly into the header files we need to
> > declare
> > > them as inline to avoid the violation of C++'s one definition rule.
> >
> > inline should be fine for now.
> >
> > > Also, how should I compare the compilation time?
> >
> > Use the time command with make in order to get the compilation time
> >
> > > Also about the prototype and benchmark you mentioned, can you explain a
> > bit
> > > more about that. I really want to make this project work and I am willing
> > > to put my time in learning new things.
> >
> > You can start by moving class implementation (those using boost spirit) to
> > header files
> > and recompile mlpack. You need to measure the compilation time before
> > and after changing implementation into header, that will gives us an
> > idea about the increase related to boost spirit
> >
> > > Thank you for the guidance.
> > >
> > > PS. Sorry, I was busy with my college project so it took some time to
> > reply.
> > >
> > > Regards,
> > > Gopi
> > >
> > > On Fri, Mar 19, 2021 at 12:42 AM Omar Shrit <omar at shrit.me> wrote:
> > >
> > > > Hello Gopi,
> > > >
> > > > Sorry, I am forgot to answer you earlier (A little bit busy), thanks
> > for
> > > > the ping.
> > > >
> > > > On 03/18, Gopi Manohar Tatiraju wrote:
> > > > > Hey Omar,
> > > > >
> > > > > Thank you soo much for such detailed replies and inputs.
> > > > >
> > > > > I see that there are many things that we need to get in order before
> > we
> > > > > start working on this idea.
> > > > > I think we should start from this point:
> > > > >
> > > > > 1) It would be nicer to have mlpack as header-only by moving these
> > > > > > implementations into header files.
> > > > >
> > > > >
> > > > > I went through the mlpack/core/data/ directory and List of .cpp
> > files in
> > > > > core/data
> > > > >
> > > > >    1. detect_file_type.cpp
> > > > >    2. load.cpp
> > > > >    3. load_csv.cpp
> > > > >    4. load_image.cpp
> > > > >    5. save_image.cpp
> > > > >
> > > > > The convention we follow in mlpack is having two .hpp files, one for
> > > > > declarations and one for implementations.
> > > >
> > > > The convention is related to classes that are defined as
> > > > class-template. I do not think it is possible to convert an
> > implementation
> > > > file to `_impl.hpp` if the class is not a class template.
> > > >
> > > > Therefore, you can put the entire implementation in the header, where
> > > > the class is defined. There are several disadvantages of putting
> > everything
> > > > in one place, for instance, increase compilation time and binary
> > footprint.
> > > >
> > > > It would be nice to have a small prototype and a benchmark to see how
> > > > much time the compilation time will increase, since we are trying to
> > > > reduce these numbers.
> > > >
> > > > I did not look into the code yet, but I do not see anything that might
> > > > prevent from having everything in the headers.
> > > >
> > > > Also, do not worry about details such as convention, these can be seen
> > > > later especially if a prototype comes back with good results.
> > > >
> > > > > I just want to clarify some things so that I can start writing some
> > code
> > > > > regarding this:
> > > > >
> > > > >    - Let's consider detect_file_type, we have for now:
> > > > >       - detect_file_type.hpp
> > > > >       <
> > > >
> > https://github.com/mlpack/mlpack/blob/master/src/mlpack/core/data/detect_file_type.hpp
> > > > >
> > > > >       - detect_file_type.cpp
> > > > >       <
> > > >
> > https://github.com/mlpack/mlpack/blob/master/src/mlpack/core/data/detect_file_type.cpp
> > > > >
> > > > >
> > > > > What we want in the future:
> > > > >
> > > > >
> > > > >    - detect_file_type.hpp
> > > > >    - detect_file_type_impl.hpp
> > > > >
> > > > > Can you just give me one example or reference or just a list of
> > points
> > > > that
> > > > > we need to consider while restructuring these files? What will be the
> > > > major
> > > > > differences after converting them to .hpp files?
> > > > >
> > > > >
> > > > > It feels great getting feedback on my idea and I get to learn soo
> > many
> > > > new
> > > > > things on the way.
> > > > >
> > > > > Thank you,
> > > > > Gopi.
> > > > >
> > > > >
> > > > > On Sat, Mar 13, 2021 at 4:12 AM Omar Shrit <omar at shrit.me> wrote:
> > > > >
> > > > > > Hello Gopi,
> > > > > >
> > > > > > The data frame class project is indeed a good idea, we have thought
> > > > > > about that, but as Ryan said, it can be a big project for GSoC
> > given
> > > > > > the limited period of time this year.
> > > > > >
> > > > > > I have several ideas to add on what Ryan said. The objective is to
> > make
> > > > > > the project lighter and more fit for a GSoC.
> > > > > >
> > > > > > Knowing that, the data load/save part from mlpack core is the only
> > part
> > > > > > that
> > > > > > has implementation files (.cpp) while all methods of mlpack are
> > > > > > header-only,
> > > > > > therefore:
> > > > > >
> > > > > > 1) It would be nicer to have mlpack as header-only by moving these
> > > > > > implementations into header files.
> > > > > >
> > > > > > 2) I would avoid re-implementing things that have already
> > implemented,
> > > > > > especially that these parts of code (Loading, Saving, Matrix
> > > > > > manipulation, and conversion) need a lot of
> > > > > > optimization, which requires years of work to have something
> > feasible.
> > > > > > However, looking at Xtensor library seems to be similar to Pandas
> > > > providing
> > > > > > what is in need, in C++, with a good performance.
> > > > > >
> > > > > > 3) Xtensor integration can be realized by adding a mlpack wrapper
> > > > > > (a small light wrapper) for Xtensor functionalities. This wrapper
> > can
> > > > be
> > > > > > integrated into mlpack source code, or can be kept separately (as
> > > > > > ensmallen)
> > > > > > allowing to be added when needed, therefore only link with library
> > that
> > > > > > we use (avoiding dependencies).
> > > > > >
> > > > > > Knowing that the above steps will require more than one GSoC to
> > > > > > complete, but they can be done independently. You can choose what
> > you
> > > > > > find the most suitable and build a proposal upon it allowing to
> > have
> > > > > > the most possible decoupling between the tasks in order to
> > maximize the
> > > > > > possible
> > > > > > feasibility of the project.
> > > > > >
> > > > > > I hope you find this helpful !
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Omar
> > > > > >
> > > > > >
> > > > > > On 03/13, Gopi Manohar Tatiraju wrote:
> > > > > > > Hey Ryan,
> > > > > > >
> > > > > > > Thanks for the feedback.
> > > > > > > I agree that this can be a very big project considering the time
> > > > span of
> > > > > > > GSoC this year, if we decide to go ahead with this project it
> > will be
> > > > > > very
> > > > > > > important to decide on some base features as you already pointed
> > out.
> > > > > > >
> > > > > > > how will users use this dataframe?
> > > > > > >
> > > > > > >
> > > > > > > We should do it in the same way as we do DatasetInfo, this will
> > keep
> > > > it
> > > > > > > separate from the dataset(arma::mat) so that we don't need to
> > change
> > > > how
> > > > > > we
> > > > > > > pass data to the agent.
> > > > > > > We will create an object of class mlFrame and pass that to the
> > load
> > > > > > > function. But we have to make sure that we don't end up making
> > > > another
> > > > > > copy
> > > > > > > of the dataset here as well, might use a bit of help here to
> > create
> > > > the
> > > > > > > skeleton of the class.
> > > > > > >
> > > > > > >  How will the dataframe integrate with mlpack's existing methods?
> > > > > > >
> > > > > > >
> > > > > > > If we could follow the way I mentioned above we won't need to
> > change
> > > > any
> > > > > > > existing implementations to access or use the data.
> > > > > > >
> > > > > > > Let's discuss point by point, let me know what you think about
> > the
> > > > > > > above-mentioned way to implement it or if I need to clear
> > anything
> > > > more
> > > > > > > regarding this, I will address other questions soon as we get a
> > basic
> > > > > > idea
> > > > > > > of the project.
> > > > > > >
> > > > > > > Regarding the image, there is an ImageInfo class we can extend
> > its
> > > > > > > functionality to work on a directory of images, but I have not
> > yet
> > > > > > figured
> > > > > > > out if we need a way to display the methods, I mean the info
> > > > regarding
> > > > > > the
> > > > > > > images should be fine right?
> > > > > > >
> > > > > > > Also, I was thinking of adding some stats to DatasetInfo class,
> > > > methods
> > > > > > to
> > > > > > > show the numerical summary of the dataset which can include mean,
> > > > std,
> > > > > > min,
> > > > > > > max, etc. These are the same methods that I suggested to
> > implement in
> > > > > > this
> > > > > > > PR <https://github.com/mlpack/mlpack/pull/2727>. As most of our
> > > > datasets
> > > > > > > are numerical for now, I think we should first implement
> > > > functionality
> > > > > > that
> > > > > > > can be utilized by numeric datasets. Let me know what you think.
> > > > > > >
> > > > > > > Thank you,
> > > > > > > Gopi
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Mar 12, 2021 at 9:35 PM Ryan Curtin <ryan at ratml.org>
> > wrote:
> > > > > > >
> > > > > > > > On Wed, Mar 10, 2021 at 01:25:04AM +0530, Gopi Manohar Tatiraju
> > > > wrote:
> > > > > > > > > I am planning to contribute to mlapck under GSoC 2021.
> > > > Currently, I
> > > > > > am
> > > > > > > > > working on creating a pandas *dataframe-like class* that can
> > be
> > > > used
> > > > > > to
> > > > > > > > > analyze the datasets in a better way.
> > > > > > > > >
> > > > > > > > > Having a class like this would help in working with datasets
> > as
> > > > ml
> > > > > > is not
> > > > > > > > > only about the model but about data as well.
> > > > > > > > >
> > > > > > > > > I have a pr already open for this:
> > > > > > > > > https://github.com/mlpack/mlpack/pull/2727
> > > > > > > > >
> > > > > > > > > I wanted to know if I can work on this in GSoC? As it was not
> > > > listed
> > > > > > on
> > > > > > > > the
> > > > > > > > > idea page, but I think this would be a start to something
> > useful
> > > > and
> > > > > > big.
> > > > > > > >
> > > > > > > > Hey Gopi,
> > > > > > > >
> > > > > > > > Thanks for working on that PR.  Personally I think this kind of
> > > > support
> > > > > > > > would be really wonderful---right now, all non-numeric data in
> > > > mlpack
> > > > > > > > has to be represented via both a `data::DatasetInfo` and an
> > > > `arma::mat`
> > > > > > > > (since all the internal methods operate on an `arma::mat`).
> > > > > > > >
> > > > > > > > The internal methods definitely can't change (they simply
> > aren't
> > > > > > > > designed to work on other types of data), but we certainly
> > could
> > > > > > improve
> > > > > > > > the wrapper class.  You are right that it probably should look
> > > > like a
> > > > > > > > Pandas dataframe or similar.
> > > > > > > >
> > > > > > > > There is a lot of support and functionality already in the
> > > > > > > > `data::DatasetInfo` class (it's a typedef of the
> > > > > > `data::DatasetMapper<>`
> > > > > > > > class), and we should definitely build on top of that.
> > > > > > > >
> > > > > > > > If you wanted to work on this project, it would be best to
> > start
> > > > with
> > > > > > > > the top-level design: how will users use this dataframe?  How
> > will
> > > > the
> > > > > > > > dataframe integrate with mlpack's existing methods?  If I had,
> > > > e.g., a
> > > > > > > > text dataset I wanted to use TF-IDF with, what would that look
> > > > like?
> > > > > > > > What would loading a set of images look like?  How could we
> > make
> > > > sure
> > > > > > > > that whether I had a dataframe or just an `arma::mat` of
> > numerical
> > > > > > data,
> > > > > > > > it "felt the same" to work with either?
> > > > > > > >
> > > > > > > > Those are just some of the questions that need to be answered.
> > > > > > > >
> > > > > > > > Take a look at this too:
> > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > https://medium.com/@johan.mabille/xframe-towards-a-c-dataframe-26e1ccde211b
> > > > > > > >
> > > > > > > > I wonder if it would be possible to convert an xframe
> > > > representation
> > > > > > > > into an `arma::mat` without copying any memory at the point at
> > > > which an
> > > > > > > > mlpack method is called.  (In fact, this even leads to another
> > > > > > question:
> > > > > > > > could we seamlessly support Eigen matrices by converting an
> > Eigen
> > > > > > matrix
> > > > > > > > to an Armadillo matrix without any conversion?  Or other matrix
> > > > > > > > libraries?)
> > > > > > > >
> > > > > > > > Given that GSoC is a shorter period of work this year, spending
> > > > lots of
> > > > > > > > time implementing custom converters and things like this is
> > > > probably
> > > > > > too
> > > > > > > > much work---we should see how much we can wrap existing work
> > and
> > > > the
> > > > > > > > GSoC project should probably focus on making sure that the
> > > > interfaces
> > > > > > > > all work right.
> > > > > > > >
> > > > > > > > I hope this is helpful!
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Ryan
> > > > > > > >
> > > > > > > > --
> > > > > > > > Ryan Curtin    | "I am."
> > > > > > > > ryan at ratml.org |   - Joe
> > > > > > > >
> > > > > >
> > > > > > > _______________________________________________
> > > > > > > mlpack mailing list
> > > > > > > mlpack at lists.mlpack.org
> > > > > > > http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack
> > > > > >
> > > > > >
> > > >
> > > > > _______________________________________________
> > > > > mlpack mailing list
> > > > > mlpack at lists.mlpack.org
> > > > > http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack
> > > >
> > > >
> >

> _______________________________________________
> mlpack mailing list
> mlpack at lists.mlpack.org
> http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack


-- 
Ryan Curtin    | "Closets bring order to a chaotic world!"
ryan at ratml.org |  - Jay


More information about the mlpack mailing list