[mlpack] GSoC-2021

Thu Apr 1 14:26:37 EDT 2021

Hey Omar,

Sorry, it took longer. I was running benchmark code since this morning and
it took a lot of time as my system is a bit slow.
I compared the default armadillo parser, mlpack's custom parser, and
rapidcsv.

Can you verify the code I used? I might have done something wrong and it
took a lot of time to run this code, but that is maybe due to the fact that
my system is not that powerful.
*Link to the repo and log file:*
https://github.com/heisenbuug/Benchmark-CSV-Parsers

In the meantime, I will also start working on my draft proposal a bit, and
once we do this testing we can use those results to decide our
plan of action. Let me know if you have any suggestions or points for the
draft proposal.

Thank you,
Gopi M. Tatiraju

On Thu, Apr 1, 2021 at 5:59 PM Omar Shrit <omar at shrit.me> wrote:

> Hello Gopi,
>
> Would it be possible to do some benchmark for these two and compare
> them with already existing Boost Spirit. If there is a considerable
> difference
> in performance between these two parsers, then the obvious choice will
> be for the faster one. I know that both of them are called (fast, rapid)
> but I did not see any benchmark yet to know which one is faster.
>
> Let me know what do you think, the benchmark will help us in doing better
> choice, since this is the internal (private) API, and will not be used
> by the user directly.
>
> These are my thoughts, let me know what do you think.
>
> Omar.
>
> On 04/01, Gopi Manohar Tatiraju wrote:
> > Hey,
> >
> > So, I want through both the libraries we considered for `csv parsers`
> > I implemented code to load the data from a small example `csv` file
> > to arma::mat, here is the sample code, let me know what you think.
> > I am loading into wrong in arma::mat? Can there be any other efficient
> > way?
> >
> > Fast CSV Parser <https://github.com/ben-strasser/fast-cpp-csv-parser>
> > io::CSVReader<4> in("llog.csv");
> > float a, b, c, d;
> > int row = 0;
> > arma::mat data(20, 4);
> >
> > while(in.read_row(a, b, c, d)){
> > data(row, 0) = a;
> > data(row, 1) = b;
> > data(row, 2) = c;
> > data(row, 3) = d;
> > row++;
> > }
> >
> > Rapid.csv <https://github.com/d99kris/rapidcsv>
> > // For headerless csv files
> > rapidcsv::Document doc("llog.csv", rapidcsv::LabelParams(-1, -1));
> > arma::mat data(doc.GetRowCount(), doc.GetColumnCount(),
> arma::fill::ones);
> >
> > std::vector<float> col;
> > for(int i = 0; i < doc.GetRowCount(); i++)
> > {
> > col = doc.GetRow<float>(i);
> > for(int j = 0; j < doc.GetColumnCount(); j++)
> > {
> > data(i, j) = col[j];
> > }
> > }
> >
> > After using both a I feel like `rapid.csv` is easier to grasp and work on
> > and seemed more structured.
> > Let me know your thoughts. Also If loading like the above example is
> file,
> > this can be converted
> > into a function that can act as basic csv file loading in arma::mat,
> right?
> >
> > Thank You,
> > Gopi
> >
> > On Mon, Mar 29, 2021 at 8:28 PM Omar Shrit <omar at shrit.me> wrote:
> >
> > > Hey Gopi
> > >
> > > On 03/29, Gopi Manohar Tatiraju wrote:
> > > > Hey,
> > > >
> > > > I agree, after going a bit through both the candidates I can see we
> can
> > > > unload a lot of work by using a well-implemented existing parser.
> > > > I think I should start by comparing both the mentioned libraries to
> > > decide
> > > > which one to use. I will use the same benchmark strategy that
> > > > was discussed in the issue. Does that sound good?
> > >
> > > Sounds good to me.
> > >
> > > > And also I think I can work on replacing boost spirits in GSoC then.
> This
> > > > will be a start to the data frame idea. Even if we are left with time
> > > > after this, I can start the work on the data frame as well. Is it
> > > > considerable?
> > >
> > > Yes of course.
> > >
> > > > Thanks,
> > > > Gopi
> > > >
> > > >
> > > > On Mon, Mar 29, 2021 at 7:33 PM Omar Shrit <omar at shrit.me> wrote:
> > > >
> > > > > Hey Gopi,
> > > > >
> > > > > I totally agree with Ryan, using existing parser will accelerate
> the
> > > > > project and allow to move forward with the dataframe class. Also, I
> > > > > do believe that replacing boost Spirit with an existing parser will
> > > take
> > > > > a considerable amount of the summer.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Omar
> > > > >
> > > > > On 03/29, Ryan Curtin wrote:
> > > > > > On Mon, Mar 29, 2021 at 04:17:35PM +0530, Gopi Manohar Tatiraju
> > > wrote:
> > > > > > > Would love to hear your thoughts on whether to go with an
> already
> > > > > > > implemented parser or build a new one. Also if we are planning
> to
> > > > > build a
> > > > > > > data frame here then
> > > > > > > maybe going with an in-house parser would be better as we will
> > > have the
> > > > > > > ability to design it in such a way that it can extend maximum
> > > support
> > > > > to
> > > > > > > the new data frame
> > > > > > > which we are planning to build ahead.
> > > > > >
> > > > > > Hey Gopi,
> > > > > >
> > > > > > Honestly I think it's best to use another package.  Not only will
> > > this
> > > > > > free up time to actually work on the dataframe class, but also it
> > > means
> > > > > > we are not responsible for maintenance of the CSV parser.  There
> are
> > > > > > lots of little complexities and edge cases in parsing (not to
> mention
> > > > > > efficiency!) and so we can probably get a lot more bang for our
> buck
> > > > > > here by using an implementation from someone who has already put
> down
> > > > > > the time to consider all those details.
> > > > > >
> > > > > > Hope this is helpful. :)
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Ryan
> > > > > >
> > > > > > --
> > > > > > Ryan Curtin    | "Kill them, Machine... kill them all."
> > > > > > ryan at ratml.org |   - Dino Velvet
> > > > >
> > >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://knife.lugatgt.org/pipermail/mlpack/attachments/20210401/dd7d3d3c/attachment.htm>