[mlpack] GSoC-2021

Omar Shrit omar at shrit.me
Thu Apr 1 08:29:42 EDT 2021


Hello Gopi,

Would it be possible to do some benchmark for these two and compare
them with already existing Boost Spirit. If there is a considerable difference
in performance between these two parsers, then the obvious choice will
be for the faster one. I know that both of them are called (fast, rapid)
but I did not see any benchmark yet to know which one is faster.

Let me know what do you think, the benchmark will help us in doing better
choice, since this is the internal (private) API, and will not be used
by the user directly.

These are my thoughts, let me know what do you think.

Omar.

On 04/01, Gopi Manohar Tatiraju wrote:
> Hey,
> 
> So, I want through both the libraries we considered for `csv parsers`
> I implemented code to load the data from a small example `csv` file
> to arma::mat, here is the sample code, let me know what you think.
> I am loading into wrong in arma::mat? Can there be any other efficient
> way?
> 
> Fast CSV Parser <https://github.com/ben-strasser/fast-cpp-csv-parser>
> io::CSVReader<4> in("llog.csv");
> float a, b, c, d;
> int row = 0;
> arma::mat data(20, 4);
> 
> while(in.read_row(a, b, c, d)){
> data(row, 0) = a;
> data(row, 1) = b;
> data(row, 2) = c;
> data(row, 3) = d;
> row++;
> }
> 
> Rapid.csv <https://github.com/d99kris/rapidcsv>
> // For headerless csv files
> rapidcsv::Document doc("llog.csv", rapidcsv::LabelParams(-1, -1));
> arma::mat data(doc.GetRowCount(), doc.GetColumnCount(), arma::fill::ones);
> 
> std::vector<float> col;
> for(int i = 0; i < doc.GetRowCount(); i++)
> {
> col = doc.GetRow<float>(i);
> for(int j = 0; j < doc.GetColumnCount(); j++)
> {
> data(i, j) = col[j];
> }
> }
> 
> After using both a I feel like `rapid.csv` is easier to grasp and work on
> and seemed more structured.
> Let me know your thoughts. Also If loading like the above example is file,
> this can be converted
> into a function that can act as basic csv file loading in arma::mat, right?
> 
> Thank You,
> Gopi
> 
> On Mon, Mar 29, 2021 at 8:28 PM Omar Shrit <omar at shrit.me> wrote:
> 
> > Hey Gopi
> >
> > On 03/29, Gopi Manohar Tatiraju wrote:
> > > Hey,
> > >
> > > I agree, after going a bit through both the candidates I can see we can
> > > unload a lot of work by using a well-implemented existing parser.
> > > I think I should start by comparing both the mentioned libraries to
> > decide
> > > which one to use. I will use the same benchmark strategy that
> > > was discussed in the issue. Does that sound good?
> >
> > Sounds good to me.
> >
> > > And also I think I can work on replacing boost spirits in GSoC then. This
> > > will be a start to the data frame idea. Even if we are left with time
> > > after this, I can start the work on the data frame as well. Is it
> > > considerable?
> >
> > Yes of course.
> >
> > > Thanks,
> > > Gopi
> > >
> > >
> > > On Mon, Mar 29, 2021 at 7:33 PM Omar Shrit <omar at shrit.me> wrote:
> > >
> > > > Hey Gopi,
> > > >
> > > > I totally agree with Ryan, using existing parser will accelerate the
> > > > project and allow to move forward with the dataframe class. Also, I
> > > > do believe that replacing boost Spirit with an existing parser will
> > take
> > > > a considerable amount of the summer.
> > > >
> > > > Thanks,
> > > >
> > > > Omar
> > > >
> > > > On 03/29, Ryan Curtin wrote:
> > > > > On Mon, Mar 29, 2021 at 04:17:35PM +0530, Gopi Manohar Tatiraju
> > wrote:
> > > > > > Would love to hear your thoughts on whether to go with an already
> > > > > > implemented parser or build a new one. Also if we are planning to
> > > > build a
> > > > > > data frame here then
> > > > > > maybe going with an in-house parser would be better as we will
> > have the
> > > > > > ability to design it in such a way that it can extend maximum
> > support
> > > > to
> > > > > > the new data frame
> > > > > > which we are planning to build ahead.
> > > > >
> > > > > Hey Gopi,
> > > > >
> > > > > Honestly I think it's best to use another package.  Not only will
> > this
> > > > > free up time to actually work on the dataframe class, but also it
> > means
> > > > > we are not responsible for maintenance of the CSV parser.  There are
> > > > > lots of little complexities and edge cases in parsing (not to mention
> > > > > efficiency!) and so we can probably get a lot more bang for our buck
> > > > > here by using an implementation from someone who has already put down
> > > > > the time to consider all those details.
> > > > >
> > > > > Hope this is helpful. :)
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Ryan
> > > > >
> > > > > --
> > > > > Ryan Curtin    | "Kill them, Machine... kill them all."
> > > > > ryan at ratml.org |   - Dino Velvet
> > > >
> >
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://knife.lugatgt.org/pipermail/mlpack/attachments/20210401/ed43b4aa/attachment.sig>


More information about the mlpack mailing list