[mlpack] GSoC-2021

Gopi Manohar Tatiraju deathcoderx at gmail.com
Fri Apr 2 13:14:12 EDT 2021


Heyy,

So I was not able to work out fast csv, but I edited the existing code to
read the whole data column-wise,
each column is returned to us as a std::vector which I then converted to
arma::vec and then at the end
insert the column into an arma::mat.
Suggested code changes:


>       arma::fmat mat(doc.GetRowCount(), doc.GetColumnCount());
>       std::vector<float> column;
>       for(int i = 0; i < doc.GetColumnCount(); i++)
>       {
>         column = doc.GetColumn<float>(i);
>         arma::fvec column_vector(column);
>         mat.col(i) = column_vector;
>       }


I am running the benchmark code, it's gonna take some time, so I will
upload the code finishes compiling.
Also, any idea regarding the other parser would help.

Thanks,
Gopi

On Fri, Apr 2, 2021 at 12:47 AM Gopi Manohar Tatiraju <deathcoderx at gmail.com>
wrote:

> Hey,
>
> Was working on it.
> Here's the link:
> https://github.com/heisenbuug/Benchmark-CSV-Parsers/blob/main/csvparser_log_check.ipynb
>
> Thanks,
> Gopi
>
> On Fri, Apr 2, 2021 at 12:28 AM Omar Shrit <omar at shrit.me> wrote:
>
>> Hello Gopi,
>>
>> Thank you for starting the benchmark, would it be possible to plot the
>> log and add the results to the open pull request to get a better
>> comparison?
>>
>> The code seems to be fine, it can be optimized, but I would wait to see
>> the plots.
>>
>> Thanks,
>>
>> Omar
>>
>> On 04/01, Gopi Manohar Tatiraju wrote:
>> > Hey Omar,
>> >
>> > Sorry, it took longer. I was running benchmark code since this morning
>> and
>> > it took a lot of time as my system is a bit slow.
>> > I compared the default armadillo parser, mlpack's custom parser, and
>> > rapidcsv.
>> >
>> > Can you verify the code I used? I might have done something wrong and it
>> > took a lot of time to run this code, but that is maybe due to the fact
>> that
>> > my system is not that powerful.
>> > *Link to the repo and log file:*
>> > https://github.com/heisenbuug/Benchmark-CSV-Parsers
>> >
>> > In the meantime, I will also start working on my draft proposal a bit,
>> and
>> > once we do this testing we can use those results to decide our
>> > plan of action. Let me know if you have any suggestions or points for
>> the
>> > draft proposal.
>> >
>> > Thank you,
>> > Gopi M. Tatiraju
>> >
>> >
>> > On Thu, Apr 1, 2021 at 5:59 PM Omar Shrit <omar at shrit.me> wrote:
>> >
>> > > Hello Gopi,
>> > >
>> > > Would it be possible to do some benchmark for these two and compare
>> > > them with already existing Boost Spirit. If there is a considerable
>> > > difference
>> > > in performance between these two parsers, then the obvious choice will
>> > > be for the faster one. I know that both of them are called (fast,
>> rapid)
>> > > but I did not see any benchmark yet to know which one is faster.
>> > >
>> > > Let me know what do you think, the benchmark will help us in doing
>> better
>> > > choice, since this is the internal (private) API, and will not be used
>> > > by the user directly.
>> > >
>> > > These are my thoughts, let me know what do you think.
>> > >
>> > > Omar.
>> > >
>> > > On 04/01, Gopi Manohar Tatiraju wrote:
>> > > > Hey,
>> > > >
>> > > > So, I want through both the libraries we considered for `csv
>> parsers`
>> > > > I implemented code to load the data from a small example `csv` file
>> > > > to arma::mat, here is the sample code, let me know what you think.
>> > > > I am loading into wrong in arma::mat? Can there be any other
>> efficient
>> > > > way?
>> > > >
>> > > > Fast CSV Parser <
>> https://github.com/ben-strasser/fast-cpp-csv-parser>
>> > > > io::CSVReader<4> in("llog.csv");
>> > > > float a, b, c, d;
>> > > > int row = 0;
>> > > > arma::mat data(20, 4);
>> > > >
>> > > > while(in.read_row(a, b, c, d)){
>> > > > data(row, 0) = a;
>> > > > data(row, 1) = b;
>> > > > data(row, 2) = c;
>> > > > data(row, 3) = d;
>> > > > row++;
>> > > > }
>> > > >
>> > > > Rapid.csv <https://github.com/d99kris/rapidcsv>
>> > > > // For headerless csv files
>> > > > rapidcsv::Document doc("llog.csv", rapidcsv::LabelParams(-1, -1));
>> > > > arma::mat data(doc.GetRowCount(), doc.GetColumnCount(),
>> > > arma::fill::ones);
>> > > >
>> > > > std::vector<float> col;
>> > > > for(int i = 0; i < doc.GetRowCount(); i++)
>> > > > {
>> > > > col = doc.GetRow<float>(i);
>> > > > for(int j = 0; j < doc.GetColumnCount(); j++)
>> > > > {
>> > > > data(i, j) = col[j];
>> > > > }
>> > > > }
>> > > >
>> > > > After using both a I feel like `rapid.csv` is easier to grasp and
>> work on
>> > > > and seemed more structured.
>> > > > Let me know your thoughts. Also If loading like the above example is
>> > > file,
>> > > > this can be converted
>> > > > into a function that can act as basic csv file loading in arma::mat,
>> > > right?
>> > > >
>> > > > Thank You,
>> > > > Gopi
>> > > >
>> > > > On Mon, Mar 29, 2021 at 8:28 PM Omar Shrit <omar at shrit.me> wrote:
>> > > >
>> > > > > Hey Gopi
>> > > > >
>> > > > > On 03/29, Gopi Manohar Tatiraju wrote:
>> > > > > > Hey,
>> > > > > >
>> > > > > > I agree, after going a bit through both the candidates I can
>> see we
>> > > can
>> > > > > > unload a lot of work by using a well-implemented existing
>> parser.
>> > > > > > I think I should start by comparing both the mentioned
>> libraries to
>> > > > > decide
>> > > > > > which one to use. I will use the same benchmark strategy that
>> > > > > > was discussed in the issue. Does that sound good?
>> > > > >
>> > > > > Sounds good to me.
>> > > > >
>> > > > > > And also I think I can work on replacing boost spirits in GSoC
>> then.
>> > > This
>> > > > > > will be a start to the data frame idea. Even if we are left
>> with time
>> > > > > > after this, I can start the work on the data frame as well. Is
>> it
>> > > > > > considerable?
>> > > > >
>> > > > > Yes of course.
>> > > > >
>> > > > > > Thanks,
>> > > > > > Gopi
>> > > > > >
>> > > > > >
>> > > > > > On Mon, Mar 29, 2021 at 7:33 PM Omar Shrit <omar at shrit.me>
>> wrote:
>> > > > > >
>> > > > > > > Hey Gopi,
>> > > > > > >
>> > > > > > > I totally agree with Ryan, using existing parser will
>> accelerate
>> > > the
>> > > > > > > project and allow to move forward with the dataframe class.
>> Also, I
>> > > > > > > do believe that replacing boost Spirit with an existing
>> parser will
>> > > > > take
>> > > > > > > a considerable amount of the summer.
>> > > > > > >
>> > > > > > > Thanks,
>> > > > > > >
>> > > > > > > Omar
>> > > > > > >
>> > > > > > > On 03/29, Ryan Curtin wrote:
>> > > > > > > > On Mon, Mar 29, 2021 at 04:17:35PM +0530, Gopi Manohar
>> Tatiraju
>> > > > > wrote:
>> > > > > > > > > Would love to hear your thoughts on whether to go with an
>> > > already
>> > > > > > > > > implemented parser or build a new one. Also if we are
>> planning
>> > > to
>> > > > > > > build a
>> > > > > > > > > data frame here then
>> > > > > > > > > maybe going with an in-house parser would be better as we
>> will
>> > > > > have the
>> > > > > > > > > ability to design it in such a way that it can extend
>> maximum
>> > > > > support
>> > > > > > > to
>> > > > > > > > > the new data frame
>> > > > > > > > > which we are planning to build ahead.
>> > > > > > > >
>> > > > > > > > Hey Gopi,
>> > > > > > > >
>> > > > > > > > Honestly I think it's best to use another package.  Not
>> only will
>> > > > > this
>> > > > > > > > free up time to actually work on the dataframe class, but
>> also it
>> > > > > means
>> > > > > > > > we are not responsible for maintenance of the CSV parser.
>> There
>> > > are
>> > > > > > > > lots of little complexities and edge cases in parsing (not
>> to
>> > > mention
>> > > > > > > > efficiency!) and so we can probably get a lot more bang for
>> our
>> > > buck
>> > > > > > > > here by using an implementation from someone who has
>> already put
>> > > down
>> > > > > > > > the time to consider all those details.
>> > > > > > > >
>> > > > > > > > Hope this is helpful. :)
>> > > > > > > >
>> > > > > > > > Thanks,
>> > > > > > > >
>> > > > > > > > Ryan
>> > > > > > > >
>> > > > > > > > --
>> > > > > > > > Ryan Curtin    | "Kill them, Machine... kill them all."
>> > > > > > > > ryan at ratml.org |   - Dino Velvet
>> > > > > > >
>> > > > >
>> > >
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://knife.lugatgt.org/pipermail/mlpack/attachments/20210402/6e486f70/attachment.htm>


More information about the mlpack mailing list