[mlpack] Suggestions needed on basic outline

Tue Mar 15 03:06:38 EDT 2016

The quantization i am referring here  is "converting a continuous range of
values into a finite range of discreet values" like 10-20 one group.
Thanks for the the suggestion for including more file formats.
I'll mail you structure of how API will work in some time.
Wanted to ask if we can discuss on mailing list after GSOC starts accepting
proposal?

I proposed the second idea because i thought the first idea is little bit
easy than other as also mentioned by difficulty level on the ideas page.so
less preference can be given to people trying to work on command line/c++
API idea.
As the slots organizations get is fixed,i am confused which project to work
on and that's why i am working on both.
Please guide me in this.
Have a Nice Day

Regards,
Nirmal Singhania
B.tech III Yr

On Mon, Mar 14, 2016 at 7:28 PM, Ryan Curtin <ryan at ratml.org> wrote:

> On Mon, Mar 14, 2016 at 05:53:15AM +0530, nirmal singhania wrote:
> > Hello,
>
> Hi Nirmal,
>
> There is no need to send your email multiple times.  Everyone on the
> list received it the first time.
>
> > Preprocessing Modules can include-
> > 1)checking a dataset for loading problems and printing errors
> > 2)Standardization module(mean removal and variance scaling) using z-score
> > 3)Scaling features to range(min-max)
> > 4)Handling Missing values/na
> >      This can be done by removing the entire rows/columns containing
> > missing values.
> >       or imputing the missing values using given data
> > 5)Scaling data with outliers
> > 6)converting categorical features into binary features
> >
> > 7)Normalization of data(Not required for every ML algorithm but it
> doesn't
> > hurt if applied)
> > 8)splitting a dataset into a training and test set
> >
> > Other features we can consider adding
> > 1)Handling Class Imbalance(Smote(Synthetic Minority Over-Sampling
> > Technique),Oversampling and Undersampling)
> > 2)Quantlization of Numerical Attributes
>
> Do you mean quantization of categorical attributes here?
>
> > A C++ API will will developed which will serve the purpose of
> > pre-processing data before using any ML-pack algorithm on it.
> > A command line interface will also be developed through which user can
> > check for problems and apply pre-processing methods on data set.
> > Command line and C++ API will intially support csv and arff files and
> > support for other formats may be added later.
> > There will be a option to save the pre-processed data set.
> > Optional-One Extra feature which can added is converted pre-processed
> arff
> > to csv and vice-versa.
> >
> > Since Data handling and pre-processing will be crucial and common
> > step,Extensive documentation will be created using Doxygen on
> > 1)How to use various Methods Present in C++ API
> > 2)How to Handle and Pre-Process data using command line
> >
> > Sample Programs and Tutorials on various data handling steps will also be
> > created using some open datasets.
> >
> >
> > I want to ask how much information about each of the above steps i should
> > give in my proposal to make it a good proposal.
>
> I like the ideas you've proposed here.  When you put your proposal
> together, though, please spend some time detailing what the proposed C++
> API will be (and we can go back and forth on this if necessary).  I
> think maybe the design guidelines would be helpful here:
>
> https://github.com/mlpack/mlpack/wiki/DesignGuidelines
>
> A couple other thoughts:
>
>  * Don't worry about writing an imputer.  A colleague of mine and I are
>    planning on adding this support in the next few months.  Detecting
>    NaNs and missing values in a dataset is a good idea though.
>
>  * We should try and support all of the file formats that Armadillo
>    supports, instead of just CSV and ARFF.  It would be good to provide
>    a tool that can work with any dataset a user might otherwise use with
>    mlpack.
>
> I hope this is helpful.  Please let me know if I can clarify anything.
>
> > 2)Implementing Decision trees and other algorithms in ml-pack
> >  I've have understood the decision stump implementation done by Udit
> Saxena
> > for adaboost and would like to add more "weak learner" adaboost some of
> > which are already implemented in ml-pack and some which will be
> implemented
> > by me.
> > since Decison stumps are basically 1-level decision tree i would like to
> > continue on the Udit Saxena's work and implement full fledged decison
> trees
> > like ID3,C4.5,C5.0,CART.
> > I also looked at the code for DET(Density Estimation Trees) and would
> like
> > to borrow tree construction ideas from it.
> >
> > Also will try to implement NB-Tree(Naive Bayes Tree) and
> > CI-Tree(Conditional Inference Tree) which are very useful in some tasks.
> > I have some knowledge about above mentioned methods and am currently
> going
> > through literature for more information and implementation.
> >
> > All the above points about documentation,tutorial also apply here.
> > As in this,we are adding new algorithm to ml-pack library
> > Testing of implemented algorithms will be an important phase of this
> > project.
> > Also as everyone knows ml-pack is known for its fast speed and
> scalability.
> > We will benchmark it against similar methods available in
> > scikit-learn,weka,R and Shogun machine learning toolkit
> > and the results will provided via interactive and charts.
> > The automatic benchmarking system by Marcus Edel and Anand Soni during
> GSOC
> > will be used for benchmarkinghttps://github.com/zoq/benchmarks
>
> I think that you should focus on just one of these two ideas; it's hard
> to write two good proposals.  Again the same advice applies for this
> proposal: make sure to spend some time designing the API and mentioning
> what it will be in your proposal.
>
> Thanks,
>
> Ryan
>
> --
> Ryan Curtin    | "I just ran out of it, you see."
> ryan at ratml.org |   - Howard Beale
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cc.gatech.edu/pipermail/mlpack/attachments/20160315/fff2ae92/attachment-0002.html>