[mlpack] Suggestions needed on basic outline

Mon Mar 14 09:58:11 EDT 2016

On Mon, Mar 14, 2016 at 05:53:15AM +0530, nirmal singhania wrote:
> Hello,

Hi Nirmal,

There is no need to send your email multiple times.  Everyone on the
list received it the first time.

> Preprocessing Modules can include-
> 1)checking a dataset for loading problems and printing errors
> 2)Standardization module(mean removal and variance scaling) using z-score
> 3)Scaling features to range(min-max)
> 4)Handling Missing values/na
>      This can be done by removing the entire rows/columns containing
> missing values.
>       or imputing the missing values using given data
> 5)Scaling data with outliers
> 6)converting categorical features into binary features
> 
> 7)Normalization of data(Not required for every ML algorithm but it doesn't
> hurt if applied)
> 8)splitting a dataset into a training and test set
> 
> Other features we can consider adding
> 1)Handling Class Imbalance(Smote(Synthetic Minority Over-Sampling
> Technique),Oversampling and Undersampling)
> 2)Quantlization of Numerical Attributes

Do you mean quantization of categorical attributes here?

> A C++ API will will developed which will serve the purpose of
> pre-processing data before using any ML-pack algorithm on it.
> A command line interface will also be developed through which user can
> check for problems and apply pre-processing methods on data set.
> Command line and C++ API will intially support csv and arff files and
> support for other formats may be added later.
> There will be a option to save the pre-processed data set.
> Optional-One Extra feature which can added is converted pre-processed arff
> to csv and vice-versa.
> 
> Since Data handling and pre-processing will be crucial and common
> step,Extensive documentation will be created using Doxygen on
> 1)How to use various Methods Present in C++ API
> 2)How to Handle and Pre-Process data using command line
> 
> Sample Programs and Tutorials on various data handling steps will also be
> created using some open datasets.
> 
> 
> I want to ask how much information about each of the above steps i should
> give in my proposal to make it a good proposal.

I like the ideas you've proposed here.  When you put your proposal
together, though, please spend some time detailing what the proposed C++
API will be (and we can go back and forth on this if necessary).  I
think maybe the design guidelines would be helpful here:

https://github.com/mlpack/mlpack/wiki/DesignGuidelines

A couple other thoughts:

 * Don't worry about writing an imputer.  A colleague of mine and I are
   planning on adding this support in the next few months.  Detecting
   NaNs and missing values in a dataset is a good idea though.

 * We should try and support all of the file formats that Armadillo
   supports, instead of just CSV and ARFF.  It would be good to provide
   a tool that can work with any dataset a user might otherwise use with
   mlpack.

I hope this is helpful.  Please let me know if I can clarify anything.

> 2)Implementing Decision trees and other algorithms in ml-pack
>  I've have understood the decision stump implementation done by Udit Saxena
> for adaboost and would like to add more "weak learner" adaboost some of
> which are already implemented in ml-pack and some which will be implemented
> by me.
> since Decison stumps are basically 1-level decision tree i would like to
> continue on the Udit Saxena's work and implement full fledged decison trees
> like ID3,C4.5,C5.0,CART.
> I also looked at the code for DET(Density Estimation Trees) and would like
> to borrow tree construction ideas from it.
> 
> Also will try to implement NB-Tree(Naive Bayes Tree) and
> CI-Tree(Conditional Inference Tree) which are very useful in some tasks.
> I have some knowledge about above mentioned methods and am currently going
> through literature for more information and implementation.
> 
> All the above points about documentation,tutorial also apply here.
> As in this,we are adding new algorithm to ml-pack library
> Testing of implemented algorithms will be an important phase of this
> project.
> Also as everyone knows ml-pack is known for its fast speed and scalability.
> We will benchmark it against similar methods available in
> scikit-learn,weka,R and Shogun machine learning toolkit
> and the results will provided via interactive and charts.
> The automatic benchmarking system by Marcus Edel and Anand Soni during GSOC
> will be used for benchmarkinghttps://github.com/zoq/benchmarks

I think that you should focus on just one of these two ideas; it's hard
to write two good proposals.  Again the same advice applies for this
proposal: make sure to spend some time designing the API and mentioning
what it will be in your proposal.

Thanks,

Ryan

-- 
Ryan Curtin    | "I just ran out of it, you see."
ryan at ratml.org |   - Howard Beale