[mlpack] Suggestions needed on basic outline

Sun Mar 13 20:23:15 EDT 2016

Hello,
I am Nirmal Singhania from NIIT University,India.
I am interested the projects "command-line program for dataset and
experimentation tools and C++ API" and" Implementing Decision trees and
other algorithms" in ml-pack
I've Been working on Machine Learning and Data Science Problems From Quite
some time and have experience working with some large and complex datasets
during the course .I know the difference of theoretical ML rand Practical
ML and Problem you face when you start applying ML on real life datasets.

My Experience-
Took courses on Machine Learning and Data Science at coursera,udacity,Edx
and working on Real life datasets(for Ex-KDD99,NSL
KDD,UNSW-NB15,MovieLens,Data.gov,UCI ML Repository,kaggle
problems),Experience in R and Scikit-learn

I am Also Working on R & D Hybrid Classification and Clustering Techniques
and Ensemble Learning.

My initial insight into topics-

1)command-line program for dataset and experimentation tools and C++ API

As you said in on ideas page,90% of data science/ML problem is getting the
data into shape to feed it to ML algorithm.
If your data is preprocessed correctly,your ML algorithm will work in the
best possible way.
Problems with real life data are-
1)Noisy Data containing outliers
2)Missing Values
3)dirty input data
4)Data format/representation
5)Vague data entries

various other machine-learning libraries like scikit learn,R(caret package)
have their different preprocessing methods.
comparatively ml-pack also has speed advantage over other ml libraries and
ml-pack should also add a module to pre-process dataset.

Preprocessing Modules can include-
1)checking a dataset for loading problems and printing errors
2)Standardization module(mean removal and variance scaling) using z-score
3)Scaling features to range(min-max)
4)Handling Missing values/na
     This can be done by removing the entire rows/columns containing
missing values.
      or imputing the missing values using given data
5)Scaling data with outliers
6)converting categorical features into binary features

7)Normalization of data(Not required for every ML algorithm but it doesn't
hurt if applied)
8)splitting a dataset into a training and test set

Other features we can consider adding
1)Handling Class Imbalance(Smote(Synthetic Minority Over-Sampling
Technique),Oversampling and Undersampling)
2)Quantlization of Numerical Attributes

A C++ API will will developed which will serve the purpose of
pre-processing data before using any ML-pack algorithm on it.
A command line interface will also be developed through which user can
check for problems and apply pre-processing methods on data set.
Command line and C++ API will intially support csv and arff files and
support for other formats may be added later.
There will be a option to save the pre-processed data set.
Optional-One Extra feature which can added is converted pre-processed arff
to csv and vice-versa.

Since Data handling and pre-processing will be crucial and common
step,Extensive documentation will be created using Doxygen on
1)How to use various Methods Present in C++ API
2)How to Handle and Pre-Process data using command line

Sample Programs and Tutorials on various data handling steps will also be
created using some open datasets.

Testing the implemented C++ API and Command line module for identifying and
resolving bugs and making it work fluently with ml-pack.

I want to ask how much information about each of the above steps i should
give in my proposal to make it a good proposal.

Please give your suggestions.

2)Implementing Decision trees and other algorithms in ml-pack
 I've have understood the decision stump implementation done by Udit Saxena
for adaboost and would like to add more "weak learner" adaboost some of
which are already implemented in ml-pack and some which will be implemented
by me.
since Decison stumps are basically 1-level decision tree i would like to
continue on the Udit Saxena's work and implement full fledged decison trees
like ID3,C4.5,C5.0,CART.
I also looked at the code for DET(Density Estimation Trees) and would like
to borrow tree construction ideas from it.

Also will try to implement NB-Tree(Naive Bayes Tree) and
CI-Tree(Conditional Inference Tree) which are very useful in some tasks.
I have some knowledge about above mentioned methods and am currently going
through literature for more information and implementation.

All the above points about documentation,tutorial also apply here.
As in this,we are adding new algorithm to ml-pack library
Testing of implemented algorithms will be an important phase of this
project.
Also as everyone knows ml-pack is known for its fast speed and scalability.
We will benchmark it against similar methods available in
scikit-learn,weka,R and Shogun machine learning toolkit
and the results will provided via interactive and charts.
The automatic benchmarking system by Marcus Edel and Anand Soni during GSOC
will be used for benchmarkinghttps://github.com/zoq/benchmarks

I know the whole mail is very long but for initial draft proposal,i have to
improve on this outline

Please give your suggestions and comments

Have a Nice Day

Regards,
Nirmal Singhania
B.tech III Yr
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cc.gatech.edu/pipermail/mlpack/attachments/20160314/17a6326a/attachment-0002.html>