[mlpack] GSoC 2016 (Decision trees and own ideas)

Thu Mar 10 06:29:15 EST 2016

Hello,
My name is Natalia Teplyakova. I'm third year student at Moscow State
University, Russia. I am pursuing a degree in Applied Mathematics and
Informatics. Currently I'm taking courses from Mail.ru Group (russian IT
company) in addition to my university program covering topics in machine
learning, information retrieval, advanced C++ and design patterns. My
skills are C/C++, Python and several data analysis libraries (numpy,
pandas, scikit-learn).

I'm interested in "Decision trees" project. I have already looked through
decision stump and density estimation tree code in mlpack. I am not quite
sure, but I think it would be better to implement new class for decision
trees with different fitness functions: gini impurity, information gain
(both of them implemented in mlpack), misclassification impurity for
classification and mean squared error for regression. Is it a good idea to
implement some ensemble methods (random forests for example) as a part of
this project?

Besides I have my own idea for GSoC project: implement different clustering
methods. mlpack has efficient data structures for neighbourhood queries, so
they can be used in DBSCAN clustering. DBSCAN has several advantages
compared to KMeans: it does not require to specify the number of clusters
in the dataset. Also this clustering method can find arbitrarily shaped
clusters and detect outliers. There is an issue about hierarchical
clustering (https://github.com/mlpack/mlpack/issues/356), so I can
implement agglomerative clustering too. What do you think about this idea?

Regards,
Natalia.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cc.gatech.edu/pipermail/mlpack/attachments/20160310/f0bdb3dd/attachment-0002.html>