[mlpack] GSoC 2016 (Decision trees and own ideas)

Thu Mar 10 09:55:02 EST 2016

On Thu, Mar 10, 2016 at 11:29:15AM +0000, Наталья Теплякова wrote:
> Hello,
> My name is Natalia Teplyakova. I'm third year student at Moscow State
> University, Russia. I am pursuing a degree in Applied Mathematics and
> Informatics. Currently I'm taking courses from Mail.ru Group (russian IT
> company) in addition to my university program covering topics in machine
> learning, information retrieval, advanced C++ and design patterns. My
> skills are C/C++, Python and several data analysis libraries (numpy,
> pandas, scikit-learn).

Hi Natalia,

Thanks for getting in touch.

> I'm interested in "Decision trees" project. I have already looked through
> decision stump and density estimation tree code in mlpack. I am not quite
> sure, but I think it would be better to implement new class for decision
> trees with different fitness functions: gini impurity, information gain
> (both of them implemented in mlpack), misclassification impurity for
> classification and mean squared error for regression. Is it a good idea to
> implement some ensemble methods (random forests for example) as a part of
> this project?

Ensemble methods would definitely be useful.  It's worth mentioning that
I'm working on an implementation of random forests based on Hoeffding
trees in my own fork of mlpack:

https://github.com/rcurtin/mlpack/blob/vfdt/src/mlpack/methods/hoeffding_trees/hoeffding_forest.hpp

Maybe that would be interesting to look at.

I think that the existing DecisionStump class could be refactored and
extended, but you are right that it might be easier to throw it away and
start over.  In either case, if you submit a proposal for this, be sure
to spend some time thinking about the API of the proposed code and make
sure that it fits in with the rest of the mlpack codebase.  (I'm happy
to look at a proposed API and give some comments.)

> Besides I have my own idea for GSoC project: implement different clustering
> methods. mlpack has efficient data structures for neighbourhood queries, so
> they can be used in DBSCAN clustering. DBSCAN has several advantages
> compared to KMeans: it does not require to specify the number of clusters
> in the dataset. Also this clustering method can find arbitrarily shaped
> clusters and detect outliers. There is an issue about hierarchical
> clustering (https://github.com/mlpack/mlpack/issues/356), so I can
> implement agglomerative clustering too. What do you think about this idea?

DBSCAN would be interesting, definitely.  As for ticket #356, extending
the EMST code to compute single-linkage clustering should not be
particularly hard, and in low dimensions the clustering should be fast
because it is a dual-tree algorithm.  If you did implement
single-linkage clustering (or other clustering algorithms) with good
tests, I'd be happy to merge them in.

Thanks,

Ryan

-- 
Ryan Curtin    | "And they say there is no fate, but there is: it's
ryan at ratml.org | what you create." - Minister