[mlpack] GSoC 2016 (Decision trees and own ideas)

Mon Mar 14 16:19:10 EDT 2016

Hello!
I'm sorry for not answering so long, I had a very busy end of the week.

I'd like to discuss the API of decision trees. Decision tree class should
be parametrized by FitnessFunction template parameter (like Hoeffding
trees). This class should have several constructors (with training and
without as mentioned in design guidelines), Train methods (with weight
support, so decision trees can be used as weak learners for adaboost),
Classify methods (to predict class labels or probabilities). Constructor
should have parameters listed below: maxDepth (decision stump would have
maxDepth=1, besides this parameter can be used to prevent overfitting),
minSamplesLeaf (minimum number of samples required to form a leaf node),
maxFeatures (number of features to consider when looking for the best
split). Maybe it would be a good idea to implement pruning.

I have several questions to discuss:
1) Should I treat ordinal and nominal variables different when splitting?
(I suggest not to do so, because ordinal variable with n distinct values
has (n-1) possible splits whereas nominal variable has about 2^(n-1)
possible splits at the same time, so it would be faster to split the
dimension if both types of variables are treated as ordinal).
2) I've noticed that decision stump code in mlpack does non-binary splits.
It constructs several bins and then try to merge them, if bins have common
most frequent class. I'd like to implement binary decision tree, so I'm not
sure decision stump code can be refactored in this case.
3) I'd like to use decision trees for regression tasks too. The only
difference between classification and regression trees is information,
stored in leaf nodes (in classification task leaf node contains class
probabilities, in regression - mean of samples in this node). I don't know
how to deal with these two different cases better. Maybe I can implement a
class parametrized by FitnessFunction that performs splitting (Splitter
class). Then I can use this class in implementation of two different
classes for classification and regression trees. Is it a good idea to
prevent duplicating of code in such way?
4) Recently I've looked through scikit-learn code for decision trees.
Splitting and impurity evaluation is separated in its code. Splitter uses
Criterion class to count impurity reduction at different possible levels to
find the best split. I like this idea a lot.
That's all for decision trees.

> As for ticket #356, extending
> the EMST code to compute single-linkage clustering should not be
> particularly hard, and in low dimensions the clustering should be fast
> because it is a dual-tree algorithm.  If you did implement
> single-linkage clustering (or other clustering algorithms) with good
> tests, I'd be happy to merge them in.
I'd like to try to implement single-linkage clustering soon.

Regards, Natalia.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cc.gatech.edu/pipermail/mlpack/attachments/20160314/c832ac1d/attachment-0002.html>