[mlpack] GSoC 2016 (Decision trees and own ideas)

Tue Mar 15 09:09:06 EDT 2016

On Mon, Mar 14, 2016 at 08:19:10PM +0000, Natalia Teplyakova wrote:
> Hello!
> I'm sorry for not answering so long, I had a very busy end of the week.
> 
> I'd like to discuss the API of decision trees. Decision tree class should
> be parametrized by FitnessFunction template parameter (like Hoeffding
> trees). This class should have several constructors (with training and
> without as mentioned in design guidelines), Train methods (with weight
> support, so decision trees can be used as weak learners for adaboost),
> Classify methods (to predict class labels or probabilities). Constructor
> should have parameters listed below: maxDepth (decision stump would have
> maxDepth=1, besides this parameter can be used to prevent overfitting),
> minSamplesLeaf (minimum number of samples required to form a leaf node),
> maxFeatures (number of features to consider when looking for the best
> split). Maybe it would be a good idea to implement pruning.

Hi Natalia,

I agree with your proposed API here.  I also think it would be a good
idea to implement pruning.  Maybe a template parameter could be used to
control the type of pruning performed.

> I have several questions to discuss:
> 1) Should I treat ordinal and nominal variables different when splitting?
> (I suggest not to do so, because ordinal variable with n distinct values
> has (n-1) possible splits whereas nominal variable has about 2^(n-1)
> possible splits at the same time, so it would be faster to split the
> dimension if both types of variables are treated as ordinal).

Just to make sure we are using the same terminology, "ordinal" and
"numeric" mean the same thing to me, and "nominal" and "categorical"
also mean the same thing to me.  I do think that you should treat these
features differently; treating categorical variables as numeric can
cause drastically lower performance.  For instance, if you train on the
covertype dataset with Hoeffding trees and treat all features as
numeric, you might get 55% accuracy (with certain parameters set that I
don't remember; I did this experiment a while ago); if you treat the
categorical features as categorical instead of numeric, you get more
like 70% accuracy.  So it can make a big difference.

> 2) I've noticed that decision stump code in mlpack does non-binary splits.
> It constructs several bins and then try to merge them, if bins have common
> most frequent class. I'd like to implement binary decision tree, so I'm not
> sure decision stump code can be refactored in this case.

Sure, that's reasonable.  I think that maybe if the splitting type was
factorized out into a template parameter then it might be easier to have
a basic tree structure that could handle both binary and non-binary
splits.  But either way is fine.

> 3) I'd like to use decision trees for regression tasks too. The only
> difference between classification and regression trees is information,
> stored in leaf nodes (in classification task leaf node contains class
> probabilities, in regression - mean of samples in this node). I don't know
> how to deal with these two different cases better. Maybe I can implement a
> class parametrized by FitnessFunction that performs splitting (Splitter
> class). Then I can use this class in implementation of two different
> classes for classification and regression trees. Is it a good idea to
> prevent duplicating of code in such way?

Yes, definitely preventing code duplication is a good thing.  If nodes
in the decision tree hold different information for classification and
regression tasks, why not add a template parameter that holds the
relevant information?  For instance, you could have a
ClassificationInformation and RegressionInformation class, and if you
are building a tree for classification then it is built with
ClassificationInformation (which I guess just holds information on the
class distribution) and if you're building a tree for regression then it
is built with RegressionInformation.

> 4) Recently I've looked through scikit-learn code for decision trees.
> Splitting and impurity evaluation is separated in its code. Splitter uses
> Criterion class to count impurity reduction at different possible levels to
> find the best split. I like this idea a lot.

Yes, this is how the Hoeffding tree code is written. Definitely it is a
good idea to separate concerns into different parts of the code.

> > As for ticket #356, extending
> > the EMST code to compute single-linkage clustering should not be
> > particularly hard, and in low dimensions the clustering should be fast
> > because it is a dual-tree algorithm.  If you did implement
> > single-linkage clustering (or other clustering algorithms) with good
> > tests, I'd be happy to merge them in.a
>
> I'd like to try to implement single-linkage clustering soon.

That contribution would be appreciated.  I don't think you will need to
dive too deep into the guts of dual-tree algorithms to do this; I think
that you can look at the DualTreeBoruvka class and figure out how to use
the results from each iteration of the dual-tree algorithm to construct
a single-linkage clustering solution.  I'm happy to help out if you
like.

Thanks,

Ryan

-- 
Ryan Curtin    | "I'm going to be on television!"
ryan at ratml.org |   - Sara Goldfarb