[mlpack] [GSoC 2016] Some attempts on refactoring decision_stump

Mon Apr 18 14:55:47 EDT 2016

On Sun, Apr 17, 2016 at 11:22:25AM +0800, Cloud Han wrote:
> Hello, everyone,
> 
> My name is Han Guangyun (Cloud Han), a master candidate student in
> China. I haven't introduced myself yet, and I'm not going to do it
> right now ;)
> 
> I only have a few private emails with Ryan for GSoC issues and finally
> decide to do some experiment on the original decision stump
> implementation, which is a building block for decision tree (my
> proposal).
> 
> So what is wrong with the original implementation? First, if you have
> read the code, you'll find the splitting criterion is hard coded as
> entropy gain (information gain) and have a bunch of redundant
> code. Second, it only support numeric feature. Some other function
> also needs to be added if you want to develop a full blown decision tree.
> 
> Also, I inspected Ryan's implementation for Hoeffding tree, which is
> basically a decision tree supports streaming data. So it has to
> remember the statistics(some state) of the data which is not needed in
> the vanilla one. It also brings the ability to handle categorical
> feature.

Hi Cloud,

Thanks for getting in touch and for taking the time to do this
experiment.  You are right that the existing decision stump
implementation is somewhat limited.

> So in this experiment, I just remove the unneeded feature and combined
> it with the stump. After inspecting the code, here is some conclusion:
> 
> 1. Fitness Function (Impurity Gain Function)
> 
>    In Ryan's implementation, these calculation is presented by a
>    FitnessFunction policy, you just prepare a `counts`, a numChildren
>    X numClasses matrix holds the statistics for number of times each
>    class occurs in a child, for the gain calculation. But we still
>    need to add weighting support, sample reweighting it required by
>    AdaBoost, class weighting is commonly needed for skewed data. So in
>    this case, instead of let `counts` hold the occurrence, we hold
>    weighted occurrence. This change should work fine for classification
>    tasks, both info gain, info gain ratio and Gini impurity just need
>    the probability of the label occurs in each child to calculate.
> 
>    However, for regression tree, I haven't figured it out on how to
>    use these statistics to calculate MSE, which needs a mean value for
>    each child. For now, I think a better one is instead of letting
>    FitnessFunction class compute the gain, we just compute
>    impurity of a (proposed) node, and combining them to get the
>    impurity gain should be done in class NumericSplit or CategoricalSplit.

Adding weighting for the GiniImpurity and InformationGain should be
pretty easy and could be done by adding an overload of
EvaluateFitnessFunction() that takes weights---or by modifying the
statistics matrix to hold the total weight of points for each class
instead of counts.

Can you explain more of your proposed solution for MSE?  I'm not sure if
I understand what you mean by combining the impurity calculations.

> 2. Numeric vs Categorical Feature
> 
>    It is common to see categorical feature, e.g. `male` vs `female`,
>    in data, decision trees have some advantages on handling them with
>    respect to other algorithms.
> 
>    Ryan use two classes, HoeffdingNumericSplit and
>    HoeffdingCategoricalSplit to handle heterogeneous data, combined
>    with a proper FitnessFunction, it is really flexible. I think this
>    should be fine and I just mimic a NumericSplit in my experiment.

Yeah---you can use mlpack::data::DatasetInfo to get information about
each dimension.  It will probably be necessary to refactor the
DecisionStump class to accept and handle DatasetInfo objects.

> 3. Missing Values
> 
>    Both the original DecisionStump and the Hoeffding Tree (not quite sure)
>    implementation in mlpack doesn't support handling missing values,
>    which is quite common in the wild. On training, it is easy to just
>    use the non-missing data on a particular dimension to train a
>    stump.
> 
>    On predicting, two ways to solve it, one is using other non-missing
>    value to perform a split, i.e. we add some fallback
>    mechanism. Another one is handle them by probability, C4.5 is using
>    this it I think. I also prefer the later one due to its simplicity.

Missing value support would definitely be appreciated, but the hard part
there is figuring out how we should represent missing data in an
arma::mat object.
> 
> Combining the above 3 feature and adding weighting support, it should be
> sufficient to build a generic decision stump or decision tree.
> 
> Some try is stored here, just added weighting support and just handle
> numeric feature by NumericSplit class. Just a little progress. Further test
> and consideration is needed. No serious test is performed, just concept
> prove :)
> https://github.com/cloudhan/mlpack/tree/weak_learner/src/mlpack/methods/weak_learner

I took a look, this seems like a good refactoring.  If you continue the
refactoring and end up doing testing on it, we can definitely integrate
it into mlpack. :)

Thanks,

Ryan

-- 
Ryan Curtin    | "So, it's just you 57 punks against KUNG FU JOE?"
ryan at ratml.org |   - Kung Fu Joe