[mlpack] Improve mlpack's tree ensemble support - GSoC 2021

Tue Mar 16 06:40:52 EDT 2021

Greetings mlpack family,
I am Rishabh Garg, 2nd year Computer Science student at IIT Mandi, India. I
am very interested in pursuing the GSoC idea of “Improving mlpack’s tree
ensemble support” posted on the GSoC Idea List for 2021. A few days ago, I
shared another idea related to time series forecasting. I like both the
ideas and it is really difficult for me to choose one. So, maybe the mlpack
family could help me figure out which one is better :-)
I apologise in advance if this email gets too big.

I would like to implement Gradient Boosting Classifier and Regressor as a
part of the project. The following is my plan of action.

After digging into the codebase for `trees` in mlpack, I realised that we
don’t have a regression tree. A regression tree is at the core of gradient
boosted trees. Thus, first priority would be to implement a
`RegressionTree` class. I am thinking of making a base `DecisionTree` class
from which `DecisionTreeClassifier` and `RandomForestClassifier` can
inherit. This means it would require to refactor the existing code a little
bit.

Then once regression tree is ready, the Gradient Boosting Tree algorithms
can be implemented. For them also, I am thinking of a similar approach of
making a base `GradientBoosting` class from which the
`GradientBoostingClassifier` and `GradientBoostingRegressor` can be
inherited.

One really nice feature I found in sklearn’s GradientBoostingTrees is that
we can train additional estimator trees on an existing trained one. This
really helps in the development phase when we are trying different hyper
parameters. Thus I would love to integrate that feature in the mlpack’s
implementation too.

So, coding the algorithms, refactoring existing code, writing unit tests,
adding documentation, making bindings, searching for good default hyper
parameters and adding tutorials/examples for the above three added
algorithms would be enough to keep me occupied for the whole summer. Don’t
want to be too ambitious, but if still time permits then I might look into
implementing XGBoost. Once, the GradientBoostingTrees are implemented, it
would make it slightly easy to implement XGBoost. But, provided that
XGBoost is really Xtreme due to its weighted quantiles, parallel learning,
out of cache optimisation etc. it would be really difficult to finish it
along with the other algorithms within the GSoC time period.

I would love to hear suggestions from the community. Also, If my idea and
goals seems plausible, then I would love to provide a more detailed
proposal of what I would be doing — like how the API would look like, how
the end user will use these classes, some more implementation details or
pseudocode, timeline of project etc.

The mentor for this project is not updated on the GSoC Ideas page
<https://github.com/mlpack/mlpack/wiki/SummerOfCodeIdeas>. I would love to
know who will be mentoring it.

Also if it feels like there are any flaws in the idea, then please provide
your valuable feedback.

Looking forward for the replies. Thanks for reading till the end.

Best regards,
Rishabh Garg
Github - RishabhGarg108 <https://github.com/RishabhGarg108>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://knife.lugatgt.org/pipermail/mlpack/attachments/20210316/6ab449a5/attachment.htm>