[mlpack] Improve mlpack's tree ensemble support - GSoC 2021

RISHABH GARG rishabhgarg108 at gmail.com
Sun Mar 28 06:32:40 EDT 2021


Hello everyone,
In continuation to the previous email, I made a small typo there. It is
`DecisionTreeRegressor` instead of `RandomForestClassifier`.

I gave a deeper thought and I realised that there is so much more that I
can do with gradient boosting trees like adding feature importance, warm
start, pruning, etc. So, I have decided to drop the idea of XGBoost from
the project and I will be investing the remaining time into implementing
these extra features.

I have been digging deep into the decision tree implementation and I
figured out that it has been built very flexibly and regression tree can be
implemented through it just by adding a new template parameter (which will
specify whether we want classification or regression) and adding a few
overloads of the existing helper functions. So, I thinking there will be no
need to make an abstract class and regression can be implemented without
doing any drastic refactoring to the existing `DecisionTree` class.
Although we will need to add a few fitness functions. I will share the full
technical details of it in my proposal.

Looking forward for the feedback.

Thanks and regards,
Rishabh Garg

On Tue, Mar 16, 2021 at 4:10 PM RISHABH GARG <rishabhgarg108 at gmail.com>
wrote:

> Greetings mlpack family,
> I am Rishabh Garg, 2nd year Computer Science student at IIT Mandi, India.
> I am very interested in pursuing the GSoC idea of “Improving mlpack’s tree
> ensemble support” posted on the GSoC Idea List for 2021. A few days ago, I
> shared another idea related to time series forecasting. I like both the
> ideas and it is really difficult for me to choose one. So, maybe the mlpack
> family could help me figure out which one is better :-)
> I apologise in advance if this email gets too big.
>
> I would like to implement Gradient Boosting Classifier and Regressor as a
> part of the project. The following is my plan of action.
>
> After digging into the codebase for `trees` in mlpack, I realised that we
> don’t have a regression tree. A regression tree is at the core of gradient
> boosted trees. Thus, first priority would be to implement a
> `RegressionTree` class. I am thinking of making a base `DecisionTree` class
> from which `DecisionTreeClassifier` and `RandomForestClassifier` can
> inherit. This means it would require to refactor the existing code a little
> bit.
>
> Then once regression tree is ready, the Gradient Boosting Tree algorithms
> can be implemented. For them also, I am thinking of a similar approach of
> making a base `GradientBoosting` class from which the
> `GradientBoostingClassifier` and `GradientBoostingRegressor` can be
> inherited.
>
> One really nice feature I found in sklearn’s GradientBoostingTrees is that
> we can train additional estimator trees on an existing trained one. This
> really helps in the development phase when we are trying different hyper
> parameters. Thus I would love to integrate that feature in the mlpack’s
> implementation too.
>
> So, coding the algorithms, refactoring existing code, writing unit tests,
> adding documentation, making bindings, searching for good default hyper
> parameters and adding tutorials/examples for the above three added
> algorithms would be enough to keep me occupied for the whole summer. Don’t
> want to be too ambitious, but if still time permits then I might look into
> implementing XGBoost. Once, the GradientBoostingTrees are implemented, it
> would make it slightly easy to implement XGBoost. But, provided that
> XGBoost is really Xtreme due to its weighted quantiles, parallel learning,
> out of cache optimisation etc. it would be really difficult to finish it
> along with the other algorithms within the GSoC time period.
>
> I would love to hear suggestions from the community. Also, If my idea and
> goals seems plausible, then I would love to provide a more detailed
> proposal of what I would be doing — like how the API would look like, how
> the end user will use these classes, some more implementation details or
> pseudocode, timeline of project etc.
>
> The mentor for this project is not updated on the GSoC Ideas page
> <https://github.com/mlpack/mlpack/wiki/SummerOfCodeIdeas>. I would love
> to know who will be mentoring it.
>
> Also if it feels like there are any flaws in the idea, then please provide
> your valuable feedback.
>
> Looking forward for the replies. Thanks for reading till the end.
>
> Best regards,
> Rishabh Garg
> Github - RishabhGarg108 <https://github.com/RishabhGarg108>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://knife.lugatgt.org/pipermail/mlpack/attachments/20210328/1f05315a/attachment-0001.htm>


More information about the mlpack mailing list