[mlpack] Proposing a possible project for GSOC 2021

Fri Mar 19 09:25:22 EDT 2021

Hi Ryan,

Thanks for your feedback.
Sorry for the late reply as I was thinking about how would I approach the
project
and various other details.

I apologize in advance if the mail gets too long.

Brief Overview:

The overall interface will be the same as you mentioned in issue thread
#2421 <https://github.com/mlpack/mlpack/issues/2421>.
The best way this can be explained is with an example.
Let me take the example of the "RandomForest" class. Here we have defined
various methods such as "Train", "Predict" etc. in c++.

For the bindings, we can define multiple functions in separate files
instead of a single function in one "_main.cpp" file.
Each function will perform separate tasks. So, the directory structure
could look something like:

random_forest/
------random_forest.hpp
------random_forest_impl.hpp
------bindings/
------------random_forest_train.cpp /* has the train function for bindings
*/
------------random_forest_predict.cpp /* has the predict function for
bindings */

Each file in the "bindings/" directory contains a separate function that
can be wrapped inside a method of a class/struct of the required
programming language.
So, for python, we can have a "RandomForestPy" class. This class will have
methods like "train", "predict" that would internally call these functions.

After a survey of the different programming languages that mlpack has
bindings in, I think that this kind of interface can be supported either by
using structs (in go, julia) or classes (in python, R).

I have also thought about the questions you mentioned in the issue
description.

For the sake of clarity, I will be referring to the methods implemented in
mlpack (eg- LinearRegression, RandomForest, etc.) as "mlpack_methods" and
methods belonging to a class/struct (eg- Train(), Predict(), etc.) as
"member_methods".

Also the function corresponding to each member_method (functions defined in
random_forest_train.cpp, random_forest_predict.cpp, etc.) is referred to as
a "functionality".
So, the functionality would be wrapped inside a member_method defined in a
class/struct inside the
required programming language.

Q1) "Does it make sense to revamp the mlpack bindings into separate
bindings for model training and model prediction?"

In answer to this question, I have prepared a list of advantages that this
interface might provide.

1) It will break the rigidity of the current interface while keeping the
interface fundamentally strong.
2) Make the user more comfortable and give the user more access.
3) Make mlpack compatible with other popular libraries. For this, I am not
completely aware of other languages but for python, we can make the
mlpack_methods compatible with scikit-learn (similar to what "catboost" and
"xgboost" libraries have done). This may not be possible in the summer due
to limited time but can be a future plan.
4) Make it easier for different contributors who are working on the
bindings of the same mlpack_method to collaborate as each function would be
present in a separate file.

Q2) "If so, what restrictions can we place on the bindings so that they fit
this format? i.e. only one output parameter?"

I could not completely understand what you meant here by "one output
parameter". But I have a plan to formalize this idea.

We can categorize each mlpack_method into various categories. Each category
will have a set of basic functionalities that should be provided to the
bindings through the member_methods.
Following are the categories that we can use:
(These are picked from the mlpack docs page. We can edit this list
accordingly. Maybe you can suggest some changes?)

1) Transformations
2) Regression
3) Classification
4) Clustering
5) Preprocessing
6) Geometry
5) Others

For the "Regression" category we can have some basic member_methods such as
"fit", "predict", "score", "get_params".
For the "Classification" category we can have "fit", "predict",
"predict_proba", "score", "get_params".

I am still working to find an exhaustive list of the basic member_methods
we should provide to the user that are common to all mlpack_methods present
inside the same category(it would be great to have some suggestions here).

Now, after these basic member_methods have been provided, each
mlpack_method might have some special/unique functionality that we would
like to provide.
For example: in Adaboost, we might want to provide the user with the
weights corresponding to each weak learner, for that we can add a
member_method called "weights" to the existing basic member_methods and
create a corresponding functionality that will be called through the
member_method.

Using this, we can increase the accessibility of mlpack and capture the
most out of all mlpack_methods while keeping the process automatic.

Q3) "What do we do with bindings that don't fit that format? Do we need a
couple more abstractions? e.g., NMF doesn't fit cleanly into
train/predict ... it's just a transformation."

I think this is answered in the previous part. This issue can be tackled by
categorizing each mlpack_method.

Q4) "Is there a way that we could manage to avoid the multiple loading cost
issue for the command-line bindings by somehow "combining" bindings that
are marked as "grouped" in the CMake configuration or something?"

There is no way that classes/structs can be accessed from the command-line.
So, here the best option can be to combine the functionalities to generate
a single function that can be used from the command-line (like the current
implementation).

Q5) "If we could "group" bindings together in CMake, could this then be
used to generate, e.g., a Python class for each set of bindings? So we
could actually have a RandomForest object that isn't just an opaque pointer
but actually has functions that return something?"

Though the class/struct that we will provide will still be a wrapper it
will have functions that return various things. Such as the "score"
member_method can return RMSE score for regression and F1 score for
classification.

Q6) "...how would we restructure our binding documentation?"

This might be the biggest task because all examples would also have to be
changed.
To keep the documentation up-to-date with the most recent interface we can
update it simultaneously instead of keeping the task for later (this would
require help from the community as it would be difficult for a single
person to go over all the documentation in all languages).

I hope that I was able to convey my ideas clearly and provided satisfactory
answers.
Though I have been contributing to mlpack for a while now there can still
be things that I do not understand. In that case, please correct me if I
mentioned anything wrong.

I am still working on finding an exhaustive list of basic member_methods
for each category. After that, I will work to create bindings for a single
mlpack_method as a proof-of-concept.

Please let me know what you think about this.

Feedbacks from everyone are welcome.

Regards,
Nippun Sharma
Github: NippunSharma
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://knife.lugatgt.org/pipermail/mlpack/attachments/20210319/c85ca16e/attachment.htm>