[mlpack] Ideas on AdaBoost developments

Thu Mar 6 13:14:20 EST 2014

On Thu, Mar 06, 2014 at 09:07:15AM +0800, Mj Liu wrote:
> Hi all,
>       I have to thank 闫林(godspeed1989 at gmail.com) and Ryan Curtin(
> gth671b at mail.gatech.edu) for the suggestion and caring of my question of
> application of GSOC. I have checked the GitHub repository and learned a
> lot, thanks.

We have a Github repository?  I wasn't aware.

The svn repo that we use can be found at
http://svn.cc.gatech.edu/fastlab/mlpack/trunk/

so you might want to look at that code instead of whatever is on Github
(can you give me a link to what you are looking at, so that I know about
it?).

>     Recently,  I went through the tutorial part of mlpack, and tried to
> compile the package, and checked several methods of mlpack.  I was thinking
> what the phylosophy of designing the mlpack, which means why shall we
> develop the mlpack and how shall we present it to the users.
>      And I have several suggestions on the designing of AdaBoost part:
> 
>    - The aim of AdaBoost part shall like any other methods provide:
>       - 1)  a simple command-line executable with weak learners can be set
>       by parameters
>       - 2)  a simple C++ interface
>       - 3)  a generic, extensible, and powerful C++ class (AdaBoost) for
>       complex usage
>    - The weak learners shall be developed separate from AdaBoost part.
>    Which means the weak learners itself shall provide self-standing
>    functionality. And AdaBoost is just another method which can improve the
>    result of these weak learners.
>    - The AdaBoost shall provide multi-thread version as well as general
>    single-thread version. Since multi-core computers are widely used in the
>    industry and research centers, and the AdaBoost method itself would run the
>    same procedure several times, so multi-thread is reasonable. And also
>    single-thread is also needed for small problems or small data sets I think.
>    (I would thank xxx for providing the AdaBoost repository from GitHub, thx
>    :~) )

We should use OpenMP for this because OpenMP's parallelization
abstractions are quite simple.  It's an aim of mlpack's code to be
readable, and in my experience having to write lots of support code for
MPI-enabled code can get quite ugly.  Plus, for what you're proposing,
OpenMP should be just fine.

>    - As I checked the method of gmm and knn and other methods, I was
>    thinking that was it possible to build a uniform Interface for all of the
>    methds like an interface "Algrithms.hpp". The uniform interface shall
>    provide the uniform mechanism of how to be called by users, like a method
>    "run(),  load(), save()". with the uniform interface it would easy to learn
>    for all of the methods.

Yeah, this would be a requirement if the AdaBoost class is going to use
templates to accept the types of weak learners it will be using.

>    - I think the mlpack shall be independent from "arma" which I suggest we
>    shall porting some basic methods from "arma" to the mlpack. Then it would
>    quite easy to install the mlpack, it would save much time from preparation
>    of building.

I'm not sure I understand what you mean here.  Armadillo is also
separate from mlpack although it is a dependency.  What would you want
to port from Armadillo?  Why not just call arma::<whatever method>()?

>    - I think the work of developing AdaBoost shall be seperate into several
>    periodes:
>       - 1) there shall be one or two weak learner's implementation;
>       - 2) then AdaBoost shall be implemented in the form of single-thread;
>       -  3) multi-thread version of AdaBoost shall be implemented;
>       - 4) more weak learners shall be added to mlpack.
> 
>       I would like to  thank  闫林(godspeed1989 at gmail.com)  and  Ryan Curtin(
> gth671b at mail.gatech.edu) again for the comments last time.
>      Thanks for reading and thanks for your time.  Welcome any comments.

Everything you've written seems reasonable.  mlpack already has some
weak learners, with least-squares linear regression, ridge regression,
and the Naive Bayes classifier.  But, we should definitely add some
more, and I agree that we should provide a command-line interface to
them too, so they can be self-standing (like you wrote).

Thanks,

Ryan

-- 
Ryan Curtin    | "Do they hurt?"
ryan at ratml.org |   - Jessica 6