[mlpack] Basic AdaBoost CLI design ideas

Tue Mar 11 12:50:36 EDT 2014

On Wed, Mar 12, 2014 at 12:38:20AM +0800, Mj Liu wrote:
> This is a  a try to define the command line interfaces(CLI). CLI shall
> provide simple intuitive instructions to the users, and provide helpful
> info when users not quite sure how to use the AdaBoost   command.
> I think the following command line interface(CLI) shall be provided to
> users.
>     1) --help           with whole help info provided
>   2) --version        AdaBoost may evolve with the development of mlpack
> library,
>                       the algorithm itself may evolve dependently.
>   3) --weak_learner   follows with built in weak leaner algorithm in
> mlpack. This *must *be specified.
>   4) --Iteration      Maximum steps of the algorithm can be exacuted,
> default shall be 1000 or some others
>   5) --InputFile      user provided dataset path. This *must *be specified.
>   6) --outputFile     results of the algorithm output file path, default
> shall be the work dir
>                       with name of "output.csv" or any others
>   7) --ThreadNo       no. of thread to call the algorithm. default shall
> use single thread
>                       to run the AdaBoost algorithm.
> 
> Implementation of the CLI could refrence to other methods provided by
> mlpack. And Ryan Curtin has menstioned that openMP is a good option for
> multi-thread implementation which provide much clear struct and is easy to
> maintain(the comment was to "Ideas on AdaBoost"). The single thread
> algorithm shall be implement in the first stage.

This is a reasonable interface but I think the C++ interface is more
important (and then providing a CLI interface from that is usually
straightforward).  I am glad you've taken the time to to list out these
ideas.  Options (1) and (2) are provided by default by the CLI module so
you won't need to worry about those.

However, there is a slight issue -- AdaBoost can be used with any
combination of weak learners.  So I could do AdaBoost with 3 C4.5
decision trees, 1 decision stump, and 14 perceptrons, for instance.  How
would that be specified on the command line?  (How we will make CLI do
that is a different question that we can figure out later)

> I'm not sure if we shall provide the following CLI:
>     *) --Algorithm         user defined algorithm which can be called by
> the AdaBoost algorithm.
> If this provided then would it need to compile the user defined algorithm
> into the mlpack?

Most mlpack algorithms have a CLI interface that implements multiple
algorithms; take a look at NCA, for instance, which allows either the
L_BFGS optimizer or the SGD optimizer ('nca' is the name of the program,
in src/mlpack/methods/nca/).

> As Ryan mentioned that AdaBoost class definition examples in the following
> lines:
> 
> template<
>   typename WeakClassifier1Type,
>   typename WeakClassifier2Type,
>   typename WeakClassifier3Type,
>   ...
> >
> class AdaBoost
> {...
>      Classify()
> 
>  ...
> };
> 
> 
> Then there shall be WeakClassifier1Type defined. I believe this is doable
> for inner defined methods in mlpack. But for user defined classifiers this
> would not work properly. So, AdaBoost algorithm shall  have the ability to
> be inherited by user defined class. And user can just override  the calling
> method like
>         adaboost::setWeakLearner(void (*WeakLearner) (char **args, ...))
> Then the AdaBoost do anything else.
> Above can be viewed as a try to define the CLIs, and welcome suggestions
> and comments and criticisms!~:)

I don't like this solution because there is no other code in mlpack that
uses function pointers.  C++ allows a bewildering array of design
paradigms, but I think it is important to restrict mlpack to only a
subset of these, so that someone working with the code knows what to
expect.  This also helps keep the complexity of the codebase to a
minimum.

In addition, inheritance (and virtual functions) can incur
non-negligible runtime cost in cases where functions are being looked up
over and over again.  I am not sure if this is the case in AdaBoost, but
even so, I'd prefer to avoid inheritance (and all the refactoring that
would have to come with it).

That said, I would much prefer a templated solution, where the types of
all weak classifiers are given as template arguments to the AdaBoost
class.  I know that this is potentially a more difficult way to go about
it, but like I said above, I'd prefer to avoid inheritance, and I
definitely want to avoid C-style function pointers.

Let me know if you have any other ideas of how to approach the problem.
I am trying to spend time thinking of my own, too.

-- 
Ryan Curtin    | "Avoid the planet Earth at all costs."
ryan at ratml.org |   - The President