[mlpack] Cross-validation and hyper-parameter tuning infrastructure

Wed Apr 19 19:02:01 EDT 2017

Hi Kirill,

Thanks for the response.  I think this email chain is getting quite long
now, so sorry if there is a lot of reading to do. :)

On Mon, Apr 17, 2017 at 08:57:21AM +0500, Kirill Mishchenko wrote:
> Hi Ryan.
> 
> > - Use template metaprogramming tricks to, given a type, expand all of
> >   its constructor arguments into a list of numeric types.  So say we
> >   had:
> > 
> >     Learner(double a, AuxType b)
> >     AuxType(double c, double d)
> > 
> >   we would ideally want to extract [double, double, double] as our list
> >   of types.  I can't quickly think of a strategy for this but it
> >   *might* be possible...
> 
> Even if we are able to implement this approach, I guess the usage will
> be quite unintuitive. The implementation (if it is possible) will
> force a user to pass AuxType constuctor arguments into hyper parameter
> tuning module instead of AuxType objects themselves since we can’t
> extract constructor arguments from a created object.

I agree, it is somewhat counterintuitive.  When I take a look at your
proposal again, my understanding is that we are going to use a
std::array<T, n> type to represent the values that should be used for
the optimization.  This supports discrete choices of values.  So in that
case, we could possibly extend this logic for an AuxType class, with
something like (using the example from the previous emails):

std::array<AuxType, 3> a = { AuxType(1.0, 2.0),
                             AuxType(2.0, 3.0),
                             AuxType(4.0, 3.0) };

However this makes it unwieldy to optimize over AuxType objects with
multiple parameters.  Maybe a thought then is to pass something a little
more complex:

std::tuple<std::array<double, 3>, std::array<double, 2>> t =
    std::make_tuple({ 1.0, 2.0, 4.0 }, { 2.0, 3.0 });
(I think the syntax is probably wrong here, but you get the idea.)

What do you think?  Are these ways that could work?  I think the design
space here is pretty complex so we have lots of choices, but many are
not good so it is tricky to pick the best design.

How can we handle optimizing over a given range, though?  For instance
we could use, e.g., simulated annealing to optimize for the best lambda
to use for softmax regression, but this could result in any arbitrary
value between 0 and 1.  So we would want to pass in a range, not just a
series of values to test.

> > - Refactor all classes that take an auxiliary class to instead take a
> >   template parameter pack to be unpacked into the auxiliary classes'
> >   constructors.  This will still be a fair amount of metaprogramming
> >   effort but I can see a closer route to a solution with this one.
> 
> For this implementation the usage should be more understandable for
> users (in this solution we provide a constructor of MLAlgorithm that
> takes arguments that we are going to pass in hyper-parameter tuning
> module), even though it is still quite complex (we need to pass
> AuxType constuctor arguments instead of AuxType objects themselves in
> the first place). But what are we going to do when AuxType is
> std::unordered_map<size_t, std::pair<size_t, size_t>>* or arma::mat
> (they appear in HoeffdingTree and LARS respectively)?

Luckily in our case the strange std::unordered_map<> object we can
ignore, since that will only be created by children nodes of a Hoeffding
tree or someone doing something truly strange.  To be honest we could
probably refactor that so the dimensionMappings parameter is only
available in a private constructor.

The LARS constructor that takes a gramMatrix arma::mat parameter is also
another case where instead of the user passing data (where the Gram
matrix will subsequently be calculated), they can pass a Gram matrix
directly.  So in that case the Gram matrix is the data itself, and it
doesn't make sense to optimize over that.

Still I guess that there should be some way for the user to pass one
specific value.  I wrote a proposal idea here, but then I saw the Bind()
idea below, which does the same thing.

Also, as a side note, I see in the proposal that the example
hoeffdingTreeOptimizer also optimizes over the DatasetInfo, but I think
this doesn't make too much sense since the DatasetInfo specifies which
dimensions are categorical and which are numeric.  I suppose, it is
possible, by allowing the user to optimize over DatasetInfos, we might
be allowing them to decide whether it is better to have their dimensions
set as categorical or numeric.  Maybe that is what you had in mind?

> >> 3. In the case of hyper-parameter tuning  I guess a loss function
> >> should be a wrap for a cross validation class (we want to optimize
> >> performance on validation sets). But it is not clear what type of
> >> interface it should provide: DecomposableFunctionType (like for SGD)
> >> or FunctionType (like for SA or GradientDescent, all prerequisites for
> >> which can potentially be combined in one class).
> > 
> > I'm not sure I fully follow here, can you clarify?
> 
> Existing optimizers in mlpack take as the first argument in their
> constructors a FunctionType object. There are different requirement
> what the FunctionType object should implement depending on the
> optimizer type. For instance, for SGD the FunctionType object should
> have the following method signature:
> 
>   double Evaluate(const arma::mat&, const size_t);
> 
> whereas for GradientDescent the FunctionType object should have this one:
> 
>   double Evaluate(const arma::mat&);
> 
> If I understand the whole discussion in the right way, we are ready to
> restrict ourself to optimise only numerical parameters in order to
> utilize the existing interface for optimisers.

Agreed.  For categorical variables we can cast them to doubles, though
like we discussed for extremely large numbers of categories this can be
problematic.  However, I think a user with ~2^53 categories (if I am
remembering the number right) will run into other problems far before
they ever get the HyperParameterTuner initialized.

> If so, I think it is
> quite possible to design a solution that allows the following usage.
> 
>   arma::mat data /* = ... */;
>   arma::Row<size_t> labels /* = ... */;
> 
>   HyperParameterTuner<HoeffdingTree<>, Accuracy, KFoldCV>
>       hoeffdingTreeTuner(data, labels);
> 
>   // Bound arguments
>   data::DatasetInfo datasetInfo /* = … */;
>   size_t numClasses = 5;
>   bool batchTraining = false;
>   size_t checkInterval = 100;
> 
>   // Setting sets of values to check
>   arma::vec successProbabilities = arma::regspace(0.9, 0.01, 0.99);
>   std::array<size_t, 2> maxSamplesSet = {0, 3};
>   std::array<size_t, 3> minSamplesSet = {50, 100, 150};
> 
>   // Making variables for best parameters
>   double successProbability;
>   size_t maxSamples;
>   size_t minSamples;
> 
>   // Finding best parameters
>   auto bestParameters =
>       hoeffdingTreeTuner.Optimize<GridSearch>(Bind(datasetInfo),
>           Bind(numClasses), Bind(batchTraining), successProbabilities,
>           maxSamplesSet, Bind(checkInterval), minSamplesSet);
> 
>   // Unpacking best parameters
>   std::tie(successProbability, maxSamples, minSamples) = bestParameters;
> 
> In this example we mark the arguments datasetInfo, numClasses,
> batchTraining, and checkInterval as being bound (they should not be
> optimised). For other HoeffdingTree constructor arguments we provide
> sets of values to investigate. Note also that we pass arguments in the
> same order as for the corresponding HoeffdingTree constructor.
> 
> The GridSearch interface will be similar to other optimisers.
> 
>   template<typename FunctionType, typename...Collections>
>   class GridSearch
>   {
>   public:
>     GridSearch(FunctionType& function,
>               const Collections& ...parameterCollections);
> 
>     double Optimize(arma::mat& iterate);
>   };

And in this case parameterCollections will be the input, e.g.,
Bind(datasetInfo), Bind(numClasses), ..., minSamplesSet?

> A FunctionType function will be an instance of a cross validation
> wrapper class with approximately such interface.
>  
>   template<typename CV, typename...BoundedArgs, int TotalArgs>
>   class CVFunction
>   {
>   public:
>     CVFunction(CV& cv, const BoundedArgs& ...boundedArgs);
> 
>     double Evaluate(const arma::mat& parameters);
>   };
> 
> During construction of a CVFunction object, we provide a cross
> validation object, a sequence of bounded arguments that should contain
> information about position in the argument list for the method
> Evaluate of the cross validation object, and total number of arguments
> that should be passed to the method Evaluate of the cross validation
> object.
> 
> With such design we can reuse GridSearch in other mlpack algorithms,
> as well as add support for other mlpack optimisers in a relatively
> simple way. For example, it should be relatively easy to add support
> for the GradientDescent optimiser with the following usage.
> 
>   HyperParameterTuner<SoftmaxRegression<>, Accuracy, KFoldCV>
>       softmaxTuner(data, labels);
> 
>   // initial value for lambda
>   double lambda = 0.001;
> 
>   // gradient descent parameters
>   double stepSize = 0.001;
>   size_t maxIterations = 20;
> 
>   double bestLambda;
>   std::tie(bestLambda) =
>       softmaxTuner.Optimize<GradientDescent>(Bind(numClasses), lambda,
>           OptimizerArg(stepSize), OptimizerArg(maxIterations));
> 
> Let me know what you think about the proposed idea.

Yes!  I like this.  The use of OptimizerArg() and Bind() to specify
which arguments can and cannot be optimized is a nice way to do this.

Assuming that the HyperParameterTuner is able to turn all of the
OptimizerArg()s into a coherent arma::mat to be optimized, we can use
any of the mlpack optimizers.

To your original question about the DecomposableFunctionType, generally
the idea for a DecomposableFunctionType is that you can express the
objective function as a sum of objective functions.  For the way SGD is
used for many machine learning problems, this is pretty true: the
objective function is a sum of objectives over each data point.  This
could also be true for hyperparameter optimization: one simply
calculates the objective function on an individual point.

So it would be possible to do something like using MiniBatchSGD in order
to perform hyperparameter optimization, but only using a subset of the
data points before each calculation of the objective and gradient.

Usually it is pretty easy to take a FunctionType and turn it into a
DecomposableFunctionType (if the objective function is indeed
decomposable) so I think it should not present much challenge to adapt
the CVFunction (or whatever else might be used there) to this situation.

Let me know if that makes sense.  I can clarify further if you like.  I
am excited with where this discussion is going; I think there are some
real neat possibilities for mlpack to provide some nice functionality
that no other package does here.

-- 
Ryan Curtin    | "Weeee!"
ryan at ratml.org |   - Bobby