[mlpack] Cross-validation and hyper-parameter tuning infrastructure

Thu Apr 20 02:38:32 EDT 2017

Hi Ryan.

> However this makes it unwieldy to optimize over AuxType objects with
> multiple parameters.  Maybe a thought then is to pass something a little
> more complex:
> 
> std::tuple<std::array<double, 3>, std::array<double, 2>> t =
>    std::make_tuple({ 1.0, 2.0, 4.0 }, { 2.0, 3.0 });
> (I think the syntax is probably wrong here, but you get the idea.)
> 
> What do you think?  Are these ways that could work?  I think the design
> space here is pretty complex so we have lots of choices, but many are
> not good so it is tricky to pick the best design.

This approach can be problematic. While handling such arguments we need to create AuxType objects that will be passed to MLAlgorithm constructors. But only from such argument types (std::tuple<std::array<double, 3>, std::array<double, 2>> in the example above) we don’t know what AuxType should be. I guess a more simple and safe solution is to add a constructor MLAlgorithm(const DataType&, const PredictionsType&, …, AuxTypeConstructorArgType1, …, AuxTypeConstructorArgTypeN, …) instead of/in addition to MLAlgorithm(const DataType&, const PredictionsType&, …, AuxType, …) where it is required. In this case we will be able to reuse the general approach.

> How can we handle optimizing over a given range, though?  For instance
> we could use, e.g., simulated annealing to optimize for the best lambda
> to use for softmax regression, but this could result in any arbitrary
> value between 0 and 1.  So we would want to pass in a range, not just a
> series of values to test.

I guess simulated annealing is a good choice for such cases. But probably it should be implemented as a separated optimiser  as an extension of what we have already discussed., since grid search is usually understood as iterating over discrete values (see, e.g., grid search implemented in scikit-learn: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html <http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html>). 

> Also, as a side note, I see in the proposal that the example
> hoeffdingTreeOptimizer also optimizes over the DatasetInfo, but I think
> this doesn't make too much sense since the DatasetInfo specifies which
> dimensions are categorical and which are numeric.  I suppose, it is
> possible, by allowing the user to optimize over DatasetInfos, we might
> be allowing them to decide whether it is better to have their dimensions
> set as categorical or numeric.  Maybe that is what you had in mind?

When I was writing a proposal I didn’t have in my mind the requirement that we really want to utilize existing optimizers and reuse grid search in other algorithms. Rather I proposed the most simple solution (in my humble opinion) to make grid search work for hyper-paramer tuning - by passing to the method Optimize all additional arguments of MLAlgorithm as sets of values even if we want to use only one value. It is why datasetInfo was passed as an array containing one value.

> However, I think a user with ~2^53 categories (if I am
> remembering the number right) will run into other problems far before
> they ever get the HyperParameterTuner initialized.

I certainly agree with you :)

> Assuming that the HyperParameterTuner is able to turn all of the
> OptimizerArg()s into a coherent arma::mat to be optimized, we can use
> any of the mlpack optimizers.

It’s true that we can potentially reuse any of the mlpack optimisers, but I would like to clarify a bit more how the method Optimize is going to be used and implemented (what I have in my mind). Firstly, the arguments in the beginning of an argument list should correspond to additional arguments of the corresponding MLAlgorithm constructor. For the call

  auto bestParameters =
      hoeffdingTreeTuner.Optimize<GridSearch>(Bind(datasetInfo),
          Bind(numClasses), Bind(batchTraining), successProbabilities,
          maxSamplesSet, Bind(checkInterval), minSamplesSet);

such arguments are the all passed arguments. For the call

  std::tie(bestLambda) =
      softmaxTuner.Optimize<GradientDescent>(Bind(numClasses), lambda,
          OptimizerArg(stepSize), OptimizerArg(maxIterations));

such arguments are Bind(numClasses) and lambda. Secondly, the mentioned two examples will be handled differently. In the first case unbound arguments (successProbabilities, maxSamplesSet, minSamplesSet) will be passed to the optimiser constructor (GridSearch). In the second case unbounded arguments (lambda) will be passed the method Optimize of the optimiser (GradientDescent) through packing into arma::mat and will be used as initial values. We decide between these two cases by, for example, checking whether the type of the first unbound argument is a class type or not. Thirdly, we use OptimizerArg (or a function with some more understandable name) to mark an argument that is not like the ones from the first example but nevertheless should be passed to the optimizer constructor. Without using OptimizerArg we don’t know whether, e.g., stepSize should be used as an initial value packed to arma::mat or as an argument for the GradientDescent constructor. HyperParameterTuner will be implemented as if optimizer constructors are of the following general form:

  Optimizer(FunctionType& function, ArgType1 arg1, ArgType2 arg2, ..., ArgTypeN argN,
      const argumentsCorrespondingToAdditionalArgumentsOfMLAlgorithmConstructor...args);

Going back to our examples, successProbabilities, maxSamplesSet, minSamplesSet from the first example will be passed to the GridSearch constructor as “args"; stepSize and maxIterations from the second example will be passed to the GradientDescent constructor as “arg1" and “arg2". It is possible to implement an optimizer that should use both “args” and “arg1”, …, “argN”. A potential instance of that can be simulated annealing, which can use “arg1”, …, “argN” as its metaparameters  and “args” for specifying ranges and distributions for MLAlgorithm hyper-parameters.

> To your original question about the DecomposableFunctionType, generally
> the idea for a DecomposableFunctionType is that you can express the
> objective function as a sum of objective functions.  For the way SGD is
> used for many machine learning problems, this is pretty true: the
> objective function is a sum of objectives over each data point.  This
> could also be true for hyperparameter optimization: one simply
> calculates the objective function on an individual point.
> 
> So it would be possible to do something like using MiniBatchSGD in order
> to perform hyperparameter optimization, but only using a subset of the
> data points before each calculation of the objective and gradient.
> 
> Usually it is pretty easy to take a FunctionType and turn it into a
> DecomposableFunctionType (if the objective function is indeed
> decomposable) so I think it should not present much challenge to adapt
> the CVFunction (or whatever else might be used there) to this situation.

So I guess we can start by implementing CVFunction with the signature of the method Evaluate as in FunctionType classes.

Best regards,

Kirill Mishchenko

> On 20 Apr 2017, at 04:02, Ryan Curtin <ryan at ratml.org> wrote:
> 
> Hi Kirill,
> 
> Thanks for the response.  I think this email chain is getting quite long
> now, so sorry if there is a lot of reading to do. :)
> 
> On Mon, Apr 17, 2017 at 08:57:21AM +0500, Kirill Mishchenko wrote:
>> Hi Ryan.
>> 
>>> - Use template metaprogramming tricks to, given a type, expand all of
>>>  its constructor arguments into a list of numeric types.  So say we
>>>  had:
>>> 
>>>    Learner(double a, AuxType b)
>>>    AuxType(double c, double d)
>>> 
>>>  we would ideally want to extract [double, double, double] as our list
>>>  of types.  I can't quickly think of a strategy for this but it
>>>  *might* be possible...
>> 
>> Even if we are able to implement this approach, I guess the usage will
>> be quite unintuitive. The implementation (if it is possible) will
>> force a user to pass AuxType constuctor arguments into hyper parameter
>> tuning module instead of AuxType objects themselves since we can’t
>> extract constructor arguments from a created object.
> 
> I agree, it is somewhat counterintuitive.  When I take a look at your
> proposal again, my understanding is that we are going to use a
> std::array<T, n> type to represent the values that should be used for
> the optimization.  This supports discrete choices of values.  So in that
> case, we could possibly extend this logic for an AuxType class, with
> something like (using the example from the previous emails):
> 
> std::array<AuxType, 3> a = { AuxType(1.0, 2.0),
>                             AuxType(2.0, 3.0),
>                             AuxType(4.0, 3.0) };
> 
> However this makes it unwieldy to optimize over AuxType objects with
> multiple parameters.  Maybe a thought then is to pass something a little
> more complex:
> 
> std::tuple<std::array<double, 3>, std::array<double, 2>> t =
>    std::make_tuple({ 1.0, 2.0, 4.0 }, { 2.0, 3.0 });
> (I think the syntax is probably wrong here, but you get the idea.)
> 
> What do you think?  Are these ways that could work?  I think the design
> space here is pretty complex so we have lots of choices, but many are
> not good so it is tricky to pick the best design.
> 
> How can we handle optimizing over a given range, though?  For instance
> we could use, e.g., simulated annealing to optimize for the best lambda
> to use for softmax regression, but this could result in any arbitrary
> value between 0 and 1.  So we would want to pass in a range, not just a
> series of values to test.
> 
>>> - Refactor all classes that take an auxiliary class to instead take a
>>>  template parameter pack to be unpacked into the auxiliary classes'
>>>  constructors.  This will still be a fair amount of metaprogramming
>>>  effort but I can see a closer route to a solution with this one.
>> 
>> For this implementation the usage should be more understandable for
>> users (in this solution we provide a constructor of MLAlgorithm that
>> takes arguments that we are going to pass in hyper-parameter tuning
>> module), even though it is still quite complex (we need to pass
>> AuxType constuctor arguments instead of AuxType objects themselves in
>> the first place). But what are we going to do when AuxType is
>> std::unordered_map<size_t, std::pair<size_t, size_t>>* or arma::mat
>> (they appear in HoeffdingTree and LARS respectively)?
> 
> Luckily in our case the strange std::unordered_map<> object we can
> ignore, since that will only be created by children nodes of a Hoeffding
> tree or someone doing something truly strange.  To be honest we could
> probably refactor that so the dimensionMappings parameter is only
> available in a private constructor.
> 
> The LARS constructor that takes a gramMatrix arma::mat parameter is also
> another case where instead of the user passing data (where the Gram
> matrix will subsequently be calculated), they can pass a Gram matrix
> directly.  So in that case the Gram matrix is the data itself, and it
> doesn't make sense to optimize over that.
> 
> Still I guess that there should be some way for the user to pass one
> specific value.  I wrote a proposal idea here, but then I saw the Bind()
> idea below, which does the same thing.
> 
> Also, as a side note, I see in the proposal that the example
> hoeffdingTreeOptimizer also optimizes over the DatasetInfo, but I think
> this doesn't make too much sense since the DatasetInfo specifies which
> dimensions are categorical and which are numeric.  I suppose, it is
> possible, by allowing the user to optimize over DatasetInfos, we might
> be allowing them to decide whether it is better to have their dimensions
> set as categorical or numeric.  Maybe that is what you had in mind?
> 
>>>> 3. In the case of hyper-parameter tuning  I guess a loss function
>>>> should be a wrap for a cross validation class (we want to optimize
>>>> performance on validation sets). But it is not clear what type of
>>>> interface it should provide: DecomposableFunctionType (like for SGD)
>>>> or FunctionType (like for SA or GradientDescent, all prerequisites for
>>>> which can potentially be combined in one class).
>>> 
>>> I'm not sure I fully follow here, can you clarify?
>> 
>> Existing optimizers in mlpack take as the first argument in their
>> constructors a FunctionType object. There are different requirement
>> what the FunctionType object should implement depending on the
>> optimizer type. For instance, for SGD the FunctionType object should
>> have the following method signature:
>> 
>>  double Evaluate(const arma::mat&, const size_t);
>> 
>> whereas for GradientDescent the FunctionType object should have this one:
>> 
>>  double Evaluate(const arma::mat&);
>> 
>> If I understand the whole discussion in the right way, we are ready to
>> restrict ourself to optimise only numerical parameters in order to
>> utilize the existing interface for optimisers.
> 
> Agreed.  For categorical variables we can cast them to doubles, though
> like we discussed for extremely large numbers of categories this can be
> problematic.  However, I think a user with ~2^53 categories (if I am
> remembering the number right) will run into other problems far before
> they ever get the HyperParameterTuner initialized.
> 
>> If so, I think it is
>> quite possible to design a solution that allows the following usage.
>> 
>>  arma::mat data /* = ... */;
>>  arma::Row<size_t> labels /* = ... */;
>> 
>>  HyperParameterTuner<HoeffdingTree<>, Accuracy, KFoldCV>
>>      hoeffdingTreeTuner(data, labels);
>> 
>>  // Bound arguments
>>  data::DatasetInfo datasetInfo /* = … */;
>>  size_t numClasses = 5;
>>  bool batchTraining = false;
>>  size_t checkInterval = 100;
>> 
>>  // Setting sets of values to check
>>  arma::vec successProbabilities = arma::regspace(0.9, 0.01, 0.99);
>>  std::array<size_t, 2> maxSamplesSet = {0, 3};
>>  std::array<size_t, 3> minSamplesSet = {50, 100, 150};
>> 
>>  // Making variables for best parameters
>>  double successProbability;
>>  size_t maxSamples;
>>  size_t minSamples;
>> 
>>  // Finding best parameters
>>  auto bestParameters =
>>      hoeffdingTreeTuner.Optimize<GridSearch>(Bind(datasetInfo),
>>          Bind(numClasses), Bind(batchTraining), successProbabilities,
>>          maxSamplesSet, Bind(checkInterval), minSamplesSet);
>> 
>>  // Unpacking best parameters
>>  std::tie(successProbability, maxSamples, minSamples) = bestParameters;
>> 
>> In this example we mark the arguments datasetInfo, numClasses,
>> batchTraining, and checkInterval as being bound (they should not be
>> optimised). For other HoeffdingTree constructor arguments we provide
>> sets of values to investigate. Note also that we pass arguments in the
>> same order as for the corresponding HoeffdingTree constructor.
>> 
>> The GridSearch interface will be similar to other optimisers.
>> 
>>  template<typename FunctionType, typename...Collections>
>>  class GridSearch
>>  {
>>  public:
>>    GridSearch(FunctionType& function,
>>              const Collections& ...parameterCollections);
>> 
>>    double Optimize(arma::mat& iterate);
>>  };
> 
> And in this case parameterCollections will be the input, e.g.,
> Bind(datasetInfo), Bind(numClasses), ..., minSamplesSet?
> 
>> A FunctionType function will be an instance of a cross validation
>> wrapper class with approximately such interface.
>> 
>>  template<typename CV, typename...BoundedArgs, int TotalArgs>
>>  class CVFunction
>>  {
>>  public:
>>    CVFunction(CV& cv, const BoundedArgs& ...boundedArgs);
>> 
>>    double Evaluate(const arma::mat& parameters);
>>  };
>> 
>> During construction of a CVFunction object, we provide a cross
>> validation object, a sequence of bounded arguments that should contain
>> information about position in the argument list for the method
>> Evaluate of the cross validation object, and total number of arguments
>> that should be passed to the method Evaluate of the cross validation
>> object.
>> 
>> With such design we can reuse GridSearch in other mlpack algorithms,
>> as well as add support for other mlpack optimisers in a relatively
>> simple way. For example, it should be relatively easy to add support
>> for the GradientDescent optimiser with the following usage.
>> 
>>  HyperParameterTuner<SoftmaxRegression<>, Accuracy, KFoldCV>
>>      softmaxTuner(data, labels);
>> 
>>  // initial value for lambda
>>  double lambda = 0.001;
>> 
>>  // gradient descent parameters
>>  double stepSize = 0.001;
>>  size_t maxIterations = 20;
>> 
>>  double bestLambda;
>>  std::tie(bestLambda) =
>>      softmaxTuner.Optimize<GradientDescent>(Bind(numClasses), lambda,
>>          OptimizerArg(stepSize), OptimizerArg(maxIterations));
>> 
>> Let me know what you think about the proposed idea.
> 
> Yes!  I like this.  The use of OptimizerArg() and Bind() to specify
> which arguments can and cannot be optimized is a nice way to do this.
> 
> Assuming that the HyperParameterTuner is able to turn all of the
> OptimizerArg()s into a coherent arma::mat to be optimized, we can use
> any of the mlpack optimizers.
> 
> To your original question about the DecomposableFunctionType, generally
> the idea for a DecomposableFunctionType is that you can express the
> objective function as a sum of objective functions.  For the way SGD is
> used for many machine learning problems, this is pretty true: the
> objective function is a sum of objectives over each data point.  This
> could also be true for hyperparameter optimization: one simply
> calculates the objective function on an individual point.
> 
> So it would be possible to do something like using MiniBatchSGD in order
> to perform hyperparameter optimization, but only using a subset of the
> data points before each calculation of the objective and gradient.
> 
> Usually it is pretty easy to take a FunctionType and turn it into a
> DecomposableFunctionType (if the objective function is indeed
> decomposable) so I think it should not present much challenge to adapt
> the CVFunction (or whatever else might be used there) to this situation.
> 
> Let me know if that makes sense.  I can clarify further if you like.  I
> am excited with where this discussion is going; I think there are some
> real neat possibilities for mlpack to provide some nice functionality
> that no other package does here.
> 
> -- 
> Ryan Curtin    | "Weeee!"
> ryan at ratml.org <mailto:ryan at ratml.org> |   - Bobby

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://knife.lugatgt.org/pipermail/mlpack/attachments/20170420/c0d4fc14/attachment-0001.html>