[mlpack] Cross-validation and hyper-parameter tuning infrastructure

Tue May 9 01:37:59 EDT 2017

Hi Ryan.

>> My suggestion is to add another overload:
>> 
>>  HyperParameterOptimizer<...> h(data, datasetInfo, labels);
>> 
>> This is because I consider the dataset information, which encodes the
>> types of dimensions, to be a part of the dataset.  Not all machine
>> learning methods support a DatasetInfo object; I believe that it is only
>> DecisionTree and HoeffdingTree at the moment (maybe there is one more I
>> forgot).
> 
> There are pros and cons of such design. Advantage: for some users it can be more natural to pass datasetInfo into the constructor rather than into the method Optimize. Disadvantages: 1) we need to double the amount of constructors for HyperParameterOptimizer, as well as for the cross-validation classes KFoldCV and SimpleCV (4 in total - weighted/non-weighted learning + presence/absence of datasetInfo parameter) ; 2) we need to double the amount of considered cases in the implementation of the method Evaluate of cross-validation classes (4 in total again - weighted/non-weighted learning + presence/absence of datasetInfo parameter); 3) I’m not sure it can be refactored in some way, so the same probably will be true for new cross-validation classes.

I just would like to ask whether I need to clarify anything in my response. I look forward to seeing what you think about the problem - should we provide additional constructors (for HyperParameterOptimizer and the cross validation classes) or should we change constructor signatures as we discussed in https://github.com/mlpack/mlpack/issues/929 <https://github.com/mlpack/mlpack/issues/929>?

Best regards,

Kirill Mishchenko

> On 28 Apr 2017, at 08:09, Kirill Mishchenko <ki.mishchenko at gmail.com> wrote:
> 
> Hi Ryan.
> 
>> My suggestion is to add another overload:
>> 
>>  HyperParameterOptimizer<...> h(data, datasetInfo, labels);
>> 
>> This is because I consider the dataset information, which encodes the
>> types of dimensions, to be a part of the dataset.  Not all machine
>> learning methods support a DatasetInfo object; I believe that it is only
>> DecisionTree and HoeffdingTree at the moment (maybe there is one more I
>> forgot).
> 
> There are pros and cons of such design. Advantage: for some users it can be more natural to pass datasetInfo into the constructor rather than into the method Optimize. Disadvantages: 1) we need to double the amount of constructors for HyperParameterOptimizer, as well as for the cross-validation classes KFoldCV and SimpleCV (4 in total - weighted/non-weighted learning + presence/absence of datasetInfo parameter) ; 2) we need to double the amount of considered cases in the implementation of the method Evaluate of cross-validation classes (4 in total again - weighted/non-weighted learning + presence/absence of datasetInfo parameter); 3) I’m not sure it can be refactored in some way, so the same probably will be true for new cross-validation classes.
> 
>> But now, we have C++11
>> and rvalue references, so we can do a redesign here to work around at
>> least the first issue: we can have the optimizers hold 'FunctionType',
>> and allow the user to pass in a 'FunctionType&&' and then use the move
>> constructor.
> 
> I’m not sure it’s possible since we don’t know the type of the template parameter FunctionType until we initialize it in the body of the method Optimize.
> 
>> Thanks again for the discussion,
> 
> My pleasure.
> 
> Best regards,
> 
> Kirill Mishchenko
> 
>> On 26 Apr 2017, at 20:17, Ryan Curtin <ryan at ratml.org <mailto:ryan at ratml.org>> wrote:
>> 
>> On Wed, Apr 26, 2017 at 11:24:18AM +0500, Kirill Mishchenko wrote:
>>> Hi Ryan.
>>> 
>>>> The key problem, like you said, is that we don't know what AuxType
>>>> should be so we can't call its constructor.  But maybe we can adapt
>>>> things a little bit:
>>>> 
>>>> template<typename AuxType, typename... Args>
>>>> struct Holder /* needs a better name */
>>>> {
>>>> // This typedef allows us access to the type we need to construct.
>>>> typedef AuxType Aux;
>>>> 
>>>> // These are the parameters we will use.
>>>> std::tuple<Args...> args;
>>>> 
>>>> Holder(Args... argsIn) { /* put argsIn into args */ }
>>>> };
>>>> 
>>>> Then we could use this in addition with the Bind() class when calling an
>>>> optimizer:
>>>> 
>>>> std::array<double, 3> param3s = { 1.0, 2.0 4.0 };
>>>> std::array<double, 2> auxParam1s = { 1.0, 3.0 };
>>>> std::array<double, 4> auxParam2s = { 4.0, 5.0, 6.0, 8.0 };
>>>> auto results = tuner.Optimize<GridSearch>(Bind(param1), Bind(param2),
>>>>     param3s, Holder<AuxType>(auxParam1s, auxParam2s));
>>>> 
>>>> Like most of my other code ideas, this is a very basic sketchup, but I
>>>> think it can work.  Let me know what you think or if there is some
>>>> detail I did not think about enough that will make the idea fail. :)
>>> 
>>> I think this approach is quite implementable. Moreover, we should be
>>> able to provide support of Bind for aux parameters:
>>> 
>>>  std::array<double, 3> param3s = { 1.0, 2.0, 4.0 };
>>>  double auxParam1 = 1.0;
>>>  std::array<double, 4> auxParam2s = { 4.0, 5.0, 6.0, 8.0 };
>>>  auto results = tuner.Optimize<GridSearch>(Bind(param1), Bind(param2),
>>>     param3s, Holder<AuxType>(Bind(auxParam1), auxParam2s));
>> 
>> Yeah, that seems like it will work.  It might be worth spending some
>> time thinking about what would be the easiest for the user to
>> understand, but in either case the general implementation will be the
>> same.
>> 
>>>> Sure; I think maybe we should allow the user to pass in a DatasetInfo
>>>> with the training data and labels, to keep things simple.
>>> 
>>> Can you clarify a bit more what you mean here?
>> 
>> Yeah, my impression is that the user creates the hyperparameter
>> optimizer like this:
>> 
>>  HyperParameterOptimizer<...> h(data, labels);
>> 
>> My suggestion is to add another overload:
>> 
>>  HyperParameterOptimizer<...> h(data, datasetInfo, labels);
>> 
>> This is because I consider the dataset information, which encodes the
>> types of dimensions, to be a part of the dataset.  Not all machine
>> learning methods support a DatasetInfo object; I believe that it is only
>> DecisionTree and HoeffdingTree at the moment (maybe there is one more I
>> forgot).
>> 
>>>> // move optimizer type to class template parameter
>>>> HyperParameterOptimizer<SoftmaxRegression<>, Accuracy, KFoldCV, SA> h;
>>>> 
>>>> h.Optimizer().Tolerance() = 1e-5;
>>>> h.Optimizer().MoveCtrlSweep() = 3;
>>>> 
>>>> h.Optimize(…);
>>> 
>>> In this approach we need to construct an optimizer before the method
>>> Optimize (of HyperParamOptimizer(Tuner) in the example above) is
>>> called, and it can be very problematic because of two reasons.
>>> 
>>> 1. We don’t know what FunctionType object (which wraps cross
>>> validation) to optimize since it depends on what we pass to the method
>>> Optimize (in particular, it depends on whether or not we bind some
>>> arguments).
>>> 
>>> 2. In the case of GridSearch we also don’t know sets of values for
>>> parameters before calling the method Optimize. Recall that we pass
>>> these sets of values during construction of an GridSearch object.
>> 
>> Right, I see what you mean.  At the current time the mlpack optimizers
>> expect a 'FunctionType&' to be passed to the optimizer, and this
>> reference is held internally.  However, that design decision was made
>> before C++11 and was intended to avoid copies.  But now, we have C++11
>> and rvalue references, so we can do a redesign here to work around at
>> least the first issue: we can have the optimizers hold 'FunctionType',
>> and allow the user to pass in a 'FunctionType&&' and then use the move
>> constructor.
>> 
>> In that way, you could create an optimizer without having access to the
>> instantiated FunctionType.
>> 
>> I can see a few ways to solve the second issue after that change is
>> done.  But in either case, the goal from my end would be to avoid a big
>> long call to Optimize() that has both Bind(), Holder<>(), and
>> OptimizerArg() types all in it.  I think the idea of passing optimizer
>> arguments after the arguments to the machine learning algorithm and
>> marking them all with OptimizerArg() might be confusing for users, and
>> it's easier if they can directly modify the parameters of the optimizer.
>> 
>>>> If that's correct, then it might be nice to implement some additional
>>>> idea such as when the user passes a 'math::Range<double> lambda', the
>>>> search will be over all possible values of lambda within the given
>>>> range.  (One can simply modify the objective value to be DBL_MAX when
>>>> outside the bounds of the given lambda, or we can consider visiting how
>>>> optimizers can work in a constrained context.)
>>> 
>>> I think this behaviour should be handled by optimizers since we
>>> suppose to call them only once. I guess we already have touched this
>>> feature in the discussion about simulated annealing.
>> 
>> I agree; at the current time we don't have any support for constrained
>> optimizers though.  Whatever you end up implementing for GridSearch
>> might be a good start, since technically grid search is a special case
>> of constrained optimization.
>> 
>>> In the light of what we have discussed recently I think it is worth to
>>> revisit what and when can be implemented as a GSoC project. <...>
>> 
>> I agree with the changes that you have proposed.
>> 
>> Thanks again for the discussion, I think the ideas here are getting
>> really mature.  I think that there is some cool functionality that will
>> be possible with these modules that isn't possible in any other machine
>> learning library.  For instance, even just hyperparameter search over
>> continuous variables isn't very well supported by other toolkits, and
>> would be a really nice thing to showcase for mlpack.
>> 
>> Ryan
>> 
>> -- 
>> Ryan Curtin    | "You can think about it... but don't do it."
>> ryan at ratml.org <mailto:ryan at ratml.org> |   - Sheriff Justice
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://knife.lugatgt.org/pipermail/mlpack/attachments/20170509/1f542649/attachment-0001.html>