[mlpack] Cross-validation and hyper-parameter tuning infrastructure

Thu May 11 14:36:03 EDT 2017

On Tue, May 09, 2017 at 10:37:59AM +0500, Kirill Mishchenko wrote:
> Hi Ryan.
> 
> >> My suggestion is to add another overload:
> >> 
> >>  HyperParameterOptimizer<...> h(data, datasetInfo, labels);
> >> 
> >> This is because I consider the dataset information, which encodes the
> >> types of dimensions, to be a part of the dataset.  Not all machine
> >> learning methods support a DatasetInfo object; I believe that it is only
> >> DecisionTree and HoeffdingTree at the moment (maybe there is one more I
> >> forgot).
> > 
> > There are pros and cons of such design. Advantage: for some users it
> > can be more natural to pass datasetInfo into the constructor rather
> > than into the method Optimize. Disadvantages: 1) we need to double
> > the amount of constructors for HyperParameterOptimizer, as well as
> > for the cross-validation classes KFoldCV and SimpleCV (4 in total -
> > weighted/non-weighted learning + presence/absence of datasetInfo
> > parameter) ; 2) we need to double the amount of considered cases in
> > the implementation of the method Evaluate of cross-validation
> > classes (4 in total again - weighted/non-weighted learning +
> > presence/absence of datasetInfo parameter); 3) I’m not sure it can
> > be refactored in some way, so the same probably will be true for new
> > cross-validation classes.
> 
> I just would like to ask whether I need to clarify anything in my
> response. I look forward to seeing what you think about the problem -
> should we provide additional constructors (for HyperParameterOptimizer
> and the cross validation classes) or should we change constructor
> signatures as we discussed in
> https://github.com/mlpack/mlpack/issues/929
> <https://github.com/mlpack/mlpack/issues/929>?

Hi Kirill,

Sorry for the slow response---I had been busy with other things and had
not had a chance to reply here, although I had thought about it.

I agree that there are some disadvantages to the approach of passing
DatasetInfo into the constructor, but I think it's important to try and
make the burden as light as possible on the users.  So personally I
think that even though this will cause some extra code and methods, it
would be easier for users if DatasetInfo objects are passed into the
constructor (when appropriate).

> >> But now, we have C++11
> >> and rvalue references, so we can do a redesign here to work around at
> >> least the first issue: we can have the optimizers hold 'FunctionType',
> >> and allow the user to pass in a 'FunctionType&&' and then use the move
> >> constructor.
> > 
> > I’m not sure it’s possible since we don’t know the type of the
> > template parameter FunctionType until we initialize it in the body
> > of the method Optimize.

Is this because the FunctionType will contain information about the
specific types to be optimized?  If so, maybe we can add an overload to
OptimizerType constructors so that they can copy parameters from other
OptimizerTypes that have different FunctionTypes, i.e.

template<typename FunctionType>
template<typename OtherFunctionType>
OptimizerType<FunctionType>::OptimizerType(
    OptimizerType<OtherFunctionType>& other);

Alternately, perhaps the FunctionType used for the hyperparameter
optimizer can be re-engineered so it doesn't require knowledge of the
specific types to be optimized when it is created.

Overall I think it is important to have a clean API, so I think it is
better to avoid having the user pass in optimizer parameters via
HyperparameterOptimizer::Optimize(..., OptimizerArg(optimizerParam1),
OptimizerArg(optimizerParam2), ...).

Let me know what you think.

Thanks,

Ryan

-- 
Ryan Curtin    | "In honor of the last American hero, to whom speed
ryan at ratml.org | means freedom of the soul."  - Super Soul