Optimizer implementation tutorial

The field of optimization is vast and complex. In optimization problems, we have to find solutions which are optimal or near-optimal with respect to some goals. Usually, we are not able to solve problems in one step, but we follow some process which guides us through problem-solving. `mlpack`

implements multiple strategies for optimizing different objective functions and provides different strategies for selecting and customizing the appropriate optimization algorithm. This tutorial discusses how to use each of the techniques that `mlpack`

implements.

This visualization allows us to see how many popular optimizers perform on different optimization problems. Select a problem to optimize, then select an optimizer and tune its parameters, and see the steps that the optimizer takes plotted in pink. Note that you can zoom in and rotate the drawing. Also, you can compare how different optimizers perform on a given problem in the second graph. As you try a given problem with more optimizers, the objective function vs. the number of iterations is plotted for each optimizer you have tried.

`Step-size: 0.01`

`Iterations: 4000`

The defaults here are not necessarily good for the given problem, so it is suggested that the values used be tailored to the task at hand. (Use the mouse to zoom/drag the function.)

As intuition says, system has higher probability of staying in the states with a smaller stepsize. As the stepsize goes up, imbalance becomes stronger. When the stepsize is close to zero, the system stays in the state(s) with the highest cost.

`Evaluations: Dynamic`

A plot of the cost reveals distinct properties for each optimizer with its own style of convergence.

AdaDelta - AdaGrad - Adam - AdaGrad - Adam - AdaMax - AMSGrad - Nadam - CNE - IQN - L_BFGS - RMSProp - SMORMS3 - SPALeRASGD - SVRG - SVRG (Barzilai-Borwein) - SARAH - SARAH+ - Katyusha - CMAES

In order to facilitate consistent implementations, we have defined a `FunctionType`

API that describes all the methods that an objective function may implement. `mlpack`

offers a few variations of this API to cover different function characteristics. This leads to several different APIs for different function types:

- FunctionType: a normal, differentiable objective function.
- DecomposableFunctionType: a differentiable objective function that can be decomposed into the sum of many objective functions (for SGD-like optimizers).
- SparseFunctionType: a decomposable, differentiable objective function with a sparse gradient.
- NonDifferentiableFunctionType: a non-differentiable objective function that can only be evaluated.
- ConstrainedFunctionType: an objective function with constraints on the allowable inputs.
- NonDifferentiableDecomposableFunctionType: a decomposable non-differentiable objective function that can only be evaluated.
- ResolvableFunctionType: a differentiable objective function where calculating the gradient with respect to only one parameter is possible.

Each of these types of objective functions require slightly different methods to be implemented. In some cases, methods will be automatically deduced by the optimizers using template metaprogramming and this allows the user to not need to implement every method for a given type of objective function. Each type described above is detailed in the following sections.

A function satisfying the `FunctionType`

API is a general differentiable objective function. It must implement an `Evaluate()`

and a `Gradient()`

method. The interface used for that can be the following two methods:

// For non-separable objectives. This should return the objective value for the

// given parameters.

double Evaluate(const arma::mat& parameters);

// For non-separable differentiable objectives. This should store the gradient

// for the given parameters in the 'gradient' matrix.

void Gradient(const arma::mat& parameters, arma::mat& gradient);

However, there are some objective functions (like logistic regression) for which it is computationally convenient to calculate both the objective and the gradient at the same time. Therefore, optionally, the following function can be implemented:

// For non-separable differentiable objectives. This should store the gradient

// fro the given parameters in the 'gradient' matrix and return the objective

// value.

double EvaluateWithGradient(const arma::mat& parameters, arma::mat& gradient);

It is not a problem to implement all three of these methods, but it is not obligatory to. `EvaluateWithGradient()`

will automatically be inferred if it is not written from the `Evaluate()`

and `Gradient()`

functions; similarly, the `Evaluate()`

and `Gradient()`

functions will be inferred from `EvaluateWithGradient()`

if they are not available. However, any automatically inferred method may be slightly slower.

The following optimizers use the `FunctionType`

API:

A function satisfying the `DecomposableFunctionType`

API is a differentiable objective function that can be decomposed into a number of separable objective functions. Examples of these types of objective functions include those that are a sum of loss on individual data points; so, common machine learning tasks like logistic regression or training a neural network can be expressed as optimizing a decomposable objective function.

Any function implementing the `DecomposableFunctionAPI`

must implement the following four methods:

// For decomposable functions: return the number of parts the optimization

// problem can be decomposed into.

size_t NumFunctions();

// For decomposable objectives. This should calculate the partial objective

// starting at the decomposable function indexed by 'start' and calculate

// 'batchSize' partial objectives and return the sum.

double Evaluate(const arma::mat& parameters,

const size_t start,

const size_t batchSize);

// For separable differentiable objective functions. This should calculate the

// gradient starting at the decomposable function indexed by 'start' and

// calculate 'batchSize' decomposable gradients and store the sum in the

// 'gradient' matrix.

void Gradient(const arma::mat& parameters,

const size_t start,

arma::mat& gradient,

const size_t batchSize);

// Shuffle the ordering of the functions.

void Shuffle();

Note that the decomposable objective functions should support batch computation—this can allow significant speedups. The `Shuffle()`

method shuffles the ordering of the functions. For some optimizers, randomness is an important component, so it is important that it is possible to shuffle the ordering of the decomposable functions.

As with the regular `FunctionType`

API, it is optional to implement an `EvaluateWithGradient()`

method in place of, or in addition to, the `Evaluate()`

and `Gradient()`

methods. The interface used for that should be the following:

// For decomposable objectives. This should calculate the partial objective

// starting at the decomposable function indexed by 'start' and calculate

// 'batchSize' partial objectives and return the sum. This should also

// calculate the gradient starting at the decomposable function indexed by

// 'start' and calculate 'batchSize' decomposable gradients and store the sum in

// the 'gradient' matrix.

double EvaluateWithGradient(const arma::mat& parameters,

const size_t start,

arma::mat& gradient,

const size_t batchSize);

The following mlpack optimizers require functions implementing the `DecomposableFunctionType`

API:

- StandardSGD
- MomentumSGD
- AdaDelta
- AdaGrad
- Adam
- AdaMax
- AMSGrad
- Nadam
- NadaMax
- IQN
- Katyusha
- KatyushaProximal
- RMSProp
- SARAH
- SARAH+
- SGDR
- SMORMS3
- SPALeRA SGD
- SVRG
- SVRG-BB

A function satisfying the `SparseFunctionType`

API is a decomposable differentiable objective function with the condition that a single individual gradient is sparse. The API is slightly different but similar to the `DecomposableFunctionType`

API; the following methods are necessary:

// For decomposable functions: return the number of parts the optimization

// problem can be decomposed into.

size_t NumFunctions();

// For decomposable objectives. This should calculate the partial objective

// starting at the decomposable function indexed by 'start' and calculate

// 'batchSize' partial objectives and return the sum.

double Evaluate(const arma::mat& parameters,

const size_t start,

const size_t batchSize);

// For separable differentiable objective functions. This should calculate the

// gradient starting at the decomposable function indexed by 'start' and

// calculate 'batchSize' decomposable gradients and store the sum in the

// 'gradient' matrix, which is a sparse matrix.

void Gradient(const arma::mat& parameters,

const size_t start,

arma::sp_mat& gradient,

const size_t batchSize);

The `Shuffle()`

method is not needed for the `SparseFunctionType`

API.

Note that it is possible to write a `Gradient()`

method that accepts a template parameter `GradType`

, which may be `arma::mat`

(dense Armadillo matrix) or `arma::sp_mat`

(sparse Armadillo matrix). This allows support for both the `DecomposableFunctionType`

API and the `SparseFunctionType`

API, as below:

template<typename GradType>

void Gradient(const arma::mat& parameters,

const size_t start,

GradType& gradient,

const size_t batchSize);

The following mlpack optimizers require an objective function satisfying the `SparseFunctionType`

API:

A function satisfying the `NonDifferentiableFunctionType`

API is a general non-differentiable objective function. Only an `Evaluate()`

method must be implemented, with the following signature:

// For non-separable objectives. This should return the objective value for the

// given parameters.

double Evaluate(const arma::mat& parameters);

The following mlpack optimizers require an objective function satisfying the `NonDifferentiableFunctionType`

API:

A function satisfying the `ConstrainedFunctionType`

API is a general differentiable objective function that has differentiable constraints for the parameters. This API is more complex than the others and requires five methods to be implemented:

// For non-separable objectives. This should return the objective value for the

// given parameters.

double Evaluate(const arma::mat& parameters);

// For non-separable differentiable objectives. This should store the gradient

// for the given parameters in the 'gradient' matrix.

void Gradient(const arma::mat& parameters, arma::mat& gradient);

// Return the number of constraints.

size_t NumConstraints();

// Evaluate the constraint with the given index.

double EvaluateConstraint(const size_t index, const arma::mat& parameters);

// Store the gradient of the constraint with the given index in the 'gradient'

// matrix.

void GradientConstraint(const size_t index,

const arma::mat& parameters,

arma::mat& gradient);

The following mlpack optimizers require a `ConstrainedFunctionType:`

A function satisfying the `NonDifferentiableDecomposableFunctionType`

API is a decomposable non-differentiable objective function. Only an `Evaluate()`

and a `NumFunctions()`

method must be implemented, with the following signatures:

// For decomposable functions: return the number of parts the optimization

// problem can be decomposed into.

size_t NumFunctions();

// For decomposable objectives. This should calculate the partial objective

// starting at the decomposable function indexed by 'start' and calculate

// 'batchSize' partial objectives and return the sum.

double Evaluate(const arma::mat& parameters,

const size_t start,

const size_t batchSize);

The following mlpack optimizers require a `NonDifferentiableDecomposableFunctionType:`

A function satisfying the `ResolvableFunctionType`

API is a partially differentiable objective function. For this API, three methods must be implemented, with the following signatures:

// For partially differentiable functions: return the number of partial

// derivatives.

size_t NumFeatures();

// For non-separable objectives. This should return the objective value for the

// given parameters.

double Evaluate(const arma::mat& parameters);

// For partially differentiable sparse and non-sparse functions. Store the

// given partial gradient for the parameter index 'j' in the 'gradient' matrix.

template<typename GradType>

void PartialGradient(const arma::mat& parameters,

const size_t j,

GradType& gradient);

It is not required to templatize so that both sparse and dense gradients can be used, but it can be helpful.

The following mlpack optimizers require a `ResolvableFunctionType:`

An optimizer must implement only the method:

template<typename FunctionType>

double Optimize(FunctionType& function, arma::mat& parameters);

The `Optimize()`

method optimizes the given function `function`

, and stores the best set of parameters in the matrix `parameters`

and returns the best objective value.

If the optimizer requires a given API from above, the following functions from `src/mlpack/optimizers/function/static_checks.hpp`

can be helpful:

- mlpack::optimization::traits::CheckFunctionTypeAPI()
- mlpack::optimization::traits::CheckDecomposableFunctionTypeAPI()
- mlpack::optimization::traits::CheckSparseFunctionTypeAPI()
- mlpack::optimization::traits::CheckNonDifferentiableFunctionTypeAPI()
- mlpack::optimization::traits::CheckConstrainedFunctionTypeAPI()
- mlpack::optimization::traits::CheckNonDifferentiableDecomposableFunctionTypeAPI()
- mlpack::optimization::traits::CheckResolvableFunctionTypeAPI()

The example below constructs a simple function, where each dimension has a parabola with a distinct minimum. Note, in this example we maintain an ordering with the vector `order`

; in other situations, such as training neural networks, we could simply shuffle the columns of the data matrix in `Shuffle()`

.

class ObjectiveFunction

{

public:

// A separable function consisting of four quadratics.

ObjectiveFunction()

{

in = arma::vec("20 12 15 100");

bi = arma::vec("-4 -2 -3 -8");

}

size_t NumFunctions() { return 4; }

void Shuffle() { ord = arma::shuffle(arma::uvec("0 1 2 3")); }

double Evaluate(const arma::mat& para, const size_t s, const size_t bs)

{

double cost = 0;

for (size_t i = s; i < s + bs; i++)

cost += para(ord[i]) * para(ord[i]) + bi(ord[i]) * para(ord[i]) + in(ord[i]);

return cost;

}

void Gradient(const arma::mat& para, const size_t s, arma::mat& g, const size_t bs)

{

g.zeros(para.n_rows, para.n_cols);

for (size_t i = s; i < s + bs; i++)

g(ord[i]) += (1.0 / bs) * 2 * para(ord[i]) + bi(ord[i]);

}

private:

// Intercepts.

arma::vec in;

// Coefficient.

arma::vec bi;

// Function order.

arma::uvec ord;

};

For the optimization of the defined `ObjectiveFunction`

and other `mlpack`

objective functions, we must implement only an Optimize() method, and a constructor to set some parameters. The code is given below.

class SimpleOptimizer

{

public:

SimpleOptimizer(const size_t bs = 1, const double lr = 0.02) : bs(bs), lr(lr) { }

template<typename FunctionType>

double Optimize(FunctionType& function, arma::mat& parameter)

{

arma::mat gradient;

for (size_t i = 0; i < 5000; i += bs)

{

if (i % function.NumFunctions() == 0)

{

function.Shuffle();

}

function.Gradient(parameter, i % function.NumFunctions(), gradient, bs);

parameter -= lr * gradient;

}

return function.Evaluate(parameter, 0, function.NumFunctions());

}

private:

size_t bs;

double lr;

};

Note for the sake of simplicity we omitted checks on the batch size (`bs`

). This optimizer assumes that `function.NumFunctions()`

is a multiple of the batch size, we also omitted other typical parts of real implementations, a more detailed example may be found in the complete API documentation". Still, `SimpleOptimizer`

works with any mlpack objective function which implements Evaluate that can compute a partial objective function, and `Gradient`

that can compute a part of the gradient starting with a separable function.

Finding the minimum of `ObjectiveFunction`

or any other `mlpack`

function can be done as shown below.

ObjectiveFunction function;

arma::mat parameter("0 0 0 0;");

SimpleOptimizer optimizer;

double objective = optimizer.Optimize(function, parameter);

std::cout << "Objective: " << objective << std::endl;

std::cout << "Optimized function parameter: " << parameter << std::endl;

The final value of the objective function should be close to the optimal value, which is the sum of values at the vertices of the parabolas.

Further documentation for each `Optimizer`

may be found in the complete API documentation. In addition, more information on the testing functions may be found in its complete API documentation.