mlpack_logistic_regression

NAME

mlpack_logistic_regression - l2-regularized logistic regression and prediction

SYNOPSIS

mlpack_logistic_regression [-h] [-v]

DESCRIPTION

An implementation of L2-regularized logistic regression using either the L-BFGS optimizer or SGD (stochastic gradient descent). This solves the regression problem

y = (1 / 1 + e^-(X * b))

where y takes values 0 or 1.

This program allows loading a logistic regression model (via the ’--input_model_file (-m)’ parameter) or training a logistic regression model given training data (specified with the ’--training_file (-t)’ parameter), or both those things at once. In addition, this program allows classification on a test dataset (specified with the ’--test_file (-T)’ parameter) and the classification results may be saved with the ’--output_file (-o)’ output parameter. The trained logistic regression model may be saved using the ’--output_model_file (-M)’ output parameter.

The training data, if specified, may have class labels as its last dimension. Alternately, the ’--labels_file (-l)’ parameter may be used to specify a separate matrix of labels.

When a model is being trained, there are many options. L2 regularization (to prevent overfitting) can be specified with the ’--lambda (-L)’ option, and the optimizer used to train the model can be specified with the ’--optimizer (-O)’ parameter. Available options are ’sgd’ (stochastic gradient descent) and ’lbfgs’ (the L-BFGS optimizer). There are also various parameters for the optimizer; the ’--max_iterations (-n)’ parameter specifies the maximum number of allowed iterations, and the ’--tolerance (-e)’ parameter specifies the tolerance for convergence. For the SGD optimizer, the ’--step_size (-s)’ parameter controls the step size taken at each iteration by the optimizer. The batch size for SGD is controlled with the ’--batch_size (-b)’ parameter. If the objective function for your data is oscillating between Inf and 0, the step size is probably too large. There are more parameters for the optimizers, but the C++ interface must be used to access these.

For SGD, an iteration refers to a single point. So to take a single pass over the dataset with SGD, ’--max_iterations (-n)’ should be set to the number of points in the dataset.

Optionally, the model can be used to predict the responses for another matrix of data points, if ’--test_file (-T)’ is specified. The ’--test_file (-T)’ parameter can be specified without the ’--training_file (-t)’ parameter, so long as an existing logistic regression model is given with the ’--input_model_file (-m)’ parameter. The output predictions from the logistic regression model may be saved with the ’--output_file (-o)’ parameter.

This implementation of logistic regression does not support the general multi-class case but instead only the two-class case. Any labels must be either 0 or 1. For more classes, see the softmax_regression program.

As an example, to train a logistic regression model on the data ’’data.csv’’ with labels ’’labels.csv’’ with L2 regularization of 0.1, saving the model to ’’lr_model.bin’’, the following command may be used:

$ logistic_regression --training_file data.csv --labels_file labels.csv --lambda 0.1 --output_model_file lr_model.bin

Then, to use that model to predict classes for the dataset ’’test.csv’’, storing the output predictions in ’’predictions.csv’’, the following command may be used:

$ logistic_regression --input_model_file lr_model.bin --test_file test.csv --output_file predictions.csv

OPTIONAL INPUT OPTIONS

--batch_size (-b) [int]

Batch size for SGD. Default value 64. --decision_boundary (-d) [double] Decision boundary for prediction; if the logistic function for a point is less than the boundary, the class is taken to be 0; otherwise, the class is 1. Default value 0.5.

--help (-h) [bool]

Default help info.

--info [string]

Get help on a specific module or option. Default value ’’. --input_model_file (-m) [string] Existing model (parameters). Default value ’’.

--labels_file (-l) [string]

A matrix containing labels (0 or 1) for the points in the training set (y). Default value ’’.

--lambda (-L) [double]

L2-regularization parameter for training. Default value 0.

--max_iterations (-n) [int]

Maximum iterations for optimizer (0 indicates no limit). Default value 10000.

--optimizer (-O) [string]

Optimizer to use for training (’lbfgs’ or ’sgd’). Default value ’lbfgs’.

--step_size (-s) [double]

Step size for SGD optimizer. Default value 0.01.

--test_file (-T) [string]

Matrix containing test dataset. Default value ’’.

--tolerance (-e) [double]

Convergence tolerance for optimizer. Default value 1e-10. --training_file (-t) [string] A matrix containing the training set (the matrix of predictors, X). Default value ’’.

--verbose (-v) [bool]

Display informational messages and the full list of parameters and timers at the end of execution.

--version (-V) [bool]

Display the version of mlpack.

OPTIONAL OUTPUT OPTIONS

--output_file (-o) [string]

If --test_file is specified, this matrix is where the predictions for the test set will be saved. Default value ’’. --output_model_file (-M) [string] Output for trained logistic regression model. Default value ’’. --output_probabilities_file (-p) [string] If --test_file is specified, this matrix is where the class probabilities for the test set will be saved. Default value ’’.

ADDITIONAL INFORMATION

ADDITIONAL INFORMATION

For further information, including relevant papers, citations, and theory, For further information, including relevant papers, citations, and theory, consult the documentation found at http://www.mlpack.org or included with your consult the documentation found at http://www.mlpack.org or included with your DISTRIBUTION OF MLPACK. DISTRIBUTION OF MLPACK.