mlpack

🔗 LMNN

The LMNN class implements large margin nearest neighbor, which can be used as both a linear dimensionality reduction technique and a distance learning technique (also called metric learning). LMNN finds a linear transformation of the dataset that improves k-nearest-neighbor classification performance.

Simple usage example:

// Learn a distance metric that improves kNN classification performance.

// All data and labels are uniform random; 10 dimensional data, 5 classes.
// Replace with a data::Load() call or similar for a real application.
arma::mat dataset(10, 1000, arma::fill::randu); // 1000 points.
arma::Row<size_t> labels =
    arma::randi<arma::Row<size_t>>(1000, arma::distr_param(0, 4));

mlpack::LMNN lmnn(3 /* neighbors to consider */); // Step 1: create object.
arma::mat distance;
lmnn.LearnDistance(dataset, labels, distance);    // Step 2: learn distance.

// `distance` can now be used as a transformation matrix for the data.
arma::mat transformedData = distance * dataset;
// Or, you can create a MahalanobisDistance to evaluate points in the
// transformed dataset space.
arma::mat q = distance.t() * distance;
mlpack::MahalanobisDistance d(std::move(q));

std::cout << "Distance between points 0 and 1:" << std::endl;
std::cout << " - Before LMNN: "
    << mlpack::EuclideanDistance::Evaluate(dataset.col(0), dataset.col(1))
    << "." << std::endl;
std::cout << " - After LMNN:  "
    << d.Evaluate(dataset.col(0), dataset.col(1)) << "." << std::endl;

More examples...

See also:

🔗 Constructors



Notes:


🔗 Learning Distances

Once an LMNN object has been created, the LearnDistance() method can be used to learn a distance.

To use distance, either:

See the examples section for more details.

LearnDistance() Parameters:

name type description  
data arma::mat Column-major training matrix.  
labels arma::Row<size_t> Training labels, between 0 and numClasses - 1 (inclusive). Should have length data.n_cols.  
distance arma::mat Output matrix to store transformation matrix representing learned distance.  
optimizer any ensmallen optimizer Instantiated ensmallen optimizer for differentiable functions or differentiable separable functions. ens::AMSGrad()
callbacks... any set of ensmallen callbacks Optional callbacks for the ensmallen optimizer, such as e.g. ens::ProgressBar(), ens::Report(), or others. (N/A)

Note: any matrix type can be used for data and distance, so long as that type implements the Armadillo API. So, e.g., arma::fmat can be used.

🔗 Other Functionality

🔗 Simple Examples

Learn a distance metric to improve classification performance on the iris dataset, and show improved performance when using NaiveBayesClassifier.

// See https://datasets.mlpack.org/satellite.test.csv.
// (We are using the test set here just because it is a little smaller and
// we want this example to run quickly.)
arma::mat dataset;
mlpack::data::Load("satellite.test.csv", dataset, true);
// See https://datasets.mlpack.org/satellite.test.labels.csv.
arma::Row<size_t> labels;
mlpack::data::Load("satellite.test.labels.csv", labels, true);

// Create an LMNN object using 5 nearest neighbors and learn a distance.
arma::mat distance;
mlpack::LMNN lmnn(5);
lmnn.LearnDistance(dataset, labels, distance);

// The distance matrix has size equal to the dimensionality of the data.
std::cout << "Learned distance size: " << distance.n_rows << " x "
    << distance.n_cols << "." << std::endl;

// Learn a NaiveBayesClassifier model on the data and print the performance.
mlpack::NaiveBayesClassifier nbc1(dataset, labels, 2);
arma::Row<size_t> predictions;
nbc1.Classify(dataset, predictions);
std::cout << "Naive Bayes Classifier without LMNN: "
    << arma::accu(labels == predictions) << " of " << labels.n_elem
    << " correct." << std::endl;

// Now transform the data and learn another NaiveBayesClassifier.
arma::mat transformedDataset = distance * dataset;
mlpack::NaiveBayesClassifier nbc2(transformedDataset, labels, 2);
nbc2.Classify(transformedDataset, predictions);
std::cout << "Naive Bayes Classifier with LMNN:    "
    << arma::accu(labels == predictions) << " of " << labels.n_elem
    << " correct." << std::endl;

Learn a distance metric on the vehicle dataset, using 32-bit floating point to represent the data and metric.

// See https://datasets.mlpack.org/vehicle.csv.
arma::fmat dataset;
mlpack::data::Load("vehicle.csv", dataset, true);

// The labels are contained as the last row of the dataset.
arma::Row<size_t> labels =
    arma::conv_to<arma::Row<size_t>>::from(dataset.row(dataset.n_rows - 1));
dataset.shed_row(dataset.n_rows - 1);

// Create an LMNN object with k=1 and learn distance on float32 data.
// Set updateInterval to a large value (100) because we are using the default
// AMSGrad optimizer (which will take very many small steps).
arma::fmat distance;
mlpack::LMNN lmnn(1, 0.5, 100);

lmnn.LearnDistance(dataset, labels, distance, ens::ProgressBar());

// We want to compute six quantities:
//
//  - Average distance to points of the same class before LMNN.
//  - Average distance to points of the same class after LMNN, using
//    MahalanobisDistance.
//  - Average distance to points of the same class after LMNN, using the
//    transformed dataset.
//
//  - The same three quantities above, but for points of the other class.
//
// LMNN should reduce the average distance to points in the same class, while
// increasing the average distance to points in other classes.
float distSums[6] = { 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f };
size_t sameCount = 0;
arma::fmat q = distance.t() * distance;
mlpack::MahalanobisDistance md(std::move(q));
arma::fmat transformedDataset = distance * dataset;
for (size_t i = 1; i < dataset.n_cols; ++i)
{
  const double d1 = mlpack::EuclideanDistance::Evaluate(
      dataset.col(0), dataset.col(i));
  const double d2 = md.Evaluate(dataset.col(0), dataset.col(i));
  const double d3 = mlpack::EuclideanDistance::Evaluate(
      transformedDataset.col(0), transformedDataset.col(i));

  // Determine whether the point has the same label as point 0.
  if (labels[i] == labels[0])
  {
    distSums[0] += d1;
    distSums[1] += d2;
    distSums[2] += d3;
    ++sameCount;
  }
  else
  {
    distSums[3] += d1;
    distSums[4] += d2;
    distSums[5] += d3;
  }
}

// Turn the results into average distances across the class.
distSums[0] /= sameCount;
distSums[1] /= sameCount;
distSums[2] /= sameCount;
distSums[3] /= (dataset.n_cols - sameCount);
distSums[4] /= (dataset.n_cols - sameCount);
distSums[5] /= (dataset.n_cols - sameCount);

// Print the results.
std::cout << "Average distance between point 0 and other points of the same "
    << "class:" << std::endl;
std::cout << " - Before LMNN:                           " << distSums[0] << "."
    << std::endl;
std::cout << " - After LMNN (with MahalanobisDistance): " << distSums[1] << "."
    << std::endl;
std::cout << " - After LMNN (with transformed dataset): " << distSums[2] << "."
    << std::endl;
std::cout << std::endl;

std::cout << "Average distance between point 0 and points of other classes: "
    << std::endl;
std::cout << " - Before LMNN:                           " << distSums[3] << "."
    << std::endl;
std::cout << " - After LMNN (with MahalanobisDistance): " << distSums[4] << "."
    << std::endl;
std::cout << " - After LMNN (with transformed dataset): " << distSums[5] << "."
    << std::endl;
std::cout << std::endl;

std::cout << "Ratio of other-class to same-class distances:" << std::endl;
std::cout << "(We expect this to go up.)" << std::endl;
std::cout << " - Before LMNN: " << (distSums[3] / distSums[0]) << "."
    << std::endl;
std::cout << " - After LMNN:  " << (distSums[5] / distSums[2]) << "."
    << std::endl;

Learn a distance metric on the iris dataset, using the L-BFGS optimizer with callbacks.

// See https://datasets.mlpack.org/iris.csv.
arma::mat dataset;
mlpack::data::Load("iris.csv", dataset, true);
// See https://datasets.mlpack.org/iris.labels.csv.
arma::Row<size_t> labels;
mlpack::data::Load("iris.labels.csv", labels, true);

// Learn a distance with ensmallen's L-BFGS optimizer.
ens::L_BFGS lbfgs;
lbfgs.NumBasis() = 5;
lbfgs.MaxIterations() = 1000;

// Use 5 neighbors for LMNN, and leave updateInterval at the default of 1,
// because we are using L-BFGS (a full-back optimizer).
mlpack::LMNN lmnn(5);

// Use a callback that prints a final optimization report.
arma::mat distance;
lmnn.LearnDistance(dataset, labels, distance, lbfgs, ens::Report());

Learn a distance metric on the vehicle dataset, but instead of using the Euclidean distance as the underlying metric, use the Manhattan distance. This means that LMNN is optimizing k-NN performance under the Manhattan distance, not under the Euclidean distance.

// See https://datasets.mlpack.org/vehicle.csv.
arma::mat dataset;
mlpack::data::Load("vehicle.csv", dataset, true);

// The labels are contained as the last row of the dataset.
arma::Row<size_t> labels =
    arma::conv_to<arma::Row<size_t>>::from(dataset.row(dataset.n_rows - 1));
dataset.shed_row(dataset.n_rows - 1);

// Create the LMNN object and optimize.  Use k=3 and Nesterov momentum SGD,
// printing a progress bar during optimization.  Because Nesterov momentum SGD
// is an ensmallen optimizer for differentiable separable functions, we increase
// updateInterval to reduce the number of neighbor recomputations.  We also set
// the regularization parameter to 1.0 to increase the penalty for nearby
// neighbors of a different class.
mlpack::LMNN<mlpack::ManhattanDistance> lmnn(3, 1.0, 100);
arma::mat distance;
ens::NesterovMomentumSGD opt(0.000001 /* step size */,
                             32 /* batch size */,
                             20 * dataset.n_cols /* 20 epochs */);
lmnn.LearnDistance(dataset, labels, distance, opt, ens::ProgressBar());

// Now inspect distances between points with the Euclidean distance and with the
// inner product distance.
arma::mat transformedDataset = distance * dataset;

// Points 0 and 1 have the same label (0).  See their original distance---with
// both the Euclidean and Manhattan distances---and their transformed distances.
// We expect these points to get closer together, in the Manhattan distance.
const double d1 = mlpack::ManhattanDistance::Evaluate(
    dataset.col(0), dataset.col(1));
const double d2 = mlpack::ManhattanDistance::Evaluate(
    transformedDataset.col(0), transformedDataset.col(1));

std::cout << "Distance between points 0 and 1 (same class):" << std::endl;
std::cout << " - Manhattan distance:" << std::endl;
std::cout << "   * Before LMNN: " << d1 << std::endl;
std::cout << "   * After LMNN:  " << d2 << std::endl;
std::cout << std::endl;

// Point 3 has a different label.  We therefore expect this point to get further
// from point 0 with the Manhattan distance, but not necessarily with the
// Euclidean distance.
const double d3 = mlpack::ManhattanDistance::Evaluate(
    dataset.col(0), dataset.col(3));
const double d4 = mlpack::ManhattanDistance::Evaluate(
    transformedDataset.col(0), transformedDataset.col(3));

std::cout << "Distance between points 0 and 3 (different class):" << std::endl;
std::cout << " - Manhattan distance:" << std::endl;
std::cout << "   * Before LMNN: " << d3 << std::endl;
std::cout << "   * After LMNN:  " << d4 << std::endl;

// Note that point 3 has been moved further away from point 0 than point 1.

Learn a distance metric while also performing dimensionality reduction, reducing the dimensionality of the satellite dataset by 3 dimensions.

// See https://datasets.mlpack.org/satellite.train.csv.
arma::mat dataset;
mlpack::data::Load("satellite.train.csv", dataset, true);
// See https://datasets.mlpack.org/satellite.labels.csv.
arma::Row<size_t> labels;
mlpack::data::Load("satellite.train.labels.csv", labels, true);

// Use a random initialization for the distance transformation, with the
// specified output dimensionality.
arma::mat distance(dataset.n_rows - 3, dataset.n_rows, arma::fill::randu);
mlpack::LMNN lmnn(3);
ens::L_BFGS opt;
opt.MaxIterations() = 10; // You may want more in a real application.
lmnn.LearnDistance(dataset, labels, distance, opt, ens::Report());

// Now transform the dataset.
arma::mat transformedData = distance * dataset;

std::cout << "Original data has size " << dataset.n_rows << " x "
    << dataset.n_cols << "." << std::endl;
std::cout << "Transformed data has size " << transformedData.n_rows << " x "
    << transformedData.n_cols << "." << std::endl;