mlpack

DatasetMapper tutorial

DatasetMapper is a class which holds information about a dataset. This can be used when dataset contains categorical non-numeric features which should be mapped to numeric features. A simple example can be

7,5,True,3
6,3,False,4
4,8,False,2
9,3,True,3

The above dataset will be represented as

7,5,0,3
6,3,1,4
4,8,1,2
9,3,0,3

Here the mappings are

Note: DatasetMapper converts non-numeric values in the order in which it encounters them in the dataset. Therefore there is a chance that True might get mapped to 0 if it encounters True before False. This 0 and 1 are not to be confused with C++ bool notations. These are mapping created by mlpack::DatasetMapper.

DatasetMapper provides an easy API to load such data and stores all the necessary information of the dataset.

πŸ”— Loading data

To use DatasetMapper we have to call a specific overload of the data::Load() function.

using namespace mlpack;

arma::mat data;
data::DatasetInfo info;
data::Load("dataset.csv", data, info);

Dataset:

7, 5, True, 3
6, 3, False, 4
4, 8, False, 2
9, 3, True, 3

πŸ”— Dimensionality

There are two ways to initialize a DatasetMapper object.

std::cout << info.Dimensionality();
4

πŸ”— Type of each dimension

Each dimension can be of either of the two types:

The function Type(size_t dimension) takes an argument dimension which is the row number for which you want to know the type

This will return an enum data::Datatype, which is cast to size_t when we print them using std::cout.

std::cout << info.Type(0) << "\n";
std::cout << info.Type(1) << "\n";
std::cout << info.Type(2) << "\n";
std::cout << info.Type(3) << "\n";

This produces:

0
0
1
0

πŸ”— Number of mappings

If the type of a dimension is data::Datatype::categorical, then during loading, each unique token in that dimension will be mapped to an integer starting with 0.

NumMappings(size_t dimension) takes dimension as an argument and returns the number of mappings in that dimension, if the dimension is numeric, or there are no mappings, then it will return 0.

std::cout << info.NumMappings(0) << "\n";
std::cout << info.NumMappings(1) << "\n";
std::cout << info.NumMappings(2) << "\n";
std::cout << info.NumMappings(3) << "\n";

will print:

0
0
2
0

πŸ”— Checking mappings

There are two ways to check the mappings.

πŸ”— UnmapString()

The UnmapString() function has the full signature UnmapString(int value, size_t dimension, size_t unmappingIndex = 0UL).

std::cout << info.UnmapString(0, 2) << "\n";
std::cout << info.UnmapString(1, 2) << "\n";

This will print:

True
False

πŸ”— UnmapValue()

The UnmapValue() function has the signature UnmapValue(const std::string &input, size_t dimension).

std::cout << info.UnmapValue("True", 2) << "\n";
std::cout << info.UnmapValue("False", 2) << "\n";

will produce:

0
1

πŸ”— Further documentation

For further documentation on DatasetMapper and its uses, see the comments in the source code in src/mlpack/core/data/, as well as its uses in the examples repository.