Dataset and Experimentation Tools : Summary

In this blog post I'll try to describe my contributions I've made to mlpack this summer.


Here is the link for all my pull requests pull requests. Below is the list of the major pull requests with self-explanatory descriptions.

  • Descriptive Statistics command-line program : 742
  • DatasetMapper & Imputer 694
  • delete unused string_util : 692
  • fix default output problem and some styles : 680
  • Binarize Function + Test : 666
  • add Split() without label: 654
  • add cli executable for data_split : 650

Descriptive Statistics

I originally built a class that calculates descriptive statistics. But after a few discussion, I ended up shrinking all of the functions down to minimum to provide maximum performance and maintainability. I also merged all commits to one to discard unnecessary commits.

Sample output on "iris.csv" would be:

[INFO ] dim     var     mean    std     median  min     max     range   skew    kurt    SE      
[INFO ] 0       0.6857  5.8433  0.8281  5.8000  4.3000  7.9000  3.6000  0.3149  -0.5521 0.0676  
[INFO ] 1       0.1880  3.0540  0.4336  3.0000  2.0000  4.4000  2.4000  0.3341  0.2908  0.0354  
[INFO ] 2       3.1132  3.7587  1.7644  4.3500  1.0000  6.9000  5.9000  -0.2745 -1.4019 0.1441  
[INFO ] 3       0.5824  1.1987  0.7632  1.3000  0.1000  2.5000  2.4000  -0.1050 -1.3398 0.0623  

Users can control the width and precision using -w and -p flag. I tested the output using excel and they match perfectly.

DatasetMapper & Imputer

I renamed DatasetInfo to DatasetMapper, which accepts template parameter of MapPolicy. ( can be used to store different kinds of maps.) DatasetMapper, however, still provides backward compatibility with typedef: using DatasetInfo = DatasetMapper<IncrementPolicy>. The IncrementPolicy denotes the original mapping policy used, which increments numbers for different categories, starting from 0.

Imputer class is also added in this pull request. Imputer also accepts template parameter called ImputationStrategy, so that different strategies can be applied.

Lastly, a command line program called "mlpack_preprocess_imputer.cpp" was added to the mlpack.


This is a simple implementation of binarize function which transforms values in matrix to 0 or 1 according to the threshold. You can use umat A = (B > C) but this function has a overload that applies binarize to only one dimension. Plus, it can produce any type of matrix, not umat.


I added TrainTestSplit() and renamed old ones to LabelTrainTestSplit() as discussed in #651 . This is just a naive implementation mostly copied from Tham's work. I believe LabelTrainTestSplit can just reuse the code in TrainTestSplit twice for both data and labels.

I also implemented "mlpack_preprocess_split.cpp".

Other changes

I also made minor contributions in debugging and fixing styles, especially related to data IO.


I wish to keep contributing to mlpack. I will try to polish the works a little bit more, and especially, I would LOVE to contribute to the deep learning modules. I've been personally reading papers about sequence-to-sequence models, which are used widely for natural language processing and timeseries data analytics.


I thank the mlpack mentors and especially to Tham who gave me a lot of advises through code reviews.


Dataset and Experimentation Tools : Week - 10 and 11 Highlights

I've been building a preprocess_validate cli-executable, a simple app that prints out warnings for possible invalid values in a dataset. The output of this program would be the one like below, which is an ultimately what we've been trying to achieve from the beginning.

[WARN ] Possibly problematic value at point 1, categorical feature 0 :
5 (numeric value in categorical feature)
[WARN ] Invalid value at point 4, numerical feature 1 : 
[WARN ] Invalid value at point 1, numerical feature 2 : a
[WARN ] Invalid value at point 4, numerical feature 2 : b

It took me a longer time to build it because I tried several approaches for making it.

1) First approach I made a class named Validator that acts similar to Imputer class. 2) I found that it is hard to track where the missing values are if there is more than two missing values, since every invalid values will be turned to "nan". So I tried hacking DatasetMapper class so that it could store where the invalid values are. 3) Next decision was to make a new validate_policy which prints out warnings as it goes through the tokens of each dimensions. (though it still has some issues to fix) 4) Lastly to fundamentally fix the problem, I suggest the approach #758.

Current maps object for DatasetMapper can be described as maps of map<dimension, pair<bimap<string, MappedType>, numMappings>> (NumMappings usually being numeric primitive types.)

I think process of having multiple map policies can be simplified by having to two mapping objects. For validation & imputation purposes we could have another mapper (I will call it invalidMaps for now). Which would look like:

// MapType =  map<dimension, pair<bimap<string, MappedType>, numMappings>>;
// InvalidMapType = maps<string, std::pair<dimension, point>>;
MapType maps;
InvalidMapType invalidMaps;
size_t numInvalidMappings;

invalidMaps and maps serve two different purposes. maps is used as usual (mapping categorical feature to numeric feature). invalidMaps is used as temporary holder for future imputation. Both x and y coordinates have to be stored in order to track the invalid values, since every invalid values are turned to NaNs.

I made commits in this branch to test its usability. The code I am referring to is "validate_policy" written in this commit.. I made it to only test, so the code has still a lot to be improved.

When I run the code with the following dataset using validate policy:

a, 2, 3
NULL, 6, a
b, 9, 1
a, 2, 3
c, , b

The result matrix produced by the above data by data::Load() becomes:

[INFO ] 3 mappings in dimension 0.
[INFO ] 0 mappings in dimension 1.
[INFO ] 0 mappings in dimension 2.

[DEBUG]             0          nan   1.0000e+00            0   2.0000e+00
[DEBUG]    2.0000e+00   6.0000e+00   9.0000e+00   2.0000e+00          nan
[DEBUG]    3.0000e+00          nan   1.0000e+00   3.0000e+00          nan

3 mappings in dimension 0 would indicate that it successfully mapped (a->0, b->1, c->2). NULL was not mapped because I set it as one of the user-defined missingValues.

All nans are mapped using invalidMaps object. And can later be used for printing errors or imputations.

I think this is intuitively a good approach. And this can replace the use of all other mapping policies. I think this way we can make mlpack more user-friendly by reducing introductions to new concepts.


Dataset and Experimentation Tools : Week-9 Highlights

This week, pull request for DatasetMapper & Imputer is merged. I thank Zoq, Rcurtin and especially, Tham for all the feedbacks. I feel like I gave them more work than I did.

DatasetMapper & Imputer

1) I added Impute() function that applies imputation to all dimensions in the given matrix.

2) I made a program called mlpack_preprocess_check (previously called mlpack_preprocess_verify in this blog). I will make a pull request after adding comments and docs.

Descriptive Statistics

1) After discussing a little how to manage statistics class, I put it into the preprocess/ folder because it will only be used for preprocess_describe command line program. It's sole purpose is to provide cleaner interface. I might even consider removing the class because the code length became too large for a small program. I will make this decision as soon as possible and make a pull request next week.

2) I optimized some functions in statistics class.

3) changed class Statistics to DescriptiveStatistics to be more specific.


1) I made lists of algorithms implemented in mlpack in and updated to date.


1) I replaced cross_validation's split function with data::Split() inside dt_utils. I will make a pull request regarding this after a few performance checks.


Dataset and Experimentation Tools : Week-8 Highlights

This week, I:

DatasetMapper & Imputer

1) Optimized Imputer a little bit. The details are discussed in the pull request #694.

2) Debugged and polished some comments.

Descriptive Statistics

1) Made statistics.hpp and statistics_impl.hpp, which is basically a more convinient version of armadillo statistics functions. It also has more features like calculating skewness and kurtosis. They are made to provide convinience, so the computational efficiency is little hurt. I made the results to sync with the results given by the excel. The commits I've done are in describe branch

2) The first version of the statistics class calculated every statistics at its constructor. The benchmark scores are recorded here.

3) Changed iomanip to boost::format for formatting the output.

I've been studying little more about how ANN and RNNs are implemented in mlpack (just personal interest). Deep learning is more fun than I thought, hopefully I can contribute to neural net parts of the mlpack in the future.

Later, I will work a little more on statistics module, mainly to optimize a little more and polish the comments and outputs.

And, I will work on mlpack_preprocess_verify executable, which is just a small extension of Imputer module. In this program, it does not change or replace any values, but only detects the invalid values.


Dataset and Experimentation Tools : Week-7 Highlights

This week, I:

DatasetMapper & Imputer

1) Applied the changes suggested, add more comments, and debugged DatasetMapper & Imputer pull request.

2) Made an overload for every imputation methods that receives only one input matrix as a paramter. The result will be overwritten to the input matrix, hopefully providing faster performance.

3) MedianImputation now excludes user-defined missing values and NaNs while it calculates the median.

4) New solution to implement ListwiseDeletion (suggested by rcurtin) is used.

Descriptive Statistics

Last week, I said I am going to work on statistics module. As a result I made a proof-of-concept work on this commit

I made a class called Statistics and put all the functions inside it. I think the Statistics class maybe useful for other things, too. so I am considering to separate the class from the executable and put it somewhere else independently.

Sample run on iris.csv shows the results like the below.

[INFO ] Loading 'iris.csv' as CSV data.  Size is 150 x 4.
[INFO ] dim     var     mean    std     median  min     max     range   skewness  kurtosis  SE        
[INFO ] 0       0.6811225.84333 0.8253015.8     4.3     7.9     3.6     0.175246  1.12569   0.0673856 
[INFO ] 1       0.1867513.054   0.4321473       2       4.4     2.4     0.0266889 0.113048  0.0352846 
[INFO ] 2       3.09242 3.75867 1.75853 4.35    1       6.9     5.9     -1.4776   15.3453   0.143583  
[INFO ] 3       0.5785321.19867 0.7606131.3     0.1     2.5     2.4     -0.04573920.557191  0.0621038 

The output of this executable is similar to this application.


Dataset and Experimentation Tools : Week-6 Highlights

I continued working on DatasetMapper & Imputer to finalize the pull request last week. All DatasetMapper, Imputer, Policy, and Imputation classes and their tests are ready for the last review.

The executable is also ready for the final review.

The changes I made are:

1) Load funciton can now work with any type of DatasetMapper class. Policy can also be decided by the user.

2) MissingPolicy now maps user-defined missing variables to NaN.

3) We had problem how data::Load maps through the MapToNumerical function. In order for MissingPolicy to work, the mapping should be done only for the missing variables, not the whole variables in the dimension. And IncrementPolicy requires the whole variables in a dimension to be mapped if at least one variable turns out to be categorical (string). I solved this by moving MapToNumerical from data::Load to Policy classes, so that each policies can decide how to map the tokens. I also renamed this function to MapTokens to be clear.

4) completed tests and cleaned the apis so that they are more consistent.

This week, I am going to work on statistics module. The statistics module would be a simple executable application to start with; the features we want to add are some what similar to this application.


Dataset and Experimentation Tools : Week-4 Highlights

This week, I worked on restructuring imputer and imputation methods. Here are briefs of what I did.

1) tests for imputer and imputation methods.

2) Restructured imputer and imputation classes. In this new implementation, imputer works like a wrapper that provides a convinient interface of the imputation classes. Imputation classes can also be used independently if a user wants to replace a number variable to another. This work took longer than I thought.

I did not make pull requests for standardization and normalization classes yet, since they are also structured as the imputer class. I will be able to make similar changes after getting comments for the imputer class, and make the pull request accordingly. (This should be quick)

I also droped one-hot-encoding class that I was working on because I did not see the clear use of this in other methods in mlpack.

todo list:

1) apply changes to imputer, imputer classes, and scalers after getting comments

2) make a overload of data::Load function so that it maps using different policy for missing variables.

3) optimize using openmp

4) start working on preprocess_scan, a cli executable which scans through the dataset and finds missing variables or abrupt gaps.

Notice: I already talked about this before to my mentors, but I have mandatory military training in June 21, 22, and 23.


Dataset and Experimentation Tools : Week-3 Highlights

Last week, I planned to finalize missing variable and imputation strategies. Tham gave me advices and ideas for implementing the Imputer and DatasetMapper classes. So I was able to:

1) Rewrite and finalize Imputer class, DatasetMapper class, and CLI executable that provides imputation methods for missing variables. I modularized the mapping policies and imputation strategies. So that they could be used interchangably.

2) Implement utility functions, which are: one-hot-encoding, standard-scale (standardization) and min-max-scale (normalization).

One of the concerns I am having is that some features I have planned are already implemented in armadillo library or mlpack.

I think I had more time reading and analyzing the code so far. As a result, I am getting used to the styles of mlpack and C++ in general. Next week, I will:

1) Refine and make pull requests for one-hot-encoding and min-max-scale.

2) Start working on statistical analyzing cli executable.

3) Plan and implement proof-of-concept for function that scans through a file and detects faults(can be used independently or before data::Load). I have to think how to re-use or modularize the code in data::Load() since it already has tokenizers.

4) Start worrying about how to treat datetime variables. (As of now, mlpack fails to map variables like "1993.05.12" or "1993/05/12". It just recognizes it as number with the first "1993" and discards the rest)


Dataset and Experimentation Tools : Week-2 Highlights

Here are some things I've done in week 2.

1) fixed default output problem with this pull request. Previously when output parameters are not specified the user, the program saved the results in a file with arbitrary name. This might delete user's data without warning. I changed the default outputs to required parameters. In some cases where output is not necessary, the program now gives warning to the user that it is not going to save the result if it is not specified, not save or overwrite existing data with default name.

2) implemented binarize functions, which transforms matrix values to 0 and 1 according to a given threshold. This can provide a easy-to-use implementation for pre-processing dataset. Previously the user had to learn how to work with armadillo matrix. Plus, it provides an overload which can apply binarize to selected dimensions.

3) I experimented with the proof-of-concept I've done last week. I thought of a way to change missing variables to NaNs while mapping the categorical (including missingi) data and apply various imputation strategies by reverse-mapping the values, but after a few discussion, it seems that implementing this while loading seems to be a better idea since it can allow users to specify which values are invalid or missing.

4) Wrote a How to install mlpack on Windows 10 Tutorial

5) I discussed and implemented basic one-hot-encoding and min-max-scale functions. These preprocessing features can be used in other methods or projects.

Next week, I am going to (really) finalize missing variables and imputation features, one-hot-encoding, and min-max-scale. Along the way, I also hope to solve this issue, which I got unsuccessful this week because of segmentation faults errors.