[mlpack] interested in GSoC

Tue Feb 27 15:04:51 EST 2018

On Mon, Feb 26, 2018 at 11:10:24PM +0100, Manos Stergiadis wrote:
> Hello Ryan,
> 
> Over the last few days I have been working through the codebase of mlpack
> after building the project and running the tests. I decided to focus on the*
> String Utilities* project as it is very similar to work I am currently
> involved with in gensim and I find very interesting. I am working on some
> notes that I plan to use to create my project proposal and I would like to
> share them with you in order to see if I am on the right track.

Hi Manos,

Thanks for getting in touch.  It's important to discuss the high-level
ideas early on because they will make up the substrate on which the
entire project will be built.  If the abstractions we choose aren't
good, then no matter how much work goes into the project it will be
flawed.  So anyway what I am saying is, I think it is great to talk
about this now. :)

> The goal of this project is to enable free text input support for mlpack.
> The obvious issue is that the inner representation of the datasets is an
> arma::Matrix which only supports numerical fields, therefore we have to
> create an std::string to std::vec<numerical> transformation. This
> transformation could follow any well known algorithm, like One Hot Encoding
> (OHE), TF-IDF, paragraph vectors etc. I can use ideas from gensim since I
> am working on it currently.

Exactly---and another point here is that it's also computationally
inefficient to work on a representation like
std::vector<std::vector<std::string>> or whatever.  It's much better to
work on a direct numeric representation like arma::Mat<numeric type>.

> I found that the approach used in loading categorical fields is somewhat
> similar as it also uses a mapping (DatasetMapper) and a Policy (for example
> IncrementPolicy is very similar to Python's LabelEncoder). The difference
> is that while existing policies produce a 1 to 1 mapping (1 std::string to
> 1 numeric value) we need to define a 1 to many mapping. In the trivial OHE
> approach each string would be mapped to a sparse vector with length equal
> to the number of unique words in the whole dataset for example (probably
> 100k or more). This can be achieved by providing a new version of MapString
> which returns an std::vector<T> instead of T where T is numeric
> (or another container, doesn't have to be vector).

Right, so I think that it is possible to use DatasetMapper to do the
mapping of strings, but it might be a little bit tricky and require a
little bit of refactoring to support strings in the way we might hope
for.  However I think this is a good start.

> We should also provide an implementation for DatasetMapper::MapFirstPass in
> order to create the Vocabulary. In the simplest case (no stop-words removal
> or lemmatization) this is just a std::unordered_map from integer to
> std::string, 1 such key-value pair for each unique word in the whole
> dataset. This approach is used (and is very effective) in gensim as well.
> The Datatype Enum should now include a third value, maybe called text
> (besides numeric and categorical). Its type should then change from boolean
> to (short unsigned?) int.

Yeah, I think that is a necessary change (and no problem at all).  I
guess we might need one enum type for each transformation strategy that
is used.

> *API*
> 
> The design principles behind the API design should be mainly related to
> user friendliness, which can be achieved by handling text columns in the
> same way that categorical and numerical features are handled (i.e using
> Load variations). We need to provide this functionality in the template
> specialized Load, once for each different
> mapping strategy (which corresponds to a new Policy). For example if we
> define the TFIDF mapping in the class TFIDFPolicy, we will use it in a new
> template specialization for the Load function defined in each data source,
> for example LoadCSV::Load. An implementation signature could look like this:
> 
> ```
> template bool Load<int, TFIDFPolicy>(const std::string&,
>                                          arma::Mat<int>&,
>                                          DatasetMapper<TFIDFPolicy>&,
>                                          const bool,
>                                          const bool);
> ```

I like this a lot, and I think this is the way to go as far as the
lower-level C++ API.  (Possibly not the only way, of course.)

One disadvantage of the current API that might show itself more in this
project is loading sets with an existing DatasetMapper.  A few of the
command-line programs use Load() with a DatasetMapper, and then have to
reuse the DatasetMapper to load the test set properly.  It would likely
be worthwhile to either refactor the interface, or streamline it, or
document it more fully.  I am not sure what the best route is, but there
is a lot of opportunity to explore ideas. :)

> I would be very grateful if you could take some time to read through this
> and possible provide some feedback to help me come up with
> as strong a proposal as possible!

Of course, I am happy to try and provide input where I can. :)

Thanks,

Ryan

-- 
Ryan Curtin    | "Hungry."
ryan at ratml.org |   - Sphinx