Augmented Recurrent Neural Networks: Week 8

Unlike previous weeks, this week was rather quiescent. It didn't feature insane bugs, weird behaviors of the code, mind-boggling compiler messages and anything else of the kind.

However, it featured a work on "casting" gradient clipping and cross-entropy performance function to the mlpack standard API. The pull request with the entire discussion: link. As mentioned in the discussion of the PR, this code is more or less ready for merging into mlpack/master.

more...

Augmented Recurrent Neural Networks: Week 7

This week featured even more fixes for baseline model. The most notable achievement is that we've finally managed to make an LSTM model that would be able to completely learn the CopyTask (even for maximum length and repetitions parameters).

To this end, several small (but important) changes had to be implemented:

  • The CrossEntropyError layer - earlier the model was learned on MSE objective. Cross entropy penalizes the mistakes sharper than MSE (e.g., assigning 0 probability to the true label results in +infinity loss with cross entropy and in only +1 loss with MSE).
  • Gradient clipping - to rectify the problem of objective explosion, all gradient components that were bigger in absolute value than some fixed number were replaced with that value. In math: g0 = min(max(g, minValue), maxValue).

Also, the model was transferred to MiniBatchSGD from Adam - the latter exposed gradient explosion problem, which resulted in a heavy overfitting.

To avoid the overfitting problem completely, the testing scheme was changed. Here's what happens now on every training epoch:

  • The model is trained on some train samples;
  • After that, the model is evaluated on validation set (we use three datasets, not two);
  • If this iteration gave the best validation score, run it on the test set and record the score on it.

In other words, now we pick the best model among all the model that were visited during training.

Also, now there is a stub (non-functional so far) for highway layer, as proposed in arxiv paper.

I think now the task API & representation problem is more or less solved. Now we switch to the long-waited part - HAM unit implementation. (As I mentioned in the IRC chat, we already have a head start with ready API for HAM unit + implemented and tested memory structure for it). Stay tuned!

more...

Augmented Recurrent Neural Networks: Weeks 5 & 6

This two weeks featured a lot of weird bugs. The most memorable, in my opinion, are:

  • The missing closing curly brace drove g++ crazy, which resulted in error message of length 27k lines (sic!). The link to the make log: gist
  • The copy-paste bug in mlpack/models repo. Because of that bug the LSTM model in all tasks was iterating the learning 20 times - even if epochs parameter was set to another value!

However, there was also several nice code refactorings: - Automating the parameter check in mlpack/models. Unlike previous versions, now we use automated check for parameter validity and error message generation. - Templating the RunTask code in mlpack/models. Now the task type is a template, unlike previous versions, where a separate function was crafted for every task.

Also, we keep on working on HAM implementation - for example, there is an open pull request on adding HAMUnit, which already features tested implementation of tree data structure.

Finally, we have more or less stable experiment results for the three tasks:

more...

Augmented Recurrent Neural Networks: Week 4

Unlike Week 3, this week was more eventful, but mostly it revolved around creating baseline for all three tasks. We finally managed to get a clean & working baseline code for CopyTask, AddTask and SortTask - it could be seen in my pull request to mlpack/models.

The results for all three tasks are pending (hope to add them tomorrow morning). As expected, the copy task was moderately easy for LSTM model. However, binary addition task turned out to be much harder - the model didn't quite manage to add 2-bit numbers.

In this situation, a "failure" is, in fact, rather successful - we managed to make a pool of tasks that can check both the sanity of any other model (CopyTask) and the superiority of augmented models to vanilla RNN models (AddTask, SortTask).

The results are still pending, so the previous paragraph may be updated if the complete result log will disprove any of the findings above.

Also, we have finally begun the work on the central point of the project - Hierarchical Attentive Memory (HAM) unit implementation. For example, we already have some kind of API for it. We are in process of writing the tree memory structure - the main feature of the HAM unit which allows to access data using only O(log(SIZE)) queries. This differs it from other augmented models, which use "differentiable" one-hot vector for querying the memory, which can be treated as O(SIZE) queries.

Of course, those features weren't really tested yet, but the main work has begun - so stay tuned for our next weekly report!

more...

Augmented Recurrent Neural Networks: Week 3

Week 3 was rather quiescent - it mostly featured fixing weird bugs from previous two weeks and refactoring/cleaning the code. Some of the noteworthy changes and observations:

  • We finally managed to feed sequences of vectors to LSTM network model. The solution was rather simple: stretch the sequence of vectors into one long bit sequence while preserving needed inputSize. For example, [(0,0),(0,0),(1,0),(0,1),(0,1),(0,1),(0,1),(0,1),(0,1),(0,1),(0,1),(0,1)] should be fed as [0,0,0,0,1,0,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1] (just omitting brackets).
  • Also we found out why the second representation from my Week 2 report was so inefficient - the model simply couldn't figure out where does the sequence end. The problem is more or less resolved (as a preliminary result, the model scored 100% accuracy for maxLen = 3). Scores of all three problems will be presented in Week 4 report.
  • Finally, Week 3 have seen a massive clean-up of unit test code - the CopyTask benchmark solution was transferred to the models repository. This repo will also feature some real models, such as the long-waited Hierarchical Attentive Memory unit.

In my opinion, the main result of Week 3 is that we have finally got a clean and lightweight codebase for generating test data and evaluating models on them.

more...

Augmented Recurrent Neural Networks: Week 2

This week I had fixed various issues of task classes - mostly refactoring old code and rewriting it to use Armadillo native features. Also, I managed to write up a simple baseline solver for CopyTask (it crashes on AddTask and SortTask, though - I'm planning to fix that bug on Week 3).

LSTM baseline solution

Data representation

There still an open-ended question on data representation for CopyTask - which is, incidentally, going to affect representation choices for two other tasks. (This question will also be resolved on Week 3, so stay tuned for more updates) The current options are:

  • Feeding binary sequences to the network as they stand. For example, in this case the model may receive [0,0,1] -> [0,0,1,0,0,1,0,0,1] as the learning example. Although the current implementation of the LSTM solver can handle this representation almost perfectly, it is unclear how to feed it information about repeat count.
  • Aligning input/output sequences to even length, padding the former sequence with zeros from the right and the latter - with zeros from the left. For example, in this case the model may receive [0,0,1,0,0,0,0,0,0,0,0,0] -> [0,0,0,0,0,1,0,0,1,0,0,1]. This representation resolves issue with repeat size information and also aligns the input and output sequences, which also makes the training procedure easier. In fact, this the currently used option, but it has an interesting extension...
  • ..., namely, using 2-dimensional vectors as elements of the input sequence. The second cell of the vector is used for storing information on whether the given element of the sequence is the member of the original sequence (from 'standard-compliant' implementation of the CopyTask). For example, the example used in the previous two options will transform to: [(0,0),(0,0),(1,0),(0,1),(0,1),(0,1),(0,1),(0,1),(0,1),(0,1),(0,1),(0,1)] -> [0,0,0,0,0,1,0,0,1,0,0,1]. This variant seem to be the most attractive, because it contains all the necessary for copying information in a very convenient form. However, right now there is a problem with feeding sequences of vectors - which will also be resolved on Week 3.

Benchmarking

Now briefly about the running results. For the first two representations (the last one is so far defunct), the following neural network structure was used:

  • linear layer with 30 hidden units;
  • LSTM layer with 15 output units with Leaky ReLU nonlinearity (the standard mlpack implementation with alpha=0.03);
  • linear layer + sigmoid nonlinearity with 1 output unit.

For training the network, standard mlpack's Adam optimizer was used.

The results for aligned representation:

Number of repetitions
Sequence length2345678910
2
1.000
(25)
1.000
(25)
1.000
(25)
1.000
(25)
1.000
(25)
1.000
(25)
1.000
(25)
1.000
(25)
1.000
(25)
3
0.300
(30)
0.200
(30)
0.300
(30)
0.233
(30)
0.000
(30)
-
-
-
-
4
0.114
(35)
0.229
(35)
0.171
(35)
-
-
-
-
-
-
5
0.025
(40)
0.100
(40)
-
-
-
-
-
-
-
6
0.111
(45)
0.000
(45)
-
-
-
-
-
-
-
7
0.040
(50)
0.040
(50)
-
-
-
-
-
-
-
8
0.127
(55)
-
-
-
-
-
-
-
-

'Vanilla' representation is very successful for the case nRepeats = 1, because in this representation LSTM passes all the tests from the previous table with 100% precision (sic!) I'm planning to find the explanation to this behavior during Week 3.

Next up

As I mentioned above, Week 3 will mostly about fixing rather strange bugs that popped up during the implementation of the baseline. However, if they will be resolved promptly, we're going to switch to the central part of the project - implementing the Hierarchical Attentive Memory unit.

more...

Augmented Recurrent Neural Networks: Community Bonding & Week 1

This summer, I will be working on the implementation of neural network models augmented with some additional memory structures (hence the name). One notable example of such model is Neural Turing Machine, which is a straightforward combination of the Turing machine idea and the idea of using "soft" action policies (probabilistic, in contrast to more traditional "hard" policies, which only choose action but don't show any degree of uncertainty). It is reported to be able to handle some very simple algorithmic tasks (e.g., copying the sequence) rather well. Of course, there are move advanced models that offer faster convergence and are able to learn in a more complex environments.

During the Community bonding stage and Week 1 I have implemented benchmarking utilities for this kind of models. For instance, I implemented three basic tasks (classes that generate data for specific problems), namely:

  • CopyTask: given a bit sequence, copy it a required number of times. For example, if the task requires to repeat it three times, the input/output pair may look like this: 001 -> 001001001.
  • SortTask: given a sequence of n-bit numbers, generate a sequence with the same numbers, but ordered from the smallest to the biggest. For example (n=3): [000, 111, 101, 100, 011] -> [000, 011, 100, 101, 111].
  • AddTask: given a sequence representing two binary numbers (separated with a special symbol), generate a sequence representing their sum. For example, 0010+1111 -> 10001, where + is some special value used to separate two numbers.

The grand total and the preceding discussion can be seen here.

Next up: during Week 2 LSTM & GRU baseline will be implemented. The implementation and preformance test results will be reported here next Wednesday.

more...