Deep Reinforcement Learning Methods - Week-6 Highlights

This week I continued working on async one-step q-learning. The major challenge this week is to make the test case success in the Travis CI. It's quite tricky. I tuned the architecture of the network and hyper-parameters, it only costs 2s in my mac but still costs almost 10 minutes in the server, even I reduce the amount of the workers to 4. So I have to try pre-trained network. It's weird that if I set the amount of workers to 0 and only have test process with the pre-trained converged network, it will fail (this only happens in ther server). So I have to set the amount of the workers to 1. Although I don't know why it works. TrainingConfigclass is quite useful in terms of passing in hyper-parameters, however it doesn't conform with the newest mlpack API style. But I assume rl methods won't interact with CrossValidation helper, so I guess the newest API style won't influence my project much. The PR for async one step q-learning is almost ready, hopefully it can be merged within 2 days.


Deep Reinforcement Learning Methods - Week-5 Highlights

This week I started to work on implementing async deep rl methods. Async one step q-learning is the first. I had an exaustive discussion with Marcus about the best design pattern and finally made it with policy based design. I succeeded in implementing async one step q-learning with OpenMP, and I believe my framework can be easily extended to async n-step q-learning, async one-step Sarsa. However, for A3C, extra effort is expected. I tested my implementation in Cart Pole domain, it works well and is very stable. However it doesn't work well in Travis CI. My understanding is the machine in Travis CI doesn't have as good performance as my laptop. And this test needs 18 threads simutaneously, if the parallelization of the machine is bad, the performance of the agent will be heavily hurt. I'm still seeking for solution for this.


Deep Reinforcement Learning Methods - Week-4 Highlights

This week I finished the update of optimizer API. I think the PR is now ready to merge. Thanks Ryan for helping me with some complicated optimizers. I also worked on exposing Forward and Backward of FFN, and the support of real batch mode. To do this, we need to look into all the LayerType and make sure they are compatible with matrix (before this we only use vector). Thanks Marcus for helping me with conv related layers. There is also another PR about this, which is almost ready to merge. I also investigated OpenMP, it's really amazing that OpenMP doesn't support shared class memeber variable -- to do this, I have to copy the class member variable to a new local variable I also learned some synchronization mechanism of OpenMP. In addition I noticed in the mailing list that some guy is going to implement HOGWILD, but I don't think my async RL will benefit much from that -- the key point of async RL I think is async agents rather than async gradient computation.

BTW I will have a two-week break starting from tomorrow due to DLSS/RLSS in Montreal. During that period I'm afraid I can't work on new PRs. But I think I can still work on the two existing PRs to fix some issues if necessary and make sure they get merged before I'm back.


Deep Reinforcement Learning Methods - Week-3 Highlights

This week I continue working on the DQN PR, and finally make it merged. It's amazing that for CartPole with Double DQN, a samll network with only 20 hidden units is better than a bigger one. I also worked on updating optimizer API. It's really a huge project, with much much more work than I expected. During this process, I started to miss pointer -- pointer parameter can have default value but non-const reference parameter cannot. So I have to write overloaded function to allow default parameter. It is still confusing me why c++ doesn't allow binding a rvalue to a non-const lvalue reference, I think sometimes we do need this feature.


Deep Reinforcement Learning Methods - Week-2 Highlights

This week I mainly worked on merging my DQN PR. During the merge process, many new ideas came out. For example, we decided to use pass-by-value convention to replace old const lvalue reference and rvalue reference overloads for API. This will give user more flexibility and make mlpack codebase more compact. We also decided to totally separate model instance and optimizer instance, which is necessary for asynchronous deep RL methods and is also helpful for the hyperparameter tuner project.

Furthermore, I realized my old design for the abstract interface of asynchronous methods has a fatal disadvantage -- it cannot support LSTM layer for A3C. So I totally redesigned the abstract interface and tested my new design with PyTorch with Atari games. I also tried some interesting things for async deep RL methods like non-centered target network, target network without lock.


Deep Reinforcement Learning Methods - Week-1 Highlights

This post contains work that has been done until now from about Feb.

This Deep RL project will implement DQN (and its variants) and several asynchronous deep rl methods. Until now basic DQN (ready for classical tasks like CartPole) has been finished and is being merged. Skeleton of async methods is also finished.

I have surveyed bunch of DQN and A3C implementations in TensorFlow and PyTorch to find the best practice. Furthermore, I implemented DQN, async Q-learning, async n-step Q-learning and async Sarsa from scratch in PyTorch. This highly modularized PyTorch implementation can be found here. All these implementations work well with CartPole. And I also tested DQN with BreakOut and A3C with Pong.

The future work will mainly be porting my PyTorch implementation to MLPack.