[mlpack] GSoC 2018 : Reinforcement Learning

Sat Mar 3 08:09:41 EST 2018

Hello Rajesh,

> The implementation of Prioritized action replay is the smallest among the 3
> ideas proposed as the idea is much simpler than the rest. So, Ideally, the
> implementation of Double DQN and duelling architectures should take somewhere
> between 2-3 months considering all components such as testing etc. And if
> there's time left after that the last extension can be added. Since it is a
> smaller addition and I would be fully familiar with mlpack by then, I think
> adding the last part can be done quickly and can even be done post-summer as I
> feel this component is quite useful to any RL library.

This sounds reasonable to me, I think every method you mentioned would fit into
the current codebase, so please feel free to choose the methods you find the
most interesting.

> While going through the code though I noticed something surprising: Sangtong
> Zhang has already implemented Dobule DQN. I saw it in this code :
> 
> https://github.com/mlpack/mlpack/blob/master/src/mlpack/methods/reinforcement_le
> arning/q_learning_impl.hpp.
> 
> Also, in one of the comments in the PR at
> https://github.com/mlpack/mlpack/pull/934  , he mentions testing Double DQN
> (comment on 27th May). So I wanted to know if there is something more that is
> required to be done as part of double DQN.

Ah right, we should close the PR to avoid any more confusions, this was just
used to track the overall process.

> If double DQN is already done, then I would propose that duelling architecture
> and noisy nets can be the main part of the project with prioritised action
> replay being the possible extension otherwise the older idea should be an
> achievable target.

Sounds good, note it's possible to improve/extend the existing Double DQN method.

> As you suggested, I went through the code to figure what and all can be extended
> and was very happy to find that the overall code is well structured and hence
> can be well exploited for reuse, such as -

You are absolutely right, make sure to include that in your proposal.

> The timeline is something I feel that can be more flexible based on the
> progress. That is, if whatever that has been proposed does get completed earlier
> than expected, then more features can be added (towards having all components of
> Rainbow Algorithm) or if it goes a little slower than expected then I will
> ensure that I complete everything that was part of the proposal even post-
> summer.

Sounds reasonable, we should see if we can define a minimal set of goals, that
ideally should be finished by the end of the summer. Also, see
https://github.com/mlpack/mlpack/wiki/Google-Summer-of-Code-Application-Guide
for some tips.

I hope anything I said was helpful, let me know if I should clarify anything.

Thanks,
Marcus

> On 1. Mar 2018, at 13:16, ⁨яαנєѕн⁩ <⁨rajeshdm9 at gmail.com⁩> wrote:
> 
> Hey Marcus, 
>  
> I think each idea you mentioned would fit into the existing codebase, but don't
> underestimate the time you need to implement the method, writing good tests,
> etc. Each part is important and takes time, so my recommendation is to focus on
> two ideas and maybe propose to work on another one or extend an idea if there is
> time left.
> 
> I completely agree with this. It will be a lengthy project so I will propose something on a smaller scale. 
> I actually was asking more about the fitting into the codebase part for which I got the answer. Thank you. 
> 
> So, I was thinking the following can be done :
> 
> 1. Implementation of Double DQN
> 
> 2. Implementation of Duelling architecture DQN/ Noisy Nets paper - whichever you think might be better
> 
> 3. Extensions if time permits: Prioritised action replay. The implementation of Prioritized action replay is the smallest among the 3 ideas proposed as the idea is much simpler than the rest. So, Ideally, the implementation of Double DQN and duelling architectures should take somewhere between 2-3 months considering all components such as testing etc. And if there's time left after that the last extension can be added. Since it is a smaller addition and I would be fully familiar with mlpack by then, I think adding the last part can be done quickly and can even be done post-summer as I feel this component is quite useful to any RL library.
> 
> While going through the code though I noticed something surprising: Sangtong Zhang has already implemented Dobule DQN. I saw it in this code : 	
> 
> https://github.com/mlpack/mlpack/blob/master/src/mlpack/methods/reinforcement_learning/q_learning_impl.hpp <https://github.com/mlpack/mlpack/blob/master/src/mlpack/methods/reinforcement_learning/q_learning_impl.hpp>.
> 
> Also, in one of the comments in the PR at https://github.com/mlpack/mlpack/pull/934 <https://github.com/mlpack/mlpack/pull/934>  , he mentions testing Double DQN (comment on 27th May). So I wanted to know if there is something more that is required to be done as part of double DQN.
> 
> If double DQN is already done, then I would propose that duelling architecture and noisy nets can be the main part of the project with prioritised action replay being the possible extension otherwise the older idea should be an achievable target.
> 
> As you suggested, I went through the code to figure what and all can be extended and was very happy to find that the overall code is well structured and hence can be well exploited for reuse, such as - 
> 
> The policies are separate and hence any change in the way the function approximator is working will not affect the policy side of it. Hence, https://github.com/mlpack/mlpack/tree/master/src/mlpack/methods/reinforcement_learning/policy <https://github.com/mlpack/mlpack/tree/master/src/mlpack/methods/reinforcement_learning/policy> can be used as is and can be very useful for testing new methods.
> 
> Same for the environment as well. https://github.com/mlpack/mlpack/tree/master/src/mlpack/methods/reinforcement_learning/environment <https://github.com/mlpack/mlpack/tree/master/src/mlpack/methods/reinforcement_learning/environment> can be used as is. 
> 
> The replay, https://github.com/mlpack/mlpack/blob/master/src/mlpack/methods/reinforcement_learning/replay/random_replay.hpp <https://github.com/mlpack/mlpack/blob/master/src/mlpack/methods/reinforcement_learning/replay/random_replay.hpp>  , is something that will be extended in the prioritized action replay method as the algorithm modifies that part of the algorithm. It will remain the same in all the other parts of the implementation. 
> 
> We can Reuse most of what's in https://github.com/mlpack/mlpack/blob/master/src/mlpack/methods/reinforcement_learning/q_learning_impl.hpp <https://github.com/mlpack/mlpack/blob/master/src/mlpack/methods/reinforcement_learning/q_learning_impl.hpp> and https://github.com/mlpack/mlpack/blob/master/src/mlpack/methods/reinforcement_learning/q_learning.hpp <https://github.com/mlpack/mlpack/blob/master/src/mlpack/methods/reinforcement_learning/q_learning.hpp> but the network type will be different for both Duelling architecture and noisy nets. But the other parts can be extended. 
> 
> The timeline is something I feel that can be more flexible based on the progress. That is, if whatever that has been proposed does get completed earlier than expected, then more features can be added (towards having all components of Rainbow Algorithm) or if it goes a little slower than expected then I will ensure that I complete everything that was part of the proposal even post-summer. 
> 
> So, I would like to know what more is required as part of the proposal and also if Double DQN was fully implemented or not.
> 
> Regards, 
> Rajesh D M
> 
> 
> 
> On Tue, Feb 27, 2018 at 3:27 AM, Marcus Edel <marcus.edel at fu-berlin.de <mailto:marcus.edel at fu-berlin.de>> wrote:
> Hello Rajesh,
> 
>> As you mentioned, I've been working on the new environment (Gridworld from
>> Sutton and Barto - it's a simple environment) for testing out.I think it is
>> ready but want to test it in the standard way. So could you please tell me how
>> exactly were the environments cartpole and mountain car tested/run in general so
>> that I can follow a similar procedure to see whatever I have done is correct or
>> not.
> 
> That sounds great, https://github.com/mlpack/mlpack/blob/master/src/mlpack/tests <https://github.com/mlpack/mlpack/blob/master/src/mlpack/tests>
> /rl_components_test.cpp should be helpful.
> 
>> So, I think mlpack should have this latest state of the art available as part of
>> the library. It may not be possible to implement all of the above mentioned
>> techniques in 3 months but I feel they are not very hard to add either as they
>> are just extensions on top of each other (for most parts) and I would be also be
>> happy to continue contributing after the GSoC as well.
>> 
>> So, can we work towards Rainbow as the goal for GSoC (with few but not all
>> components). Will that be a good idea ?
> 
> Sounds like you already put some time into the project idea, that is great. I
> think each idea you mentioned would fit into the existing codebase, but don't
> underestimate the time you need to implement the method, writing good tests,
> etc. Each part is important and takes time, so my recommendation is to focus on
> two ideas and maybe propose to work on another one or extend an idea if there is
> time left. Also, another tip for the proposal is to mention the parts that can
> be reused or have to be extended over the summer, a clear structure of the
> project idea helps a lot.
> 
> I hope anything I said was helpful, let me know if I should clarify anything.
> 
> Thanks,
> Marcus
> 
>> On 26. Feb 2018, at 19:23, ⁨яαנєѕн⁩ <⁨rajeshdm9 at gmail.com <mailto:rajeshdm9 at gmail.com>⁩> wrote:
>> 
>> Hey Marcus, Rajesh here. 
>> 
>> As you mentioned, I've been working on the new environment (Gridworld from Sutton and Barto - it's a simple environment) for testing out.I think it is ready but want to test it in the standard way. So could you please tell me how exactly were the environments cartpole and mountain car tested/run in general so that I can follow a similar procedure to see whatever I have done is correct or not.
>> 
>> Also, with this I have gotten a good idea of how mlpack works and getting more and more used to it by the day. I also wanted to parallelly start working on the proposal.
>> 
>> I went through everything Shangtong Zhang had done last year as part of GSoC and learnt that DQN and async n-step q-learning are the major contributions with rest of his work revolving around them.
>> 
>> So I think the following can be extensions to his work which would fit well into existing architecture built by him :
>> 
>> 1. Double DQN (as suggested by you guys in the ideas list)
>> 
>> 2. Prioritized action replay : In this method, the samples are no longer selected at random for the replay buffer as they are in DQN method but are prioritized based on a parameter. One of the parameters is the TD-error. This method's results beat the results of Double DQN
>> 
>> 3. After this, Deep mind released their next improvisation :  Dueling architecture :
>>   
>> In this architecture,  the state values and the actions values from the state action function are separated in the neural net architecture and combined back before the last step. The intuition behind this is t hat the value of a state does not always depend only on the actions that can be taken from that state.
>> 
>> 4. They then came up with Noisy Nets: Another improvement while using all the above methods by adding noise to the neural net weights which in turn according to them improved the overall exploration efficiency.
>> 
>> They also had other improvements in Multi Step RL and Distributional RL.
>> 
>> After this is when they came up with their best algorithm:
>> 
>> Rainbow : It is a combination of all the above mentioned algorithms. They were able to combine all the algorithms as they all work on different parts of the learning of the RL agent (exploration,  policy update etc). The results of Rainbow far exceed the results of any of the other techniques out there. The paper also shows results of other combinations of the above mentioned methods.
>>  <http://www.mlpack.org/gsocblog/ShangtongZhangPage.html>
>> So, I think mlpack should have this latest state of the art available as part of the library. It may not be possible to implement all of the above mentioned techniques in 3 months but I feel they are not very hard to add either as they are just extensions on top of each other (for most parts) and I would be also be happy to continue contributing after the GSoC as well.
>> 
>> So, can we work towards Rainbow as the goal for GSoC (with few but not all components). Will that be a good idea ? 
>> 
>> I have already read all papers as part of my thesis work and actually working towards improving upon them and hence have a thorough understanding of all the concepts so I can start working on them at the earliest. 
>> 
>> PS: The other implementation of Proximal Policy Optimization Algorithms(PPO) is actually an improvement over Trust Region Policy Optimization (TRPO) so to implement PPO, TRPO might have to be implemented first. Also, that is in the domain of continuous action space and continuous state space (Rainbow and other techniques are can handle only continuous state space) and the other state of the art in that area is Deep Deterministic Policy Gradient(DDPG) . So if you want that to be part of mlpack, it'll probably be a good idea to implement those 3 together. I am equally interested in both sets of implementation (Have gone through all 3 of these papers also already) .
>> 
>> I personally feel going with the first set is better as Shangtong Zhang has created a great base for build up of new methods on top of it. Pleas let me know what you think about the same.
>> 
>> -- 
>> Regards,
>> Rajesh D M
>> <Distributional RL.pdf><Dueling Network Architectures for DeepRL.pdf><Noisy Networks for exploration RL.pdf><Prioritized_experience_replay.pdf><TrustRegionPolicyOptimisation.pdf>
> 
> 
> 
> 
> -- 
> Regards,
> Rajesh D M

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://knife.lugatgt.org/pipermail/mlpack/attachments/20180303/38f2f3ef/attachment-0001.html>