[mlpack] Hints for A3C/PPO

Shangtong Zhang zhangshangtong.cpp at gmail.com
Mon Feb 19 12:22:47 EST 2018


For TRPO you need to read the original paper.. I don’t have better idea.
Starting from a vanilla policy gradient is good, however the main concern is that from my experience, you need either experience replay or multi-workers to make a non-linear function approximator work (they can give you uncorrelated data, which is crucial to train a network). Without them it may be hard to tune (although it’s possible if you work on small network and small task, it’s worth a trial). 

Shangtong Zhang,
Second year graduate student,
Department of Computing Science,
University of Alberta
Github <https://github.com/ShangtongZhang> | Stackoverflow <http://stackoverflow.com/users/3650053/slardar-zhang>
> On Feb 19, 2018, at 10:13, Chirag Ramdas <chiragramdas at gmail.com> wrote:
> 
> Hi Shangtong,
> 
> Thank you so very much for the detailed reply, I appreciate it a lot!
> 
> I spoke to Marcus about an initial contribution to make my GSoC proposal strong, and he suggested me that i could implement a vanilla stochastic policy gradients implementation.. So i was looking to implement a vanilla implementation with a monte carlo value estimate as my advantage function - basically just the simplest of implementations...
> 
> I am yet to fully theoretically understand TRPO and PPO, because they are statistically quite heavy.. i mean the papers provide mechanical pseudocode, but the intution on what is really happening is what i wish to understand.. Towards this, i am trying to find blogs, and indeed the past few days have gone in a beautiful RL blur! But it really has been so interesting.. if you can provide some resources to understand the statistical intution behing trust region algos, it would really be helpful!
> 
> Right now, i am just looking at implementing a single threaded vanilla policy gradient algorithm. I will look at https://github.com/mlpack/mlpack/pull/934/files#diff-eadb67b1609095b00a8abbc908c7ef35 <https://github.com/mlpack/mlpack/pull/934/files#diff-eadb67b1609095b00a8abbc908c7ef35>, and see how i can use it! I am not even looking at actor critic right now, and PPO for sure is the state of the art, but that's way beyond scope for me right now.. 
> 
> I am attaching a screenshot of what I am aiming at implementing
> What are your inputs on implementing this?
> Would you say that if i refer to the file you have mentioned, it should be doable, considering a single threaded environment?
> 
> Thanks a lot again!
> 
> 
> On Feb 19, 2018 10:12 PM, "Shangtong Zhang" <zhangshangtong.cpp at gmail.com <mailto:zhangshangtong.cpp at gmail.com>> wrote:
> Hi Chirag,
> 
> I think it would be better to also cc the mail list.
> 
> I assume you are trying to implement A3C or something likes this.
> Actually this has almost been done. See my PR  https://github.com/mlpack/mlpack/pull/934 <https://github.com/mlpack/mlpack/pull/934>
> This is my work last summer. To compute the gradient, you can use 
> src/mlpack/methods/ann/layer/policy.hpp <https://github.com/mlpack/mlpack/pull/934/files#diff-eadb67b1609095b00a8abbc908c7ef35>
> And there is also an actor_critic worker to show how to use this.
> 
> The most annoying thing is it doesn’t work and I don’t know why. I and Marcus tried hard but didn’t find any obvious logic bug.
> So if you want tor implement A3C I think the simplest way is to find the bug.
> I have some hints for you:
> 1. Even we don’t have shared layers among actor and critic, A3C should work well on small task like CartPole. If you do want shared layers, you need to look into https://github.com/mlpack/mlpack/pull/1091 <https://github.com/mlpack/mlpack/pull/1091> (I highly recommend you not to do this first, as this is not critical)
> 2. I believe the bug may lie in the async mechanism so it’s difficult to debug (It’s possible I’m wrong). A good practice I think is to implement A2C and corresponding PPO, which I believe is the state-of-the-art technique. You can implement the vectorized environment, i.e. the interaction with the environment is parallelized and synchronous, while the optimization occurs at a single thread. See OpenAI baselines (tensorflow, https://github.com/openai/baselines <https://github.com/openai/baselines>) or my A2C (pytorch, https://github.com/ShangtongZhang/DeepRL/blob/master/agent/A2C_agent.py <https://github.com/ShangtongZhang/DeepRL/blob/master/agent/A2C_agent.py>) to see how this idea works. I believe it’s much easier to implement and debug. Once you implement the vectorized environment, it’s easy to plugin all the algorithms, e.g. one/n-step q learning, n-step salsa, actor-critic and PPO. From my experience, if tuned properly, the speed is comparable to fully async implementations.
> 3. If you do want A3C and want to find that bug. I think you can implement actor-critic with experience replay first to verify if it works in single thread case (Note this is wrong theoretically as to do this you need to use off-policy actor-critic, while in practice you can just ignore the importance sampling ratio and treat the data in the buffer as on-policy, it should work and is enough to check the implementation in small task like CartPole)
> 
> BTW your understanding about how forward and backward in DQN is absolutely right.
> 
> Hope this can help,
> 
> Best regards,
> 
> Shangtong Zhang,
> Second year graduate student,
> Department of Computing Science,
> University of Alberta
> Github <https://github.com/ShangtongZhang> | Stackoverflow <http://stackoverflow.com/users/3650053/slardar-zhang>
>> On Feb 19, 2018, at 00:58, Chirag Ramdas <chiragramdas at gmail.com <mailto:chiragramdas at gmail.com>> wrote:
>> 
>> I think I can probably write a custom compute_gradients() method for my backprop here, but i wanted to know if mlpack's implementation provides me with something similar to a convenient Forward() + Backword() pair which i can use for my requirements here..
>> 
>> 
>> 
>> Yours Sincerely,
>> 
>> Chirag Pabbaraju,
>> B.E.(Hons.) Computer Science Engineering,
>> BITS Pilani K.K. Birla Goa Campus,
>> Off NH17B, Zuarinagar,
>> Goa, India
>> chiragramdas at gmail.com <mailto:chiragramdas at gmail.com> | +91-9860632945
>> 
>> On Mon, Feb 19, 2018 at 1:26 PM, Chirag Ramdas <chiragramdas at gmail.com <mailto:chiragramdas at gmail.com>> wrote:
>> Hello,
>> 
>> I had an implementation question to ask.. So from the neural network implementation i saw (ffn_impl), eg.  lines 146-156 <https://github.com/chogba/mlpack/blob/master/src/mlpack/methods/reinforcement_learning/q_learning_impl.hpp#L146-L156> , you first forwarded the network on the states and saw what output (Q value) it was giving for each action.. thereafter, you updated the targets for the actions which you actually saw from your experience replay mechanism and this updated target matrix now behaves like your labels which you wanted the neural net to actually predict.. now, i saw from the q_learning_test.hpp file that you are initialising the FFN with MeanSquaredError, so i am assuming if you pass this target matrix to learningNetwork.Backward(), it computes the gradients of the mean squared error with respect to all the parameters. Thereafter, with these gradients and the optimizer which you have specified eg.Adam,etc,  updater.Update() updates the parameters of the network.
>> Do correct me if i was wrong anywhere..
>> 
>> So now my question is.. I am faced with a custom optimisation function, and i am required to compute gradients of this function with respect to each of the parameters of my neural net.. The Forward() + Backward() pair which was called in the above implementation required me to compute 1) what my network computes for an input 2) what i believe it should have computed, and thereafter computes the gradients by itself. But I simply have an objective function (no notion of what the network should have computed ie labels) and correspondingly an update rule which i want to follow..
>> 
>> Precisely, i have a policy function pi which is approximated by a neural net parameterised by theta, and which outputs the probabilities of performing each action given a state.. now, i want the following update rule for the parameters..
>> 
>> <Screen Shot 2018-02-19 at 1.10.32 PM.png>
>> 
>> 
>> basically, i am asking if i can have my neural net optimise an objective function which i myself specify, in some form.
>> I looked at the implementation of ffn, but i couldn't figure out how i could do this.. hope my question was clear.. 
>> 
>> Thanks a lot!
>> 
>> Yours Sincerely,
>> 
>> Chirag Pabbaraju,
>> B.E.(Hons.) Computer Science Engineering,
>> BITS Pilani K.K. Birla Goa Campus,
>> Off NH17B, Zuarinagar,
>> Goa, India
>> chiragramdas at gmail.com <mailto:chiragramdas at gmail.com> | +91-9860632945
>> 
> 
> <Screenshot_20180218-212917.jpg>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://knife.lugatgt.org/pipermail/mlpack/attachments/20180219/f0bbb53d/attachment-0001.html>


More information about the mlpack mailing list