[mlpack] Hints for A3C/PPO

Shangtong Zhang zhangshangtong.cpp at gmail.com
Tue Feb 20 12:11:21 EST 2018


> So that was stupid of me, forward() in policy.hpp is just computing the softmaxes for the input (first param) and storing it in output (second param) -> does that mean policy has to be the last layer of my neural net?
See the comment here https://github.com/mlpack/mlpack/pull/934/files#diff-eadb67b1609095b00a8abbc908c7ef35R21 <https://github.com/mlpack/mlpack/pull/934/files#diff-eadb67b1609095b00a8abbc908c7ef35R21>

> if you could give me an intution for this, it would be great, because this is seemingly the main part of the code
You need to manually derivate the equation to understand it. Assume input is x, 
p = softmax(x), J = -adv * (1 * log p[a])  + coef * sum(p * log p), 
our objective is to minimize J, so you need to compute dJ/dx then you can understand that.
The first term of J is to minimize the cross-entropy (weighted by adv), the second term is to maximize policy entropy (weighted by coef, or minimize KL loss w.r.t. a uniform policy)

There is a test case at the end of the PR, where the ground truth gradient is computed by PyTorch. You can debug into that to see what happens after you have the equation.

Shangtong Zhang,
Second year graduate student,
Department of Computing Science,
University of Alberta
Github <https://github.com/ShangtongZhang> | Stackoverflow <http://stackoverflow.com/users/3650053/slardar-zhang>
> On Feb 20, 2018, at 04:52, Chirag Ramdas <chiragramdas at gmail.com> wrote:
> 
> *UPDATE* : I see % operator between matrices is an element wise multiplication..
> 
> 
> 
> Yours Sincerely,
> 
> Chirag Pabbaraju,
> B.E.(Hons.) Computer Science Engineering,
> BITS Pilani K.K. Birla Goa Campus,
> Off NH17B, Zuarinagar,
> Goa, India
> chiragramdas at gmail.com <mailto:chiragramdas at gmail.com> | +91-9860632945
> 
> On Tue, Feb 20, 2018 at 4:49 PM, Chirag Ramdas <chiragramdas at gmail.com <mailto:chiragramdas at gmail.com>> wrote:
> *UPDATE* :: Ignore previous mail
> So that was stupid of me, forward() in policy.hpp is just computing the softmaxes for the input (first param) and storing it in output (second param) -> does that mean policy has to be the last layer of my neural net?
> 
> Also, I figured that in the test, that after youve called forward() on the states, you have now have the probabilities of actions for each state that the network predicted... you then have an advantage matrix, which is intutively the amount by which you want the network to boost the probability of each action it predicted given that state.. (for our practical purposes, it can be Q predicted by the critic network, or for my vanilla implementation, simply the the value of the state in that episode).. so now how is this being backpropogated? Concretely, i didn't understand these <https://github.com/mlpack/mlpack/pull/934/files#diff-65a3ea5936ee02ebf2245666b5e5a985R44> lines.. don't know where the modulo (% prob) came from at all..  if you could give me an intution for this, it would be great, because this is seemingly the main part of the code
> 
> Do correct me if i was wrong anywhere in my understanding till now..
> 
> Thanks a lot!
> 
> 
> 
> Yours Sincerely,
> 
> Chirag Pabbaraju,
> B.E.(Hons.) Computer Science Engineering,
> BITS Pilani K.K. Birla Goa Campus,
> Off NH17B, Zuarinagar,
> Goa, India
> chiragramdas at gmail.com <mailto:chiragramdas at gmail.com> | +91-9860632945
> 
> On Tue, Feb 20, 2018 at 2:55 PM, Chirag Ramdas <chiragramdas at gmail.com <mailto:chiragramdas at gmail.com>> wrote:
> Hey Shangtong,
> 
> Could you explain to me the parameters of the Forward and the Backward() function in policy.hpp (this <https://github.com/mlpack/mlpack/pull/934/files#diff-eadb67b1609095b00a8abbc908c7ef35R48> and this <https://github.com/mlpack/mlpack/pull/934/files#diff-eadb67b1609095b00a8abbc908c7ef35R59>) ? For a forward pass, i could make sense of the input parameter, but couldn't understand the other parameter.. Also, didn't exactly understand the parameters of Backward() function..
> 
> Could you explain what you are testing in this <https://github.com/mlpack/mlpack/pull/934/files#diff-2b4f212861373da25829bada11b85b97R131> test file too?
> 
> Thanks!
> 
> 
> 
> Yours Sincerely,
> 
> Chirag Pabbaraju,
> B.E.(Hons.) Computer Science Engineering,
> BITS Pilani K.K. Birla Goa Campus,
> Off NH17B, Zuarinagar,
> Goa, India
> chiragramdas at gmail.com <mailto:chiragramdas at gmail.com> | +91-9860632945
> 
> On Mon, Feb 19, 2018 at 11:11 PM, Shangtong Zhang <zhangshangtong.cpp at gmail.com <mailto:zhangshangtong.cpp at gmail.com>> wrote:
> Sure.
> 
> Shangtong Zhang,
> Second year graduate student,
> Department of Computing Science,
> University of Alberta
> Github <https://github.com/ShangtongZhang> | Stackoverflow <http://stackoverflow.com/users/3650053/slardar-zhang>
>> On Feb 19, 2018, at 10:40, Chirag Ramdas <chiragramdas at gmail.com <mailto:chiragramdas at gmail.com>> wrote:
>> 
>> Right, will get on this!
>> Will ping you if I've been hopelessly stuck at any point for a long time, if that's okay with you?
>> 
>> On Feb 19, 2018 11:06 PM, "Shangtong Zhang" <zhangshangtong.cpp at gmail.com <mailto:zhangshangtong.cpp at gmail.com>> wrote:
>> Yes. First try the vanilla implementation, if it doesn’t work augment it with experience replay (ER).
>> However I would suggest not to merge your vanilla implementation with ER, because it’s wrong theoretically as I mentioned before. I would also suggest not to merge your vanilla implementation without ER, as I’m pretty sure it won’t work for large network and large task.
>> 
>> Anyway it’s a good starting point to prove you are good at this. And if you want it to be merged, you can implement policy-gradient + ER + importance sampling ratio, which is theoretically right but may be unstable. You can truncate the importance sampling ratio to make it stable (although it introduces bias, it’s acceptable).
>> 
>> Shangtong Zhang,
>> Second year graduate student,
>> Department of Computing Science,
>> University of Alberta
>> Github <https://github.com/ShangtongZhang> | Stackoverflow <http://stackoverflow.com/users/3650053/slardar-zhang>
>>> On Feb 19, 2018, at 10:26, Chirag Ramdas <chiragramdas at gmail.com <mailto:chiragramdas at gmail.com>> wrote:
>>> 
>>> I see.. so what i can probably do is, i can use experience replay mechanism along with this vanilla implementation.. this should intutively work for a single thread worker right? how does that sound for a start?
>>> 
>>> On Feb 19, 2018 10:52 PM, "Shangtong Zhang" <zhangshangtong.cpp at gmail.com <mailto:zhangshangtong.cpp at gmail.com>> wrote:
>>> For TRPO you need to read the original paper.. I don’t have better idea.
>>> Starting from a vanilla policy gradient is good, however the main concern is that from my experience, you need either experience replay or multi-workers to make a non-linear function approximator work (they can give you uncorrelated data, which is crucial to train a network). Without them it may be hard to tune (although it’s possible if you work on small network and small task, it’s worth a trial). 
>>> 
>>> Shangtong Zhang,
>>> Second year graduate student,
>>> Department of Computing Science,
>>> University of Alberta
>>> Github <https://github.com/ShangtongZhang> | Stackoverflow <http://stackoverflow.com/users/3650053/slardar-zhang>
>>>> On Feb 19, 2018, at 10:13, Chirag Ramdas <chiragramdas at gmail.com <mailto:chiragramdas at gmail.com>> wrote:
>>>> 
>>>> Hi Shangtong,
>>>> 
>>>> Thank you so very much for the detailed reply, I appreciate it a lot!
>>>> 
>>>> I spoke to Marcus about an initial contribution to make my GSoC proposal strong, and he suggested me that i could implement a vanilla stochastic policy gradients implementation.. So i was looking to implement a vanilla implementation with a monte carlo value estimate as my advantage function - basically just the simplest of implementations...
>>>> 
>>>> I am yet to fully theoretically understand TRPO and PPO, because they are statistically quite heavy.. i mean the papers provide mechanical pseudocode, but the intution on what is really happening is what i wish to understand.. Towards this, i am trying to find blogs, and indeed the past few days have gone in a beautiful RL blur! But it really has been so interesting.. if you can provide some resources to understand the statistical intution behing trust region algos, it would really be helpful!
>>>> 
>>>> Right now, i am just looking at implementing a single threaded vanilla policy gradient algorithm. I will look at https://github.com/mlpack/mlpack/pull/934/files#diff-eadb67b1609095b00a8abbc908c7ef35 <https://github.com/mlpack/mlpack/pull/934/files#diff-eadb67b1609095b00a8abbc908c7ef35>, and see how i can use it! I am not even looking at actor critic right now, and PPO for sure is the state of the art, but that's way beyond scope for me right now.. 
>>>> 
>>>> I am attaching a screenshot of what I am aiming at implementing
>>>> What are your inputs on implementing this?
>>>> Would you say that if i refer to the file you have mentioned, it should be doable, considering a single threaded environment?
>>>> 
>>>> Thanks a lot again!
>>>> 
>>>> 
>>>> On Feb 19, 2018 10:12 PM, "Shangtong Zhang" <zhangshangtong.cpp at gmail.com <mailto:zhangshangtong.cpp at gmail.com>> wrote:
>>>> Hi Chirag,
>>>> 
>>>> I think it would be better to also cc the mail list.
>>>> 
>>>> I assume you are trying to implement A3C or something likes this.
>>>> Actually this has almost been done. See my PR  https://github.com/mlpack/mlpack/pull/934 <https://github.com/mlpack/mlpack/pull/934>
>>>> This is my work last summer. To compute the gradient, you can use 
>>>> src/mlpack/methods/ann/layer/policy.hpp <https://github.com/mlpack/mlpack/pull/934/files#diff-eadb67b1609095b00a8abbc908c7ef35>
>>>> And there is also an actor_critic worker to show how to use this.
>>>> 
>>>> The most annoying thing is it doesn’t work and I don’t know why. I and Marcus tried hard but didn’t find any obvious logic bug.
>>>> So if you want tor implement A3C I think the simplest way is to find the bug.
>>>> I have some hints for you:
>>>> 1. Even we don’t have shared layers among actor and critic, A3C should work well on small task like CartPole. If you do want shared layers, you need to look into https://github.com/mlpack/mlpack/pull/1091 <https://github.com/mlpack/mlpack/pull/1091> (I highly recommend you not to do this first, as this is not critical)
>>>> 2. I believe the bug may lie in the async mechanism so it’s difficult to debug (It’s possible I’m wrong). A good practice I think is to implement A2C and corresponding PPO, which I believe is the state-of-the-art technique. You can implement the vectorized environment, i.e. the interaction with the environment is parallelized and synchronous, while the optimization occurs at a single thread. See OpenAI baselines (tensorflow, https://github.com/openai/baselines <https://github.com/openai/baselines>) or my A2C (pytorch, https://github.com/ShangtongZhang/DeepRL/blob/master/agent/A2C_agent.py <https://github.com/ShangtongZhang/DeepRL/blob/master/agent/A2C_agent.py>) to see how this idea works. I believe it’s much easier to implement and debug. Once you implement the vectorized environment, it’s easy to plugin all the algorithms, e.g. one/n-step q learning, n-step salsa, actor-critic and PPO. From my experience, if tuned properly, the speed is comparable to fully async implementations.
>>>> 3. If you do want A3C and want to find that bug. I think you can implement actor-critic with experience replay first to verify if it works in single thread case (Note this is wrong theoretically as to do this you need to use off-policy actor-critic, while in practice you can just ignore the importance sampling ratio and treat the data in the buffer as on-policy, it should work and is enough to check the implementation in small task like CartPole)
>>>> 
>>>> BTW your understanding about how forward and backward in DQN is absolutely right.
>>>> 
>>>> Hope this can help,
>>>> 
>>>> Best regards,
>>>> 
>>>> Shangtong Zhang,
>>>> Second year graduate student,
>>>> Department of Computing Science,
>>>> University of Alberta
>>>> Github <https://github.com/ShangtongZhang> | Stackoverflow <http://stackoverflow.com/users/3650053/slardar-zhang>
>>>>> On Feb 19, 2018, at 00:58, Chirag Ramdas <chiragramdas at gmail.com <mailto:chiragramdas at gmail.com>> wrote:
>>>>> 
>>>>> I think I can probably write a custom compute_gradients() method for my backprop here, but i wanted to know if mlpack's implementation provides me with something similar to a convenient Forward() + Backword() pair which i can use for my requirements here..
>>>>> 
>>>>> 
>>>>> 
>>>>> Yours Sincerely,
>>>>> 
>>>>> Chirag Pabbaraju,
>>>>> B.E.(Hons.) Computer Science Engineering,
>>>>> BITS Pilani K.K. Birla Goa Campus,
>>>>> Off NH17B, Zuarinagar,
>>>>> Goa, India
>>>>> chiragramdas at gmail.com <mailto:chiragramdas at gmail.com> | +91-9860632945
>>>>> 
>>>>> On Mon, Feb 19, 2018 at 1:26 PM, Chirag Ramdas <chiragramdas at gmail.com <mailto:chiragramdas at gmail.com>> wrote:
>>>>> Hello,
>>>>> 
>>>>> I had an implementation question to ask.. So from the neural network implementation i saw (ffn_impl), eg.  lines 146-156 <https://github.com/chogba/mlpack/blob/master/src/mlpack/methods/reinforcement_learning/q_learning_impl.hpp#L146-L156> , you first forwarded the network on the states and saw what output (Q value) it was giving for each action.. thereafter, you updated the targets for the actions which you actually saw from your experience replay mechanism and this updated target matrix now behaves like your labels which you wanted the neural net to actually predict.. now, i saw from the q_learning_test.hpp file that you are initialising the FFN with MeanSquaredError, so i am assuming if you pass this target matrix to learningNetwork.Backward(), it computes the gradients of the mean squared error with respect to all the parameters. Thereafter, with these gradients and the optimizer which you have specified eg.Adam,etc,  updater.Update() updates the parameters of the network.
>>>>> Do correct me if i was wrong anywhere..
>>>>> 
>>>>> So now my question is.. I am faced with a custom optimisation function, and i am required to compute gradients of this function with respect to each of the parameters of my neural net.. The Forward() + Backward() pair which was called in the above implementation required me to compute 1) what my network computes for an input 2) what i believe it should have computed, and thereafter computes the gradients by itself. But I simply have an objective function (no notion of what the network should have computed ie labels) and correspondingly an update rule which i want to follow..
>>>>> 
>>>>> Precisely, i have a policy function pi which is approximated by a neural net parameterised by theta, and which outputs the probabilities of performing each action given a state.. now, i want the following update rule for the parameters..
>>>>> 
>>>>> <Screen Shot 2018-02-19 at 1.10.32 PM.png>
>>>>> 
>>>>> 
>>>>> basically, i am asking if i can have my neural net optimise an objective function which i myself specify, in some form.
>>>>> I looked at the implementation of ffn, but i couldn't figure out how i could do this.. hope my question was clear.. 
>>>>> 
>>>>> Thanks a lot!
>>>>> 
>>>>> Yours Sincerely,
>>>>> 
>>>>> Chirag Pabbaraju,
>>>>> B.E.(Hons.) Computer Science Engineering,
>>>>> BITS Pilani K.K. Birla Goa Campus,
>>>>> Off NH17B, Zuarinagar,
>>>>> Goa, India
>>>>> chiragramdas at gmail.com <mailto:chiragramdas at gmail.com> | +91-9860632945
>>>>> 
>>>> 
>>>> <Screenshot_20180218-212917.jp <http://20180218-212917.jp/>g>
>>> 
>> 
> 
> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://knife.lugatgt.org/pipermail/mlpack/attachments/20180220/8f586c5a/attachment-0001.html>


More information about the mlpack mailing list