[mlpack] Fwd: Re: Doubt in deep Q network implementation in mlpack

Thu Feb 15 15:59:44 EST 2018

---------- Forwarded message ----------
From: "Chirag Ramdas" <chiragramdas at gmail.com>
Date: Feb 16, 2018 2:20 AM
Subject: Fwd: Re: Doubt in deep Q network implementation in mlpack
To: <mlpack at lists.mlpack.org>
Cc:

---------- Forwarded message ----------
From: "Shangtong Zhang" <zhangshangtong.cpp at gmail.com>
Date: Feb 16, 2018 12:33 AM
Subject: Re: Doubt in deep Q network implementation in mlpack
To: "Chirag Ramdas" <chiragramdas at gmail.com>
Cc: <mlpack at lists.mlpack.org>

Hi Chirag,

I think it’s better to also cc the mail list.

For your question:
1) yes
2) deterministic is a flag to distinguish training mode and test mode. If
it’s false, which indicates we are training, then the policy is
*epsilon-greedy* and we do experience replay to update the network. If it’s
true, which indicates we are testing, then the policy is *greedy* and we
don’t update the network, in this case we only want to evaluate the learned
q network.
3)

is learningNetwork only changing its parameters during the experience
mechanism part in Step()

Yes, it only happens in this line https://github.com/mlpack
/mlpack/blob/master/src/mlpack/methods/reinforcement_learnin
g/q_learning_impl.hpp#L156

In the sense the learning (parameter updates) only starts once totalSteps
has exceeded config.ExplorationSteps() right?

Yes, before that it doesn’t do any learning but collecting experience.

Hope my answer can help and feel free to reach me if you have any questions
about my code.

Best regards,

Shangtong Zhang,
Second year graduate student,
Department of Computing Science,
University of Alberta
Github <https://github.com/ShangtongZhang> | Stackoverflow
<http://stackoverflow.com/users/3650053/slardar-zhang>

On Feb 15, 2018, at 04:48, Chirag Ramdas <chiragramdas at gmail.com> wrote:

Hello Shangtong,

I have been going through your implementation of Deep Q networks in
src/mlpack/methods/reinforcement_learning/  . The code is really well
written and generalised beautifully for a variety of RL control problems..

I had a few questions, it would be great if you could clarify...

So in q_learning_impl.hpp, you are maintaining a totalSteps instance
variable, as well as a steps variable initialised per call to Episode()
method.

1) assume once totalSteps exceeds  config.ExplorationSteps() over a number
of initial episode runs. Thereafter, in the future calls to Episode(), will
every step compose of an experience replay mechanism ie willl every step
cause a change of parameters in learningNetwork? (because there is an if
condition check in Steps())

2) what does the variable deterministic mean in the implementation? I see
that if deterministic is true, you never do an experience replay ie
learningNetwork parameters never get changed.. What then, is happening?

3) last question is a bit fundamental... is learningNetwork only changing
its parameters during the experience mechanism part in Step()? nowhere else
are the parameters changing right? In the sense the learning (parameter
updates) only starts once totalSteps has exceeded
config.ExplorationSteps() right?
until then its just a sequence of steps over the initial randomly assigned
parameters..

Thank you so much! really appreciate it!

*Yours Sincerely,*

*Chirag Pabbaraju,*
*B.E.(Hons.) Computer Science Engineering,*
*BITS Pilani K.K. Birla Goa Campus,*
*Off NH17B, Zuarinagar,*
*Goa, India*
*chiragramdas at gmail.com <chiragramdas at gmail.com> | +91-9860632945*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://knife.lugatgt.org/pipermail/mlpack/attachments/20180216/57512d07/attachment.html>