Policy Gradient is a class of reinforcement learning which takes the parameters of a policy and updates it according to the reward received. At a high level, it’s a very “dumb” algorithm. Encourage every outcome which led to a good outcome, and discourage actions which lead to bad outcomes. You don’t know how good each individual action is, but if you optimize it as a whole, then eventually a simple model will be able to learn what actions and are good and when.
These are my rough working notes, but I highly recommend you go read the resources in the references.
General outline:
- you want to take steps in the right direction, update policy for higher reward
- show how to derive the log thing
- why do you even do it that way?
- so that we can do mc estimation?
- note that reward is not guaranteed
- elgp lemma
- extensions: baseline function and anything else?
- in extensions, also show rewards to go
TODO
- read gae paper
- eventually: extend to trpo and other methods