[Reinforcement Learning] Study Note

Prediction: predicting estimated total return
Control: maximizing estimated total return

Classical Reinforcement Learning

Dynamic Programming Methods

Policy Evaluation: prediction (Bellman Expectation Equation) with an error bound of epsilon
Policy Iteration: policy evaluation (Bellman Expectation Equation) + greedy policy update until convergence
Value Iteration: run Bellman Optimality Equation to maximize return

* DP is is a model-based approach, where we have the complete knowledge of environment. such as transition probabilities and rewards

Monte Carlo Methods & Temporal-Difference Methods

On-policyOff-policy
Monte CarloMonte Carlo
Temporal-DifferenceSARSAQ-Learning

* MC and TD are model-free approaches, where we just focus on figuring out the value functions directly from the interactions with the environment

Biggest breakthrough with one of these method, I think, was probably DQN which gave a good nudge which now lead to great interest in reinforcement learning.

Value-based to Policy-based

All the above methods are value-based method. V is state-value and Q is action-value, and we have been using them to build policy from it. DQN is also value-based method which is just augmentation of neural network on the target and behavior function.

Using policy-based method has several advantages: better convergence, well-suited for continuous action, and possibility of stochastic policy. On the other hand, it can fall into local optimum and is rather inefficient to evaluate and has high variance.

Recent research seem to have been making policy-based methods more sample efficient, lower variance, and making them reach better optima.

Following are some of the categories.

Finite Difference Policy Gradient

This is a numerical method where we tweak each parameter by epsilon to evaluate derivatives, taking n evaluations to compute policy gradient in n dimensions.

Monte Carlo Policy Gradient (REINFORCE)

This is the analytical way to get the gradient, where we take the gradient of the objective function directly (assuming the policy is differentiable).

We derive it the following way

[include derivation]

One more design choice to be made is on expressing stochastic policy. It is done using either (1) softmax or (2) gaussian

Actor-Critic Policy Gradient

Critic approximate Q function (with parameter w) and Actor approximates policy (with parameter theta)

Reference

Sutton’s Book: http://incompleteideas.net/book/the-book.html
PPO paper: https://arxiv.org/pdf/1707.06347.pdf
Policy Gradient Algorithms: https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html#reinforce

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.