Online: as data comes in, have more data with timing constraints.

Offline: you have a static dataset, acquire a large corpus as dataset and train a classifier.

https://stats.stackexchange.com/questions/897/online-vs-offline-learning

Leave a reply

Online: as data comes in, have more data with timing constraints.

Offline: you have a static dataset, acquire a large corpus as dataset and train a classifier.

https://stats.stackexchange.com/questions/897/online-vs-offline-learning

Some details need more research, but works for now!

**MuJoCo** is a advanced physical simulation tool, developed by Emo Todorov for Roboti LLC**mujoco-py** is a glue to let people use MuJoCo in python, developed by OpenAI Robotics team**OpenAI Gym** is a modular python package to provide environments for RL, developed by OpenAI

__Get License__

- Download
**mjpro150_linux.zip**from https://www.roboti.us/index.html - Get 1 year “Personal License” or 30 days “Trial License”
- Download
**mjkey.txt**that will be emailed

__Install MuJoCo__

- Make directory in root named
**/.mujoco**(in root for convenience in integration with mujoco-py) - Unzip
**mjpro150_linux.zip**to**/.mujoco**which will make**/.mujoco/mjpro150/** - Save
**mjkey.txt**to**/.mujoco**and**/.mujoco/mjpro150/bin/**(one in**/.mujoco/mjpro150/bin/**is for**./simulate**to test MuJoCo working) - Add environment variables to
**~/.bashrc**

```
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/.mujoco/mjpro150/bin"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/lib/nvidia-384"
```

__Test MuJoCo__

- Go to
**/.mujoco/mjpro150/bin** - Run
**./simulate**

- Clone mujoco-py from https://github.com/openai/mujoco-py
- Install to python3 (using
**sudo python3 -m pip install -e . –no-cache**) - Add environment variables to
**~/.bashrc**

```
export MUJOCO_PY_MJPRO_PATH="/.mujoco/mjpro150"
export MUJOCO_PY_MUJOCO_PATH="/.mujoco"
export MUJOCO_PY_MJKEY_PATH="/.mujoco/mjkey.txt"
```

Recommend installing this after MuJoCo and mujoco-py.

- Install to python3 (using
**sudo python3 -m pip install gym[ALL]**)

- Test using python script (if this works, you are good to go)

```
import gym
env = gym.make('Humanoid-v2')
env.reset()
for _ in range(1000):
env.render()
action = env.action_space.sample()
env.step(action)
```

https://deeprobotics.wordpress.com/2018/01/23/installing-mujoco-and-integrating-it-with-python-on-ubuntu/

https://github.com/openai/mujoco-py/issues/190

Prediction: predicting estimated total return

Control: maximizing estimated total return

__Dynamic Programming Methods__

Policy Evaluation: prediction (Bellman Expectation Equation) with an error bound of epsilon

Policy Iteration: policy evaluation (Bellman Expectation Equation) + greedy policy update until convergence

Value Iteration: run Bellman Optimality Equation to maximize return

* DP is is a model-based approach, where we have the complete knowledge of environment. such as transition probabilities and rewards

__Monte Carlo Methods & Temporal-Difference Methods__

On-policy | Off-policy | |

Monte Carlo | Monte Carlo | |

Temporal-Difference | SARSA | Q-Learning |

* MC and TD are model-free approaches, where we just focus on figuring out the value functions directly from the interactions with the environment

Biggest breakthrough with one of these method, I think, was probably DQN which gave a good nudge which now lead to great interest in reinforcement learning.

All the above methods are value-based method. V is state-value and Q is action-value, and we have been using them to build policy from it. **DQN is also value-based method** which is just augmentation of neural network on the target and behavior function.

Using policy-based method has several advantages: better convergence, well-suited for continuous action, and possibility of stochastic policy. On the other hand, it can fall into local optimum and is rather inefficient to evaluate and has high variance.

Recent research seem to have been making policy-based methods more sample efficient, lower variance, and making them reach better optima.

Following are some of the categories.

__Finite Difference Policy Gradient__

This is a **numerical method** where we tweak each parameter by epsilon to evaluate derivatives, taking n evaluations to compute policy gradient in n dimensions.

__Monte Carlo Policy Gradient (REINFORCE)__

This is the **analytical way** to get the gradient, where we take the gradient of the objective function directly (assuming the policy is differentiable).

We derive it the following way

[include derivation]

One more design choice to be made is on expressing stochastic policy. It is done using either (1) softmax or (2) gaussian

__Actor-Critic Policy Gradient__

Critic approximate Q function (with parameter w) and Actor approximates policy (with parameter theta)

Sutton’s Book: http://incompleteideas.net/book/the-book.html

PPO paper: https://arxiv.org/pdf/1707.06347.pdf

Policy Gradient Algorithms: https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html#reinforce