Deep Reinforcement Learning

Kai Arulkumaran / @KaiLashArul

Imperial College London

Foreword

DL + RL $\subset$ Artificial General Intelligence (?)

Deep learning $\subset$ Representation learning

Replace hand-engineering with learning features

Solve problems with the right representations

Wouldn't perform object classification straight from pixels

Learn representations using general-purpose priors

Summary

Deep learning [1, 2]

Reinforcement learning [3]

Deep Q-network [4] & advantage actor-critic [5]

Assorted topics [6]

Deep Learning

Neural networks (NNs) are powerful function approximators

NNs can learn features directly from data

Stacking layers enables learning hierarchical features

End-to-End Training

Backpropagation ≈ calculating gradients with the chain rule

Stochastic gradient descent (or variants)

Requires differentiable loss function, computational graph

Reinforcement Learning

Agent interacts with a (generally stochastic) environment
and learns through trial-and-error

Agent perceives environment state $\mathbf{s}_t$ and chooses action $\mathbf{a}_t$

Performing $\mathbf{a}_t$ transitions $\mathbf{s}_t$ to $\mathbf{s}_{t+1}$ with scalar reward $r_{t+1}$

Supervised vs. Reinforcement

Supervised learning: receive correct answer,
produce correct answer

Reinforcement learning: receive reward signal,
produce correct action?

Difficulties

Correct action unknown

Agent affects its own observations (no i.i.d.)

Long-range time dependencies (credit assignment)

Goal

Maximise expected return (a.k.a. value) $\mathbb{E}[R]$

Return is cumulative (discounted) reward: $R = \sum\limits_{t=0}^{T-1} \gamma^tr_{t+1}$

Discount $\gamma \in [0, 1]$ determines "far-sightedness"

If non-episodic ($T = \infty$), $\gamma \in [0, 1)$

Learn a policy $\pi$ that maps states to actions
to maximise $\mathbb{E}[R]$

Optimal policy $\pi^*$ maximises $\mathbb{E}[R]$ from all states

Markov Assumption

Collect history, e.g. $\mathbf{h}_2 = \{\mathbf{s}_0, \mathbf{a}_0, r_1, \mathbf{s}_1, \mathbf{a}_1, r_2, \mathbf{s}_2\}$

RL assumes Markov decision process (MDP)

Choose $\mathbf{a}_2$ based purely on $\mathbf{s}_2$, not $\mathbf{h}_2$

State is a sufficient statistic of the future;
allows dynamic programming instead of Monte Carlo estimates

Realistic problems are usually partially observable MDPs

Receive observation $\mathbf{o}_{t+1} \sim O(\mathbf{s}_{t+1}, \mathbf{a}_t)$

Approaches

Value functions: estimate the value (expected return)
of being in a given state

Policy search: directly find a policy

Actor-critic: combine a value function (critic)
with policy search (actor)

Can be combined with (learned) models in many ways,
e.g., training from simulation, model predictive control

Considering tabular value functions/policies,
i.e., $|\pi| = |\mathcal{S}| \times |\mathcal{A}|$

Value Function

Define the state value function: $V^\pi(\mathbf{s}_t) = \mathbb{E}_\pi[R|\mathbf{s}_t]$

Optimal value function comes from optimal policy:
$V^* = V^{\pi^*} = \max\limits_\pi V^\pi(\mathbf{s}) \ \forall \mathbf{s}$

With the environment model, $\mathbf{s}_{t+1} \sim P(\mathbf{s}_t, \mathbf{a}_t)$,
we could use dynamic programming with $V^\pi$

Q-Function

Define the state-action value function: $Q^\pi(\mathbf{s}_t, \mathbf{a}_t) = \mathbb{E}_\pi[R|\mathbf{s}_t, \mathbf{a}_t]$

If we had $Q^*$, $\pi^*(\mathbf{s}_t) = \arg\!\max\limits_{\mathbf{a}}Q^*(\mathbf{s}_t, \mathbf{a})$

$Q^\pi$ satisfies a recursive relation (Bellman equation): $Q^\pi(\mathbf{s}_t, \mathbf{a}_t) = \mathbb{E}_{\mathbf{s}_{t+1},\pi}\big[r_{t+1} + \gamma Q^\pi(\mathbf{s}_{t+1}, \pi(\mathbf{s}_{t+1}))\big]$

Therefore, $Q^\pi$ can be improved by bootstrapping

Can also define relative advantage of action against baseline: $A(\mathbf{s}_t, \mathbf{a}_t) = Q(\mathbf{s}_t, \mathbf{a}_t) - V(\mathbf{s}_t)$

Q-Learning

Learn from experience: $Q'(\mathbf{s}_t, \mathbf{a}_t) = Q(\mathbf{s}_t, \mathbf{a}_t) + \alpha \delta$,
where $\alpha$ is the learning rate and $\delta$ is the TD-error [7]

$\delta = Y - Q = \left(r_t + \gamma\max\limits_aQ(\mathbf{s}_{t+1}, \mathbf{a})\right) - Q(\mathbf{s}_t, \mathbf{a}_t)$

$Y$ is reward received + discounted max Q-value of next state

Minimising $\delta$ satisfies recursive relationship

Loss is Mean Squared Error (over batch): $\mathcal{L}(\delta) = \frac{1}{N}\sum\limits_{n=1}^{N}(\delta_n)^2$

DL Note: RL updates are usually formulated for gradient ascent

Generalised Policy Iteration

Used to get $Q^*$ from $Q^\pi$

Interleave steps of policy evaluation and policy improvement

Policy evaluation: with updated policy,
improve estimate of value function

Policy improvement: with updated value function,
improve policy

Policy Search

Directly output actions (parameterised policy): $\pi_\theta(\mathbf{a}_t|\mathbf{s}_t)$

Search methods can include black-box optimisers
such as genetic algorithms or even random search [8]

Continuous Control

"Direct" policy methods easily allow continuous action outputs,
rather than searching for $\arg\!\max\limits_{\mathbf{a}}Q(\mathbf{s}, \mathbf{a})$

Policy Gradients

Increase the log probability of actions, weighted by reward

Score function gradient estimator (REINFORCE) [9]: $\nabla_\theta \mathbb{E}_{\mathbf{s}}[R(\mathbf{s})] = \mathbb{E}[R(\mathbf{s})\nabla_\theta\log \pi_\theta(\mathbf{s})]$

Stochastic estimation when $r$ is non-differentiable but
$\pi_\theta$ can be sampled from

For more details, see Deep Reinforcement Learning: Pong from Pixels

Actor-Critic

Actor: policy $\pi(\mathbf{a}_t|\mathbf{s}_t)$, trained with policy gradients

Critic: state value function $V(\mathbf{s}_t)$, trained with TD-error $\delta$

Backups

2 (of several) "dimensions" in RL

Full backups require a model,
shallow backups require a value function

Deep Reinforcement Learning

Previous approaches do not scale

Deep neural networks are powerful function approximators

Convolutional NNs (CNNs) for visual inputs,
Recurrent NNs (RNNs) for sequential inputs

Differentiable attention, differentiable memory, etc.

In Practice?

😃 Scalability: where deep RL excels
😔 Implementation: intermediate
😞 Efficiency: poor sample efficiency

Browser demo: ReinforceJS WaterWorld

Partial Landscape

Deep Q-network

$Q(s, a)$ is approximated by a neural net

Process raw pixels with convolutional layers

Efficient: Output $Q(s, a) \ \forall a \in \mathcal{A}$ (discrete set)

Exploration vs. Exploitation

Trade-off in all RL

$\epsilon$-greedy: Pick random action with $p(\epsilon)$

Anneal $\epsilon$ to decrease exploration over time

Otherwise use policy, pick $\arg\!\max$ action

Note: DQN outputs $\neq$ probability distribution

Ongoing research topic in its own right/
aided by, e.g., hierarchical RL, learning from demonstration

In Action

Saliency maps can show attention of network [10]

Experience Replay

Store transitions $(\mathbf{s}_t, \mathbf{a}_t, r_{t+1}, \mathbf{s}_{t+1}, \text{term.})$ in memory $\xi$ [11]

Sample minibatches from $\xi$ offline

Breaks strong temporal correlations

Efficiency for samples and hardware

Experience replay $\implies$ off-policy training,
on-policy includes SARSA, TD(λ), basic actor-critic methods

Target Network

Function approximation in RL is unstable

Occasionally freeze weights $\theta$ in a target network: $\theta^-$

$\delta = \left(r_t + \gamma\max\limits_aQ(\mathbf{s}_{t+1}, \mathbf{a}; \theta^-)\right) - Q(\mathbf{s}_t, \mathbf{a}_t; \theta)$

Rainbow

State-of-the-art (1 GPU): DQN with several extensions [12]

Double Q-learning [13]
Prioritised experience replay [14]
Dueling network architecture [15]
n-step returns [3]
Distributional value function [16]
Noisy networks for exploration [17]

Open-source implementation: Kaixhin/Rainbow

Actors and Learners

Can decouple acting and learning for parallelism/distribution

Instead of experience replay, train multiple agents in parallel

Update weights asynchronously for improved exploration (?)

A3C

A3C = Asynchronous + Advantage + Actor-Critic

Policy gradients can have large variance Monte Carlo backups

Can use action-independent baseline subtraction
to reduce variance: $(R(\mathbf{s}) - b)\nabla_\theta\log \pi_\theta(\mathbf{s})$

Hence can utilise $A(\mathbf{s}) = R(\mathbf{s}) - V(\mathbf{s})$: $\ A(\mathbf{s})\nabla_\theta\log \pi_\theta(\mathbf{s})$

Later shown that A2C (synchronous) suffices

IMPALA

IMPALA = Importance-Weighted Actor-Learner Architecture [18]

Many (even distributed) actors collecting data asynchronously

Single/several learners updating parameters synchronously

Uses (truncated) importance weights for off-policy [19]

Importance weight $c$ with evaluation policy $\pi$
and behaviour policy $\mu$ [20]: $c = \frac{\pi(\mathbf{a}|\mathbf{s})}{\mu(\mathbf{a}|\mathbf{s})}$

Ongoing Research

Many topics: model-based RL, hierarchical RL,
policy gradients, deep neuroevolution, meta-learning,
transfer learning, distributed training...

Slides mainly cite "older" DRL works

Check out A Brief Survey of Deep Reinforcement Learning [6]

Trust Regions

Large updates in a policy can result in a disastrous policy

Can enforce a soft/hard constraint on policy deviating
from current setting with probability ratio $c(\theta) = \frac{\pi_\theta(\mathbf{a}|\mathbf{s})}{\pi_{\theta_{old}}(\mathbf{a}|\mathbf{s})}$

Trust region policy optimisation (TRPO) uses hard constraint
by line search on the Fisher information matrix [20]:
$c(\theta)A(\mathbf{s}, \mathbf{a})\quad\text{s.t.} \ D_{\text{KL}}(\pi_{\theta_{old}}(\cdot|\mathbf{s})\Vert \pi_\theta(\cdot|\mathbf{s})) \leq \delta$

Proximal policy optimisation (PPO) uses soft constraint [21]:
$\min(c(\theta)A(\mathbf{s}, \mathbf{a}), \text{clip}(c(\theta), 1 - \epsilon, 1 + \epsilon)A(\mathbf{s}, \mathbf{a}))$

Guided Policy Search

Policy search is difficult; local minima are a big problem

Example demonstrations, e.g. via offline planning,
to escape local minima

Problem with naive SL is compounding errors

Guided policy search =
guiding samples + importance sampling [22]

Guiding samples from fitting dynamics model to examples

Importance sampling corrects for off-policy samples

For an overview, see Guided Policy Search - TechTalks.tv

Search

Simulate playouts with random actions:
Monte Carlo Tree Search (MCTS)

AlphaGo = Policy gradients + MCTS [23]

AlphaZero [24]/Expert iteration [25] use MCTS
to provide regression targets

Symbolic AI

Combine symbols and logic with deep RL [26]

Extract symbols using autoencoders [27, 28, 29, 30]

Model-Based RL

Know dynamics $P(\mathbf{s}_{t+1}|\mathbf{s}_t, \mathbf{a}_t)$ or model $\hat{P}(\mathbf{s}_{t+1}|\mathbf{s}_t, \mathbf{a}_t)$

Use model-based RL or control theory

Can make predictions conditional on actions [31]

Errors compound

Successor RL

Decompose $Q$ as successor map $\cdot$ reward predictor

Extract subgoals (e.g. entrances)
using successor representation [32]

Hierarchical RL

Temporally-extended actions/hierarchical policies

Hierarchical-DQN [33], deep skill networks [34],
option heads [35]

Strategic attentive writer [36]

Intrinsic Motivation

Intrinsic motivation: add general-purpose internal reward

"Intelligent" exploration is hard

Motivating exploration helps with sparse rewards [37]

Transfer Learning

Vital for continual/lifelong learning

Pre-trained teacher networks, train student network [38, 39]

Train, freeze weights, change task, expand, repeat [40, 41]

Learning from Demonstration

Behavioural cloning is supervised learning on trajectories

Fails when cloned policy diverges from "expert states"

Inverse RL (IRL) is learning cost/reward function
from expert demonstrations

NNs are more expressive than linear, cheaper than GPs [42]

Learn policy directly with adversarial networks [43]

NLP as RL

Character prediction as RL problem [44, 45, 46, 47, 48]

Directly optimise non-differentiable cost functions [44]

Make stochastic (binary) decisions [46]

Not all wrong outputs are equally bad;
reward-augmented maximum likelihood [47]

Can train sequential generative (adversarial) networks [48]

Meta-Learning

"Learning to learn"

Also important for lifelong learning

Learn optimisation algorithms [49]

Learn neural network architectures [50]

Meta-reinforcement learning [51, 52]

What else can be converted into/augmented with RL?

Conclusion

Impressive results!

Resurgence of old RL techniques, now with DL

Distribution can actually improve performance

Sample efficiency is still poor

Long-term dependencies are still an open problem

Watch this space...

Figures

References 1/3

Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural networks, 61, 85-117.
Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning. Cambridge: MIT press.
Sutton, R. S., & Barto, A. G. (1998). Introduction to reinforcement learning. Cambridge: MIT press.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Petersen, S. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529-533.
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., ... & Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. arXiv preprint arXiv:1602.01783.
Arulkumaran, K., Deisenroth, M. P., Brundage, M., & Bharath, A. A. (2017). Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 34(6), 26-38.
Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine learning, 8(3-4), 279-292.
Such, F. P., Madhavan, V., Conti, E., Lehman, J., Stanley, K. O., & Clune, J. (2017). Deep neuroevolution: genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. arXiv preprint arXiv:1712.06567.
Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4), 229-256.
Simonyan, K., Vedaldi, A., & Zisserman, A. (2013). Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034.
Lin, L. J. (1992). Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(3-4), 293-321.
Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., ... & Silver, D. (2017). Rainbow: Combining improvements in deep reinforcement learning. arXiv preprint arXiv:1710.02298.
Van Hasselt, H., Guez, A., & Silver, D. (2016, February). Deep Reinforcement Learning with Double Q-Learning. In AAAI (Vol. 2, p. 5).
Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2015). Prioritized experience replay. arXiv preprint arXiv:1511.05952.
Wang, Z., de Freitas, N., & Lanctot, M. (2015). Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581.
Bellemare, M. G., Dabney, W., & Munos, R. (2017). A distributional perspective on reinforcement learning. arXiv preprint arXiv:1707.06887.
Fortunato, M., Azar, M. G., Piot, B., Menick, J., Osband, I., Graves, A., ... & Blundell, C. (2017). Noisy networks for exploration. arXiv preprint arXiv:1706.10295.
Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., ... & Legg, S. (2018). IMPALA: Scalable distributed Deep-RL with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561.
Munos, R., Stepleton, T., Harutyunyan, A., & Bellemare, M. (2016). Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems (pp. 1054-1062).
Precup, D., Sutton, R. S., & Singh, S. P. (2000, June). Eligibility Traces for Off-Policy Policy Evaluation. In Proceedings of the Seventeenth International Conference on Machine Learning (pp. 759-766). Morgan Kaufmann Publishers Inc.

References 2/3

Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015, June). Trust region policy optimization. In International Conference on Machine Learning (pp. 1889-1897).
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
Levine, S., & Koltun, V. (2013). Guided Policy Search. In ICML (3) (pp. 1-9).
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., ... & Dieleman, S. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484-489.
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., ... & Chen, Y. (2017). Mastering the game of Go without human knowledge. Nature, 550(7676), 354.
Anthony, T., Tian, Z., & Barber, D. (2017). Thinking fast and slow with deep learning and tree search. In Advances in Neural Information Processing Systems (pp. 5360-5370).
Garnelo, M., Arulkumaran, K., & Shanahan, M. (2016). Towards Deep Symbolic Reinforcement Learning. arXiv preprint arXiv:1609.05518.
Whitney, W. F., Chang, M., Kulkarni, T., & Tenenbaum, J. B. (2016). Understanding Visual Concepts with Continuation Learning. arXiv preprint arXiv:1602.06822.
Higgins, I., Pal, A., Rusu, A., Matthey, L., Burgess, C., Pritzel, A., ... & Lerchner, A. (2017, July). DARLA: Improving Zero-Shot Transfer in Reinforcement Learning. In International Conference on Machine Learning (pp. 1480-1490).
Higgins, I., Sonnerat, N., Matthey, L., Pal, A., Burgess, C. P., Botvinick, M., ... & Lerchner, A. (2017). SCAN: Learning Abstract Hierarchical Compositional Visual Concepts. arXiv preprint arXiv:1707.03389.
Oh, J., Guo, X., Lee, H., Lewis, R. L., & Singh, S. (2015). Action-conditional video prediction using deep networks in atari games. In Advances in Neural Information Processing Systems (pp. 2863-2871).
Kulkarni, T. D., Saeedi, A., Gautam, S., & Gershman, S. J. (2016). Deep Successor Reinforcement Learning. arXiv preprint arXiv:1606.02396.
Kulkarni, T. D., Narasimhan, K. R., Saeedi, A., & Tenenbaum, J. B. (2016). Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. arXiv preprint arXiv:1604.06057.
Tessler, C., Givony, S., Zahavy, T., Mankowitz, D. J., & Mannor, S. (2016). A Deep Hierarchical Approach to Lifelong Learning in Minecraft. arXiv preprint arXiv:1604.07255.
Arulkumaran, K., Dilokthanakul, N., Shanahan, M., & Bharath, A. A. (2016). Classifying Options for Deep Reinforcement Learning. arXiv preprint arXiv:1604.08153.
Mnih, V., Agapiou, J., Osindero, S., Graves, A., Vinyals, O., & Kavukcuoglu, K. (2016). Strategic Attentive Writer for Learning Macro-Actions. arXiv preprint arXiv:1606.04695.
Bellemare, M. G., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., & Munos, R. (2016). Unifying Count-Based Exploration and Intrinsic Motivation. arXiv preprint arXiv:1606.01868.

References 3/3

Rusu, A. A., Colmenarejo, S. G., Gulcehre, C., Desjardins, G., Kirkpatrick, J., Pascanu, R., ... & Hadsell, R. (2015). Policy distillation. arXiv preprint arXiv:1511.06295.
Parisotto, E., Ba, J. L., & Salakhutdinov, R. (2015). Actor-mimic: Deep multitask and transfer reinforcement learning. arXiv preprint arXiv:1511.06342.
Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., ... & Hadsell, R. (2016). Progressive Neural Networks. arXiv preprint arXiv:1606.04671.
Rusu, A. A., Vecerik, M., Rothörl, T., Heess, N., Pascanu, R., & Hadsell, R. (2016). Sim-to-Real Robot Learning from Pixels with Progressive Nets. arXiv preprint arXiv:1610.04286.
Wulfmeier, M., Ondruska, P., & Posner, I. (2015). Maximum Entropy Deep Inverse Reinforcement Learning. arXiv preprint arXiv:1507.04888.
Ho, J., & Ermon, S. (2016). Generative adversarial imitation learning. arXiv prepri nt arXiv:1606.03476.
Ranzato, M. A., Chopra, S., Auli, M., & Zaremba, W. (2015). Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732.
Bahdanau, D., Brakel, P., Xu, K., Goyal, A., Lowe, R., Pineau, J., ... & Bengio, Y. (2016). An Actor-Critic Algorithm for Sequence Prediction. arXiv preprint arXiv:1607.07086.
Luo, Y., Chiu, C. C., Jaitly, N., & Sutskever, I. (2016). Learning Online Alignments with Continuous Rewards Policy Gradient. arXiv preprint arXiv:1608.01281.
Norouzi, M., Bengio, S., Chen, Z., Jaitly, N., Schuster, M., Wu, Y., & Schuurmans, D. (2016). Reward Augmented Maximum Likelihood for Neural Structured Prediction. arXiv preprint arXiv:1609.00150.
Yu, L., Zhang, W., Wang, J., & Yu, Y. (2016). SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient. arXiv preprint arXiv:1609.05473.
Li, K., & Malik, J. (2016). Learning to Optimize. arXiv preprint arXiv:1606.01885.
Zoph, B., & Le, Q. V. (2016). Neural Architecture Search with Reinforcement Learning. arXiv preprint arXiv:1611.01578.
Duan, Y., Schulman, J., Chen, X., Bartlett, P. L., Sutskever, I., & Abbeel, P. (2016). RL²: Fast Reinforcement Learning via Slow Reinforcement Learning. arXiv preprint arXiv:1611.02779.
Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H., Leibo, J. Z., Munos, R., ... & Botvinick, M. (2016). Learning to reinforcement learn. arXiv preprint arXiv:1611.05763.

Thanks

Pedro Mediano, Feryal Behbahani and
other colleagues at BICV and Computational Neurodynamics