DL + RL $\subset$ Artificial General Intelligence (?)
Deep learning $\subset$ Representation learning
Replace hand-engineering with learning features
Solve problems with the right representations
Wouldn't perform object classification straight from pixels
Learn representations using general-purpose priors
Deep learning [1, 2]
Reinforcement learning [3]
Deep Q-network [4] & advantage actor-critic [5]
Assorted topics [6]
Neural networks (NNs) are powerful function approximators
NNs can learn features directly from data
Stacking layers enables learning hierarchical features
Backpropagation ≈ calculating gradients with the chain rule
Stochastic gradient descent (or variants)
Requires differentiable loss function, computational graph
Agent interacts with a (generally stochastic) environment
and learns through trial-and-error
Agent perceives environment state $\mathbf{s}_t$ and chooses action $\mathbf{a}_t$
Performing $\mathbf{a}_t$ transitions $\mathbf{s}_t$ to $\mathbf{s}_{t+1}$ with scalar reward $r_{t+1}$
Supervised learning: receive correct answer,
produce correct answer
Reinforcement learning: receive reward signal,
produce correct action?
Correct action unknown
Agent affects its own observations (no i.i.d.)
Long-range time dependencies (credit assignment)
Maximise expected return (a.k.a. value) $\mathbb{E}[R]$
Return is cumulative (discounted) reward: $R = \sum\limits_{t=0}^{T-1} \gamma^tr_{t+1}$
Discount $\gamma \in [0, 1]$ determines "far-sightedness"
If non-episodic ($T = \infty$), $\gamma \in [0, 1)$
Learn a policy $\pi$ that maps states to actions
to maximise $\mathbb{E}[R]$
Optimal policy $\pi^*$ maximises $\mathbb{E}[R]$ from all states
Collect history, e.g. $\mathbf{h}_2 = \{\mathbf{s}_0, \mathbf{a}_0, r_1, \mathbf{s}_1, \mathbf{a}_1, r_2, \mathbf{s}_2\}$
RL assumes Markov decision process (MDP)
Choose $\mathbf{a}_2$ based purely on $\mathbf{s}_2$, not $\mathbf{h}_2$
State is a sufficient statistic of the future;
allows dynamic programming instead of Monte Carlo estimates
Realistic problems are usually partially observable MDPs
Receive observation $\mathbf{o}_{t+1} \sim O(\mathbf{s}_{t+1}, \mathbf{a}_t)$
Value functions: estimate the value (expected return)
of being in a given state
Policy search: directly find a policy
Actor-critic: combine a value function (critic)
with policy search (actor)
Can be combined with (learned) models in many ways,
e.g., training from simulation, model predictive control
Considering tabular value functions/policies,
i.e., $|\pi| = |\mathcal{S}| \times |\mathcal{A}|$
Define the state value function: $V^\pi(\mathbf{s}_t) = \mathbb{E}_\pi[R|\mathbf{s}_t]$
Optimal value function comes from optimal policy:
$V^* = V^{\pi^*} = \max\limits_\pi V^\pi(\mathbf{s}) \ \forall \mathbf{s}$
With the environment model, $\mathbf{s}_{t+1} \sim P(\mathbf{s}_t, \mathbf{a}_t)$,
we could use dynamic programming with $V^\pi$
Define the state-action value function: $Q^\pi(\mathbf{s}_t, \mathbf{a}_t) = \mathbb{E}_\pi[R|\mathbf{s}_t, \mathbf{a}_t]$
If we had $Q^*$, $\pi^*(\mathbf{s}_t) = \arg\!\max\limits_{\mathbf{a}}Q^*(\mathbf{s}_t, \mathbf{a})$
$Q^\pi$ satisfies a recursive relation (Bellman equation): $Q^\pi(\mathbf{s}_t, \mathbf{a}_t) = \mathbb{E}_{\mathbf{s}_{t+1},\pi}\big[r_{t+1} + \gamma Q^\pi(\mathbf{s}_{t+1}, \pi(\mathbf{s}_{t+1}))\big]$
Therefore, $Q^\pi$ can be improved by bootstrapping
Can also define relative advantage of action against baseline: $A(\mathbf{s}_t, \mathbf{a}_t) = Q(\mathbf{s}_t, \mathbf{a}_t) - V(\mathbf{s}_t)$
Learn from experience: $Q'(\mathbf{s}_t, \mathbf{a}_t) = Q(\mathbf{s}_t, \mathbf{a}_t) + \alpha \delta$,
where $\alpha$ is the learning rate and $\delta$ is the TD-error [7]
$\delta = Y - Q = \left(r_t + \gamma\max\limits_aQ(\mathbf{s}_{t+1}, \mathbf{a})\right) - Q(\mathbf{s}_t, \mathbf{a}_t)$
$Y$ is reward received + discounted max Q-value of next state
Minimising $\delta$ satisfies recursive relationship
Loss is Mean Squared Error (over batch): $\mathcal{L}(\delta) = \frac{1}{N}\sum\limits_{n=1}^{N}(\delta_n)^2$
DL Note: RL updates are usually formulated for gradient ascent
Used to get $Q^*$ from $Q^\pi$
Interleave steps of policy evaluation and policy improvement
Policy evaluation: with updated policy,
improve estimate of value function
Policy improvement: with updated value function,
improve policy
Directly output actions (parameterised policy): $\pi_\theta(\mathbf{a}_t|\mathbf{s}_t)$
Search methods can include black-box optimisers
such as genetic algorithms or even random search [8]
"Direct" policy methods easily allow continuous action outputs,
rather than searching for $\arg\!\max\limits_{\mathbf{a}}Q(\mathbf{s}, \mathbf{a})$
Increase the log probability of actions, weighted by reward
Score function gradient estimator (REINFORCE) [9]: $\nabla_\theta \mathbb{E}_{\mathbf{s}}[R(\mathbf{s})] = \mathbb{E}[R(\mathbf{s})\nabla_\theta\log \pi_\theta(\mathbf{s})]$
Stochastic estimation when $r$ is non-differentiable but
$\pi_\theta$ can be sampled from
For more details, see Deep Reinforcement Learning: Pong from Pixels
Actor: policy $\pi(\mathbf{a}_t|\mathbf{s}_t)$, trained with policy gradients
Critic: state value function $V(\mathbf{s}_t)$, trained with TD-error $\delta$
2 (of several) "dimensions" in RL
Full backups require a model,
shallow backups require a value function
Previous approaches do not scale
Deep neural networks are powerful function approximators
Convolutional NNs (CNNs) for visual inputs,
Recurrent NNs (RNNs) for sequential inputs
Differentiable attention, differentiable memory, etc.
Browser demo: ReinforceJS WaterWorld
$Q(s, a)$ is approximated by a neural net
Process raw pixels with convolutional layers
Efficient: Output $Q(s, a) \ \forall a \in \mathcal{A}$ (discrete set)
Trade-off in all RL
$\epsilon$-greedy: Pick random action with $p(\epsilon)$
Anneal $\epsilon$ to decrease exploration over time
Otherwise use policy, pick $\arg\!\max$ action
Note: DQN outputs $\neq$ probability distribution
Ongoing research topic in its own right/
aided by, e.g., hierarchical RL, learning from demonstration
Saliency maps can show attention of network [10]
Store transitions $(\mathbf{s}_t, \mathbf{a}_t, r_{t+1}, \mathbf{s}_{t+1}, \text{term.})$ in memory $\xi$ [11]
Sample minibatches from $\xi$ offline
Breaks strong temporal correlations
Efficiency for samples and hardware
Experience replay $\implies$ off-policy training,
on-policy includes SARSA, TD(λ), basic actor-critic methods
Function approximation in RL is unstable
Occasionally freeze weights $\theta$ in a target network: $\theta^-$
$\delta = \left(r_t + \gamma\max\limits_aQ(\mathbf{s}_{t+1}, \mathbf{a}; \theta^-)\right) - Q(\mathbf{s}_t, \mathbf{a}_t; \theta)$
State-of-the-art (1 GPU): DQN with several extensions [12]
Open-source implementation: Kaixhin/Rainbow
Can decouple acting and learning for parallelism/distribution
Instead of experience replay, train multiple agents in parallel
Update weights asynchronously for improved exploration (?)
A3C = Asynchronous + Advantage + Actor-Critic
Policy gradients can have large variance Monte Carlo backups
Can use action-independent baseline subtraction
to reduce variance: $(R(\mathbf{s}) - b)\nabla_\theta\log \pi_\theta(\mathbf{s})$
Hence can utilise $A(\mathbf{s}) = R(\mathbf{s}) - V(\mathbf{s})$: $\ A(\mathbf{s})\nabla_\theta\log \pi_\theta(\mathbf{s})$
Later shown that A2C (synchronous) suffices
IMPALA = Importance-Weighted Actor-Learner Architecture [18]
Many (even distributed) actors collecting data asynchronously
Single/several learners updating parameters synchronously
Uses (truncated) importance weights for off-policy [19]
Importance weight $c$ with evaluation policy $\pi$
and behaviour policy $\mu$ [20]: $c = \frac{\pi(\mathbf{a}|\mathbf{s})}{\mu(\mathbf{a}|\mathbf{s})}$
Many topics: model-based RL, hierarchical RL,
policy gradients, deep neuroevolution, meta-learning,
transfer learning, distributed training...
Slides mainly cite "older" DRL works
Check out A Brief Survey of Deep Reinforcement Learning [6]
Large updates in a policy can result in a disastrous policy
Can enforce a soft/hard constraint on policy deviating
from current setting with probability ratio $c(\theta) = \frac{\pi_\theta(\mathbf{a}|\mathbf{s})}{\pi_{\theta_{old}}(\mathbf{a}|\mathbf{s})}$
Trust region policy optimisation (TRPO) uses hard constraint
by line search on the Fisher information matrix [20]:
$c(\theta)A(\mathbf{s}, \mathbf{a})\quad\text{s.t.} \ D_{\text{KL}}(\pi_{\theta_{old}}(\cdot|\mathbf{s})\Vert \pi_\theta(\cdot|\mathbf{s})) \leq \delta$
Proximal policy optimisation (PPO) uses soft constraint [21]:
$\min(c(\theta)A(\mathbf{s}, \mathbf{a}), \text{clip}(c(\theta), 1 - \epsilon, 1 + \epsilon)A(\mathbf{s}, \mathbf{a}))$
Policy search is difficult; local minima are a big problem
Example demonstrations, e.g. via offline planning,
to escape local minima
Problem with naive SL is compounding errors
Guided policy search =
guiding samples + importance sampling [22]
Guiding samples from fitting dynamics model to examples
Importance sampling corrects for off-policy samples
For an overview, see Guided Policy Search - TechTalks.tv
Simulate playouts with random actions:
Monte Carlo Tree Search (MCTS)
AlphaGo = Policy gradients + MCTS [23]
AlphaZero [24]/Expert iteration [25] use MCTS
to provide regression targets
Combine symbols and logic with deep RL [26]
Extract symbols using autoencoders [27, 28, 29, 30]
Know dynamics $P(\mathbf{s}_{t+1}|\mathbf{s}_t, \mathbf{a}_t)$ or model $\hat{P}(\mathbf{s}_{t+1}|\mathbf{s}_t, \mathbf{a}_t)$
Use model-based RL or control theory
Can make predictions conditional on actions [31]
Errors compound
Decompose $Q$ as successor map $\cdot$ reward predictor
Extract subgoals (e.g. entrances)
using successor representation [32]
Temporally-extended actions/hierarchical policies
Hierarchical-DQN [33], deep skill networks [34],
option heads [35]
Strategic attentive writer [36]
Intrinsic motivation: add general-purpose internal reward
"Intelligent" exploration is hard
Motivating exploration helps with sparse rewards [37]
Vital for continual/lifelong learning
Pre-trained teacher networks, train student network [38, 39]
Train, freeze weights, change task, expand, repeat [40, 41]
Behavioural cloning is supervised learning on trajectories
Fails when cloned policy diverges from "expert states"
Inverse RL (IRL) is learning cost/reward function
from expert demonstrations
NNs are more expressive than linear, cheaper than GPs [42]
Learn policy directly with adversarial networks [43]
Character prediction as RL problem [44, 45, 46, 47, 48]
Directly optimise non-differentiable cost functions [44]
Make stochastic (binary) decisions [46]
Not all wrong outputs are equally bad;
reward-augmented maximum likelihood [47]
Can train sequential generative (adversarial) networks [48]
"Learning to learn"
Also important for lifelong learning
Learn optimisation algorithms [49]
Learn neural network architectures [50]
Meta-reinforcement learning [51, 52]
What else can be converted into/augmented with RL?
Impressive results!
Resurgence of old RL techniques, now with DL
Distribution can actually improve performance
Sample efficiency is still poor
Long-term dependencies are still an open problem
Watch this space...
Pedro Mediano, Feryal Behbahani and
other colleagues at BICV and Computational Neurodynamics