[toc]

Playing Atari with Deep Reinforcement Learning

多平台维护不易，内容实时更新于 个人网站，请移步阅读最新内容。

# Abstract

文章提出第一个深度学习模型，能够使用强化学习算法从高维的感知输入(high-dimensional sensory input)中学习到控制策略(control policies)。这个模型是一个卷积神经网络(convolutional neural network)，使用Q-learning的变种进行训练，可以直接输入像素，然后输出评估未来奖励的值函数(a value function estimating future rewards)。

The deep learning model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards.

# Introduction

研究现状：很多RL应用依赖于人工制定的特征，其性能表现严重依赖特征表示的质量，不具备自学习能力。 > Most successful RL applica- tions that operate on these domains have relied on

**hand-crafted features combined with linear value functions or policy representations**. Clearly, the performance of such systems heavily relies on the quality of the**feature representation**.从深度学习DL在计算机视觉和语音识别上取得进步的现状，思考类似的方法能否应用到通过强化学习RL来控制智能体。 > Recent advances in deep learning have made it possible to

**extract high-level features from raw sensory data**, leading to breakthroughs in computer vision and speech recognition.- DL --> extract high-level features

It seems natural to ask whether similar techniques could also be beneficial for RL with sensory data.

从DL的视角，讨论RL面临的挑战。

- 比较DL和RL
- RL有一些问题，DL依然无法解决。

Deep Learning | Reinforcement Learning | |
---|---|---|

Data source | required large amounts of hand-labelled training data | learn from a scalar reward signal that is frequently sparse, noisy and delayed (actions and rewards) |

数据来源 | 从大量人工标注的数据中学习 | 只能从稀少、噪声和延迟的奖励中学习 |

Data features | assume the data samples to be independent | typically sequences of highly correlated states |

数据特点 | 采样数据独立 | 一系列高度关联的状态 |

Data distribution | a fixed underlying distribution | data distribution changes as the algorithm learns new behaviours |

数据分布 | 固定的数据分布 | 学习新行为后，数据分布会调整 |

## 本文贡献

卷积神经网络可以在复杂的RL环境中从视频输入数据中学习控制策略。

This paper demonstrates that a convolutional neural network can overcome these challenges to learn successful control policies from raw video data in complex RL environments.

- 使用Q-learning的变种训练
- The network is trained with a variant of the Q-learning algorithm, with stochastic gradient descent to update the weights.

- 为了缓和关联数据和非平稳分布，使用经验回放机制(an experience replay
mechanism)，随机采样之前的转换，所以可以平滑过去行为的训练分布。
- To alleviate the problems of correlated data and non-stationary distributions, we use an experience replay mechanism which randomly samples previous transitions, and thereby smooths the training distribution over many past behaviors.

- 使用Q-learning的变种训练

# Background

## 任务描述

环境(environment) > We consider tasks in which an agent interacts with an environment E, in this case the Atari emulator, in a sequence of actions, observations and rewards.

行动(actions):通过合法的行为来进一步改变模拟器的内部状态和游戏得分。 > At each time-step the agent selects an action at from the set of legal game actions, A = {1, . . . ,K}.

The action is passed to the emulator and modifies its internal state and the game score.

观察(observations)：屏幕像素 > The emulator’s internal state is not observed by the agent; instead it observes an image rom the emulator, which is a vector of raw pixel values representing the current screen.

奖励(rewards)：游戏得分 > In addition it receives a reward representing the change in game score.

交互特征

行动和奖励的延时性(feedback delay) > Note that in general the game score may depend on the whole prior sequence of actions and observations; feedback about an action may only be received after many thousands of time-steps have elapsed.

智能体最大化得分 > The goal of the agent is to interact with the emulator by selecting actions in a way that maximises future rewards.

## 问题抽象

1 | graph TB |

- 从一系列的行动和观察中学习 > We therefore consider sequences of
actions and observations, st = x1, a1, x2, ..., at−1, xt, and learn game
strategies that depend upon these sequences.
- 有效时间内终结 > All sequences in the emulator are assumed to terminate in a finite number of time-steps.
- 转化为MDP问题 > This formalism gives rise to a large but finite Markov decision process (MDP) in which each sequence is a distinct state.
- 使用强化学习来处理 MDPs 问题 > We can apply standard reinforcement learning methods for MDPs, simply by using the complete sequence st as the state representation at time t.

- Optimal action-value function：在确定初始状态后(st, at,
pi)后，在规定的迭代时间内，不同的policy值返回的奖励是不同的；选择序列中最大的奖励，那么对于的policy就是最优的。
- We define the optimal action-value function Q∗(s, a) as the maximum expected return achievable by following any strategy, after seeing some sequence s and then taking some action a, Q∗(s, a) = maxπ E [Rt|st = s, at = a, π], where π is a policy mapping sequences to actions (or distributions over actions).

- Bellman equation
- The optimal action-value function obeys an important identity known as the Bellman equation.
- Intuition : 转化问题为寻找当前最优
- 因为the optimal value Q∗(s?, a?) of the sequence s? at the next
time-step 不可知，所以进一步简化问题。 > The basic idea behind many
reinforcement learning algorithms is to estimate the action- value
function, by using the Bellman equation as an iterative update, Qi+1(s,
a) = E [r + γ maxa?Qi(s?, a?)|s, a]. Such value
- 证明收敛性： > Such value iteration algorithms converge to the optimal action-value function.

- 泛化能力
实践中，原模型严重依赖当前的s，缺乏泛化能力。 > In practice, this basic approach is totally impractical, because the action-value function is estimated separately for each sequence, without any generalisation.

我们希望使用一个近似函数来估计行为-值函数 > It is common to use a function approximator to estimate the action-value function.

- 近似函数可以使用线性函数，或者非线性函数，比如神经网络。 > In the reinforcement learning community this is typically a linear function approximator, but sometimes a non-linear function approximator is used instead, such as a neural network.

问题转换：把一个复杂的数学模型，通过 Bellman equation 转换为 Q-learning 的问题。

- Training -- Q-network
- A Q-network can be trained by minimising a sequence of loss functions that changes at each iteration.
- Note that the targets depend on the network weights; this is in contrast with the targets used for supervised learning, which are fixed before learning begins.
- Differentiating the loss function with respect to the weights at the gradient.
- [又一次简化问题] Rather than computing the full expectations in the above gradient, it is often computationally expe- dient to optimise the loss function by stochastic gradient descent.

- Summary
- model-free: it solves the reinforcement learning task directly using samples from the emulator E, without explicitly constructing an estimate of E.
- off-policy: it learns about the greedy strategy a = maxa Q(s, a; θ), while following a behaviour distribution that ensures adequate exploration of the state space.

# Related Work

## Timeline

1995: TD-gammon used a model-free reinforcement learning algorithm similar to Q-learning, and approximated the value function using

**a multi-layer perceptron with one hidden layer**.- Early attempts to follow up on TD-gammon were less successful.
- 1997: It was shown that combining model-free reinforcement learning
algorithms such as Q-learning with
**non-linear function approximators**, or indeed with off-policy learning could cause the Q-network to diverge.

Subsequently, the majority of work in reinforcement learning focused on

**linear function approximators**with better convergence guarantees.More recently, there has been a revival of interest in combining deep learning with reinforcement learning.

- Deep neural networks have been used to estimate the environment; restricted Boltzmann machines have been used to estimate the value function [2004]; or the policy [2012].
- The divergence issues with Q-learning have been partially addressed by gradient temporal-difference methods.
- Pros
- These methods are proven to converge when evaluating a
**fixed policy**with a nonlinear function approximator [2009]; or when learning a control policy with linear function approximation using a restricted variant of Q-learning [2010].

- These methods are proven to converge when evaluating a
- Cons
- However, these methods have not yet been extended to nonlinear control.

2005: NFQ optimises the sequence of loss functions, using the RPROP algorithm to update the parameters of the Q-network.

- NFQ has also been successfully applied to simple real-world control
tasks using purely visual input, by
**first using deep autoencoders to learn a low dimensional representation of the task**, and then applying NFQ to this representation.- Our approach applies reinforcement learning
**end-to-end**, directly from the visual inputs; as a result it may**learn features**that are directly relevant to discriminating.

- Our approach applies reinforcement learning
- Cons
- It uses a batch update that has a computational cost per iteration that is proportional to the size of the data set, whereas we consider stochastic gradient updates that have a low constant cost per iteration and scale to large data-sets.

- NFQ has also been successfully applied to simple real-world control
tasks using purely visual input, by
1993: Q-learning has also previously been combined with experience replay and a simple neural network.

- But again starting with
**a low-dimensional state**rather than raw visual inputs.

- But again starting with
2013: The use of the Atari 2600 emulator as a reinforcement learning platform, who applied standard reinforcement learning algorithms with linear function approximation and generic visual features.

- The HyperNEAT evolutionary architecture has also been applied to the Atari platform, where it was used to evolve (separately, for each distinct game) a neural network representing a strategy for that game.

# Deep Reinforcement Learning

现有的基于深度学习在视觉和声音上的研究，激发作者的本文研究。 > Recent breakthroughs in computer vision and speech recognition have relied on efficiently training deep neural networks on very large training sets. The most successful approaches are trained directly from the raw inputs, using lightweight updates based on

**stochastic gradient descent**. By feeding sufficient data into deep neural networks, it is often possible to**learn better representations than handcrafted features**.本文的研究目标，连接RL和DNN。 > Our goal is to connect a reinforcement learning algorithm to a deep neural network which operates directly on RGB images and efficiently process training data by using stochastic gradient updates.

Deep Q-learning with Experience Replay

- We store the agent’s experiences at each time-step.
- We apply Q-learning updates, or minibatch updates, to samples of experience, drawn at random from the pool of stored samples.
- After performing experience replay, the agent selects and executes an action according to an ?-greedy policy.

- DQN (off-policy) vs online Q-learning
- Each step of experience is potentially used in many weight updates, which allows for greater data efficiency.
- Learning directly from consecutive samples is inefficient, due to the strong correlations between the samples; randomizing the samples breaks these correlations and therefore reduces the variance of the updates.
- When learning on-policy the current parameters determine the next
data sample that the parameters are trained on.
- It is easy to see how unwanted feedback loops may arise and the parameters could get stuck in a poor local minimum, or even diverge catastrophically.
- Note that when learning by experience replay, it is necessary to learn off-policy (because our current parameters are different to those used to generate the sample), which motivates the choice of Q-learning.

- By using experience replay the behavior distribution is averaged over many of its previous states, smoothing out learning and avoiding oscillations or divergence in the parameters.
- In practice, our algorithm only stores
**the last N experience tuples**in the replay memory, and samples uniformly at random from D when performing updates.- A more sophisticated sampling strategy might emphasize transitions from which we can learn the most, similar to prioritized sweeping.

## Preprocessing and Model Architecture

- 降低输入数据的维度
- Reduce the input dimensionality
- [降采样和灰度化] The raw frames are preprocessed by first converting their RGB representation to gray-scale and down-sampling it to a 110×84 image.

- Reduce the input dimensionality
- How to parameterize Q using a neural network?
- We instead use an architecture in which there is
**a separate output uni**t for each possible action, and only**the state representation is an input to the neural network**.- The outputs correspond to the predicted Q-values of the individual action for the input state.

- The main advantage of this type of architecture is the ability to compute Q-values for all possible actions in a given state with only a single forward pass through the network.

- We instead use an architecture in which there is
- Deep Q-Networks
- Input: state representation > The input to the neural network consists is an 84 × 84 × 4 image.
- Three hidden layers, followed by a rectifier nonlinearity.
- Output: Q-values for actions > The output layer is a fully- connected linear layer with a single output for each valid action.
- We refer to convolutional networks trained with our approach as
**Deep Q-Networks**(DQN).

# Experiments

方法的通用性和鲁棒性：不需要根据游戏定制信息。 > We use the same network architecture, learning algorithm and hyperparameters settings across all seven games, showing that our approach is robust enough to work on a variety of games without incorporating game-specific information.

适应游戏的做法

- [修改分数比值] Since the scale of scores varies greatly from game to game, we fixed all positive rewards to be 1 and all negative rewards to be −1, leaving 0 rewards unchanged.
- [frame-skipping technique] The agent sees and selects actions on every kth frame instead of every frame, and its last action is repeated on skipped frames.

## Training and Stability

- 如何评估训练 > Since our evaluation metric is the
**total reward**the agent collects in an episode or game averaged over a number of games, we periodically compute it during training.- The
**average total reward**evolves during training are indeed quite noisy, giving one the impression that the learning algorithm is not making steady progress. - More stable, metric is the policy’s
**estimated action-value function Q**, which provides an estimate of how much discounted reward the agent can obtain by following its policy from any given state. - 收敛性缺乏理论依据保证 > This suggests that,
**despite lacking any theoretical convergence guarantees**, our method is able to train large neural networks using a reinforcement learning signal and stochastic gradient descent in a stable manner.

- The

## Visualizing the Value Function

- 尽管理论依据少，但通过 predicted value function 和实际游戏界面的比较和关联，来证明算法的有效性。 > Figure 3 demonstrates that our method is able to learn how the value function evolves for a reasonably complex sequence of events.

## Main Evaluation

和现有的算法、做法比较（理论和实际） > We compare our results with the best performing methods from the RL literature.

- 找出其他算法的缺陷，或者验证算法提高点。 > Note that both of
these methods
**incorporate significant prior knowledge**about the visual problem by using background sub- traction and treating each of the 128 colors as a separate channel. In contrast, our agents only receive**the raw RGB screenshots**as input and must learn to detect objects on their own. > Our approach (labeled DQN) outperforms the other learning methods by a substantial margin on all seven games despite incorporating almost no prior knowledge about the inputs.

- 找出其他算法的缺陷，或者验证算法提高点。 > Note that both of
these methods
人机比较 > In addition to the learned agents, we also report scores for an expert human game player and a policy that selects actions uniformly at random.

和 evolutionary policy search 比较 > We also include a comparison to the evolutionary policy search approach from [8] in the last three rows of table.

# Conclusion

- 贡献 > This paper introduced a new deep learning model for
reinforcement learning, and demonstrated its ability to master difficult
control policies for Atari 2600 computer games, using only raw pixels as
input.

> We also presented a variant of online Q-learning that combines stochastic minibatch up- dates with experience replay memory to ease the training of deep networks for RL.