解读 Playing Atari with Deep Reinforcement Learning
[toc]
Playing Atari with Deep Reinforcement Learning
多平台维护不易，内容实时更新于 个人网站，请移步阅读最新内容。
Abstract
文章提出第一个深度学习模型，能够使用强化学习算法从高维的感知输入 (highdimensional sensory input) 中学习到控制策略 (control policies)。这个模型是一个卷积神经网络 (convolutional neural network)，使用 Qlearning 的变种进行训练，可以直接输入像素，然后输出评估未来奖励的值函数 (a value function estimating future rewards)。
The deep learning model is a convolutional neural network, trained with a variant of Qlearning, whose input is raw pixels and whose output is a value function estimating future rewards.
Introduction

研究现状：很多 RL 应用依赖于人工制定的特征，其性能表现严重依赖特征表示的质量，不具备自学习能力。
Most successful RL applica tions that operate on these domains have relied on handcrafted features combined with linear value functions or policy representations. Clearly, the performance of such systems heavily relies on the quality of the feature representation.

从深度学习 DL 在计算机视觉和语音识别上取得进步的现状，思考类似的方法能否应用到通过强化学习 RL 来控制智能体。
Recent advances in deep learning have made it possible to extract highlevel features from raw sensory data, leading to breakthroughs in computer vision and speech recognition.
 DL > extract highlevel features
It seems natural to ask whether similar techniques could also be beneficial for RL with sensory data.

从 DL 的视角，讨论 RL 面临的挑战。
 比较 DL 和 RL
 RL 有一些问题，DL 依然无法解决。
Deep Learning  Reinforcement Learning  

Data source  required large amounts of handlabelled training data  learn from a scalar reward signal that is frequently sparse, noisy and delayed (actions and rewards) 
数据来源  从大量人工标注的数据中学习  只能从稀少、噪声和延迟的奖励中学习 
Data features  assume the data samples to be independent  typically sequences of highly correlated states 
数据特点  采样数据独立  一系列高度关联的状态 
Data distribution  a fixed underlying distribution  data distribution changes as the algorithm learns new behaviours 
数据分布  固定的数据分布  学习新行为后，数据分布会调整 
本文贡献

卷积神经网络可以在复杂的 RL 环境中从视频输入数据中学习控制策略。
This paper demonstrates that a convolutional neural network can overcome these challenges to learn successful control policies from raw video data in complex RL environments.
 使用 Qlearning 的变种训练
 The network is trained with a variant of the Qlearning algorithm, with stochastic gradient descent to update the weights.
 为了缓和关联数据和非平稳分布，使用经验回放机制 (an experience replay mechanism)，随机采样之前的转换，所以可以平滑过去行为的训练分布。
 To alleviate the problems of correlated data and nonstationary distributions, we use an experience replay mechanism which randomly samples previous transitions, and thereby smooths the training distribution over many past behaviors.
 使用 Qlearning 的变种训练
Background
任务描述

环境 (environment)
We consider tasks in which an agent interacts with an environment E, in this case the Atari emulator, in a sequence of actions, observations and rewards.

行动 (actions): 通过合法的行为来进一步改变模拟器的内部状态和游戏得分。
At each timestep the agent selects an action at from the set of legal game actions, A = {1, . . . ,K}.
The action is passed to the emulator and modifies its internal state and the game score.

观察 (observations)：屏幕像素
The emulator’s internal state is not observed by the agent; instead it observes an image rom the emulator, which is a vector of raw pixel values representing the current screen.

奖励 (rewards)：游戏得分
In addition it receives a reward representing the change in game score.

交互特征

行动和奖励的延时性 (feedback delay)
Note that in general the game score may depend on the whole prior sequence of actions and observations; feedback about an action may only be received after many thousands of timesteps have elapsed.

智能体最大化得分
The goal of the agent is to interact with the emulator by selecting actions in a way that maximises future rewards.

问题抽象
1  graph TB 

从一系列的行动和观察中学习
We therefore consider sequences of actions and observations, st = x1, a1, x2, …, at−1, xt, and learn game strategies that depend upon these sequences.
 有效时间内终结
All sequences in the emulator are assumed to terminate in a finite number of timesteps.
 转化为 MDP 问题
This formalism gives rise to a large but finite Markov decision process (MDP) in which each sequence is a distinct state.
 使用强化学习来处理 MDPs 问题
We can apply standard reinforcement learning methods for MDPs, simply by using the complete sequence st as the state representation at time t.
 有效时间内终结

Optimal actionvalue function：在确定初始状态后 (st, at, pi) 后，在规定的迭代时间内，不同的 policy 值返回的奖励是不同的；选择序列中最大的奖励，那么对于的 policy 就是最优的。
 We define the optimal actionvalue function Q∗(s, a) as the maximum expected return achievable by following any strategy, after seeing some sequence s and then taking some action a, Q∗(s, a) = maxπ E [Rtst = s, at = a, π], where π is a policy mapping sequences to actions (or distributions over actions).

Bellman equation
 The optimal actionvalue function obeys an important identity known as the Bellman equation.
 Intuition : 转化问题为寻找当前最优
 因为 the optimal value Q∗(s?, a?) of the sequence s? at the next timestep 不可知，所以进一步简化问题。
The basic idea behind many reinforcement learning algorithms is to estimate the action value function, by using the Bellman equation as an iterative update, Qi+1(s, a) = E [r + γ maxa?Qi(s?, a?)s, a]. Such value
 证明收敛性：
Such value iteration algorithms converge to the optimal actionvalue function.
 证明收敛性：

泛化能力

实践中，原模型严重依赖当前的 s，缺乏泛化能力。
In practice, this basic approach is totally impractical, because the actionvalue function is estimated separately for each sequence, without any generalisation.

我们希望使用一个近似函数来估计行为  值函数
It is common to use a function approximator to estimate the actionvalue function.
 近似函数可以使用线性函数，或者非线性函数，比如神经网络。
In the reinforcement learning community this is typically a linear function approximator, but sometimes a nonlinear function approximator is used instead, such as a neural network.
 近似函数可以使用线性函数，或者非线性函数，比如神经网络。

问题转换：把一个复杂的数学模型，通过 Bellman equation 转换为 Qlearning 的问题。


Training – Qnetwork
 A Qnetwork can be trained by minimising a sequence of loss functions that changes at each iteration.
 Note that the targets depend on the network weights; this is in contrast with the targets used for supervised learning, which are fixed before learning begins.
 Differentiating the loss function with respect to the weights at the gradient.
 [又一次简化问题] Rather than computing the full expectations in the above gradient, it is often computationally expe dient to optimise the loss function by stochastic gradient descent.
 Summary
 modelfree: it solves the reinforcement learning task directly using samples from the emulator E, without explicitly constructing an estimate of E.
 offpolicy: it learns about the greedy strategy a = maxa Q(s, a; θ), while following a behaviour distribution that ensures adequate exploration of the state space.
Related Work
Timeline

1995: TDgammon used a modelfree reinforcement learning algorithm similar to Qlearning, and approximated the value function using a multilayer perceptron with one hidden layer.
 Early attempts to follow up on TDgammon were less successful.
 1997: It was shown that combining modelfree reinforcement learning algorithms such as Qlearning with nonlinear function approximators, or indeed with offpolicy learning could cause the Qnetwork to diverge.

Subsequently, the majority of work in reinforcement learning focused on linear function approximators with better convergence guarantees.

More recently, there has been a revival of interest in combining deep learning with reinforcement learning.
 Deep neural networks have been used to estimate the environment; restricted Boltzmann machines have been used to estimate the value function [2004]; or the policy [2012].
 The divergence issues with Qlearning have been partially addressed by gradient temporaldifference methods.
 Pros
 These methods are proven to converge when evaluating a fixed policy with a nonlinear function approximator [2009]; or when learning a control policy with linear function approximation using a restricted variant of Qlearning [2010].
 Cons
 However, these methods have not yet been extended to nonlinear control.

2005: NFQ optimises the sequence of loss functions, using the RPROP algorithm to update the parameters of the Qnetwork.
 NFQ has also been successfully applied to simple realworld control tasks using purely visual input, by first using deep autoencoders to learn a low dimensional representation of the task, and then applying NFQ to this representation.
 Our approach applies reinforcement learning endtoend, directly from the visual inputs; as a result it may learn features that are directly relevant to discriminating.
 Cons
 It uses a batch update that has a computational cost per iteration that is proportional to the size of the data set, whereas we consider stochastic gradient updates that have a low constant cost per iteration and scale to large datasets.
 NFQ has also been successfully applied to simple realworld control tasks using purely visual input, by first using deep autoencoders to learn a low dimensional representation of the task, and then applying NFQ to this representation.

1993: Qlearning has also previously been combined with experience replay and a simple neural network.
 But again starting with a lowdimensional state rather than raw visual inputs.

2013: The use of the Atari 2600 emulator as a reinforcement learning platform, who applied standard reinforcement learning algorithms with linear function approximation and generic visual features.
 The HyperNEAT evolutionary architecture has also been applied to the Atari platform, where it was used to evolve (separately, for each distinct game) a neural network representing a strategy for that game.
Deep Reinforcement Learning

现有的基于深度学习在视觉和声音上的研究，激发作者的本文研究。
Recent breakthroughs in computer vision and speech recognition have relied on efficiently training deep neural networks on very large training sets. The most successful approaches are trained directly from the raw inputs, using lightweight updates based on stochastic gradient descent. By feeding sufficient data into deep neural networks, it is often possible to learn better representations than handcrafted features.

本文的研究目标，连接 RL 和 DNN。
Our goal is to connect a reinforcement learning algorithm to a deep neural network which operates directly on RGB images and efficiently process training data by using stochastic gradient updates.

Deep Qlearning with Experience Replay
 We store the agent’s experiences at each timestep.
 We apply Qlearning updates, or minibatch updates, to samples of experience, drawn at random from the pool of stored samples.
 After performing experience replay, the agent selects and executes an action according to an ?greedy policy.
 DQN (offpolicy) vs online Qlearning
 Each step of experience is potentially used in many weight updates, which allows for greater data efficiency.
 Learning directly from consecutive samples is inefficient, due to the strong correlations between the samples; randomizing the samples breaks these correlations and therefore reduces the variance of the updates.
 When learning onpolicy the current parameters determine the next data sample that the parameters are trained on.
 It is easy to see how unwanted feedback loops may arise and the parameters could get stuck in a poor local minimum, or even diverge catastrophically.
 Note that when learning by experience replay, it is necessary to learn offpolicy (because our current parameters are different to those used to generate the sample), which motivates the choice of Qlearning.
 By using experience replay the behavior distribution is averaged over many of its previous states, smoothing out learning and avoiding oscillations or divergence in the parameters.
 In practice, our algorithm only stores the last N experience tuples in the replay memory, and samples uniformly at random from D when performing updates.
 A more sophisticated sampling strategy might emphasize transitions from which we can learn the most, similar to prioritized sweeping.
Preprocessing and Model Architecture

降低输入数据的维度
 Reduce the input dimensionality
 [降采样和灰度化] The raw frames are preprocessed by first converting their RGB representation to grayscale and downsampling it to a 110×84 image.
 Reduce the input dimensionality

How to parameterize Q using a neural network?
 We instead use an architecture in which there is a separate output unit for each possible action, and only the state representation is an input to the neural network.
 The outputs correspond to the predicted Qvalues of the individual action for the input state.
 The main advantage of this type of architecture is the ability to compute Qvalues for all possible actions in a given state with only a single forward pass through the network.
 We instead use an architecture in which there is a separate output unit for each possible action, and only the state representation is an input to the neural network.

Deep QNetworks
 Input: state representation
The input to the neural network consists is an 84 × 84 × 4 image.
 Three hidden layers, followed by a rectifier nonlinearity.
 Output: Qvalues for actions
The output layer is a fully connected linear layer with a single output for each valid action.
 We refer to convolutional networks trained with our approach as Deep QNetworks (DQN).
 Input: state representation
Experiments

方法的通用性和鲁棒性：不需要根据游戏定制信息。
We use the same network architecture, learning algorithm and hyperparameters settings across all seven games, showing that our approach is robust enough to work on a variety of games without incorporating gamespecific information.

适应游戏的做法
 [修改分数比值] Since the scale of scores varies greatly from game to game, we fixed all positive rewards to be 1 and all negative rewards to be −1, leaving 0 rewards unchanged.
 [frameskipping technique] The agent sees and selects actions on every kth frame instead of every frame, and its last action is repeated on skipped frames.
Training and Stability
 如何评估训练
Since our evaluation metric is the total reward the agent collects in an episode or game averaged over a number of games, we periodically compute it during training.
 The average total reward evolves during training are indeed quite noisy, giving one the impression that the learning algorithm is not making steady progress.
 More stable, metric is the policy’s estimated actionvalue function Q, which provides an estimate of how much discounted reward the agent can obtain by following its policy from any given state.
 收敛性缺乏理论依据保证
This suggests that, despite lacking any theoretical convergence guarantees, our method is able to train large neural networks using a reinforcement learning signal and stochastic gradient descent in a stable manner.
Visualizing the Value Function
 尽管理论依据少，但通过 predicted value function 和实际游戏界面的比较和关联，来证明算法的有效性。
Figure 3 demonstrates that our method is able to learn how the value function evolves for a reasonably complex sequence of events.
Main Evaluation

和现有的算法、做法比较（理论和实际）
We compare our results with the best performing methods from the RL literature.
 找出其他算法的缺陷，或者验证算法提高点。
Note that both of these methods incorporate significant prior knowledge about the visual problem by using background sub traction and treating each of the 128 colors as a separate channel. In contrast, our agents only receive the raw RGB screenshots as input and must learn to detect objects on their own.
Our approach (labeled DQN) outperforms the other learning methods by a substantial margin on all seven games despite incorporating almost no prior knowledge about the inputs.
 找出其他算法的缺陷，或者验证算法提高点。

人机比较
In addition to the learned agents, we also report scores for an expert human game player and a policy that selects actions uniformly at random.

和 evolutionary policy search 比较
We also include a comparison to the evolutionary policy search approach from [8] in the last three rows of table.
Conclusion
 贡献
This paper introduced a new deep learning model for reinforcement learning, and demonstrated its ability to master difficult control policies for Atari 2600 computer games, using only raw pixels as input.
We also presented a variant of online Qlearning that combines stochastic minibatch up dates with experience replay memory to ease the training of deep networks for RL.