解读 Socially Aware Motion Planning with Deep Reinforcement Learning

Posted on 2019-09-20 Edited on 2023-05-19 In paper Views: Word count in article: 4.1k Reading time ≈ 15 mins.

[toc]

Socially Aware Motion Planning with Deep Reinforcement Learning

简评：之前的方法使用特征匹配 (feature-matching techniques) 的做法来描述和模仿行人的轨迹，但是人和人之间的特征是有差异的 (vary from person to person)，所以生成的行人轨迹并不理想。这篇文献指出，尽管导航时指明机器人什么应该和人类交互做是比较困难的 (精确的行人导航机制)，但是却可以简单地指明，什么是不应该做的 (违反社交规范)。尤其是，本文使用深度强化学习，提出一种时间高效 (time-efficient) 的遵守社会规范的导航策略。

This work notes that while it is challenging to directly specify the details of what to do (precise mechanisms of human navigation), it is straightforward to specify what not to do (violations of social norms). Specifically, using deep reinforcement learning, this work develops a time-efficient navigation policy that respects common social norms.

此外，本文是在作者基于深度强化学习的多智能体避障的研究基础上，在多智能体系统中引入具有社交意识的行为，本文的主要贡献是如何在 CADRL 中引入和融合社交行为。所以，可以简单认为，SA-CADRL = SA (socially aware) + CADRL (collision avoidance with deep reinforcement learning)。

This work extends the collision avoidance with deep reinforcement learning framework (CADRL) to characterize and induce socially aware behaviors in multiagent systems.

多平台维护不易，内容实时更新于个人网站，请移步阅读最新内容。

INTRODUCTION

人机交互方法演进

将行人视为具有简单动力学的动态障碍物，执行特定的反应式避障行为。

A common approach treats pedestrians as dynamic obstacles with simple kinematics, and employs specific reactive rules for avoiding collision.
- 缺陷：这种方法没有观察人类的行为，会产生不安全、不自然的行为，尤其当机器人的速度和人类行走速度比较靠近时。
Since these methods do not capture human behaviors, they sometimes generate unsafe/unnatural movements, particularly when the robot operates near human walking speed.
使用更精致的运动模型来推理附近行人的运动意图 (hidden intents)，产生一系列预测轨迹。然后，使用传统的路径规划算法为机器人生成无碰撞 (collision-free) 的路径。

More sophisticated motion models have been proposed, which would reason about the nearby pedestrians’ hidden intents to generate a set of predicted paths. Subsequently, classical path planning algorithms would be employed to generate a collision-free path for the robot.
- 缺陷：导航问题分割为不关联的预测和路径规划，可能导致机器人冻结问题 (the freezing robot problem)，机器人无法找到任何可行的行为，因为预测的轨迹让大部分的空间不可通行。
Separating the navigation problem into disjoint prediction and planning steps can lead to the freezing robot problem, in which the robot fails to find any feasible action because the predicted paths could mark a large portion of the space untraversable.

My comment (MC): 虽然作者认为这种做法不合理，但是这是在目前工业界比较流行的做法。导航问题分割为多个层次模块，上下游之间透明，容易迁移和调试。

合作 (cooperation)

基于上述研究的问题，作者提出一种做法，合作 (cooperation)，建模 (model)/ 预测 (anticipate) 机器人对周围行人的影响。

A key to resolving this problem is to account for cooperation, that is, to model/anticipate the impact of the robot’s motion on the nearby pedestrians.

现阶段，基于合作的社交导航研究，主要分为基于模型 (model-based) 的做法和基于学习 (learning-based) 的做法。

Existing work on cooperative, socially compliant navigation can be broadly classified into two categories, namely model-based and learning-based.
- 基于模型的做法，是一种典型的多智能体避障 (multiagent collision avoidance algorithm) 的扩展，通过增加参数来引入对社交交互行为的考虑。
Model-based approaches are typically extensions of multiagent collision avoidance algorithms, with additional parameters introduced to account for social interactions.

模型方法的缺陷：不确定行人是否会遵循预设的几何模型；势力场需要针对不同的行人，调节参数；可能导致规划轨迹震荡 (oscillatory)。
- 基于学习的做法，旨在通过匹配特征统计来开发一种策略。
Learning-based approaches aim to develop a policy that emulates human behaviors by matching feature statistics. In particular, Inverse Reinforcement Learning (IRL) has been applied to learn a cost function from human demonstration (teleoperation), and a probability distribution over the set of joint trajectories with nearby pedestrians.

学习方法比模型方法更贴近人类的行为，但是同时需要更高的计算代价 (computational cost)。同时，特征统计在人和人之间变化明显 (vary significantl)，也引起其在不同场景下的泛化能力的担忧。

简而言之，存在的方法试图建模或复制详细的社交行为机制 (mechanisms of social compliance)，因为行人行为的随机性 (stochasticity)，仍然很难去量化 (quantify)。

In short, existing works are mostly focused on modeling and replicating the detailed mechanisms of social compliance, which remains difficult to quantify due to the stochasticity in people’s behaviors.

作者认为人类会遵循一系列简单的社交规范，比如从右侧通过 (passing on the right)。所以在强化学习框架中描述 (characterize) 这些行为的特征，发现通过解决合作避障的问题可以生成 (emerge) 类人的导航惯例。

Building on a recent paper, we characterize these properties in a reinforcement learning framework, and show that human-like navigation conventions emerge from solving a cooperative collision avoidance problem.

Symmetries in multiagent collision avoidance

BACKGROUND

Collision Avoidance with Deep Reinforcement Learning

首先，多智能体的避障，可以表述为在强化学习下的一系列行为决策 (a sequential decision making) 问题。

A multiagent collision avoidance problem can be formulated as a sequential decision making problem in a reinforcement learning framework.

强化学习问题建模

这部分理论分析非常精彩，建议多阅读几次，理解深意。

为了刻画附近行人意图的不确定性 (uncertainty)，将状态矢量分为可观察部分 (observable) 和不可观察部分 (unobservable)。其中，可观察部分包括行人的速度，位置和大小；不可观察部分，包括行人的目标位置，偏好速度和方向。
所以，模型的目标是开发一种策略，在避开和附近行人碰撞的基础上，最小化抵达目标的时间。
在此模型基础上，这个问题可以在强化学习框架下表述为和邻近行人的关联配置 (joint configuration)。另外，引入奖励函数，奖励那些抵达目标的智能体，惩罚那些发生碰撞的智能体。

In particular, a reward function can be specified to reward the agent for reaching its goal and penalize the agent for colliding with others.

状态转移函数模型，因为考虑了其他智能体的隐藏意图 (hidden intents)，所以也简介考虑了其他智能体的行为不确定性。

The unknown state-transition model takes into account the uncertainty in the other agent’s motion due to its hidden intents.

随后，解决这个 RL 问题就是找到表达到达目标点的预估时间的最优值函数 (the optimal value function)，然后可以由值函数回溯得到最优策略 (optimal policy)。

Solving the RL problem amounts to finding the optimal value function that encodes an estimate of the expected time to goal.

然后，找到最优值函数的主要的挑战是，关联状态是连续的、高维的矢量，使得离散化和枚举状态空间不可行。

A major challenge in finding the optimal value function is that the joint state sjn is a continuous, high-dimensional vector, making it impractical to discretize and enumerate the state space.

最近，可以使用深度神经网络来解决这个强化学习的问题，去表示高维空间的值函数，并且具有人类水平的表现。 Recent advances in reinforcement learning address this issue by using deep neural networks to represent value functions in high-dimensional spaces, and have demonstrated human-level performance on various complex tasks.

MC: 到目前为止，作者是在介绍自己已有的研究，the collision avoidance with deep reinforcement learning framework (CADRL)，接下来会在这个基础上，引入多智能体间具有社交意识的行为。

最后，作者公开的实现代码 mit-acl/cadrl_ros 。

与其直接去量化人类行为，本文认为复杂的规范行为模式，是由一系列简单的局部交互组成的。MC: 我赞同这个观点，可以把复杂的问题拆分为小问题，容易解决。

Rather than trying to quantify human behaviors directly, this work notes that the complex normative motion patterns can be a consequence of simple local interactions.

因此，本文进一步猜想，相比于一系列精确定义的规则 (a set of precisely defined procedural rules)，社交规范是从相互避免碰撞的机制中新生的。

Thus, we conjecture that rather than a set of precisely defined procedural rules, social norms are the emergent behaviors from a time-efficient, reciprocal collision avoidance mechanism.

Reciprocity implicitly encodes a model of the other agents’ behavior, which is the key for enabling cooperation without explicit communication.

有点哲学感: 局部避免碰撞中的互惠原则，衍生出来了所谓的社交行为规范。作者进一步实验表明，无规则的 CADRL 也可以展示出一定的导航规范。(可以作为 research hypothesis)

Reciprocity implicitly encodes a model of the other agents’ behavior, which is the key for enabling cooperation without explicit communication. While no behavioral rules were imposed in the problem formulation, CADRL policy exhibits certain navigation conventions.

所以，作者在这个基础上认为，通过多智能体的避碰学习，可以习得人类现在的行为规范。

已有的文献报道，人类导航趋向于合作和时间最优。所以，作者在 CADRL 基础上，通过引入最小时间奖励函数和互惠假设 (学习到的最优行为，智能体基本都会采用)。

Existing works have reported that human navigation (or teleoperation of a robot) tends to be cooperative and time- efficient. This work notes that these two properties are encoded in the CADRL formulation through using the min-time reward function and the reciprocity assumption.

同时，作者指出，从 CADRL 衍生的合作行为，和人类现有的理解是不同的。所以，作者会进一步解决这个问题。

However, the cooperative behaviors emerging from a CADRL solution are not consistent with human interpretation. The next section will address this issue and present a method to induce behaviors that respect human social norms.

APPROACH

本章首先描述两个智能体如何在 RL 框架中塑造规范行为，然后将这一方法推广到多智能体场景。

We first describe a strategy for shaping normative behaviors for a two-agent system in the RL framework, and then generalize the method to multiagent scenarios.

和自己的解法基本是一致的，只不过没有使用神经网络罢了

现有的社交行为是众多解决对称避障的方法之一。为了引入一个特定的行为，就需要向 RL 中引入一点偏爱 (bias)，更偏向于一组行为。

This work notes that social norms are one of the many ways to resolve a symmetrical collision avoidance scenario. To induce a particular norm, a small bias can be introduced in the RL training process in favor of one set of behaviors over others.

如作者所说，这一方法的优点在于，违背特定性为的做法一般容易被识别，并且这一规范不需要精确。这是因为新增的惩罚打破了避碰的平衡和对称，所以会偏向于遵守社会规则的行为。

The advantage of this approach is that violations of a particular social norm are usually easy to specify; and this specification need not be precise. This is because the addition of a penalty breaks the symmetry in the collision avoidance problem, thereby favoring behaviors respecting the desired social norm.

最后，训练的结果表明学到了和人类行为类似的策略，比如 left-handed and right-handed norms。

As long as training converges, the penalty sets’ size does not have a major effect on the learned policy. This is expected because the desired behaviors are not in the penalty set.

Training a Multiagent Value Network

因为上文的训练只是在两个智能体之间，所以很难引入到更高阶的行为，比如多智能体环境。这部分主要讲述如何训练多智能体。

Since training was solely performed on a two-agent system, it was difficult to encode/induce higher order behaviors, such as accounting for the relations between nearby agents. This work addresses this problem by developing a method that allows for training on multiagent scenarios directly.

为了刻画多智能体对称的特性，本文使用了权重共享 (weight-sharing) 和最大池 (max-pooling layers) 的神经网络。该网络涉及 4 个智能体，其中附近三个智能体的状态可以互换而不影响训练结果。

网络结构的详细设计，可以阅读原文。

To capture the multiagent system’s symmetrical structure, a neural network with weight-sharing and max-pooling layers is employed,

Network structure for multiagent scenarios

在训练中，会先生成轨迹，然后将轨迹转化为经验集。 > The trajectories are then turned into state-value pairs and assimilated into the experience sets.

CADRL 和 SA-CADRL 的训练区别 - Two experience sets are used to distinguish between trajectories that reached the goals and those that ended in a collision. - During the training process, trajectories generated by SA-CADRL are reflected in the x-axis with probability. * This procedure exploits symmetry in the problem to explore different topologies more efficiently.

作者在网络训练时已经设置开关 (a binary flag indicating whether the other agent is real or virtual (details)，所以 n - 智能体的网络也可以用于 p (p<=n) 个智能体的场景。

An n-agent network can be used to generate trajectories for scenarios with fewer agents.

RESULTS

Computational Details (online performance and offline training)

模型具有比较优秀的实时和收敛 (convergence and time-efficient) 表现。

The size and connections in the multiagent network are tuned to obtain good performance (ensure convergence and produce time-efficient paths) while achieving real-time performance.

Simulation Results

三组对比试验：一组没有社交行为奖励函数，另外两组是偏向左和右的行为奖励函数。

Three copies of four-agent SA-CADRL policies were trained, one without the norm inducing reward, one with the left-handed, and the other with the right-handed.

Hardware Experiment

硬件设备

The differential-drive vehicle is outfitted with a Lidar for localization, three Intel Realsenses for free space detection, and four webcams for pedestrian detection.

A hardware demonstration video can be found at here.

CONCLUSION

Contribution

In a reinforcement learning framework, a pair of simulated agents navigate around each other to learn a policy that respect human navigation norms, such as passing on the right and overtaking on the left in a right-handed system.
This approach is further generalized to multiagent (n > 2) scenarios through the use of a symmetrical neural network structure.
Moreover, SA-CADRL is implemented on robotic hardware, which enabled fully autonomous navigation at human walking speed in a dynamic environment with many pedestrians.

Future work

Consider the relationships between nearby pedestrians, such as a group of people who walk together.
- Group surfing
MC: 可以迁移到其他具有规则学习的场景，比如水面无人艇中 COLREGS 规则和人类的行为类似。
- A COLREGs-based obstacle avoidance approach for unmanned surface vehicles
网络模型训练的效率，文章只是训练 4 个智能体，如果场景复杂，进一步扩展呢？