Deep Dyna-Q 阅读笔记

科技2022-07-11 212

读论文：Deep Dyna-Q

Deep Dyna-Q: Integrating Planning for Task-Completion Dialogue Policy Learning

时间：2018

作者：微软研究院、香港中文大学

源码：https://github.com/MiuLab/DDQ

【按：这个源码太老了，竟然是用py2，numpy实现的】

参考：https://zhuanlan.zhihu.com/p/50223176

内容：任务型对话中基于RL的POL，需要人与agent交互，过去都是基于用户模拟器，而用户模拟器与真实的用户是不一样的，缺少真实交互的谈话复杂性，导致训练得的agent必然深受模拟器的设计的影响。而且，用户模拟器的好坏并没有合适的指标去衡量。因此，我们提出了一个新颖的策略，即引入小样本的真实用户交互数据，以监督学习的方法，使得world model(即环境)表现得更像真人（称为direct RL）；或者，直接拿这些真实用户数据以RL的方法提升POL（称为indirect RL）。第二，我们提出deep Dyna-Q（DDQ）作为RL算法。通过使用world model用于planning，我们的模型可以视作一个model-based RL方法，也就是incorporates planning for task-completion dialogue policy learning

DDQ

DDQ包含5个子件：

1、LSTM-NLU：用于识别用户意图、抽取相关slots

2、DSTer：跟踪对话状态

3、POL：基于当前对话状态，决策next action

4、model-based NLG：将action转化为response

5、world model：生成模拟的user actions与simulated rewards。

算法：

初始化dialogue policy $Q$ 和 world model $M$ （都是用pre-collected humen交互数据训练）

然后将三个子过程，以iterative training procedure的方式训练。

现在直接看算法伪代码是看不懂的，所以先依次说明三个子过程。最后再回过头看算法伪代码

（1）Direct Reinforcement Learning

目的：直接拿已收集的real experience来提升POL。

涉及变量：

$s$ ：current dialogue state

$a$ ：action to execute，用Q（s，a’）基于 $\epsilon$ -greedy得到。 $a^{'}$ 是previous action

$r$ ：获得的reward

$a^u$ ：next user response

$s^{'}$ ：更新后的state

上述五个量就是一个experience。

将experience存放在replay buffer $D^u$ 内。

以DQN的通用模式训练Q函数。注意target Q函数is only periodically updated（即三个子过程都结束后更新）

（2）Planning

目的：the world model is employed to generate simulated experience that can be used to improve dialogue policy

其实，我们的DDQ模型使用了两个replay buffers：

$D^u$ for storing real experience

$D^s$ for simulated experience.

simulated experience的用法，与第一个子过程里基于DQN对real experimence的用法是完全一样的。所以，这里我们仅描述the simulated experience是怎样生成的。

在每一个dial开始前，初始化一个user goal $G = (C, R)$ ，

C是一系列constrains约束，R是一个requests的集合。本质都是slots，不过request_slots比constraint_slots多几个requestable slots。

The first user action $a^u$ can be either a request or an inform dialogue act.

在对话的每一回合中，world model $M$ 将 $s$ 和 $a^{'}$ 作为输入，生成 $a^u$ ， $r$ ， $t$ （这里 $t$ 是一个二值变量来标志对话是否terminate）。M的作用就是完成上述生成过程。

M的结构很简单，就是一个MLP。将s和a拼接起来，以一个共有线性层生成中间hidden向量，hidden向量再通过三个独立的线性层输出 $r$ 、 $a^u$ 、 $t$ 。这里 $r$ 是标量、 $a^u$ 和 $t$ 都是softmax向量。

（3）World Model Learning

目的：训练M。

这里 $r$ 是标量、 $a^u$ 和 $t$ 都是softmax向量。所以训练M其实就是一个多任务训练（2个分类问题，1个回归问题）。M是通过real expeience即replay buffer $D^u$ 训练优化的。

（4）总结：算法伪代码

现在就能看懂了，核心待训练参数是RL算法的DQN网络、以及用于生成simulated experience的M网络。初始化阶段，Q和M都借助了预训练。replay buffer思想很经典，利用的是Replay Buffer Spiking（RBS）。初始阶段，real experience池填满（后文知是100个dialogues of experience），simulated experience池为空。然后就可以开始迭代训练了。每一次迭代训练都是依次三个子过程，共N次迭代。第一步，基于DQN经典办法，通过 $\epsilon$ -greedy策略不断调整real experience池，从real experience池取样用于优化Q函数的样本，优化一次Q函数。第二步，从real experience池取样用于优化M模型的样本，优化M模型。第三步（进行K次），利用M模型生成一批simulated experience，存入simulated experience池；从simulated experience池取样用于优化Q函数的样本，优化一次Q函数。三步走完是一次迭代，即完成了一次Q函数的更新，使用新Q函数进行新一轮迭代。

【按：上面涉及的超参数有N，K，L。N就是迭代次数。K则是一次迭代内运行过程三的次数，如果world model足够好，可以经阿尔模拟真实环境，那么K就可以设置得很大，从而加快学习过程。L是人为规定的dial最长回合数，达到L时强制令 $t$ 为true。】

Experiments and Results

1 Dataset

使用movie-ticket booking task，包含11个acts和16个slots，280个标注dial样本数据，平均length=11.

细节：

A dialogue is considered successful only when a movie ticket is booked successfully and when the information provided by the agent satisfies all the user’s constraints.At end of each dialogue, the agent receives a positive reward of 2 * L for success, or a negative reward of - L for failure。L是最大对话长度超参数，set to 40. in each turn, the agent receives a reward of -1 to encourage shorter dialogues.根据已标注数据生成user goal，储存在一个database里，以后需直接从库里抽取。

2 Dialogue Agents for Comparison

Nvariations含义解释1DQN仅有过程一用于考察world model的作用2DDQ(K)我们的算法3DDQ(K, rand-init

\theta_M

)world model随机初始化用于考察预训练真人数据对world model的影响4DDQ(K, fixed

\theta_M

)world model参数固定不更新用于考察联合优化对world model的影响5DQN(K)在DQN基础上增加K times more real experiences相当于DDQ(K)的上限

4, 5 evaluated in the simulation setting only

Implementation Details

1、两个buffer的size都是5000

2、Q和M的参数优化使用one-step(Z=1) 16-tuple-minibatch update

3、超参数L=40,

4、使用了RBS。

5、初始化时为real experience池存入100个dialogs of experience。

我们将上述比较的模型在两个Evaluation setting下进行评估：

Simulated User Evaluation

Human-in-the-Loop Evaluation

3 Simulated User Evaluation

In this setting the dialogue agents are optimized by interacting with user simulators, instead of real users. Thus, the world model is learned to mimic user simulators.

所以我们引入一个公开可用的user simulator(2016)。在训练中，simulator提供给agent一个simulated user response( each turn)，一个reward signal（在dial结束时）。

模拟结果如下：

评估指标：success rate（成功率）, average reward（平均奖励）, and average number of turns（平均轮数）

图4：学习曲线。

结果表明，DDQ始终优于DQN。在没有planning的情况下，DQN agent花费了大约180轮训练达到50%的成功率，而DDQ(10)只用了50轮。文中实验还提到，K的最优值需要在world model的质量和用于产生模拟经验的数量之间寻求一个最佳的平衡来确定。这是一个非常重要的优化问题，因为在训练过程中，agent和world model都在不断更新，因此需要相应地调整最优的K。(future work)

4 Human-in-the-Loop Evaluation

In each dialogue session, one of the agents was randomly picked to converse with a user

【按：该模式的目的是考察agent在adapt its policy on the fly by interacting with real users via deep RL的效能。】

Conclusion

为了研究课题：

1、exploration in planning。

2、handle the domain extension problem

3、the conflict between exploration and exploitation.

注：In a planning context, exploration means trying actions that may improve the world model, whereas exploitation means trying to behave in the optimal way given the current model. To this end, we want the agent to explore in the optimal way given the current model.

Processed: 0.011, SQL: 8