本专栏按照 https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html 顺序进行总结 。
M A D D P G \color{red}MADDPG MADDPG :[ paper | code ]
从单个智能体的角度来看,环境是非平稳的,因为其他智能体的策略在很快地更新并且一直是未知的。MADDPG是一个经过重新设计的演员评论家算法,专门用于处理这种不断变化的环境以及智能体之间的互动。
算法是 DDPG 算法在多智能体系统下的自然扩展,属于中心化训练,去中心化执行的算法框架。 改进之处: 在 Q值函数的建模过程中,通过在输入端引入从其他智能体当前策略采样出的动作作为额外信息,来解决多智能体场景下的环境非平稳问题。
令 o ⃗ = o 1 , … , o N , μ ⃗ = μ 1 , … , μ N \vec{o}=o_{1}, \ldots, o_{N}, \vec{\mu}=\mu_{1}, \ldots, \mu_{N} o =o1,…,oN,μ =μ1,…,μN 并且策略是由 θ ⃗ = θ 1 , … , θ N \vec{\theta}=\theta_{1}, \dots, \theta_{N} θ =θ1,…,θN 参数化的。
MADDPG中的评论家为第 i i i 个智能体(每个智能体)学习一个中心化的动作-值函数 Q i μ ⃗ ( o ⃗ , a 1 , … , a N ) Q_{i}^{\vec{\mu}}\left(\vec{o}, a_{1}, \ldots, a_{N}\right) Qiμ (o ,a1,…,aN),其中 a 1 ∈ A 1 , … , a N ∈ A N a_{1} \in \mathcal{A}_{1}, \ldots, a_{N} \in \mathcal{A}_{N} a1∈A1,…,aN∈AN 是所有智能体的动作。每一个 Q i μ ⃗ , i = 1 , … , N Q_{i}^{\vec{\mu}},\;i=1, \dots, N Qiμ ,i=1,…,N 都是独立学习的,因而每个智能体可以拥有任意形式的回报函数,包括竞争环境中相互冲突的回报函数。同时,每个智能体各自的演员,也是独立探索以及独立更新策略参数 θ i θ_i θi
演员更新: ∇ θ i J ( θ i ) = E o ⃗ , a ∼ D [ ∇ a i Q i μ ⃗ ( o ⃗ , a 1 , … , a N ) ∇ θ i μ θ i ( o i ) ∣ a i = μ θ i ( o i ) ] \nabla_{\theta_{i}} J\left(\theta_{i}\right)=\mathbb{E}_{\vec{o}, a \sim D}\left[\nabla_{a_{i}} Q_{i}^{\vec{\mu}}\left(\vec{o}, a_{1}, \ldots, a_{N}\right) \nabla_{\theta_{i}} \mu_{\theta_{i}}\left.\left(o_{i}\right)\right|_{a_{i}=\mu_{\theta_{i}}\left(o_{i}\right)}\right] ∇θiJ(θi)=Eo ,a∼D[∇aiQiμ (o ,a1,…,aN)∇θiμθi(oi)∣ai=μθi(oi)]
其中 D 表示经验回放缓冲,包含大量轨迹样本 ( o ⃗ , a 1 , … , a N , r 1 , … , r N , o ⃗ ′ ) \left(\vec{o}, a_{1}, \ldots, a_{N}, r_{1}, \ldots, r_{N}, \vec{o}^{\prime}\right) (o ,a1,…,aN,r1,…,rN,o ′) —— 给定当前联合观察 o ⃗ \vec{o} o ,每个智能体分别执行动作 a 1 , … , a N a_{1}, \dots, a_{N} a1,…,aN 后获取各自的回报 r 1 , … , r N r_{1}, \dots, r_{N} r1,…,rN , 并转移到下一个联合观察 o ⃗ ′ \vec{o}^{\prime} o ′
评论家更新:
L ( θ i ) = E o ⃗ , a 1 , … , a N , r 1 , … , r N , o ⃗ ’ [ ( Q i μ ⃗ ( o ⃗ , a 1 , … , a N ) − y ) 2 ] 其中 y = r i + γ Q i μ ⃗ ’ ( o ⃗ ’ , a ’ 1 , … , a ’ N ) ∣ a ’ j = μ ’ θ j ; TD目标值! \begin{aligned} \mathcal{L}(\theta_i) &= \mathbb{E}_{\vec{o}, a_1, \dots, a_N, r_1, \dots, r_N, \vec{o}’}[ (Q^{\vec{\mu}}_i(\vec{o}, a_1, \dots, a_N) - y)^2 ] & \\ \text{其中 } y &= r_i + \gamma Q^{\vec{\mu}’}_i (\vec{o}’, a’_1, \dots, a’_N) \rvert_{a’_j = \mu’_{\theta_j}} & \scriptstyle{\text{; TD目标值!}} \end{aligned} L(θi)其中 y=Eo ,a1,…,aN,r1,…,rN,o ’[(Qiμ (o ,a1,…,aN)−y)2]=ri+γQiμ ’(o ’,a’1,…,a’N)∣a’j=μ’θj; TD目标值!
其中, μ ⃗ ′ \vec{\mu}^{\prime} μ ′ 是延迟软更新参数的目标策略。
【每一个Critic更新参数时,需要知道所有Actor的 ( s , a = μ ( s ) , r , s n e x t , a ′ = μ ′ ( s ) ) (s, a=μ(s), r, s_next, a'=μ'(s)) (s,a=μ(s),r,snext,a′=μ′(s)), 其中 a ′ a' a′ 来自其Target Policy】
为了缓解环境中竞争或协作关系的智能体之间由于相互作用所带来的高方差,MADDPG提出了另外一个技术——策略集成:
为单个智能体训练 K K K 个策略;随机选取一个策略用以轨迹采样;使用 K K K 个策略的集成梯度来进行参数更新。总之,MADDPG在DDPG之上添加了三个额外部分,使其适应多智能体环境:
中心化评论家+去中心化演员;智能体能够使用估计的其他智能体的策略来进行学习;策略集成能够很好的减小方差。模型由多个DDPG网络组成,每个网络学习policy π π π (Actor) 和 action value Q (Critic);同时具有target network,用于Q-learning的off-policy学习。
训练过程采取集中训练、分散执行的方式,即每个智能体根据自身策略得到当前状态执行的动作,并与环境交互得到经验存入自身的经验缓存池。待所有智能体与环境交互后,每个智能体从经验池中随机抽取经验训练各自的神经网络。
为加速智能体的学习过程,Critic 网络的输入要包括其他智能体的观察状态和采取的动作,通过最小化损失以更Critic 网络参数。而后通过梯度下降法计算更新动作网络的参数。
多智能体强化学习一个顽固的问题是由于每个智能体的策略都在更新迭代导致环境针对一个特定的智能体是动态不稳定的。这种情况在竞争任务下尤其严重,经常会出现一个智能体针对其竞争对手过拟合出一个强策略。但是这个强策略是非常脆弱的,也是我们希望得到的,因为随着竞争对手策略的更新改变,这个强策略很难去适应新的对手策略。
略
详情请见:https://github.com/xuehy/pytorch-maddpg
from model import Critic, Actor import torch as th from copy import deepcopy from memory import ReplayMemory, Experience from torch.optim import Adam from randomProcess import OrnsteinUhlenbeckProcess import torch.nn as nn import numpy as np from params import scale_reward def soft_update(target, source, t): for target_param, source_param in zip(target.parameters(), source.parameters()): target_param.data.copy_( (1 - t) * target_param.data + t * source_param.data) def hard_update(target, source): for target_param, source_param in zip(target.parameters(), source.parameters()): target_param.data.copy_(source_param.data) class MADDPG: def __init__(self, n_agents, dim_obs, dim_act, batch_size, capacity, episodes_before_train): self.actors = [Actor(dim_obs, dim_act) for i in range(n_agents)] self.critics = [Critic(n_agents, dim_obs, dim_act) for i in range(n_agents)] self.actors_target = deepcopy(self.actors) self.critics_target = deepcopy(self.critics) self.n_agents = n_agents self.n_states = dim_obs self.n_actions = dim_act self.memory = ReplayMemory(capacity) self.batch_size = batch_size self.use_cuda = th.cuda.is_available() self.episodes_before_train = episodes_before_train self.GAMMA = 0.95 self.tau = 0.01 self.var = [1.0 for i in range(n_agents)] self.critic_optimizer = [Adam(x.parameters(), lr=0.001) for x in self.critics] self.actor_optimizer = [Adam(x.parameters(), lr=0.0001) for x in self.actors] if self.use_cuda: for x in self.actors: x.cuda() for x in self.critics: x.cuda() for x in self.actors_target: x.cuda() for x in self.critics_target: x.cuda() self.steps_done = 0 self.episode_done = 0 def update_policy(self): # do not train until exploration is enough if self.episode_done <= self.episodes_before_train: return None, None ByteTensor = th.cuda.ByteTensor if self.use_cuda else th.ByteTensor FloatTensor = th.cuda.FloatTensor if self.use_cuda else th.FloatTensor c_loss = [] a_loss = [] for agent in range(self.n_agents): transitions = self.memory.sample(self.batch_size) batch = Experience(*zip(*transitions)) non_final_mask = ByteTensor(list(map(lambda s: s is not None, batch.next_states))) # state_batch: batch_size x n_agents x dim_obs state_batch = th.stack(batch.states).type(FloatTensor) action_batch = th.stack(batch.actions).type(FloatTensor) reward_batch = th.stack(batch.rewards).type(FloatTensor) # : (batch_size_non_final) x n_agents x dim_obs non_final_next_states = th.stack( [s for s in batch.next_states if s is not None]).type(FloatTensor) # for current agent whole_state = state_batch.view(self.batch_size, -1) whole_action = action_batch.view(self.batch_size, -1) self.critic_optimizer[agent].zero_grad() current_Q = self.critics[agent](whole_state, whole_action) non_final_next_actions = [ self.actors_target[i](non_final_next_states[:, i, :]) for i in range( self.n_agents)] non_final_next_actions = th.stack(non_final_next_actions) non_final_next_actions = ( non_final_next_actions.transpose(0, 1).contiguous()) target_Q = th.zeros( self.batch_size).type(FloatTensor) target_Q[non_final_mask] = self.critics_target[agent]( non_final_next_states.view(-1, self.n_agents * self.n_states), non_final_next_actions.view(-1, self.n_agents * self.n_actions) ).squeeze() # scale_reward: to scale reward in Q functions target_Q = (target_Q.unsqueeze(1) * self.GAMMA) + ( reward_batch[:, agent].unsqueeze(1) * scale_reward) loss_Q = nn.MSELoss()(current_Q, target_Q.detach()) loss_Q.backward() self.critic_optimizer[agent].step() self.actor_optimizer[agent].zero_grad() state_i = state_batch[:, agent, :] action_i = self.actors[agent](state_i) ac = action_batch.clone() ac[:, agent, :] = action_i whole_action = ac.view(self.batch_size, -1) actor_loss = -self.critics[agent](whole_state, whole_action) actor_loss = actor_loss.mean() actor_loss.backward() self.actor_optimizer[agent].step() c_loss.append(loss_Q) a_loss.append(actor_loss) if self.steps_done % 100 == 0 and self.steps_done > 0: for i in range(self.n_agents): soft_update(self.critics_target[i], self.critics[i], self.tau) soft_update(self.actors_target[i], self.actors[i], self.tau) return c_loss, a_loss def select_action(self, state_batch): # state_batch: n_agents x state_dim actions = th.zeros( self.n_agents, self.n_actions) FloatTensor = th.cuda.FloatTensor if self.use_cuda else th.FloatTensor for i in range(self.n_agents): sb = state_batch[i, :].detach() act = self.actors[i](sb.unsqueeze(0)).squeeze() act += th.from_numpy( np.random.randn(2) * self.var[i]).type(FloatTensor) if self.episode_done > self.episodes_before_train and\ self.var[i] > 0.05: self.var[i] *= 0.999998 act = th.clamp(act, -1.0, 1.0) actions[i, :] = act self.steps_done += 1 return actions