基于策略的方法

科技2022-07-12 165

深层钢筋学习介绍— 18 (DEEP REINFORCEMENT LEARNING EXPLAINED — 18)

This is a new post devoted to Policy-Based Methods, in the “Deep Reinforcement Learning Explained” series. Here we will introduce a class of algorithms that allow us to approximate the policy function, π, instead of the values functions (V, or Q). That is, instead of training a network that outputs action values, we will train a network to output (the probability of) actions, as we advanced with one example in Post 6.

这是“ 深度强化学习解释 ”系列中专门讨论基于策略的方法的新文章。在这里，我们将介绍一类算法，这些算法使我们能够近似策略函数π而不是值函数(V或Q)。也就是说，与训练输出动作值的网络不同，我们将训练网络输出动作的(概率)，如我们在Post 6中的一个示例所介绍的那样。

基于策略的方法 (Policy-Based Methods)

Reinforcement learning is ultimately about learning an optimal policy, denoted by π*, from interacting with the Environment. So far, we’ve been learning at value-based methods, where we first find an estimate of the optimal action-value function q* from which we obtain the optimal policy π*.

强化学习最终是要从与环境的交互中学习以π*表示的最佳策略。到目前为止，我们一直在学习基于价值的方法，在该方法中，我们首先找到最佳行动价值函数q *的估计值，然后从中获得最佳策略π*。

For small state spaces, like the Frozen-Lake example introduced in Post 1, this optimal value function q* can be represented in a table, the Q-table, with one row for each state and one column for each action. At each time step, for a given state, we only need to pull its corresponding row from the table, and the optimal action is just the action with the largest value entry.

对于较小的状态空间，如在Post 1中引入的Frozen-Lake示例，此最佳值函数q *可以表示在表Q表中，每个状态一行，每个动作一行。对于给定的状态，在每个时间步骤中，我们只需要从表中拉出其对应的行，最佳操作就是具有最大输入值的操作。

But what about environments with much larger state spaces, like the Pong Environment introduced in the previous Posts? There’s a vast number of possible states, and that would make the table way too big to be useful in practice. So, we presented how to represent the optimal action-value function q* with a neural network. In this case, the neural network is fed with the Environment state as input, and it returns as output the value of each possible action.

但是，状态空间大得多的环境(如先前的帖子中介绍的Pong环境)如何呢？有很多可能的状态，这会使表太大而无法在实践中使用。因此，我们提出了如何用神经网络表示最佳作用值函数q *。在这种情况下，神经网络将环境状态作为输入，然后将每个可能动作的值作为输出返回。

But it is important to note that in both cases, whether we used a table or a neural network, we had to first estimate the optimal action-value function before we could tackle the optimal policy π*. Then, an interesting question arises: can we directly find the optimal policy without first having to deal with a value function? The answer is yes, and the class of algorithms to accomplish this are known as policy-based methods.

但重要的是要注意，在两种情况下，无论我们使用表格还是神经网络，都必须先估算最佳作用值函数，然后才能解决最佳策略π*。然后，一个有趣的问题出现了：我们可以直接找到最优政策而不必先处理价值函数吗？答案是肯定的，完成该任务的算法类别称为基于策略的方法。

With value-based methods, the Agent uses its experience with the Environment to maintain an estimate of the optimal action-value function. The optimal policy is then obtained from the optimal action-value function estimate (e.g., using e-greedy).

通过基于价值的方法，代理可以利用其在环境方面的经验来维护最佳行动值功能的估计。然后从最佳作用值函数估计中获得最佳策略(例如，使用e-greedy)。

Value-based methods approach 基于价值的方法

Instead, Policy-based methods directly learn the optimal policy from the interactions with the environment, without having to maintain a separate value function estimate.

相反，基于策略的方法直接从与环境的交互中学习最佳策略，而不必维护单独的价值函数估计。

Policy-based methods approach 基于策略的方法

An example of a policy-based method, was already introduced at the beginning of this series when the cross-entropy method was presented in Post 6. We introduced that a policy, denoted by 𝜋(𝑎|𝑠), says which action the Agent should take for every state observed. In practice, the policy is usually represented as a probability distribution over actions (that the Agent can take at a given state) with the number of classes equal to the number of actions we can carry out. We refer to it as a stochastic policy because it returns a probability distribution over actions rather than returning a single deterministic action.

在本系列文章的开始部分已经介绍了一种基于策略的方法，该方法在Post 6中介绍了交叉熵方法。我们引入了一个政策，用π表示(𝑎|𝑠)说，每一个国家观察到的代理应该采取哪些行动。实际上，策略通常表示为行动上的概率分布 (代理可以在给定的状态下执行)，其类数等于我们可以执行的操作数。我们将其称为随机策略，因为它会返回操作上的概率分布，而不是返回单个确定性操作。

Policy-based methods offer a few advantages over value-prediction methods like DQN presented in the previous three Posts. One is that, as we already discussed, we no longer have to worry about devising an action-selection strategy like ϵ-greedy policy; instead, we directly sample actions from the policy. And this is important; remember that we wasted a lot of time fixing up methods to improve the stability of training our DQN. For instance, we had to use experience replay and target networks, and there are several other methods in the academic literature that helps. A policy network tends to simplify some of that complexity.

与前三篇 Post中介绍的DQN等价值预测方法相比，基于策略的方法具有一些优势。一个是，正如我们已经讨论的那样，我们不再需要担心制定行动选择策略，例如ϵ -greedy策略；相反，我们直接从政策中抽样采取措施。这很重要；请记住，我们浪费了很多时间来确定方法，以提高训练DQN的稳定性。例如，我们必须使用经验重播和目标网络，而学术文献中还有其他几种方法可以提供帮助。策略网络倾向于简化其中的某些复杂性。

神经网络的策略函数逼近 (Policy Function Approximation with a Neural Network)

In Deep Reinforcement Learning, it is common to represent the policy with a neural network (as we did for the first time in Post 6). Let´s consider the Cart-Pole balancing problem from Post 12 of this series as an example to introduce how we can represent a policy with a neural network.

在深度强化学习中，通常使用神经网络来表示策略(就像我们在Post 6中的第一次)。让我们以本系列文章的第12部分中的购物车-车子平衡问题为例，介绍如何使用神经网络表示策略。

Remember that a cart is positioned on a frictionless track along the horizontal axis in this example, and a pole is anchored to the top of the cart. The objective is to keep the pole from falling over by moving the cart either left or right, and without falling off the track.

请记住，在此示例中，推车沿水平轴位于无摩擦的轨道上，并且一根杆子固定在推车的顶部。目的是通过向左或向右移动手推车来防止杆子跌落，并且不会掉落轨道。

The system is controlled by applying a force of +1 (left) or -1 (right) to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every time-step that the pole remains upright, including the episode’s final step. The episode ends when the pole is more than 15 degrees from the vertical, or the cart moves more than 2.4 units from the center.

通过对推车施加+1(左)或-1(右)的力来控制系统。钟摆开始直立，目的是防止其跌落。对于杆保持直立的每个时间步(包括情节的最后一步)，都会提供+1的奖励。当极点与垂直方向的夹角超过15度时，或者情节推车从中心移出2.4个单位以上时，情节就会结束。

The observation space for this Environment at each time point is an array of 4 numbers. At every time step, you can observe its position, velocity, angle, and angular velocity. These are the observable states of this world. You can look up what each of these numbers represents in this document. Notice the minimum (-Inf) and maximum (Inf) values for both Cart Velocity and the Pole Velocity at Tip. Since the entry in the array corresponding to each of these indices can be any real number, that means the state space is infinite!

此环境在每个时间点的观察空间是4个数字的数组。在每个时间步上，您都可以观察其位置，速度，角度和角速度。这些是这个世界可观察的状态。您可以查询本文档中每个数字所代表的含义。注意小车速度和顶杆速度的最小值(-Inf)和最大值(Inf)。由于数组中与这些索引中的每个索引相对应的条目可以是任何实数，这意味着状态空间是无限的！

At any state, the cart only has two possible actions: move to the left or move to the right. In other words, the state-space of the Cart-Pole has four dimensions of continuous values, and the action-space has one dimension of two discrete values.

在任何状态下，购物车只有两种可能的动作：向左移动或向右移动。换句话说，购物车杆的状态空间具有四个维的连续值，而动作空间具有一个维的两个离散值。

We can construct a neural network that approximates the policy that takes a state as input. In this example, the output layer will have two nodes that return, respectively, the probability for each action. In general, if the Environment has discrete action space, as in this example, the output layer has a node for each possible action and contains the probability that the Agent should select each possible action.

我们可以构造一个神经网络，以近似以状态为输入的策略。在此示例中，输出层将具有两个节点，分别返回每个动作的概率。通常，如果环境具有离散的操作空间(如本例所示)，则输出层将为每个可能的操作提供一个节点，并包含代理应选择每个可能的操作的概率。

torres.ai torres.ai

The way to use the network is that the Agent feeds the current Environment state, and then the Agent samples from the probabilities of actions (left or right in this case) to select its next action.

使用网络的方式是，代理提供当前的环境状态，然后代理从操作概率(在这种情况下为左或右)中采样以选择其下一个操作。

Then, the objective is to determine appropriate values for the network weights represented by θ (Theta). θ encodes the policy that for each state that we pass into the network, it returns action probabilities where the optimal action is most likely to be selected. The actions chosen influences the Rewards obtained that are used to get the return.

然后，目标是确定由θ (Theta)表示的网络权重的适当值。 θ对以下策略进行编码：对于我们进入网络的每个状态，它返回最有可能选择最佳操作的操作概率。选择的动作会影响所获得的奖励，该奖励用于获取回报。

Remember that the Agent’s goal is always to maximize expected return. In our case, let’s denote the expected return as J. The main idea is that it is possible to write the expected return J as a function of θ. Later we will see how we can express this relationship, J(θ), in a more “mathematical” way to find the values for the weights that maximize the expected return.

请记住，座席的目标始终是最大化预期回报。在我们的例子中，让我们将期望收益表示为J。主要思想是可以将期望收益率J写成θ的函数。稍后，我们将看到如何以更“数学”的方式表达这种关系J(θ)，以找到权重值，从而将预期收益最大化。

无导数方法 (Derivative-free methods)

In the previous section, we have seen how a neural network can represent a policy. The weights in this neural network are initially set to random values. Then, the Agent updates the weights as it interacts with the Environment. This section will give an overview of approaches that we can take towards optimizing these weights, derivative-free methods, also known as zero-order methods.

在上一节中，我们已经看到了神经网络如何表示策略。该神经网络中的权重最初设置为随机值。然后，代理在与环境交互时更新权重。本节将概述可用于优化这些权重的方法，无导数方法，也称为零阶方法。

Derivative-free methods directly search in parameter space for the vector of weights that maximizes the returns obtained by a policy; by evaluating only some positions of the parameter space, without derivatives that compute the gradients. Let’s explain the most straightforward algorithm in this category that will help us to understand later how policy gradient methods work, the Hill Climbing.

无导数方法直接在参数空间中搜索权重向量，以使策略所获得的回报最大化。通过仅评估参数空间的某些位置，而无需计算梯度的导数。让我们解释一下这一类别中最直接的算法，它将帮助我们以后理解策略梯度方法的工作原理，即“爬山”。

爬山 (Hill Climbing)

Hill Climbing is an iterative algorithm that can be used to find the weights θ for an optimal policy. It is a relatively simple algorithm that the Agent can use to gradually improve the weights θ in its policy network while interacting with the Environment.

爬山是一种迭代算法，可用于找到最佳策略的权重θ 。这是一种相对简单的算法，代理可以在与环境交互时使用它来逐步提高其策略网络中的权重θ 。

As the name indicates, intuitively, we can visualize that the algorithm draws up a strategy to reach the highest point of a hill, where θ indicates the coordinates of where we are at a given moment and G indicates the altitude at which we are at that point:

顾名思义，我们可以直观地看到该算法制定了一种策略来达到山的最高点，其中θ表示给定时刻我们所在位置的坐标，G表示我们在该时刻所处的海拔高度点：

θ = (θ1,θ2). θ=(θ1，θ2)。

This visual example represents a function of two parameters, but the same idea extends to more than two parameters. The algorithm begins with an initial guess for the value of θ (random set of weights). We collect a single episode with the policy that corresponds to those weights θ and then record the return G.

这个直观的示例表示两个参数的函数，但是同一思想扩展到两个以上的参数。该算法从对θ值(权重的随机集合)的初始猜测开始。我们使用与权重θ相对应的策略收集单个情节，然后记录收益G。

This return is an estimate of what the surface looks like at that value of Theta. It will not be a perfect estimate because the return we just collected is unlikely to be equal to the expected return. This is because due to randomness in the Environment (and the policy, if it is stochastic), it is highly likely that if we collect a second episode with the same values for θ, we’ll likely get a different value for the return G. But in practice, even though the (sampled) return is not a perfect estimate for the expected return estimates, it often turns out to be good enough.

该返回值是在那个Theta值下表面外观的估计。这不是一个完美的估计，因为我们刚刚收集的回报不太可能等于预期回报。这是因为由于环境(和策略，如果是随机的)中的随机性，如果我们收集第二个具有相同θ值的情节，则很有可能获得不同的收益G值。但在实践中，即使(采样)的回报是不是为预期收益估计一个完美的估计，它往往被证明是不够好。

At each iteration, we slightly perturb the values (add a little bit of random noise) of the current best estimate for the weights θ, to yield a new set of candidate weights we can try. These new weights are then used to collect an episode. To see how good those new weights are, we’ll use the policy that they give us to again interact with the Environment for an episode and add up the return.

在每次迭代中，我们都会对权重θ的当前最佳估计值稍加扰动(添加一点随机噪声)，以产生我们可以尝试的一组新的候选权重。这些新的权重然后用于收集情节。为了了解这些新权重的优劣，我们将使用它们赋予我们的策略，再次与环境互动以制作情节并累加回报。

In up the new weights, give us more return than our current best estimate, we focus our attention on that new value, and then we just repeat by iteratively proposing new policies in the hope that they outperform the existing policy. In the event that they don’t well, we go back to our last best guess for the optimal policy and iterate until we end up with the optimal policy.

在增加新的权重方面，给我们带来的收益要大于当前的最佳估计，我们将注意力集中在这一新的价值上，然后我们反复反复地提出新的政策，以期使它们优于现有政策。如果它们不适，我们将返回最佳策略的最后最佳猜测，并进行迭代，直到最终获得最佳策略。

Now that we have an intuitive understanding of how the hill climbing algorithm should work, we can summarize it in the following pseudocode:

现在我们对爬山算法应该如何工作有了直观的了解，我们可以用以下伪代码对其进行总结：

Initialize policy π with random weights θ

初始化策略π 随机权重θ

Initialize θbest (our best guess for the weights θ)

初始化θbest (我们对权重θ的最佳猜测)

Initialize Gbest (our highest return G we have gotten so far)

初始化GBEST(我们的回报率最高摹我们已经变得如此远)

Collect a single episode with θbest, and record the return G

用θbest收集一个情节，并记录回报G

If G > Gbest then θbest ← θ and Gbest ← G

如果G> Gbest，则θbest←θ和Gbest←G

Add a little bit of random noise to θbest, to get a new set of weights θ

随机噪声的一点点添加到θbest，以获得新的权重集的θ

Repeat steps 4–6 until Environment solved.

重复步骤4-6，直到环境解决。

In our example, we assumed a surface with only one maximum in which the algorithm Hill-climbers are well-suited. Note that it’s not guaranteed to always yield the weights of the optimal policy on a surface with more than one local maxima. This is because if the algorithm begins in a poor location, it may converge to the lower maximum.

在我们的示例中，我们假设一个表面只有一个最大值，算法Hill-climbers非常适合。请注意，不能保证总是在具有多个局部最大值的曲面上产生最优策略的权重。这是因为如果算法在较差的位置开始，则可能会收敛到较低的最大值。

编码山攀岩 (Coding Hill Climbing)

This section will explore an implementation of Hill-Climbing applied to the Cartpole Environment based in the previous pseudocode. The neural network model here is so simple, uses only the simplest matrix of shape[4x2] (state_space x action_space), that does not use tensors (no PyTorch required nor even GPU).

本节将探讨基于先前伪代码的应用于Cartpole环境的Hill-Climbing的实现。这里的神经网络模型是如此简单，仅使用形状最简单的矩阵[4x2] ( state_space x action_space)，而不使用张量(不需要PyTorch甚至GPU)。

The code presented in this section can be found on GitHub (and can be run as a Colab google notebook using this link).

本节中提供的代码可以在GitHub上找到 (并且可以使用此链接作为Colab谷歌笔记本运行 )。

From this post, since it repeats many of the things that we have been using, we will not go into describing the code in detail; I think this is quite self-explanatory.

在这篇文章中，由于它重复了我们一直在使用的许多东西，因此我们将不再详细描述代码；我认为这是不言而喻的。

As always, we will start by importing the required packages and create the Environment:

与往常一样，我们将首先导入所需的包并创建环境：

import gymimport numpy as npfrom collections import dequeimport matplotlib.pyplot as pltenv = gym.make('CartPole-v0')

The policy π (and its initialization with random weights θ) can be coded as:

策略π(及其使用随机权重θ的初始化)可以编码为：

class Policy():def __init__(self, s_size=4, a_size=2):# 1. Initialize policy π with random weights self.θ = 1e-4*np.random.rand(s_size, a_size) def forward(self, state): x = np.dot(state, self.θ)return np.exp(x)/sum(np.exp(x))def act(self, state): probs = self.forward(state) action = np.argmax(probs) # deterministic policyreturn action

To visualize the training’s effect, we plot the weights θ before and after the training and render how the Agent apply the policy:

为了可视化培训的效果，我们绘制了培训前后的权重θ ，并渲染了Agent如何应用该策略：

def watch_agent(): env = gym.make('CartPole-v0') state = env.reset() rewards = [] img = plt.imshow(env.render(mode='rgb_array'))for t in range(2000): action = policy.act(state) img.set_data(env.render(mode='rgb_array')) plt.axis('off') display.display(plt.gcf()) display.clear_output(wait=True) state, reward, done, _ = env.step(action) rewards.append(reward)if done: print("Reward:", sum([r for r in rewards]))break env.close()policy = Policy()print ("Policy weights θ before train:\n", policy.θ)watch_agent()Policy weights θ before train: [[6.30558674e-06 2.13219853e-05] [2.32801200e-05 5.86359967e-05] [1.33454380e-05 6.69857175e-05] [9.39527443e-05 6.65193884e-05]]Reward: 9.0 Run the code in the collab and you will be able to see how the Agent handles the cartpole. 运行协作中的代码，您将能够看到代理如何处理棘手问题。

The following code defines the function that trains the Agent:

以下代码定义了训练代理的功能：

def hill_climbing(n_episodes=10000, gamma=1.0, noise=1e-2):"""Implementation of hill climbing. Params ====== n_episodes (int): maximum number of training episodes gamma (float): discount rate noise(float): standard deviation of additive noise """ scores_deque = deque(maxlen=100) scores = []#2. Initialize θbestGbest = -np.Inf#3. Initialize Gbest θbest = policy.θ for i_episode in range(1, n_episodes+1): rewards = [] state = env.reset()while True:#4.Collect a single episode with θ,and record the return G action = policy.act(state) state, reward, done, _ = env.step(action) rewards.append(reward)if done:break scores_deque.append(sum(rewards)) scores.append(sum(rewards)) discounts = [gamma**i for i in range(len(rewards)+1)] G = sum([a*b for a,b in zip(discounts, rewards)])if G >= Gbest: # 5. If G>Gbest then θbest←θ & Gbest←G Gbest = G θbest = policy.θ#6. Add a little bit of random noise to θbes policy.θ = θbest + noise * np.random.rand(*policy.θ.shape)if i_episode % 10 == 0: print('Episode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_deque)))# 7. Repeat steps 4-6 until Environment solved. if np.mean(scores_deque)>=env.spec.reward_threshold: print('Environment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_deque))) policy.θ = θbestbreakreturn scores

The code does not require too much explanation as it is quite explicit, and it is annotated with the corresponding pseudocode steps. Maybe note some details. For instance, the algorithm seeks to maximize the cumulative discounted reward, and it looks in Python as follows:

该代码不需要太多解释，因为它非常明确，并带有相应的伪代码步骤进行注释。也许要注意一些细节。例如，该算法寻求最大化的累积折扣回报，它看起来在Python如下：

discounts = [gamma**i for i in range(len(rewards)+1)]G = sum([a*b for a,b in zip(discounts, rewards)])

Remember that Hill Climbing is a simple gradient-free algorithm (i.e., we do not use the gradient ascent/gradient descent methods). We try to climb to the top of the curve by only changing the arguments of the target function G, the weight matrix θ determining the neural network that underlies in our mode, using a certain noise:

请记住，爬山是一种简单的无梯度算法(即，我们不使用梯度上升/梯度下降方法)。我们尝试通过仅更改目标函数G的参数来爬到曲线的顶部，权重矩阵θ使用某种噪声来确定作为我们模式基础的神经网络：

policy.θ = θbest + noise * np.random.rand(*policy.θ.shape)

As in some previous examples, we try to exceed a certain threshold to consider the Environment solved. For Cartpole-v0, this threshold score is 195, indicated by env.spec.reward_threshold. In the example that we used to write this post, we only needed 215 episodes to solve the Environment:

如前面的示例所示，我们尝试超过某个阈值来考虑所解决的环境。对于Cartpole-v0 ，此阈值得分为195 ，由env.spec.reward_threshold指示。在我们以前写这篇文章的示例中，我们只需要215集就可以解决环境问题：

scores = hill_climbing(gamma=0.9)Episode 10 Average Score: 59.50Episode 20 Average Score: 95.45Episode 30 Average Score: 122.37Episode 40 Average Score: 134.60Episode 50 Average Score: 145.60Episode 60 Average Score: 149.38Episode 70 Average Score: 154.33Episode 80 Average Score: 160.04Episode 90 Average Score: 163.56Episode 100 Average Score: 166.87Episode 110 Average Score: 174.70Episode 120 Average Score: 168.54Episode 130 Average Score: 170.92Episode 140 Average Score: 173.79Episode 150 Average Score: 174.83Episode 160 Average Score: 178.00Episode 170 Average Score: 179.60Episode 180 Average Score: 179.58Episode 190 Average Score: 180.41Episode 200 Average Score: 180.74Episode 210 Average Score: 186.96Environment solved in 215 episodes! Average Score: 195.65

With the following code, we can plot the scores obtained in each episode during training:

使用以下代码，我们可以绘制出训练期间每个情节中获得的分数：

fig = plt.figure()plt.plot(np.arange(1, len(scores)+1), scores)plt.ylabel('Score')plt.xlabel('Episode #')plt.show()

Now we plot the weights θ again, after the training and also how the Agent applies this policy and seems smarter:

现在，在训练之后，以及代理如何应用此策略并且看起来更聪明之后，我们再次绘制权重θ ：

print ("Policy weights θ after train:\n", policy.θ)watch_agent()Policy weights θ after train: [[0.83126272 0.83426041] [0.83710884 0.86015151] [0.84691878 0.89171965] [0.80911446 0.87010399]]Reward: 200.0 Run the code in the collab and you will be able to see how the Agent handles the cartpole 运行协作中的代码，您将能够看到代理如何处理棘手问题

Although in this example, we have coded a deterministic policy for simplicity, Policy-based methods can learn either stochastic or deterministic policies, and they can be used to solve Environments with either finite or continuous action spaces.

尽管在此示例中，为简单起见，我们已对确定性策略进行了编码，但是基于策略的方法可以学习随机策略或确定性策略，并且可以用于解决具有有限或连续动作空间的环境。

超越登山 (Beyond Hill Climbing)

Hill Climbing algorithm does not need to be differentiable or even continuous, but because it is taking random steps, this may not result in the most efficient path up the hill. There are in the literature many improvements to this approach: adaptive noise scaling, Steepest ascent hill climbing, random restarts, simulated annealing, evolution strategies, or cross-entropy method (a method iteratively suggests a small number of neighboring policies, and uses a small percentage of the best performing policies to calculate a new estimate already presented in Post 6) .

爬山算法不需要是可微的甚至是连续的，但是由于它采用随机步长，因此可能不会导致最有效的爬坡路径。文献中对此方法进行了许多改进：自适应噪声缩放，最陡峭的爬坡登山，随机重启，模拟退火，演化策略或交叉熵方法 (一种方法会反复建议少量的相邻策略，并使用一小部分性能最佳的策略来计算已经在Post 6中提出的新估算值)。

However, the usual solutions to this problem consider Policy Gradient Methods that estimate an optimal policy’s weights through gradient ascent. Policy gradient methods are a subclass of policy-based methods that we will present in the next post.

但是，解决此问题的常用方法是使用“ 策略梯度法” ，该方法通过梯度上升来估计最佳策略的权重。策略渐变方法是基于策略的方法的子类，我们将在下一篇文章中介绍。

发布摘要 (Post summary)

In this post, we introduced the concept of Policy-Based Methods. There are several reasons why we consider policy-based methods instead of value-based methods that seem to work so well as we saw in the previous post. Primarily because Policy-based methods directly get to the problem at hand (estimating the optimal policy) without having to store additional data, i.e., the action values that may not be useful. A further advantage over value-based methods, Policy-based methods are well-suited for continuous action spaces. As we will see in future posts, unlike value-based methods, policy-based methods can learn true stochastic policies.

在本文中，我们介绍了基于策略的方法的概念。为什么要考虑基于策略的方法，而不是像上一篇文章中看到的那样，基于价值的方法行之有效的原因有很多。主要是因为基于策略的方法直接解决了当前的问题(估计最佳策略)，而不必存储其他数据，即可能无效的操作值。与基于价值的方法相比，基于策略的方法的另一个优势是非常适合连续操作空间。正如我们将在以后的文章中看到的那样，与基于价值的方法不同，基于策略的方法可以学习真正的随机策略。

翻译自: https://towardsdatascience.com/policy-based-methods-8ae60927a78d

Processed: 0.016, SQL: 8