乐器演奏

科技2022-07-12 155

乐器演奏

In this project, I’ll walk through an introductory project on tabular Q-learning. We’ll train a simple RL agent to be able to evaluate tic-tac-toe positions in order to return the best move by playing against itself for many games.

在这个项目中，我将逐步介绍表格Q学习的入门项目。我们将训练一个简单的RL代理，使其能够评估井字游戏的位置，以便通过在许多游戏中与自己对战来返回最佳动作。

First, let’s import the required libraries

首先，让我们导入所需的库

Note that tabular q-learning only works for environments which can be represented by a reasonable number of actions and states. Tic-tac-toe has 9 squares, each of which can be either an X, and O, or empty. Therefore, there are approximately 3⁹ = 19683 states (and 9 actions, of course). Therefore, we have a table with 19683 x 9 = 177147 cells. This is not small, but it is certainly feasible for tabular q-learning. In fact, we could exploit the fact that the game of tic-tac-toe is unchanged by rotations of the board. Therefore, there are actually far fewer “unique states”, if you consider rotations and reflections of a particular board configuration the same. I won’t get into deep Q-learning, because this is intended to be an introductory project.

请注意，表格式q学习仅适用于可以由合理数量的动作和状态表示的环境。井字游戏有9个正方形，每个正方形可以是X和O，也可以是空的。因此，大约有3⁹= 19683个状态(当然还有9个动作)。因此，我们有一个表格，其中包含19683 x 9 = 177147个单元格。这并不小，但是对于表格q学习来说肯定是可行的。实际上，我们可以利用以下事实，即棋盘的旋转不会改变井字游戏。因此，如果您将特定板配置的旋转和反射视为相同，则实际上“唯一状态”要少得多。我不会进行深入的Q学习，因为这是一个入门项目。

First, we initialize our q-table with the aforementioned shape:

首先，我们使用上述形状初始化q表：

Now, let’s set some hyperparameters for training:

现在，让我们为训练设置一些超参数：

Now, we need to set up an exploration strategy. Assuming you understand exploration-exploitation in RL, the exploration strategy is the way we will gradually decrease epsilon (the probability of taking random actions). We need to initially play at least semi-randomly in order to properly explore the environment (the possible tic-tac-toe board configurations). But we cannot forever take random actions, because RL is an iterative process that relies under the assumption that the evaluation of future reward gets better over time. If we simply played random games forever, we would be trying to associate a random list of actions with some final game result that has no actual dependency upon any particular action we took.

现在，我们需要制定一个探索策略。假设您了解RL中的探索开发，那么探索策略就是我们逐渐减少ε(采取随机动作的概率)的方式。首先，我们需要至少半随机地玩，以便正确探索环境(可能的井字游戏板配置)。但是我们不能永远采取随机行动，因为RL是一个迭代过程，它依赖于对未来奖励的评估会随着时间的推移而变得更好的假设。如果我们只是永久地玩随机游戏，我们将尝试将随机动作列表与一些最终游戏结果相关联，而这些最终游戏结果实际上并不依赖于我们采取的任何特定动作。

Now, let’s create a graph of epsilon vs. episodes (number of games simulated) with matplotlib, saving the figure to an image file:

现在，让我们使用matplotlib创建epsilon与情节(模拟的游戏数)的图形，然后将图形保存到图像文件中：

When we start to simulate games, we need to set some restrictions so that the agents can’t make insensible moves. In tic-tac-toe, occupied squares are no longer available, so we need a function to return the legal moves, given a board configuration. We will be representing our board by a 3x3 NumPy array, where unoccupied squares are 0, X’s are 1, and O’s are -1. We can use NumPy’s np.argwhereto retrieve the indices of the 0 elements.

当我们开始模拟游戏时，我们需要设置一些限制，以使特工不会做出明智的举动。在井字游戏中，不再有占用的正方形，因此，在有棋盘配置的情况下，我们需要一个函数来返回合法移动。我们将以3x3 NumPy数组表示董事会，其中未占用的正方形为0，X为1，O为-1。我们可以使用NumPy的np.argwhere来检索0个元素的索引。

We also need a helper function to convert between a 3x3 board representation and an integer state. We’re storing the future reward estimations in a q-table, so we need to be able to index any particular board configuration with ease. My algorithm for converting between the board in the format I previously described works by partitioning the total number of possible states into a number of sections corresponding to the number of actions. For each cell in the board:

我们还需要一个辅助函数来在3x3电路板表示形式和整数状态之间进行转换。我们将未来的奖励估算存储在一个q表中，因此我们需要能够轻松索引任何特定的板配置。我以前面描述的格式在电路板之间进行转换的算法是通过将可能状态的总数划分为与操作数相对应的多个部分来工作的。对于板上的每个单元：

If the cell is -1, you don’t change state

如果单元格为-1，则不更改state

If the cell is 0, you change state by one-third of the window size

如果单元格为0，则将state更改为窗口大小的三分之一

If the cell is 1, you change state by two-thirds of the window size.

如果单元格为1，则将state更改为窗口大小的三分之二。

Finally, we need one last helper function to determine when the game has reached a terminal state. This function also needs to return the result of the game if it is indeed over. My implementation checks the rows, columns, and diagonals for a series of either 3 consecutive 1's or 3 consecutive -1’s by taking the sum of the board array across each axis. This produces 3 sums, one for each axis or column. If -3 is one of these sums, this axis must have all -1’s, indicating that the player corresponding to -1 won and vice versa. The diagonals work just the same, except there are only 2 diagonals, while there are 3 rows and 3 columns. My original implementation is a bit naive, I found a much better one online. It’s much shorter and improves speed slightly.

最后，我们需要最后一个辅助函数来确定游戏何时达到终端状态。如果确实结束，则此函数还需要返回游戏结果。我的实现通过取每个轴上的电路板阵列的总和来检查一系列3个连续的1或3个连续的-1的行，列和对角线。这将产生3个总和，每个轴或每列一个。如果-3是这些总和之一，则此轴必须具有所有-1，表示对应于-1的玩家赢了，反之亦然。对角线的工作原理相同，只是对角线只有2条，而行数为3行3列。我最初的实现有点天真，我在网上找到了更好的方法。它更短，并且速度稍有提高。

Now, let’s initialize some lists to record training metrics.

现在，让我们初始化一些列表以记录训练指标。

past_results will store the results of each simulated game, with 0 representing a tie, 1 indicating that the player corresponding to the positive integer won, and vice versa with -1.

past_results将存储每个模拟游戏的结果，0表示平局，1表示对应于正整数的玩家获胜，反之亦然，-1。

win_probs will store a list of percentages, updated after each episode. Each value tells the fraction of games up to the current episode in which either player has won. draw_probs also records percentages, but corresponding to the fraction of games in which a draw occurred.

win_probs将存储百分比列表，每集更新一次。每个值表示直到当前情节(其中任一玩家已获胜)的游戏分数。 draw_probs还记录百分比，但与发生平局的游戏分数相对应。

After training, if we were to graph win_probs and draw_probs, they should demonstrate the following behavior.

训练后，如果我们要绘制win_probs和draw_probs图表，则它们应表现出以下行为。

Early in training, the win probability will be high, while the draw probability will be low. This is because when both opponents are taking random actions in a game like tic-tac-toe, there will more often be wins than draws simply due to the existence of a larger number of win states than draw states.

在训练的早期，获胜的概率会很高，而平局的概率会很低。这是因为，当两个对手都在井字游戏中采取随机动作时，获胜的机会往往多于平局，这仅仅是因为存在比平局多得多的获胜状态。 Mid-way through training, when the agent begins to play according to its table’s policy, the win and draw probabilities will fluctuate with symmetry across the 50% line. Once the agent starts playing competitively against itself, it will encounter more draws, as both sides are playing according to the same strategic policy. Each time the agent discovers a new offensive strategy, there will be a fluctuation in the graph, for the agent is able to trick its opponent (itself) for a short period of time.

在训练过程中，当坐席开始按照其牌桌政策进行比赛时，获胜和抽奖概率将在50％的范围内对称波动。一旦特工开始与自己竞争，双方将根据相同的战略政策进行比赛。代理每次发现新的进攻策略时，图表中都会出现波动，因为代理能够在短时间内欺骗其对手(自己)。 After fluctuating for a while, draw probabilities should approach 100%. If the agent was truly playing optimally against itself, it would always encounter a draw, for it is attempting to maximize reward according to a table of expected future rewards… the same table being used by the opponent (itself).

波动一段时间后，绘制概率应接近100％。如果代理商真正在与自己进行最佳对抗，那么它总是会遇到平局，因为它试图根据预期的未来奖励表最大化奖励，而对手(自己)也使用相同的表。

Let’s write the training script. For each episode, we begin at a non-terminal state: an empty 3x3 board filled with 0’s. It each move, with some probability epsilon, the agent takes a random action from the list of available squares. Otherwise, it looks up the row of the q-table corresponding to the current state and selects the action which maximizes the expected future reward. The integer representation of the new board state is computed, and we record the pair (s, a, s’). Once this game ends, we will need to correlate the state-action pair we just observed with the final game results (which are yet-to-be-determined). Once the game ends, we refer back to each recorded state-action pair, and update the corresponding cell of the q-table according to the following:

让我们编写训练脚本。对于每个情节，我们都从一个非终结状态开始：一个充满0的空3x3面板。每次移动(以一定概率ε)，代理都会从可用正方形列表中采取随机动作。否则，它将查找与当前状态相对应的q表的行，并选择使预期的未来奖励最大化的动作。计算新板状态的整数表示，然后记录该对(s，a，s')。游戏结束后，我们需要将刚刚观察到的状态-动作对与最终游戏结果(尚未确定)相关联。游戏结束后，我们将参考每个记录的状态-动作对，并根据以下内容更新q表的相应单元格：

Q-learning update rule Q学习更新规则

In the above update formula, s is the integer representation of the state, a is the integer representation of the action the agent took at state s , alpha is the learning rate, R(s, a) is the reward (in our case, the end result of the corresponding game in which this pair (s, a) was observed), Q is the q-table, and the statement involving max represents the maximum expected reward for the resulting state. Say the board configuration was:

在上述更新公式中， s是状态的整数表示， a是代理在状态s采取的动作的整数表示，alpha是学习率， R(s, a)是奖励(在我们的例子中，观察到这对(s, a)的对应游戏的最终结果)， Q是q表，涉及max的陈述表示对结果状态的最大预期奖励。说板配置为：

[[0, 0, 0], [0, 0, 0], [0, 0, 1]]

and we took the action 3, corresponding to the cell at coordinate (1, 0), the resulting state would be:

然后我们执行了动作3，它对应于坐标(1, 0) ，结果状态为：

[[0 0, 0], [-1, 0, 0], [0, 0, 1]]

This part of the update formula is referring to the maximum expected reward for any of the actions we could take from here, according to the policy defined by our current q-table. Therefore, s' is the second state I just described, and a' is all of the actions we could theoretically take from this state (0–8), although in reality, some are illegal (but this is irrelevant).

根据当前q表定义的策略，更新公式的这一部分指的是我们从此处可以采取的任何行动的最大预期奖励。因此， s'是我刚刚描述的第二种状态，而a'是我们从理论上可以从该状态(0-8)采取的所有动作，尽管实际上，有些是非法的(但这无关紧要)。

At the end of every 1000 episodes, I just save the list of training metrics, and a plot of these metrics. At the end, I save the q-table and the lists storing these training metrics.

在每1000集的结尾，我只保存训练指标列表以及这些指标的图表。最后，我保存了q表和存储这些训练指标的列表。

结果 (Results)

I trained mine with Google Colab’s online GPU, but you can train yours locally if you’d like; you don’t necessarily have to train all the way to convergence to see great results.

我使用Google Colab的在线GPU训练了我的游戏，但是您可以根据需要在本地进行训练；您不一定要全力以赴地融合才能看到出色的结果。

Just as I previously mentioned, the relationship between games terminating in a win/loss and those terminating in a draw should work as follows:

就像我之前提到的，以赢/输结束的游戏与以平局结束的游戏之间的关系应如下工作：

Earlier in training, an unskilled, randomly-playing agent will frequently encounter win-loss scenarios.

在训练的早期，不熟练的，随机播放的特工经常会遇到输赢的情况。 Each time the agent discovers a new strategy, there will be fluctuations.

代理每次发现新策略时，都会有波动。 Towards the end of training, near convergence, the agent will almost always encounter a draw, as it is playing optimally against itself.

在训练快要结束时，接近收敛时，座席几乎总是会遇到平局，因为它正以最佳状态对抗自己。

Therefore, the larger fluctuations in the graph indicate moments when the agent learned to evaluate a particular board configuration very well, and in doing so this allowed it to prevent draws.

因此，图表中较大的波动表示当代理商学会很好地评估特定的电路板配置时的时刻，这样做可以防止抽签。

We can see this is clearly demonstrated in the resulting graph:

我们可以看到在结果图中清楚地展示了这一点：

Win/Loss-Draw Ratio Throughout Training 整个训练中的赢/输比

Throughout the middle of training, it frequently appears as if the q-table will converge, only to quickly change entirely. These are the aforementioned moments when a significant strategy was exploited for the first time.

在整个训练过程中，q表经常会收敛，只是很快就会完全改变。以上是首次采用重大策略的上述时刻。

Also, as you can see, the fluctuations occur more rarely as you progress throughout training. This is due to the fact that there are less yet-to-be-discovered tactics as your progress. Theoretically, if the agent converged, there would never be any more great fluctuations like this. Draws would occur 100% of the time and after the rapid rise in the draw percentage, it would not fall back down again.

另外，如您所见，在整个训练过程中，波动很少发生。这是由于您的进度较少，因此尚待发现。从理论上讲，如果主体会聚，就不会再有如此大的波动。抽奖将在100％的时间内发生，并且在抽奖百分比快速上升之后，不会再次下降。

I decided it would be a good idea to visualize the change in the q-values over time, so I retrained it while recording the sum of the absolute values of the Q table at each episode. Whether a particular q-value is positive or negative recording the sum of all absolute q-values shows us when convergence is occurring (the gradient of the q-values over time decreases as we reach convergence).

我认为可视化q值随时间的变化将是一个好主意，因此我在记录每个情节Q表的绝对值之和时对其进行了重新训练。特定的q值是正的还是负的，记录所有绝对q值的总和，会在发生收敛时向我们显示(随着我们达到收敛，q值随时间的梯度会减小)。

Win/Loss-Draw Ratio Throughout Training + Sum of Absolute Value of Q-table throughout Training 整个培训期间的赢/输比+整个培训期间Q表的绝对值总和

You can visit the full code on Google Colab here:

您可以在以下位置访问Google Colab上的完整代码：

Or on GitHub here:

或在GitHub上：

Experimenting with the exploration strategy will influence training. You can change parameters relating to epsilon, as well as how it is decayed in order to get different results.

尝试探索策略将影响培训。您可以更改与epsilon有关的参数，以及如何使其衰减以获取不同的结果。

翻译自: https://towardsdatascience.com/an-introductory-reinforcement-learning-project-learning-tic-tac-toe-via-self-play-tabular-b8b845e18fe

乐器演奏

相关资源：The Study of Orchestration(Third Edition) 配器法教程第三版.pdf 扫描版带书签

Processed: 0.011, SQL: 8