python 500中最常用的单词在python中用wordcloud屏蔽

科技2022-08-03 97

Wordcloud is a visual representation of text data. Displays a list of words with the importance of each beeing indicated by font size or color. Words that need to be emphasized better in the text can thus be seen easily.

世界云是文本数据的直观表示。显示单词列表，每个单词的重要性由字体大小或颜色指示。因此，很容易看到需要在文本中更好地强调的单词。

如果我们总结一下重要功能： (If we look at the important functions in summary:)

background_color: are the background colors in which the words in the text are displayed. Although there are many color options, white and black are more preferred. After all, we use the WordCloud for better reading.

background_color：是显示文本中单词的背景颜色。尽管有很多颜色选择，但白色和黑色是更可取的。毕竟，我们使用WordCloud以获得更好的阅读效果。

colormap: colormap reflects the words with a certain color concept. We can call it more choosing from a certain color palette. There are many nice Matplotlib color maps. You can also find a choice according to your taste below.

colormap： colormap反映具有特定颜色概念的单词。我们可以称其为从某种调色板中进行更多选择。有许多不错的Matplotlib颜色图。您还可以根据自己的口味在下面找到一个选择。

source: https://www.codecademy.com/articles/seaborn-design-ii 来源： https : //www.codecademy.com/articles/seaborn-design-ii

collocations: A function that prevents duplicate words in your WordCloud. For this, it must be set to False. It is important that similar words are not overlooked in some WordCloud works without you realizing it.

搭配：防止WordCloud中重复单词的功能。为此，必须将其设置为False。重要的是，在您未意识到某些单词云作品的情况下，不要忽略相似的单词。

width/height: Thanks to this function you can change your WordCloud dimension to your preferred any width and height.

宽度/高度：借助此功能，您可以将WordCloud尺寸更改为首选的任何宽度和高度。

plt.figure(figsize=(X,Y)):To make your image clearer and more pleasant, all you need to add is plt.figure (figsize = (X, Y). X and Y are the size you want. We need to take into account that the image is stretched towards the shape, so the default width and height can create a blurry image.

plt.figure(figsize =(X，Y))：为了使您的图像更清晰，更愉快，您需要添加的只是plt.figure(figsize =(X，Y)。X和Y是您想要的大小。我们需要考虑到图像朝着形状拉伸，因此默认的宽度和高度会产生模糊的图像。

好了，我们现在就可以开始我们的项目了： (Well then we can start our project now:)

查找和创建数据集 (Finding and creating data set)

We are looking for the 500 most common German words used in daily life. We can find this information on this internet address: https://www.thegermanprofessor.com/top-500-german-words/

我们正在寻找日常生活中最常用的500个德语单词。我们可以在以下互联网地址上找到此信息： https : //www.thegermanprofessor.com/top-500-german-words/

However, since we cannot find the data as an xml or csv file, we will create it manually. Let’s create rows and columns here ourselves in a new excel file and save them in an excel file:

但是，由于我们无法以xml或csv文件的形式找到数据，因此我们将手动创建它。让我们在这里在新的excel文件中创建行和列，然后将它们保存在excel文件中：

数据优化 (Data optimization)

In the word list, we wrote the value as a ranking next to each word. And now we can clear the additional information next to the words and then move on to what we would do in Python.

在单词列表中，我们将值写为每个单词旁边的排名。现在，我们可以清除单词旁边的其他信息，然后继续使用Python进行操作。

1.导入我们将使用的库： (1. Importing the libraries we will use:)

import pandas as pdimport numpy as npimport matplotlib.pyplot as plt%matplotlib inlinefrom PIL import Imagefrom wordcloud import WordCloud, STOPWORDSimport osfrom os import pathfrom wordcloud import WordCloud, STOPWORDS, ImageColorGeneratorprint(“Imported libraries”)

2.我们将执行一系列操作以处理Excel文件 (2. We will perform a series of operations to work on the Excel file)

a) Assign the Excel file name to “file”:

a)将Excel文件名分配给“文件”：

file= ‘german_w.xlsx’

b)Load Excel:

b)加载Excel：

xl = pd.ExcelFile(file)

c)Print page names:

c)打印页面名称：

print(xl.sheet_names)out['Sheet1']

d) Load DataFrame by this Page name: german_words:

d)通过以下页面名称加载DataFrame：german_words：

german_words = xl.parse(‘Sheet1’)

3. head()函数现在可以工作： (3. The head () function can now work:)

4，让我们检查一下清单是否存在NaN值的地方： (4.Let’s check our list if there is a place with a NaN value:)

Now seeing there is no field to fill in, we can now proceed with the WordCloud creation process.

现在看到没有要填写的字段，我们现在可以继续WordCloud创建过程。

5.计算要求的值 (5. Counting the requested values)

At this stage, our work would be easier if the data set we have was a txt file. However, there should be manual operations on csv or xml / xlsx files. Depending on what we want the WordCloud to generate on we can either do:

在这个阶段，如果我们拥有的数据集是一个txt文件，我们的工作将会更加轻松。但是，应该对csv或xml / xlsx文件进行手动操作。根据我们希望WordCloud生成的内容，我们可以执行以下操作：

First way: We can use value_counts().Index in the process of counting the values in rows for a column we want. For example:

开始步骤的方法：我们可以使用value_counts().Index在行了，我们希望有一个列计算值的过程。例如：

wordcloud = WordCloud(background_color=’white’, width=1024, height=1024)\ .generate(“ “.join(data.column_name.value_counts().index))

If you want to view an example about this, you can examine the following project:

如果要查看有关此示例，可以检查以下项目：

Second way: Use WordCloud.generate_from_frequenciesto manually pass the computed frequencies of words. We can make WordCloud according to the value corresponding to a row in the words column. It would make sense here to create tuples first:

小号的Econd方式：使用WordCloud.generate_from_frequencies手动传字的计算频率。我们可以根据单词列中一行对应的值制作WordCloud。在这里首先创建元组是有意义的：

tuples = [tuple(x) for x in german_words.values]wordcloud = WordCloud(background_color=”black”,width=3000, height=2000, max_words=300, random_state=1, colormap=’Set2', collocations=False).generate_from_frequencies(dict(tuples))

In our example, we will prefer to use the second one because there is one word on each line. Output will then look like this:

在我们的示例中，我们将首选使用第二个单词，因为每行上只有一个单词。输出将如下所示：

6.获取所需的数据 (6. Obtaining the requested data)

However, as can be seen, the word “Spiel”, which is in the 500th place in the ranking column, was displayed the largest. Actually, yes we need the opposite order. We can rearrange a new column named “frequency of use” manually by adding it to Excel so that it is inversely proportional to ranking.

但是，可以看出，在排名列第500位的“ Spiel”一词显示得最大。实际上，是的，我们需要相反的顺序。通过将其添加到Excel，我们可以手动重新排列名为“使用频率”的新列，使其与排名成反比。

Now, let’s transfer the regenerated (frequency of use column added) excel file to import as we did in the beginning. Then, let’s get rid of the ranking column, which we will not use with the drop()as below.

现在，让我们像开始一样将重新生成的(添加使用频率列)excel文件导入。然后，让我们摆脱排名列，我们将其与drop() ，如下所示。

german_words.drop(“ranking”,axis=1, inplace=True)german_words

As we have emphasized before, let’s explain better that each word exists 1 time in the list. If we count the values of 20 words:

正如我们之前所强调的，让我们更好地解释一下，每个单词在列表中存在1次。如果我们计算20个单词的值：

german_words.value_counts()[0:20]

7.利用真正的词云提高分辨率 (7. Increase resolution with true word-cloud)

As a result, we decided to use frequency value as value count. We set max_words = 500 for the display here and wanted to see all of them. And then WordCloud for the 500 most frequently used words in German:

结果，我们决定使用频率值作为值计数。我们将max_words = 500设置为此处的显示，并希望查看所有这些。然后使用WordCloud来搜索500个最常用的德语单词：

Yes, ‘Spiel’ and the like became ghost after the new setting :) 是的，在新设置之后，“ Spiel”之类的东西变成了幽灵：)

Now that data optimization is what we want, we can now go to mask making. First of all, I recommend that you click the link below and examine the examples.

既然我们想要的是数据优化，那么我们现在可以进行掩膜制作。首先，我建议您单击下面的链接并检查示例。

将您的wordcloud遮盖成您选择的任何形状 (Mask your wordcloud into any shape of your choice)

Before starting a masking process, the first step we should take is to check whether our picture is suitable for masking.

在开始遮罩过程之前，我们应该采取的第一步是检查我们的图片是否适合遮罩。

Source for goethe_mask: https://de.wikipedia.org/wiki/Datei:Silhouette_of_Johann_Wolfgang_von_Goethe.svg

歌德面具的来源： https ://de.wikipedia.org/wiki/Datei:Silhouette_of_Johann_Wolfgang_von_Goethe.svg

We can understand this better when we see the Image values with np.array

当我们使用np.array查看Image值时，我们可以更好地理解这一点

As can be seen, our values are fixed at 255. According to this result, our background is already pure white. This means that we won’t actually need to transform this image.

可以看出，我们的值固定为255.根据这个结果，我们的背景已经是纯白色了。这意味着我们实际上不需要转换此图像。

wc = WordCloud(background_color=’black’, collocations=False, relative_scaling =1,random_state=1, mask = goethe_mask, width=3000, height=2000, colormap=’Set3').generate_from_frequencies(dict(tuples))plt.figure(figsize = [15,10])plt.imshow(wc, interpolation=’bilinear’)plt.axis(‘off’)plt.show()

And here are the famous German legend, genius and philosopher J.W. Goethe

这是德国著名的传奇，天才和哲学家歌德(JW Goethe)

Johann Wolfgang von Goethe — Mask by author 约翰·沃尔夫冈·冯·歌德—作者的面具

将您的wordcloud遮盖成您选择的任何颜色模式 (Mask your wordcloud into any color pattern of your choice)

Now let’s show WordCloud on the German flag and finish the article. We want the colors on the German flag to create the colors in WordCloud. For this we will use the ImageColorGenerator

现在，让我们在德国国旗上显示WordCloud并完成本文。我们希望德国国旗上的颜色在WordCloud中创建颜色。为此，我们将使用ImageColorGenerator

至此，我们的项目结束了。通过这种方式： (Thus, our project ended. In this way:)

Creating Excel data and visualizing it with pandas ✔️

创建Excel数据并使用熊猫可视化 ✔️

Deleting columns for data optimization, etc. ✔️

删除列以进行数据优化等。

Taking the data we want to count from the existing data and creating frequency in wordcloud. ✔️

从现有数据中获取我们想要计数的数据，并在wordcloud中创建频率。 ✔️

Creating WordCloud with Masking ✔️

使用遮罩创建 WordCloud✔️

And we learned how to create WordCloud with the color palette in the image. ✔️

我们学习了如何使用图像中的调色板创建WordCloud。 ✔️

翻译自: https://medium.com/swlh/masking-with-wordcloud-in-python-500-most-frequently-used-words-in-german-c0e865e911bb

相关资源：jdk-8u281-windows-x64.exe

Processed: 0.009, SQL: 8