r 语言初学者指南

科技2023-12-18 79

r 语言初学者指南

In this article, we’ll try to understand some basic concepts related to Natural Language Processing (NLP). I will be focusing on the theoretical aspects over programming practices.

在本文中，我们将尝试了解与自然语言处理(NLP)相关的一些基本概念。我将专注于编程实践的理论方面。

介绍 (Introduction)

Why should one pre-process text, anyway? It is because computers are best at understanding numerical data. So, we convert strings into numerical form and then pass this numerical data into models to make them work.

无论如何，为什么要预处理文本？这是因为计算机最擅长理解数值数据。因此，我们将字符串转换为数字形式，然后将这些数字数据传递到模型中以使其工作。

We’ll be looking into techniques like Tokenization, Normalization, Stemming, Lemmatization, Corpus, Stopwords, Part of speech, a bag of words, n-grams, and word embedding. These techniques are enough to make a computer understand data with the text.

我们将研究诸如标记化，规范化，词干，词法化，语料库，停用词，词性，词袋，n-gram和词嵌入之类的技术。这些技术足以使计算机理解带有文本的数据。

代币化 (Tokenization)

It is the process of converting long strings of text into smaller pieces or tokens, hence the name- Tokenization.

这是将较长的文本字符串转换成较小的片段或标记的过程，因此将其命名为Tokenization。

Suppose we have a string like, “Tokenize this sentence for the testing purposes.”

假设我们有一个字符串，例如“为了测试目的将这句话标记化。”

In this case, after tokenization is processed the sentence would look like, {“Tokenize”, “this”, “sentence”, “for”, “the”, “testing”, “purpose”, “.”}

在这种情况下，在处理标记化之后，该句子将看起来像{{标记化“，” this“，”句子“，” for“，” the“，” testing“，”目的“，”。“}

This would be an example of- word tokenization, we can perform characterized tokenization similarly.

这将是单词标记化的一个示例，我们可以类似地执行特征化标记化。

正常化 (Normalization)

It is the process of generalizing all words by converting these into the same case, removing punctuations, expanding contractions, or converting words to their equivalents.

它是通过将所有单词转换成相同的大小写，删除标点符号，扩大收缩或将单词转换为等效词来概括所有单词的过程。

Normalization would get rid of punctuations and case-sensitivity in the aforementioned example and our sentence would then look like this, {“tokenize”, “this”, “sentence”, “for”, “the”, “testing”, “purpose”}.

在上述示例中，规范化将消除标点符号和区分大小写，然后我们的句子将如下所示： {“ tokenize”，“ this”，“句子”，“ for”，“ the”，“ testing”，“目的” ”} 。

抽干 (Stemming)

Stemming is the process of removing affixes (suffixes, prefixes, infixes, circumfixes) from a word. For example, running will be converted to run. So after Stemming, our sentence would look like, {“tokenize”, “this”, “sentence”, “for”, “the”, “test”, “purpose”}.

词干是从单词中删除词缀(后缀，前缀，中缀，后缀)的过程。例如， running将转换为run 。因此在Stemming之后，我们的句子将看起来像是{“ tokenize”，“ this”，“句子”，“ for”，“ the”，“ test”，“目的”} 。

合法化 (Lemmatization)

It is the process of capturing canonical forms based on a word’s lemma. In simple terms, for uniformity in the corpus, we use simple forms of words. For example, the word better will be converted to good.

这是根据单词的引理捕获规范形式的过程。简而言之，为了使语料库统一，我们使用简单的单词形式。例如，单词Better将转换为good 。

语料库 (Corpus)

Corpus or body in Latin, is the collection of text. It refers to the collection generated from our text data. You might see corpora in some places which is the plural form of the corpus. It is the dictionary for our NLP models. Computers work with numbers instead of strings, so all these strings are represented in numerical forms as follows: {“tokenize”:1, “this”:2, “sentence”:3, “for”:4, “the”:5, “test”:6, “purpose”:7}

语料库或拉丁文中的正文是文本的集合。它指的是从我们的文本数据生成的集合。您可能会看到在一些地方是语料库的复数形式语料库。它是我们NLP模型的词典。计算机使用数字代替字符串，因此所有这些字符串均以数字形式表示，如下所示： {“ tokenize”：1，“ this”：2，“ sentence”：3，“ for”：4，“ the”：5 ，“测试”：6，“目的”：7}

停用词 (Stop words)

There are some words in a sentence that play no part in the context or meaning of the sentence. These words are called Stop words. Before passing data as input we remove them from the corpus. Stop words include words like, “the”, “a”, “and”. These words tend to occur frequently in a sentence structure.

句子中的某些词在句子的上下文或含义中不起作用。这些单词称为停用词。在传递数据作为输入之前，我们将其从语料库中删除。停用词包括“ the”，“ a”，“ and”等词。这些词倾向于在句子结构中频繁出现。

词性(POS) (Part Of Speech (POS))

POS tagging consists of assigning a category tag to the tokenized parts of the sentence, such that all the words fall under one of these categories: nouns, verbs, adjectives, etc. This helps in understanding the role of a word in the sentence.

POS标记包括为句子的标记部分分配一个类别标签，以便所有单词都属于以下类别之一：名词，动词，形容词等。这有助于理解单词在句子中的作用。

言语包 (Bag of Words)

It is a representation of sentences such that a machine learning model can understand. Here, the main focus is on the occurrences of words as opposed to the sequence of that word. So the generated dictionary for our sentence looks like this: {“tokenize”:1, “this”:1, “sentence”:1, “for”:1, “test”:1, “purpose”:1}. There are many limitations to this algorithm.

它是句子的表示，机器学习模型可以理解。这里，主要重点是单词的出现，而不是单词的顺序。因此，为我们的句子生成的字典看起来像这样： {“ tokenize”：1，“ this”：1，“ sentence”：1，“ for”：1，“ test”：1，“ purpose”：1} 。此算法有很多限制。

It fails to convey the meaning of a sentence. As it only focuses on the number of occurrences, words with high occurrences dominate the model. We have to then rely on other algorithms to solve these limitations.

它无法传达句子的含义。由于仅关注出现次数，因此出现次数高的单词主导了模型。然后，我们必须依靠其他算法来解决这些限制。

克 (n-grams)

Instead of storing the number of occurrences, we can focus on getting a sequence of N items at the time of text selection. It is much more useful for storing the context of sentences. Here N could be any number of consecutive words. For example, trigrams contain 3 consecutive words:

除了存储出现的次数，我们还可以专注于在选择文本时获得N个项目的序列。它对于存储句子的上下文更为有用。这里N可以是任意数量的连续单词。例如，三字组包含3个连续的单词：

{“tokenize this sentence”, “this sentence for”, “sentence for test”, “for test purpose”}

{“赋予此句子一个名词”，“此句子旨在”，“测试句子”，“用于测试目的”}

Even for humans, this seems more appropriate as it conveys information regarding the sequence of occurrences.

即使对于人类，这似乎更合适，因为它传达了有关发生顺序的信息。

TF-IDF矢量化器 (tf-idf vectorizer)

tf-idf stands for term frequency-inverse document frequency. In this vectorizer, for every first occurrence of a word, we count the number of occurrences in sentences and divide that number with the number of occurrences of the word in the entire document. It could be represented as a term frequency/document frequency.

tf-idf代表术语频率与文档频率成反比。在此矢量化程序中，对于单词的每个首次出现，我们计算句子中出现的次数，然后将该数字除以整个文档中单词出现的次数。它可以表示为术语频率/文档频率。

This vectorizer works perfectly without removing stop words too as it gives low importance to words with higher occurrences. NLP vectorizing of text most commonly uses tf-idf vectorizer.

该矢量化器可以完美地工作，而无需删除停用词，因为它对出现次数较高的词的重要性较低。文本的NLP矢量化最常用的是tf-idf矢量化器。

结论 (Conclusion)

Here, we went through most of the terms used for Natural Language Processing in layman’s terms. You can try working with these concepts from python libraries like NLTK and Spacy.

在这里，我们以通俗易懂的方式介绍了用于自然语言处理的大多数术语。您可以尝试从NLTK和Spacy等python库中使用这些概念。

Do check out this article if you’re interested in learning to create a Neural Network with NLP.

请看看这个文章如果您有兴趣学习创建一个神经网络与NLP。

For more about programming, follow me and Aubergine Solutions, so you’ll get notified when we write new posts.

有关编程的更多信息，请关注我和 Aubergine Solutions ，因此我们在撰写新文章时会通知您。

翻译自: https://medium.com/aubergine-solutions/beginners-guide-to-natural-language-processing-e4981866cb64

r 语言初学者指南

相关资源：R语言初学者指南(高清)pdf

Processed: 0.022, SQL: 8