卷积神经网络cnn对文档进行分类

    科技2023-11-28  94

    Dealing with text data in deep learning with the help of CNN and word embedding.

    在CNN和单词嵌入的帮助下在深度学习中处理文本数据。

    Table of content-

    表中的内容-

    What is document classification

    什么是文件分类 Preprocessing

    前处理 Word Embedding

    词嵌入 Keras Embedding layer

    Keras嵌入层 Tokenizer API

    令牌生成器API GloVe: global vectors for word representation

    GloVe:单词表示的全局向量 Model Creation

    模型制作 Model Summary

    型号汇总

    文件分类 (Document classification)

    Document classification is an example of Machine learning where we classify text based on its content.

    文档分类是机器学习的一个示例,其中我们根据文本的内容对文本进行分类。

    There are two broad categories of Machine learning techniques which can be used for it.

    可以将机器学习技术分为两大类。

    Supervised leaning — Where we already have the category to which particular document belongs to. Our model parse through the data during training, maps the function from it.

    有监督的学习 —我们已经有了特定文档所属的类别的位置。 我们的模型在训练过程中解析数据,并从中映射功能。

    Categories are predefined and documents within the training datasets are manually tagged with one or more category labels.After training, the model is smart enough to categorize the new document given.

    类别是预先定义的,并且训练数据集中的文档使用一个或多个类别标签手动标记。训练后,该模型足够智能,可以对给定的新文档进行分类。

    Unsupervised learning — Where we do not have the class label attached to the document and we use ML algorithms to cluster the document which are of same type.

    无监督学习 -我们没有在文档上附加类别标签的地方,而我们使用ML算法对相同类型的文档进行聚类。

    Refer below diagram for better understanding-

    请参考下图,以更好地理解

    前处理 (Preprocessing)

    Lets suppose we have millions of emails with us and we need to classify to which class each of these email belongs to.

    假设我们有数百万封电子邮件,我们需要对这些电子邮件中的每一个属于哪个类别进行分类。

    In real world the data given is never perfect. We need to do preprocessing so as to extract maximum knowledge out of it with out making our model get confused due to extra information given .

    在现实世界中,给出的数据从来都不是完美的。 我们需要进行预处理,以便从中提取最大的知识,而又不会因给出的额外信息而使我们的模型感到困惑。

    Take out the subject, remove extra details from it and put it in a Data-frame.

    取出主题,从主题中删除更多细节,然后将其放入数据框。

    Extract all the Email ids mentioned into the mail and get it into a Data-frame.

    提取邮件中提到的所有电子邮件ID,并将其放入数据框。

    Extract the given text data , preprocess it and put it in a Data-frame.

    提取给定的文本数据,对其进行预处理,然后将其放入数据框。

    Combine all these and we are ready with the desired text to give to our model.

    结合所有这些,我们准备好所需的文本以提供给我们的模型。

    For more information and code visit my github. Link is at the end of this blog .

    有关更多信息和代码,请访问我的github。 链接在此博客的结尾。

    词嵌入 (Word embedding)

    Word Embedding is a representation of text where words that have the same meaning have a similar representation. In other words it represents words in a coordinate system where related words, based on a corpus of relationships, are placed closer together.

    词嵌入是文本的表示形式,其中具有相同含义的词具有类似的表示形式。 换句话说,它表示坐标系中的单词,其中基于关系的语料库将相关单词放在一起。

    It is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to dense vector of real numbers.

    它是自然语言处理(NLP)中一组语言建模和特征学习技术的总称,其中词汇表中的单词或短语被映射到实数的密集矢量。

    It is actually an improvement over traditional ways of encoding such as Bag-of-word where each word was represented by a large sparse vector depending upon size of vocabulary it is dealing with.

    实际上,它是对传统编码方式(例如单词袋)的一种改进,其中每个单词由大的稀疏矢量表示,具体取决于其处理的词汇量。

    In contrast to this, in an embedding, representation of a word is by a dense vector which represents the projection of the word into a continuous vector space.

    与此相反,在嵌入中,单词的表示是通过密集的矢量表示的,该密集的矢量表示单词在连续矢量空间中的投影。

    The position of a word within the vector space is learned from text and is based on the neighboring words.

    单词在向量空间中的位置是从文本中学习的,并且基于相邻单词。

    Keras嵌入层 (Keras Embedding layer)

    Keras offers an Embedding layer that can be used for neural networks on text data. It requires that the input data be integer encoded, so that each word is represented by a unique integer. This data preparation step can be performed using the Tokenizer API also provided with keras.

    Keras提供了一个嵌入层,可用于文本数据的神经网络。 它要求输入数据是整数编码的,以便每个单词都由唯一的整数表示。 可以使用以下命令执行此数据准备步骤 令牌生成器API 还提供了喀拉拉邦。

    This layer can be represented as —

    该层可以表示为-

    Keras Embedding layer Keras嵌入层

    Where few important arguments are-

    很少有重要论据是-

    input_dim — Which specifies the size of the vocabulary in the text data, that means the total unique words in your data is 50.

    input_dim —指定文本数据中词汇的大小,这意味着数据中的唯一词总数为50。

    output_dim — Which specifies the size of the word vector you get as the output of the embedding layer.

    output_dim —指定作为嵌入层输出获得的单词向量的大小。

    input_length — It is the length of the input sequence. For example if the maximum length of a sentence in a document is 100 then its input length is 100.

    input_length —输入序列的长度。 例如,如果文档中句子的最大长度为100,则其输入长度为100。

    Trainable — It specifies whether we want to train the embedding layer or not.

    可训练的 —它指定我们是否要训练嵌入层。

    The embedding layer can be used in different ways-

    嵌入层可以以不同的方式使用-

    We use it as a part of deep learning model and this layer learns with the model itself. In such scenarios we give parameter trainable as True.

    我们将其用作深度学习模型的一部分,并且该层通过模型本身进行学习。 在这种情况下,我们将可训练的参数设为True。 We use already pretrained vectors to represent our words which are trained on large datasets.In such scenarios we give parameter trainable as False.

    我们使用已经训练好的向量来表示在大型数据集上训练过的单词,在这种情况下,我们将参数训练为False。

    We will be focusing on how to use the pretrained vector for representing our words and train our complete dataset on it.

    我们将专注于如何使用预训练向量来表示我们的单词并在其上训练我们的完整数据集。

    Let us take an example to understand it more deeply-

    让我们举个例子来更深入地了解它-

    suppose we have our dataset which contains few remarks and we need to classify to which class these remarks belong to.

    假设我们的数据集包含很少的注释,我们需要对这些注释所属的类别进行分类。

    1 signifies that the remark is good where as 0 signifies it to be bad.

    1表示该评论为好,而0表示该评论为差。

    Given set of data 给定数据集

    The Keras deep learning library provides some basic tools to help us prepare our text data.Text data must be encoded as numbers to be used as input or output for machine learning and deep learning models.

    Keras深度学习库提供了一些基本工具来帮助我们准备文本数据,文本数据必须编码为数字才能用作机器学习和深度学习模型的输入或输出。

    For this purpose we use Tokenizer API.

    为此,我们使用Tokenizer API。

    令牌生成器API (Tokenizer API)

    Tokenizer 分词器

    Sometimes we want to have some special punctuation to be a part of our analysis thus in that case we can specify only the filters we want to get removed.

    有时,我们希望将某些特殊的标点符号作为分析的一部分,因此在这种情况下,我们只能指定要删除的过滤器。

    Tokenizer along with filters argument 令牌生成器以及过滤器参数

    Our vocabulary is the total number of unique words present in the data-set and Tokenizer represent each word with a unique digit.

    我们的词汇量是数据集中存在的唯一单词的总数,并且Tokenizer用一个唯一的数字代表每个单词。

    word index dictionary 单词索引字典

    So the total vocab size is this case will be 13.

    因此,这种情况下的总唱盘大小为13。

    Now we need to encode our complete data in integer form, where texts_to_sequences is used and then apply padding to make complete data of same length.

    现在我们需要将完整的数据编码为整数形式,其中使用texts_to_sequences,然后应用填充以制作相同长度的完整数据。

    We can pad our data in 2 forms-

    我们可以用2种形式填充数据-

    Padding is Post 填充已发布 Padding is Pre 预先填充

    Now since we are ready with our words lets discuss about the pre trained vectors for word representation and where can we download it from.

    现在,既然我们已经准备好了单词,我们就可以讨论有关单词表示的预训练向量以及可以从哪里下载的内容。

    GloVe:单词表示的全局向量 (GloVe : global vectors for word representation)

    GloVe stands for global vectors for word representation. It is an unsupervised learning algorithm developed by Stanford for generating word embedding by aggregating global word-word co-occurrence matrix from a corpus.

    GloVe代表单词表示的整体向量。 这是斯坦福大学开发的一种无监督学习算法,用于通过聚合语料库中的全局单词-单词共现矩阵来生成单词嵌入。

    We can download this and can seed the Keras Embedding layer with weights from the pre-trained embedding for the words in our training dataset.

    我们可以下载此代码,并可以使用训练前的嵌入权重为Keras嵌入层播种我们训练数据集中的单词。

    GloVe: Global Vectors for Word Representation

    GloVe:单词表示的全局向量

    We can download any of the zip file depending upon the requirement. After unzipping the final file I used is “glove.6B.100d.txt”.

    我们可以根据需要下载任何zip文件。 解压缩后,我使用的最终文件是“ glove.6B.100d.txt”。

    If we try to look inside of this file, it has a vector associated with each word.

    如果我们尝试查看此文件的内部,则该文件具有与每个单词关联的向量。

    Content of GloVe file GloVe文件的内容

    Now we need to load this entire file and try to fetch words and its corresponding vector representation.

    现在我们需要加载整个文件,并尝试获取单词及其对应的矢量表示。

    Once this is done, we need to create the embedding matrix for our vocabulary we created using the training dataset and its vector representation will be taken from the GloVe file.

    完成此操作后,我们需要为使用训练数据集创建的词汇表创建嵌入矩阵,其向量表示形式将从GloVe文件中获取。

    模型制作 (Model Creation)

    Once we are all set with our embedding matrix, we will use Keras predefined Embedding layer and give parameter weights as our embedding matrix.

    设置好嵌入矩阵后,我们将使用Keras预定义的嵌入层,并赋予参数权重作为嵌入矩阵。

    This embedding layer comes just after the input layer. We can add any number of layers as per our model’s requirement.

    该嵌入层位于输入层之后。 我们可以根据模型的要求添加任意数量的层。

    Final model 最终模型 Model Summary 型号汇总

    now we can fit the model using fit method.

    现在我们可以使用拟合方法拟合模型。

    Accordingly we can evaluate it and validate it for our test data.

    因此,我们可以对其进行评估并针对我们的测试数据进行验证。

    For more details visit my Github.

    有关更多详细信息,请访问我的Github 。

    参考文献: (References :)

    翻译自: https://medium.com/swlh/classification-of-documents-using-convolutional-neural-network-cnn-e0768bb81aad

    相关资源:英文文本分类和聚类的语料
    Processed: 0.009, SQL: 9