伯特技术与简单logistic回归进行自然语言处理

    科技2023-12-18  79

    BERT is a very interesting multilayer deep learning model that is currently considered as the state of the art for natural language processing. It has been pre-trained on Wikipedia and BooksCorpus so it can do a good job at many natural language processing tasks. The special thing about this model is that it manages to provide a very rich representation of the words that manage to capture the context that the word was used in. This feature is really important in the NLP applications where the context can change what a word refers to(e.g. in the phrase “I’m so full that my stomach is about to explode”, the word “explode” would usually refer to dangerous situations but in that context, it’s just a metaphor).

    BERT是一个非常有趣的多层深度学习模型,目前被认为是自然语言处理的最新技术。 它已经在Wikipedia和BooksCorpus上进行了预培训,因此可以很好地完成许多自然语言处理任务。 此模型的特殊之处在于,它设法提供了单词的非常丰富的表示形式,从而可以捕获单词所使用的上下文。此功能在NLP应用程序中非常重要,在该应用中,上下文可以更改单词所指的含义(例如,在“我已经吃饱了,我的肚子将要爆炸”的短语中,“爆炸”一词通常指危险情况,但在这种情况下,这只是一个隐喻)。

    BERT is able to gain these insights because it’s a deeply bidirectional model. Bidirectional means that it learns information from both the left and the right side of a token’s context during the training phase. In order to gain the context of a sentence, BERT has two main processes in the training:

    BERT之所以能够获得这些见解,是因为它是一种深度双向模型。 双向意味着它在训练阶段从令牌上下文的左右两侧学习信息。 为了获得句子的上下文,BERT在训练中有两个主要过程:

    Masked Language Modeling: Masked language modeling is basically predicting the next word given a sequence of words (similar to what usual hidden Markov models do but this is done with deep learning). The way MLM works in BERT is that before feeding word sequences into BERT, 15% of the words get masked (hidden). The model then, as part of its training, attempts to predict the hidden words (the token value of the hidden word) using the context provided by the other non-masked words.

    屏蔽语言建模:屏蔽语言建模基本上是在给定单词序列的情况下预测下一个单词(类似于通常的隐马尔可夫模型,但这是通过深度学习完成的)。 MLM在BERT中的工作方式是在将单词序列馈送到BERT之前,有15%的单词被屏蔽(隐藏)。 然后,作为模型训练的一部分,该模型尝试使用其他未屏蔽单词提供的上下文来预测隐藏单词(隐藏单词的标记值)。

    Next Sentence Prediction: The model receives pairs of sentences and it tries to predict whether the second sentence is the subsequent sentence in the original dataset. This works by making 50% of the input pairs in which the second sentence is the subsequent sentence in the original dataset, while the other 50% are random pairs. For the model to be able to do that, the input should have some processing done on the training data:

    下一个句子预测:该模型接收句子对,并尝试预测第二个句子是否是原始数据集中的后续句子。 这可以通过使输入对中的50%(其中第二个句子是原始数据集中的后续句子)进行工作,而其他50%是随机对。 为了使模型能够做到这一点,输入应该对训练数据进行一些处理:

    First, a string “CLS” that stands for classification is inserted at the beginning of each sentence and another string “SEP” which stands for separation is inserted at the end of the sentence. These two are used to indicate the beginning and end of the sentences.

    首先,在每个句子的开头插入代表分类的字符串“ CLS”,在句子的末尾插入代表分隔的另一个字符串“ SEP”。 这两个用于指示句子的开头和结尾。 A sentence embedding is also added to indicate sentence A or sentence B

    还添加了句子嵌入以指示句子A或句子B A positional embedding is also added to each sentence to indicate its position in the sequence.

    还向每个句子添加位置嵌入以指示其在序列中的位置。

    BERT has been trained on very large datasets of sentences and words using the two training methods mentioned above. This makes it an open-source model that can be imported and fine-tuned if necessary.

    BERT已使用上述两种训练方法对非常庞大的句子和单词数据集进行了训练。 这使其成为一种开源模型,可以在必要时导入和微调它。

    After the model is trained, the input to the model should be processed in the following way so that the model can use it:

    训练完模型后,应按以下方式处理模型的输入,以便模型可以使用它:

    First, the sentence gets split into tokens.

    首先,将句子拆分为标记。 Second, “CLS” and “SEP” get added to the list of tokens (“CLS” at the beginning and “SEP” at the end), these just tell the model where the sentence starts and ends.

    其次,将“ CLS”和“ SEP”添加到标记列表中(开头是“ CLS”,结尾是“ SEP”),它们只是告诉模型句子的开始和结束位置。 Third, the tokenizer of the model replaces each token by a unique ID

    第三,模型的令牌生成器将每个令牌替换为唯一的ID Fourth, all tokens vectors should have the same length.

    第四,所有令牌向量应具有相同的长度。

    Now, we can see how BERT performs on an NLP dataset that is context-sensitive. I’m going to work on a dataset provided by Kaggle. The dataset contains 10,000 different tweets recorded from real accounts on twitter. Some of these tweets talk about real disasters that happened in the past and others don’t talk about disasters but they use some words that seem to be used in the context of disasters. For example, the following are samples from the dataset which talk about disasters:

    现在,我们可以看到BERT如何在上下文相关的NLP数据集上执行。 我将处理Kaggle提供的数据集。 数据集包含从twitter上的真实帐户记录的10,000条不同的tweet。 其中一些推文谈论的是过去发生的实际灾难,而其他推文则没有谈论灾难,但它们使用的某些词汇似乎是在灾难的背景下使用的。 例如,以下是数据集中有关灾难的样本:

    And the following is an example of tweets that are not about disasters:

    以下是与灾难无关的推文示例:

    We can see that some of these are completely irrelevant to disasters and some of them use some words like “wrecked”, “explode”, “bang” that might refer to disasters but they actually don’t in that context.

    我们可以看到,其中一些与灾难完全无关,而其中一些则使用了“破坏”,“爆炸”,“爆炸”之类的词,这些词可能指的是灾难,但实际上并非如此。

    Let’s check the distribution of the two classes of the data (fake VS Real):

    让我们检查一下两类数据的分布(伪VS Real):

    sns.set()real_disasters = len(kaggle_train[kaggle_train['target']==1])fake_disasters = len(kaggle_train[kaggle_train['target']==0])all_tweets = len(kaggle_train)plt.bar(["Real", "Fake"], [real_disasters, fake_disasters])print(f"Real Percentage: {real_disasters/all_tweets} and Fake Percentage: {fake_disasters/all_tweets}")

    The figure shows that the data is slightly imbalanced with about 57% representing the fake tweets and 43% representing the real tweets. So this should be kept in mind when evaluating the accuracy of the model because if I get a 57% accuracy then this might be just because the model is always classifying the data as fake which would be very misleading. For this reason, I’m going to evaluate my model using two metrics: accuracy and F1 score. F1 score considers both the precision and the recall to compute a score for the model, it’s considered the harmonic mean of the precision and recall. So using these two metrics, I will be able to tell if my model is improving as I fine-tune its parameters.

    该图显示数据略有不平衡,其中约57%代表虚假推文,而43%代表真实推文。 因此,在评估模型的准确性时应牢记这一点,因为如果我获得了57%的准确性,那可能仅仅是因为模型总是将数据归类为假数据,这会极具误导性。 因此,我将使用两个指标来评估模型:准确性和F1得分。 F1分数同时考虑了精度和查全率,以计算模型的分数,它被认为是精度和查全率的调和平均值。 因此,使用这两个指标,当我微调参数时,我就能知道我的模型是否正在改进。

    Also, note that the data is actually not of “fake” and “real” tweets since the tweets that don’t talk about disasters don’t try to fake news about disasters as the name is conveying but they are just using words that might seem like disasters. I’m just going to use the same naming conventions that were used by the producers of the dataset.

    另外,请注意,数据实际上不是“假”和“真实”的推文,因为没有谈论灾难的推文不会在名称传达时试图伪造关于灾难的新闻,但它们只是使用可能好像是灾难。 我将使用与数据集的生产者相同的命名约定。

    Before I start using BERT on this dataset, I need a baseline model that I can compare BERT to. Logistic Regression is one of the simplest and easiest models to implement so I will use it as my baseline model.

    在此数据集上开始使用BERT之前,我需要一个可以将BERT与之进行比较的基准模型。 Logistic回归是最简单,最容易实现的模型之一,因此我将其用作基准模型。

    First, I’m going to start by cleaning the data. From printing a few lines of the data, I saw that some tweets contain emojis, HTML, and other text that will make it hard to build a model on. I’m going to clean the data by doing the following:

    首先,我将从清理数据开始。 通过打印几行数据,我看到一些推文包含表情符号,HTML和其他文本,这将使建立模型变得困难。 我将通过以下操作清除数据:

    Remove emojis from every text

    从每个文本中删除表情符号 Remove all punctuations

    删除所有标点符号Make all text lower case

    使所有文字均小写Remove all stopwords

    删除所有停用词Change the format to more formal format (e.g. “what’s” will be “what is”)

    将格式更改为更正式的格式(例如“ what's”将是“ what is”)Remove all symbols that are not alphabetical (e.g. “+”, “-”, “=”…etc)

    删除所有非字母符号(例如“ +”,“-”,“ =”…等) Remove infrequent words (words that show up less than two times).

    删除不常用的单词(显示少于两次的单词)。 def clean_text(text):# Remove Emojisemoji_pattern = re.compile("["u"\U0001F600-\U0001F64F" # emoticonsu"\U0001F300-\U0001F5FF" # symbols & pictographsu"\U0001F680-\U0001F6FF" # transport & map symbolsu"\U0001F1E0-\U0001F1FF" # flags (iOS)u"\U00002702-\U000027B0"u"\U000024C2-\U0001F251""]+", flags=re.UNICODE)text = emoji_pattern.sub(r'', text)## Remove puncuationtext = text.translate(string.punctuation)## Convert words to lower case and split themtext = text.lower().split()## Remove stop wordsstops = set(stopwords.words("english"))text = [w for w in text if not w in stops and len(w) >= 3]text = " ".join(text)# Clean the texttext = re.sub(r"[^A-Za-z0-9^,!.\/'+-=]", " ", text)text = re.sub(r"what's", "what is ", text)text = re.sub(r"\'s", " ", text)text = re.sub(r"\'ve", " have ", text)text = re.sub(r"n't", " not ", text)text = re.sub(r"i'm", "i am ", text)text = re.sub(r"\'re", " are ", text)text = re.sub(r"\'d", " would ", text)text = re.sub(r"\'ll", " will ", text)text = re.sub(r",", " ", text)text = re.sub(r"\.", " ", text)text = re.sub(r"!", " ! ", text)text = re.sub(r"\/", " ", text)text = re.sub(r"\^", " ^ ", text)text = re.sub(r"\+", " + ", text)text = re.sub(r"\-", " - ", text)text = re.sub(r"\=", " = ", text)text = re.sub(r"'", " ", text)text = re.sub(r"(\d+)(k)", r"\g<1>000", text)text = re.sub(r":", " : ", text)text = re.sub(r" e g ", " eg ", text)text = re.sub(r" b g ", " bg ", text)text = re.sub(r" u s ", " american ", text)text = re.sub(r"\0s", "0", text)text = re.sub(r" 9 11 ", "911", text)text = re.sub(r"e - mail", "email", text)text = re.sub(r"j k", "jk", text)text = re.sub(r"\s{2,}", " ", text)text = text.split()stemmer = SnowballStemmer('english')stemmed_words = [stemmer.stem(word) for word in text]text = " ".join(stemmed_words)return text

    Now to make the data suitable for Logistic Regression, I will use the simplest technique of natural language processing which is turning the texts into a bag of words and vectorizing these words. I split the training data into train and test and used cross-validation with a regression model to get the performance on the training data:

    现在,为了使数据适合于Logistic回归,我将使用自然语言处理的最简单技术,即将文本变成一袋单词并将这些单词向量化。 我将训练数据分为训练和测试,并使用交叉验证和回归模型来获得训练数据的性能:

    x_train, x_test, y_train, y_test = train_test_split(kaggle_train['text'], kaggle_train['target'], test_size=0.2, random_state=1)# I'm going to use cross validation to get unbiased performance of the model on the training datacount_vectorizer = feature_extraction.text.CountVectorizer() # Using the common bag of words techniquetrain_vectors = count_vectorizer.fit_transform(x_train)baseline_model = linear_model.LogisticRegression()f1_scores = model_selection.cross_val_score(baseline_model, train_vectors, y_train, cv=3, scoring="f1")accuracy_scores = model_selection.cross_val_score(baseline_model, train_vectors, y_train, cv=3, scoring="accuracy")print(f"Cross Validation Accuracy Scores:", accuracy_scores)print(f"Cross Validation Accuracy f1_score:", f1_scores)

    The logistic regression seems to do well. Let’s test it on the test data:

    逻辑回归似乎做得很好。 让我们在测试数据上对其进行测试:

    baseline_model.fit(train_vectors, y_train)# Getting the scores on the test:test_vectors = count_vectorizer.transform(x_test)baseline_predict_test = baseline_model.predict(test_vectors)print("Accuracy:", accuracy_score(baseline_predict_test, y_test))print("F1_score:", f1_score(baseline_predict_test, y_test))

    I see that F1 score is always less than the accuracy and I think this is because the data is slightly imbalanced.

    我看到F1分数始终小于准确性,我认为这是因为数据略有不平衡。

    Now, I’m going to use BERT and compare it to that baseline.

    现在,我将使用BERT并将其与该基准进行比较。

    Google has trained BERT on a large dataset of sentences. If I import that model and use it to extract richer features for my data, I will then be able to use any simple classification model to do classification.

    Google已对庞大的句子数据集进行了BERT训练。 如果导入该模型并使用它为我的数据提取更丰富的功能,则可以使用任何简单的分类模型进行分类。

    # Importing a pretrained BERT model so that I can use it to extract richer features of my dataset:model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer,'distilbert-base-uncased')# Load pretrained model/tokenizertokenizer = tokenizer_class.from_pretrained(pretrained_weights)model = model_class.from_pretrained(pretrained_weights)# Tokenizing the datatokenized = kaggle_train['text'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

    The BERT model works only when all vectors have the same length. But because some tweets are longer than others, I have to make all vectors the same length by adding zeroes to vectors with shorter lengths (this shouldn’t affect the features of the sentence)

    BERT模型仅在所有向量的长度相同时才起作用。 但是由于某些推文的长度比其他推文长,因此我必须通过将零添加到长度较短的矢量上来使所有矢量的长度相同(这不会影响句子的特征)

    max_len = 0for i in tokenized.values: if len(i) > max_len: max_len = len(i)padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values]) #padding with zeroes

    Now, I can add the masks that hide 15% of the words as I explained above:

    现在,如上所述,我可以添加掩盖了15%单词的蒙版:

    attention_mask = np.where(padded != 0, 1, 0)attention_mask.shape

    Now, I pass the data to the model to extract the richer features (I will technically take the output of the last layer):

    现在,我将数据传递给模型以提取更丰富的功能(从技术上讲,我将获取最后一层的输出):

    input_ids = torch.tensor(padded) #.to(torch.int64)attention_mask = torch.tensor(attention_mask)with torch.no_grad(): last_hidden_states = model(input_ids, attention_mask=attention_mask)

    I will only take the CLS(its index is 0) vector because it represents the entire sentence:

    我将仅采用CLS(其索引为0)向量,因为它代表了整个句子:

    features = last_hidden_states[0][:,0,:].numpy()labels = kaggle_train['target']

    So using BERT, I was able to obtain a much richer representation of the dataset that accounts for the context. Now, I can apply any simple classification model on these features. I will go with SVC:

    因此,使用BERT,我能够获得更丰富的表示上下文的数据集表示。 现在,我可以在这些功能上应用任何简单的分类模型。 我将使用SVC:

    train_features, test_features, train_labels, test_labels = train_test_split(features, labels)bert_svc = SVC()bert_svc.fit(train_features, train_labels)

    Now, checking the performance of the model on the test data:

    现在,在测试数据上检查模型的性能:

    # will check the performance of the model on the test:bert_svc_test_predict = bert_svc.predict(test_features)print("BERT+SVC Accuracy Score:", accuracy_score(bert_svc_test_predict, test_labels))print("BERT +SVC F1 Score:", f1_score(bert_svc_test_predict, test_labels))

    Although this score isn’t too bad, I think the model will perform much better if I fine-tune the parameters of BERT instead of just using a pre-trained model to extract some features. The last layer of the BERT model outputs a vector representation for each token (word) that was passed on. The first token in all the sentences is the “CLS” as I described above. This token will have a vector that is a representation of the entire sentence and I think that one should be used to classify each sentence. I’m going to add one last layer on top of the BERT model. This layer will only take that “CLS” vector output and classify it into either 0 or 1. So I’m going to use a sigmoid activation function in the layer (with 1 output). I will also use a threshold of 0.5 so that the output of the last layer will be rounded and that should be my classification. I will split my data into train and test and then train the model on the train data to fine-tune its parameters.

    尽管这个分数还不错,但我认为如果我微调BERT的参数,而不是仅仅使用预先训练的模型来提取一些特征,该模型的性能会更好。 BERT模型的最后一层为传递的每个令牌(单词)输出矢量表示。 如上所述,所有句子中的第一个标记是“ CLS”。 该标记将具有一个向量,该向量表示整个句子,我认为应该使用一个向量对每个句子进行分类。 我将在BERT模型的最后添加最后一层。 该层将仅获取该“ CLS”矢量输出并将其分类为0或1。因此,我将在该层中使用S形激活函数(输出为1)。 我还将使用0.5的阈值,以便最后一层的输出将被四舍五入,这应该是我的分类。 我将数据分为训练和测试,然后在训练数据上训练模型以微调其参数。

    # I will use the tokenization script provided by google:!wget --quiet https://raw.githubusercontent.com/tensorflow/models/master/official/nlp/bert/tokenization.pyimport tokenizationdef text_encoder(texts, tokenizer, max_len=512):# this function will take the text and the tokenizer provided by google and will turn them into the three input types: tokens(words), masks(15% of the words),# segments(pairs of sentences)all_tokens = [] # bag of wordsall_masks = [] # masks which are words that will be hidden and the model have to predict as part of the trainingall_segments = [] #pairs of sentences that will help the model in learning the sequencefor text in texts: text = tokenizer.tokenize(text) # Turning the text into tokenstext = text[:max_len-2] # making sure that all sentences have the same lengthinput_sequence = ["[CLS]"] + text + ["[SEP]"] # adding the "CLS":classification and "SEP":sentence separation# This tells the model when sentence start and end. This will be the segment input that will train the model#on the sequencepad_len = max_len - len(input_sequence) # Bert takes as input a padded arraytokens = tokenizer.convert_tokens_to_ids(input_sequence) #turning the each word into its corresponding ID in the Bert model# For Bert to work, all tokens vectors should have the same length, so here I'm adding 0s to vectors with smaller length to ensure# that all have the same length. Adding 0 doesn't affect the features because 0 is not an ID for any word in the BERT modeltokens += [0] * pad_len #pad_masks = [1] * len(input_sequence) + [0] * pad_lensegment_ids = [0] * max_lenall_tokens.append(tokens)all_masks.append(pad_masks)all_segments.append(segment_ids)return np.array(all_tokens), np.array(all_masks), np.array(all_segments)def build_model(bert_layer, max_len=512):# The Bert model is not mainly designed for classification, so I will add one layer to do classificationinput_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids") #the tokensinput_mask = Input(shape=(max_len,), dtype=tf.int32, name="input_mask") # the masks (hidden words)segment_ids = Input(shape=(max_len,), dtype=tf.int32, name="segment_ids") # the pairs of sentences_, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids]) #extracting the featuresclf_output = sequence_output[:, 0, :] # the BERT model by default returns a vector that represents each word besides one vector that represents all the sentence# This is the vector that has the same index (0) as the "CLS" that I inputed above so I'm building classification with that vector# and I will ignore the other vectors as they are used for other purposesout = Dense(1, activation='sigmoid')(clf_output) # Since I have only two classes, I'm adding one layer with a sigmoid function to do the classification# Adding everything together:model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=out)model.compile(Adam(lr=2e-6), loss='binary_crossentropy', metrics=['accuracy'])# I'm using the same paramters used in the origional paper: Adam's optimizer and binary crossentropy which will work on maximizing the likelihoodreturn model#%%time# loading the pretrained model:module_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/1"bert_layer = hub.KerasLayer(module_url, trainable=True)vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)#processing the data using the functions abovetrain_input = text_encoder(x_train, tokenizer, max_len=160)test_input = text_encoder(x_test, tokenizer, max_len=160)train_labels = y_train#finally training the modelmodel = build_model(bert_layer, max_len=160)model.fit(train_input, train_labels,validation_split=0.2,epochs=3,batch_size=16)

    I can now use the model:

    我现在可以使用该模型:

    y = model.predict(test_input)print("Accuracy score of fine tuned BERT:", accuracy_score(y.round().astype(int), y_test)) # here I'm rounding because sigmoid returns number between 0 and 1print("F1 score of fine tuned BERT:", f1_score(y.round().astype(int), y_test)) # so im using a treshhold of .5

    Finally, the model performance improved and surpassed the logistic regression. This is expected since this model captures the context and it should do a good job at theses kind of tasks that require understanding the context. I think it can still improve even more if I run it for more epochs, I had to limit it to 3 epochs because it was taking too much time.

    最终,该模型的性能得到改善,并超过了逻辑回归。 这是可以预期的,因为此模型捕获了上下文,并且在需要理解上下文的这类任务上应该做得很好。 我认为,如果我将其运行更多的时间,它仍然可以进一步改善,我不得不将其限制为3个时间,因为这花费了太多时间。

    Now I can finally take the test file provided by kaggle and see how the model performs on it:

    现在,我终于可以获取kaggle提供的测试文件,并查看模型的性能:

    The model scored 0.82719. There are much better scores than that and I think they can be achieved by fine-tuning the model with a few more epochs.

    该模型得分为0.82719。 有比这更好的分数,我认为可以通过再调整几个模型来实现。

    You can find the notebook here.

    您可以在这里找到笔记本。

    翻译自: https://medium.com/swlh/bert-state-of-the-art-vs-simple-logistic-regression-for-natural-langauge-processing-c4ddd7428207

    相关资源:jdk-8u281-windows-x64.exe
    Processed: 0.029, SQL: 8