finetune

科技2025-03-20 50

finetune

In one of my last blog post, How to fine-tune bert on text classification task, I had explained fine-tuning BERT for a multi-class text classification task. In this post, I will be explaining how to fine-tune DistilBERT for a multi-label text classification task. I have made a GitHub repo as well containing the complete code which is explained below. You can visit the below link to see it and can fork it and use it.

在我的最后一篇博文中，如何对文本分类任务中的bert进行微调，我解释了对多类文本分类任务中的BERT的微调。在这篇文章中，我将说明如何针对多标签文本分类任务微调DistilBERT 。我也做了一个GitHub存储库，其中包含完整的代码，下面将进行解释。您可以访问下面的链接查看它，也可以对其进行分叉和使用。

https://github.com/DhavalTaunk08/Transformers_scripts

介绍 (Introduction)

The DistilBERT model (https://arxiv.org/pdf/1910.01108.pdf) was released by Huggingface.co which is a distilled version of BERT released by Google (https://arxiv.org/pdf/1810.04805.pdf).

Huggingface.co发布了DistilBERT模型( https://arxiv.org/pdf/1910.01108.pdf )，该模型是Google( https://arxiv.org/pdf/1810.04805.pdf )发行的BERT的简化版本。

According to the authors:-

根据作者的说法：

They leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40% while retaining 97% of its language understanding capabilities and being 60% faster.

他们在预训练阶段利用了知识提炼，并表明可以将BERT模型的大小减少40％，同时保留其97％的语言理解能力，并且速度提高60％。

So let’s start with the details and the process to fine-tune the model.

因此，让我们从详细信息和微调模型的过程开始。

多类别v / s多标签分类 (Multi-Class v/s Multi-Label classification)

First of all, it is important to understand the difference between multi-class and multi-label classification. Multi-class classification means classifying the samples into one of the three or more available classes. While in multi-label classification, one sample can belong to more than one class. Let me explain it more clearly by an example:-

首先，重要的是要了解多类和多标签分类之间的区别。多类别分类是指将样本分为三个或更多可用类别之一。在多标签分类中，一个样本可以属于一个以上的类别。让我通过一个例子更清楚地解释它：

Multiclass classification — Let say we have 10 fruits. They can belong to one of the three classes — ‘apple’, ‘mango’ and ‘banana’. If we are asked to classify the fruits in these given classes, they can belong to only one of these classes. Therefore, it is a multi-class classification problem.

多类分类-假设我们有10种水果。它们可以属于“苹果”，“芒果”和“香蕉”这三个类别之一。如果要求我们对这些给定类中的水果进行分类，则它们只能属于这些类之一。因此，这是一个多类分类问题。

Multi-label classification — Let say we have few movie names and our task is to classify these movies into the genres to which they belong to like ‘action’, ‘comedy’, ‘horror’, ‘sci-fi’, ‘drama’ etc. These movies can belong to more than one genre. For example — ‘The Matrix movie series belongs to the ‘action’ as well as ‘sci-fi’ category. Thus it is called multi-label classification.

多标签分类-假设我们的电影名称很少，我们的任务是将这些电影分类为它们所属的流派，例如“动作”，“喜剧”，“恐怖”，“科幻”，“戏剧”等。这些电影可以属于多个流派。例如，“黑客帝国”电影系列既属于“动作”类别，也属于“科幻”类别。因此，它被称为多标签分类。

资料格式 (Data Formatting)

First of all, there is a need to format the data. The required data can contain 2 columns. One column containing text to be classified. Another column containing labels related to that sample. The below image is an example of the data frame:-

首先，需要格式化数据。所需的数据可以包含2列。一列包含要分类的文本。另一列包含与该样品相关的标签。下图是数据框的示例：-

The above example shows that we have six different classes and the sample can belong to any number of classes.

上面的示例显示我们有六个不同的类，并且示例可以属于任意数量的类。

But the question is how to convert the labels into this format? Here, scikit-learn comes to the rescue!!!

但是问题是如何将标签转换成这种格式？在这里，scikit-learn可以解救！！！

Below is an example of how to convert these labels to the required format.

下面是一个如何将这些标签转换为所需格式的示例。

>>> from sklearn.preprocessing import MultiLabelBinarizer>>> mlb = MultiLabelBinarizer()>>> mlb.fit_transform([{'sci-fi', 'thriller'}, {'comedy'}])array([[0, 1, 1], [1, 0, 0]])>>> list(mlb.classes_)['comedy', 'sci-fi', 'thriller']

Also, you can refer to the below link to get more details about it.

另外，您可以参考以下链接以获取有关它的更多详细信息。

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html

https://scikit-learn.org/stable/modules/generation/sklearn.preprocessing.MultiLabelBinarizer.html

码 (Code)

Now let’s get to the code part about the required libraries, how to write DataLoader, and model class for this task.

现在，让我们进入有关所需库，如何编写DataLoader以及此任务的模型类的代码部分。

所需的库 (Required libraries)

transformers==3.0.2

变压器== 3.0.2

torch

火炬

scikit-learn

scikit学习

numpy

麻木

pandas

大熊猫

tqdm

These can be installed with the ‘pip install’ command.

这些可以使用“ pip install”命令安装。

导入库 (Importing libraries)

import numpy as npimport pandas as pdimport transformersimport torchfrom torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSamplerfrom transformers import DistilBertModel, DistilBertTokenizerfrom tqdm import tqdmfrom sklearn.preprocessing import MultiLabelBinarizer from torch import cudadevice = 'cuda' if cuda.is_available() else 'cpu'

The above step is to set up the device for GPU.

上面的步骤是为GPU设置设备。

训练参数 (Training parameters)

MAX_LEN = 256TRAIN_BATCH_SIZE = 8VALID_BATCH_SIZE = 4EPOCHS = 1LEARNING_RATE = 1e-05

These parameters can be tuned according to one’s needs. But there is one important point to be noted here:-

这些参数可以根据自己的需要进行调整。但是这里有一个重要的要点：

DistilBERT accepts a max_sequence_length of 512 tokens.

DistilBERT接受512个令牌的max_sequence_length。

We cannot give max_sequence_length more than this. If you want to give a sequence length of size more than 512 tokens, you can try the longformer model (https://arxiv.org/pdf/2004.05150)

我们不能给max_sequence_length多于此。如果要提供长度大于512个令牌的序列长度，则可以尝试使用longformer模型( https://arxiv.org/pdf/2004.05150 )

数据加载器 (DataLoader)

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased') class MultiLabelDataset(Dataset): def __init__(self, dataframe, tokenizer, max_len): self.tokenizer = tokenizer self.data = dataframe self.text = dataframe.text self.targets = self.data.labels self.max_len = max_len def __len__(self): return len(self.text) def __getitem__(self, index): text = str(self.text[index]) text = " ".join(text.split()) inputs = self.tokenizer.encode_plus( text, None, add_special_tokens=True, max_length=self.max_len, pad_to_max_length=True, return_token_type_ids=True ) ids = inputs['input_ids'] mask = inputs['attention_mask'] token_type_ids = inputs["token_type_ids"] return { 'ids': torch.tensor(ids, dtype=torch.long), 'mask': torch.tensor(mask, dtype=torch.long), 'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long), 'targets': torch.tensor(self.targets[index], dtype=torch.float) }

Calling the tokenizer and loading the dataset. Here, train_dataset and val_dataset will be training and validation datasets in pandas data frame format with column names as [‘text’, ‘labels’].

调用令牌生成器并加载数据集。在这里，train_dataset和val_dataset将是熊猫数据帧格式的训练和验证数据集，列名称为['text'，'labels']。

training_set = MultiLabelDataset(train_dataset, tokenizer, MAX_LEN)testing_set = MultiLabelDataset(test_dataset, tokenizer, MAX_LEN)train_params = {'batch_size': TRAIN_BATCH_SIZE, 'shuffle': True, 'num_workers': 0 }test_params = {'batch_size': VALID_BATCH_SIZE, 'shuffle': True, 'num_workers': 0 }training_loader = DataLoader(training_set, **train_params)testing_loader = DataLoader(testing_set, **test_params)

The above step converts the data into the required format using the MultiLabelDataset class and PyTorch's DataLoader. You can read more about DataLoader by visiting the below-given link:-

上面的步骤使用MultiLabelDataset类和PyTorch的DataLoader将数据转换为所需的格式。您可以通过访问以下链接来了解有关DataLoader的更多信息：

https://pytorch.org/tutorials/beginner/data_loading_tutorial.html

型号类别 (Model Class)

class DistilBERTClass(torch.nn.Module): def __init__(self): super(DistilBERTClass, self).__init__() self.l1 = DistilBertModel.from_pretrained("distilbert-base-uncased") self.pre_classifier = torch.nn.Linear(768, 768) self.dropout = torch.nn.Dropout(0.3) self.classifier = torch.nn.Linear(768, num_classes) def forward(self, input_ids, attention_mask): output_1 = self.l1(input_ids=input_ids, attention_mask=attention_mask) hidden_state = output_1[0] pooler = hidden_state[:, 0] pooler = self.pre_classifier(pooler) pooler = torch.nn.ReLU()(pooler) pooler = self.dropout(pooler) output = self.classifier(pooler) return output

Here, I have used 2 linear layers on top of the DistilBERT model with a dropout unit and ReLu as an activation function. num_classes will be the number of classes available in your dataset. The model will return the logit scores for each class. The class can be called by the below method:-

在这里，我在DistilBERT模型的顶部使用了2个线性层，其中包含一个压差单元和ReLu作为激活函数。 num_classes将是数据集中可用的类数。该模型将返回每个班级的对数得分。可以通过以下方法调用该类：

model = DistilBERTClass()model.to(device)

损失函数和优化器 (Loss function and optimizer)

def loss_fn(outputs, targets): return torch.nn.BCEWithLogitsLoss()(outputs, targets) optimizer = torch.optim.Adam(params = model.parameters(), lr=LEARNING_RATE)

Here, BCEWithLogitsLoss is used which is used generally for multi-label classification. One can read more by visiting the below link:-

此处，使用BCEWithLogitsLoss，它通常用于多标签分类。可以访问以下链接来阅读更多内容：-

https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html

https://pytorch.org/docs/stable/generation/torch.nn.BCEWithLogitsLoss.html

训练功能 (Training function)

def train_model(epoch): model.train() for _, data in enumerate(training_loader, 0): ids = data['ids'].to(device, dtype = torch.long) mask = data['mask'].to(device, dtype = torch.long) token_type_ids = data['token_type_ids'].to(device, dtype = torch.long) targets = data['targets'].to(device, dtype = torch.float) outputs = model(ids, mask, token_type_ids) optimizer.zero_grad() loss = loss_fn(outputs, targets) if _%1000==0: print(f'Epoch: {epoch}, Loss: {loss.item()}') optimizer.zero_grad() loss.backward() optimizer.step() for epoch in range(EPOCHS): train_model(epoch)

The above function is used for training the model for the specified number of epochs.

上面的功能用于训练模型指定的时期数。

验证方式 (Validation)

def validation(testing_loader): model.eval() fin_targets=[] fin_outputs=[] with torch.no_grad(): for _, data in enumerate(testing_loader, 0): ids = data['ids'].to(device, dtype = torch.long) mask = data['mask'].to(device, dtype = torch.long) token_type_ids = data['token_type_ids'].to(device, dtype = torch.long) targets = data['targets'].to(device, dtype = torch.float) outputs = model(ids, mask, token_type_ids) fin_targets.extend(targets.cpu().detach().numpy().tolist()) fin_outputs.extend(torch.sigmoid(outputs).cpu().detach().numpy().tolist()) return fin_outputs, fin_targets outputs, targets = validation(testing_loader)outputs = np.array(outputs) >= 0.5accuracy = metrics.accuracy_score(targets, outputs)f1_score_micro = metrics.f1_score(targets, outputs, average='micro')f1_score_macro = metrics.f1_score(targets, outputs, average='macro')print(f"Accuracy Score = {accuracy}")print(f"F1 Score (Micro) = {f1_score_micro}")print(f"F1 Score (Macro) = {f1_score_macro}")

Here I have used accuracy and f1_score for now. But usually, the hamming loss and hamming score are the better metrics for calculating loss and accuracy for multilabel classification tasks. I will be discussing that in my next post.

在这里，我现在已经使用了准确性和f1_score。但是通常，汉明损失和汉明分数是计算多标签分类任务的损失和准确性的更好指标。我将在下一篇文章中讨论。

So this is it for now. Stay tuned for the next post for more details on hamming loss, score, and other things. If you want to read more, you can visit my profile for more posts.

所以现在就这样。请继续关注下一篇文章，以详细了解汉明损失，得分等问题。如果您想了解更多信息，可以访问我的个人资料以获取更多帖子。

翻译自: https://medium.com/analytics-vidhya/finetune-distilbert-for-multi-label-text-classsification-task-994eb448f94c

finetune

相关资源：微信小程序源码-合集6.rar

Processed: 0.026, SQL: 8