python中带有multioutputclassifier和xgboost的多标签分类示例

    科技2025-03-20  27

    Scikit-learn API provides a MulitOutputClassifier class that helps to classify multi-output data. In this tutorial, we’ll learn how to classify multi-output (multi-label) data with this method in Python. Multi-output data contains more than one y label data for a given X input data. The tutorial covers:

    Scikit-learn API提供了MulitOutputClassifier类,该类有助于对多输出数据进行分类。 在本教程中,我们将学习如何在Python中使用此方法对多输出(多标签)数据进行分类。 对于给定的X输入数据,多输出数据包含多个y标签数据。 本教程涵盖:

    Preparing the data

    准备数据 Defining the model

    定义模型Predicting and accuracy check

    预测和准确性检查Source code listing

    源代码清单

    我们将从加载本教程所需的库开始。(We’ll start by loading the required libraries for this tutorial.)

    import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix from sklearn.metrics import roc_auc_score from sklearn.metrics import classification_report from sklearn.datasets import make_multilabel_classification from xgboost import XGBClassifier from sklearn.model_selection import KFold from sklearn.multioutput import MultiOutputClassifier from sklearn.pipeline import Pipeline

    准备数据(Preparing the data)

    We can generate a multi-output data with a make_multilabel_classification function. The target dataset contains 20 features (x), 5 classes (y), and 10000 samples.

    我们可以使用make_multilabel_classification函数生成多输出数据。 目标数据集包含20个要素(x),5个类(y)和10000个样本。

    We’ll define them in the parameters of the function.

    我们将在函数的参数中定义它们。

    x, y = make_multilabel_classification(n_samples=10000, n_features=20, n_classes=5, random_state=88)

    The generated data looks as below. There are 20 features and 5 labels in this dataset.

    生成的数据如下。 该数据集中有20个要素和5个标签。

    for i in range(5): print(x[i]," =====> ", y[i]) ---------------------------------------------------------------------------------- [5. 4. 0. 4. 3. 0. 1. 1. 0. 3. 0. 1. 6. 0. 0. 2. 0. 1. 6. 1.] =====> [1 0 0 0 0] [2. 2. 0. 1. 5. 1. 2. 0. 7. 4. 1. 0. 2. 1. 5. 2. 0. 4. 0. 6.] =====> [0 0 0 0 1] [3. 4. 2. 1. 4. 5. 2. 2. 4. 1. 1. 2. 3. 5. 2. 3. 0. 4. 5. 2.] =====> [0 1 0 1 0] [0. 5. 2. 3. 2. 3. 7. 4. 4. 1. 3. 0. 5. 5. 2. 1. 3. 3. 2. 3.] =====> [0 0 0 0 0] [3. 6. 2. 3. 2. 0. 1. 3. 2. 4. 0. 0. 3. 4. 1. 6. 0. 5. 0. 8.] =====> [1 0 0 0 1]

    接下来,我们将数据分为训练和测试部分。(Next, we’ll split the data into the train and test parts.)

    xtrain, xtest, ytrain, ytest=train_test_split(x, y, train_size=0.8, random_state=88)

    定义模型 (Defining the model)

    We’ll define the model with the MultiOutputClassifier class of sklearn. As an estimator, we’ll use XGBClassifier, and then we’ll include the estimator into the MultiOutputClassifier class.

    我们将使用sklearn的MultiOutputClassifier类定义模型。 作为估计器,我们将使用XGBClassifier,然后将估计器包括在MultiOutputClassifier类中。

    We can check the parameters of the model by the print command.

    我们可以通过print命令检查模型的参数。

    classifier = MultiOutputClassifier(XGBClassifier()) clf = Pipeline([('classify', classifier)]) print (clf) ------------------------------------------------ Pipeline(steps=[('classify', MultiOutputClassifier(estimator=XGBClassifier(base_score=None, booster=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, gamma=None, gpu_id=None, importance_type='gain', interaction_constraints=None, learning_rate=None, max_delta_step=None, max_depth=None, min_child_weight=None, missing=nan, monotone_constraints=None, n_estimators=100, n_jobs=None, num_parallel_tree=None, random_state=None, reg_alpha=None, reg_lambda=None, scale_pos_weight=None, subsample=None, tree_method=None, validate_parameters=None, verbosity=None)))])

    我们将使用训练数据拟合模型并检查训练准确性。(We’ll fit the model with training data and check the training accuracy.)

    clf.fit(xtrain, ytrain)print(clf.score(xtrain, ytrain))

    We’ll check the numbers of accuracy metrics for this prediction. Remember, we have five output labels in the ytest and the yhat data, thus we need to use them accordingly.

    我们将检查此预测的准确性指标的数量。 记住,ytest和yhat数据中有五个输出标签,因此我们需要相应地使用它们。

    First, we’ll check the area under the ROC with the roc_auc_score function.

    首先,我们将使用roc_auc_score函数检查ROC下的区域。

    auc_y1 = roc_auc_score(ytest[:,0],yhat[:,0]) auc_y2 = roc_auc_score(ytest[:,1],yhat[:,1]) auc_y3 = roc_auc_score(ytest[:,2],yhat[:,2]) auc_y4 = roc_auc_score(ytest[:,3],yhat[:,3]) auc_y5 = roc_auc_score(ytest[:,4],yhat[:,4]) print("ROC AUC y1: %.4f, y2: %.4f, y3: %.4f, y4: %.4f, y5: %.4f" % (auc_y1, auc_y2, auc_y3, auc_y4, auc_y5)) ------------------------------------------------------- ROC AUC y1: 0.8230, y2: 0.8025, y3: 0.8091, y4: 0.8005, y5: 0.8086

    The second method is to check the confusion matrics.

    第二种方法是检查混淆矩阵。

    cm_y1 = confusion_matrix(ytest[:,0],yhat[:,0]) cm_y2 = confusion_matrix(ytest[:,1],yhat[:,1]) cm_y3 = confusion_matrix(ytest[:,2],yhat[:,2]) cm_y4 = confusion_matrix(ytest[:,3],yhat[:,3]) cm_y5 = confusion_matrix(ytest[:,4],yhat[:,4]) print (cm_y1) --------------- [[1053 140] [ 191 616]]

    Finally, we’ll check the classification report with the classification_report function.

    最后,我们将使用category_report函数检查分类报告。

    cr_y1 = classification_report(ytest[:,0],yhat[:,0]) cr_y2 = classification_report(ytest[:,1],yhat[:,1]) cr_y3 = classification_report(ytest[:,2],yhat[:,2]) cr_y4 = classification_report(ytest[:,3],yhat[:,3]) cr_y5 = classification_report(ytest[:,4],yhat[:,4]) print (cr_y1) ----------------------------- precision recall f1-score support 0 0.85 0.88 0.86 1193 1 0.81 0.76 0.79 807 accuracy 0.83 2000 macro avg 0.83 0.82 0.83 2000 weighted avg 0.83 0.83 0.83 2000

    In this tutorial, we’ve briefly learned how to classify multi-label data with MultiOutputClassifier and XGBoost in Python.

    在本教程中,我们简要学习了如何在Python中使用MultiOutputClassifier和XGBoost对多标签数据进行分类。

    源代码清单 (Source code listing)

    import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix from sklearn.metrics import roc_auc_score from sklearn.metrics import classification_report from sklearn.datasets import make_multilabel_classification from xgboost import XGBClassifier from sklearn.model_selection import KFold from sklearn.multioutput import MultiOutputClassifier from sklearn.pipeline import Pipeline x, y = make_multilabel_classification(n_samples=10000, n_features=20, n_classes=5, random_state=88) for i in range(5): print(x[i]," =====> ", y[i]) xtrain, xtest, ytrain, ytest=train_test_split(x, y, train_size=0.8, random_state=88) print(len(xtest)) classifier = MultiOutputClassifier(XGBClassifier()) clf = Pipeline([('classify', classifier) ]) print (clf) clf.fit(xtrain, ytrain) print(clf.score(xtrain, ytrain)) yhat = clf.predict(xtest) auc_y1 = roc_auc_score(ytest[:,0],yhat[:,0]) auc_y2 = roc_auc_score(ytest[:,1],yhat[:,1]) auc_y3 = roc_auc_score(ytest[:,2],yhat[:,2]) auc_y4 = roc_auc_score(ytest[:,3],yhat[:,3]) auc_y5 = roc_auc_score(ytest[:,4],yhat[:,4]) print("ROC AUC y1: %.4f, y2: %.4f, y3: %.4f, y4: %.4f, y5: %.4f" % (auc_y1, auc_y2, auc_y3, auc_y4, auc_y5)) cm_y1 = confusion_matrix(ytest[:,0],yhat[:,0]) cm_y2 = confusion_matrix(ytest[:,1],yhat[:,1]) cm_y3 = confusion_matrix(ytest[:,2],yhat[:,2]) cm_y4 = confusion_matrix(ytest[:,3],yhat[:,3]) cm_y5 = confusion_matrix(ytest[:,4],yhat[:,4]) print(cm_y1) cr_y1 = classification_report(ytest[:,0],yhat[:,0]) cr_y2 = classification_report(ytest[:,1],yhat[:,1]) cr_y3 = classification_report(ytest[:,2],yhat[:,2]) cr_y4 = classification_report(ytest[:,3],yhat[:,3]) cr_y5 = classification_report(ytest[:,4],yhat[:,4]) print (cr_y1)

    翻译自: https://medium.com/the-innovation/multi-label-classification-example-with-multioutputclassifier-and-xgboost-in-python-98c84c7d379f

    相关资源:xgboost算法的python实现
    Processed: 0.009, SQL: 8