Scikit-learn API provides a MulitOutputClassifier class that helps to classify multi-output data. In this tutorial, we’ll learn how to classify multi-output (multi-label) data with this method in Python. Multi-output data contains more than one y label data for a given X input data. The tutorial covers:
Scikit-learn API提供了MulitOutputClassifier类,该类有助于对多输出数据进行分类。 在本教程中,我们将学习如何在Python中使用此方法对多输出(多标签)数据进行分类。 对于给定的X输入数据,多输出数据包含多个y标签数据。 本教程涵盖:
Preparing the data 准备数据 Defining the model定义模型Predicting and accuracy check预测和准确性检查Source code listing源代码清单We can generate a multi-output data with a make_multilabel_classification function. The target dataset contains 20 features (x), 5 classes (y), and 10000 samples.
我们可以使用make_multilabel_classification函数生成多输出数据。 目标数据集包含20个要素(x),5个类(y)和10000个样本。
We’ll define them in the parameters of the function.
我们将在函数的参数中定义它们。
x, y = make_multilabel_classification(n_samples=10000, n_features=20, n_classes=5, random_state=88)The generated data looks as below. There are 20 features and 5 labels in this dataset.
生成的数据如下。 该数据集中有20个要素和5个标签。
for i in range(5): print(x[i]," =====> ", y[i]) ---------------------------------------------------------------------------------- [5. 4. 0. 4. 3. 0. 1. 1. 0. 3. 0. 1. 6. 0. 0. 2. 0. 1. 6. 1.] =====> [1 0 0 0 0] [2. 2. 0. 1. 5. 1. 2. 0. 7. 4. 1. 0. 2. 1. 5. 2. 0. 4. 0. 6.] =====> [0 0 0 0 1] [3. 4. 2. 1. 4. 5. 2. 2. 4. 1. 1. 2. 3. 5. 2. 3. 0. 4. 5. 2.] =====> [0 1 0 1 0] [0. 5. 2. 3. 2. 3. 7. 4. 4. 1. 3. 0. 5. 5. 2. 1. 3. 3. 2. 3.] =====> [0 0 0 0 0] [3. 6. 2. 3. 2. 0. 1. 3. 2. 4. 0. 0. 3. 4. 1. 6. 0. 5. 0. 8.] =====> [1 0 0 0 1]We’ll define the model with the MultiOutputClassifier class of sklearn. As an estimator, we’ll use XGBClassifier, and then we’ll include the estimator into the MultiOutputClassifier class.
我们将使用sklearn的MultiOutputClassifier类定义模型。 作为估计器,我们将使用XGBClassifier,然后将估计器包括在MultiOutputClassifier类中。
We can check the parameters of the model by the print command.
我们可以通过print命令检查模型的参数。
classifier = MultiOutputClassifier(XGBClassifier()) clf = Pipeline([('classify', classifier)]) print (clf) ------------------------------------------------ Pipeline(steps=[('classify', MultiOutputClassifier(estimator=XGBClassifier(base_score=None, booster=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, gamma=None, gpu_id=None, importance_type='gain', interaction_constraints=None, learning_rate=None, max_delta_step=None, max_depth=None, min_child_weight=None, missing=nan, monotone_constraints=None, n_estimators=100, n_jobs=None, num_parallel_tree=None, random_state=None, reg_alpha=None, reg_lambda=None, scale_pos_weight=None, subsample=None, tree_method=None, validate_parameters=None, verbosity=None)))])We’ll check the numbers of accuracy metrics for this prediction. Remember, we have five output labels in the ytest and the yhat data, thus we need to use them accordingly.
我们将检查此预测的准确性指标的数量。 记住,ytest和yhat数据中有五个输出标签,因此我们需要相应地使用它们。
First, we’ll check the area under the ROC with the roc_auc_score function.
首先,我们将使用roc_auc_score函数检查ROC下的区域。
auc_y1 = roc_auc_score(ytest[:,0],yhat[:,0]) auc_y2 = roc_auc_score(ytest[:,1],yhat[:,1]) auc_y3 = roc_auc_score(ytest[:,2],yhat[:,2]) auc_y4 = roc_auc_score(ytest[:,3],yhat[:,3]) auc_y5 = roc_auc_score(ytest[:,4],yhat[:,4]) print("ROC AUC y1: %.4f, y2: %.4f, y3: %.4f, y4: %.4f, y5: %.4f" % (auc_y1, auc_y2, auc_y3, auc_y4, auc_y5)) ------------------------------------------------------- ROC AUC y1: 0.8230, y2: 0.8025, y3: 0.8091, y4: 0.8005, y5: 0.8086The second method is to check the confusion matrics.
第二种方法是检查混淆矩阵。
cm_y1 = confusion_matrix(ytest[:,0],yhat[:,0]) cm_y2 = confusion_matrix(ytest[:,1],yhat[:,1]) cm_y3 = confusion_matrix(ytest[:,2],yhat[:,2]) cm_y4 = confusion_matrix(ytest[:,3],yhat[:,3]) cm_y5 = confusion_matrix(ytest[:,4],yhat[:,4]) print (cm_y1) --------------- [[1053 140] [ 191 616]]Finally, we’ll check the classification report with the classification_report function.
最后,我们将使用category_report函数检查分类报告。
cr_y1 = classification_report(ytest[:,0],yhat[:,0]) cr_y2 = classification_report(ytest[:,1],yhat[:,1]) cr_y3 = classification_report(ytest[:,2],yhat[:,2]) cr_y4 = classification_report(ytest[:,3],yhat[:,3]) cr_y5 = classification_report(ytest[:,4],yhat[:,4]) print (cr_y1) ----------------------------- precision recall f1-score support 0 0.85 0.88 0.86 1193 1 0.81 0.76 0.79 807 accuracy 0.83 2000 macro avg 0.83 0.82 0.83 2000 weighted avg 0.83 0.83 0.83 2000In this tutorial, we’ve briefly learned how to classify multi-label data with MultiOutputClassifier and XGBoost in Python.
在本教程中,我们简要学习了如何在Python中使用MultiOutputClassifier和XGBoost对多标签数据进行分类。
翻译自: https://medium.com/the-innovation/multi-label-classification-example-with-multioutputclassifier-and-xgboost-in-python-98c84c7d379f
相关资源:xgboost算法的python实现