roc曲线和auc
The receiver operating characteristic (ROC) curve is frequently used for evaluating the performance of binary classification algorithms. It provides a graphical representation of a classifier’s performance, rather than a single value like most other metrics.
接收器工作特性(ROC)曲线经常用于评估二进制分类算法的性能。 它提供了分类器性能的图形表示,而不是像大多数其他指标一样的单个值。
First, let’s establish that in binary classification, there are four possible outcomes for a test prediction: true positive, false positive, true negative, and false negative.
首先,让我们确定在二元分类中,测试预测有四个可能的结果: 真阳性 , 假阳性 , 真阴性和假阴性 。
Confusion matrix structure for binary classification problems 二进制分类问题的混淆矩阵结构The ROC curve is produced by calculating and plotting the true positive rate against the false positive rate for a single classifier at a variety of thresholds. For example, in logistic regression, the threshold would be the predicted probability of an observation belonging to the positive class. Normally in logistic regression, if an observation is predicted to be positive at > 0.5 probability, it is labeled as positive. However, we could really choose any threshold between 0 and 1 (0.1, 0.3, 0.6, 0.99, etc.) — and ROC curves help us visualize how these choices affect classifier performance.
ROC曲线是通过计算和绘制单个分类器在各种阈值下的真实阳性率相对于错误阳性率而绘制的。 例如,在逻辑回归中,阈值是属于阳性类别的观测值的预测概率。 通常,在逻辑回归中,如果预测观察结果以> 0.5的概率为阳性,则将其标记为阳性。 但是,我们实际上可以选择0到1之间的任何阈值(0.1、0.3、0.6、0.99等),而ROC曲线可以帮助我们直观地看到这些选择如何影响分类器性能。
The true positive rate, or sensitivity, can be represented as:
真实的阳性率或敏感性可以表示为:
where TP is the number of true positives and FN is the number of false negatives. The true positive rate is a measure of the probability that an actual positive instance will be classified as positive.
其中TP是真正数 , FN是假负数 。 真实阳性率是对实际阳性实例将被分类为阳性的概率的度量。
The false positive rate, or 1 — specificity, can be written as:
误报率或1- 特异性可以写成:
where FP is the number of false positives and TN is the number of true negatives. The false positive rate is essentially a measure of how often a “false alarm” will occur — or, how often an actual negative instance will be classified as positive.
其中FP是假阳性的数目, TN是真阴性的数目。 误报率从根本上衡量“误报”发生的频率,或者实际否定实例被分类为积极的频率。
Figure 1 demonstrates how some theoretical classifiers would plot on an ROC curve. The gray dotted line represents a classifier that is no better than random guessing — this will plot as a diagonal line. The purple line represents a perfect classifier — one with a true positive rate of 100% and a false positive rate of 0%. Nearly all real-world examples will fall somewhere between these two lines — not perfect, but providing more predictive power than random guessing. Typically, what we’re looking for is a classifier that maintains a high true positive rate while also having a low false positive rate — this ideal classifier would “hug” the upper left corner of Figure 1, much like the purple line.
图1展示了一些理论上的分类器将如何绘制在ROC曲线上。 灰色虚线表示分类器,它不比随机猜测好,它会以对角线的形式绘制。 紫色线代表一个完美的分类器-正确率为100%,错误率为0%。 几乎所有现实世界中的示例都将落在这两行之间-并非完美,但比随机猜测提供了更多的预测能力。 通常,我们正在寻找的是一个分类器,该分类器可以保持较高的真实阳性率,同时也具有较低的错误阳性率-这种理想的分类器将“拥抱”图1的左上角,就像紫线一样。
Fig. 1 — Some theoretical ROC curves 图1 —一些理论ROC曲线While it is useful to visualize a classifier’s ROC curve, in many cases we can boil this information down to a single metric — the AUC.
虽然可视化分类器的ROC曲线很有用,但在许多情况下,我们可以将此信息简化为一个指标- AUC。
AUC stands for area under the (ROC) curve. Generally, the higher the AUC score, the better a classifier performs for the given task.
AUC代表(ROC)曲线下的面积。 通常,AUC分数越高,分类器对给定任务执行的越好。
Figure 2 shows that for a classifier with no predictive power (i.e., random guessing), AUC = 0.5, and for a perfect classifier, AUC = 1.0. Most classifiers will fall between 0.5 and 1.0, with the rare exception being a classifier performs worse than random guessing (AUC < 0.5).
图2显示,对于没有预测能力(即随机猜测)的分类器,AUC = 0.5,而对于完美分类器,AUC = 1.0。 大多数分类器将落在0.5到1.0之间,极少的例外是分类器的效果比随机猜测差 (AUC <0.5)。
Fig. 2 — Theoretical ROC curves with AUC scores 图2-具有AUC分数的理论ROC曲线One advantage presented by ROC curves is that they aid us in finding a classification threshold that suits our specific problem.
ROC曲线呈现的一个优势是,它们可以帮助我们找到适合我们特定问题的分类阈值。
For example, if we were evaluating an email spam classifier, we would want the false positive rate to be really, really low. We wouldn’t want someone to lose an important email to the spam filter just because our algorithm was too aggressive. We would probably even allow a fair amount of actual spam emails (true positives) through the filter just to make sure that no important emails were lost.
例如,如果我们正在评估电子邮件垃圾邮件分类器,则我们希望误报率非常低。 我们不希望仅仅因为我们的算法过于激进而使某人丢失重要邮件到垃圾邮件过滤器。 我们甚至可能会通过过滤器允许相当数量的实际垃圾邮件(真实肯定),以确保不会丢失重要的电子邮件。
On the other hand, if our classifier is predicting whether someone has a terminal illness, we might be ok with a higher number of false positives (incorrectly diagnosing the illness), just to make sure that we don’t miss any true positives (people who actually have the illness).
另一方面,如果我们的分类器正在预测某人是否患有绝症,那么我们可以接受更多的误报(错误地诊断疾病),以确保我们不会错过任何真正的肯定(人实际患有这种疾病的人)。
Additionally, ROC curves and AUC scores also allow us to compare the performance of different classifiers for the same problem.
此外,ROC曲线和AUC分数还使我们能够比较同一问题的不同分类器的性能。
To demonstrate how the ROC curve is constructed in practice, I’m going to work with the Heart Disease UCI data set in Python. The data set has 14 attributes, 303 observations, and is typically used to predict whether a patient has heart disease based on the other 13 attributes, which include age, sex, cholesterol level, and other measurements.
为了演示在实践中如何构建ROC曲线,我将使用Python中的“ 心脏病” UCI数据集进行工作。 该数据集具有14个属性,303个观察值,通常用于基于其他13个属性(包括年龄,性别,胆固醇水平和其他测量值)来预测患者是否患有心脏病。
Imports & Loading Data
导入和加载数据
# Imports import pandas as pd import numpy as np import matplotlib.pyplot as plt # Load data df = pd.read_csv('heart.csv') df.head()Train-Test Split
火车测试拆分
For this analysis, I’ll use a standard 75% — 25% train-test split.
在此分析中,我将使用标准的75%-25%的火车测试拆分。
# Split data into train and test sets from sklearn.model_selection import train_test_split X = df.drop('target', axis=1) y = df.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=56)Logistic Regression Classifier
逻辑回归分类器
Before I write a function to calculate false positive and true positive rate, I’ll fit a vanilla Logistic Regression classifier on the training data, and make predictions on the test set.
在编写用于计算假阳性和真阳性率的函数之前,我将在训练数据上拟合香草Logistic回归分类器,并对测试集进行预测。
# Fit a vanilla Logistic Regression classifier and make predictions from sklearn.linear_model import LogisticRegression clf = LogisticRegression(max_iter=1000) clf.fit(X_train, y_train) y_pred_test = clf.predict(X_test)Calculating True Positive Rate and False Positive Rate
计算正确率和错误率
Now that I have test predictions, I can write a function to calculate the true positive rate and false positive rate. This is a critical step, as these are the two variables needed to produce the ROC curve.
现在我有了测试预测,我可以编写一个函数来计算真实的阳性率和错误的阳性率。 这是关键的一步,因为这是生成ROC曲线所需的两个变量。
# Function to calculate True Positive Rate and False Positive Rate def calc_TP_FP_rate(y_true, y_pred): # Convert predictions to series with index matching y_true y_pred = pd.Series(y_pred, index=y_true.index) # Instantiate counters TP = 0 FP = 0 TN = 0 FN = 0 # Determine whether each prediction is TP, FP, TN, or FN for i in y_true.index: if y_true[i]==y_pred[i]==1: TP += 1 if y_pred[i]==1 and y_true[i]!=y_pred[i]: FP += 1 if y_true[i]==y_pred[i]==0: TN += 1 if y_pred[i]==0 and y_test[i]!=y_pred[i]: FN += 1 # Calculate true positive rate and false positive rate tpr = TP / (TP + FN) fpr = FP / (FP + TN) return tpr, fpr # Test function calc_TP_FP_rate(y_test, y_pred_test) (0.6923076923076923, 0.1891891891891892)The test shows that the function appears to be working — a true positive rate of 69% and a false positive rate of 19% are perfectly reasonable results.
测试表明该功能似乎在起作用-正确率为69%,错误率为19%是完全合理的结果。
Exploring varying thresholds
探索不同的阈值
To obtain the ROC curve, I need more than one pair of true positive/false positive rates. I need to vary the threshold probability that the Logistic Regression classifier uses to predict whether a patient has heart disease (target=1) or doesn’t (target=0). Remember, while Logistic Regression is used to assign a class label, what it’s actually doing is determining the probability that an observation belongs to a specific class. In a typical binary classification problem, an observation must have a probability of > 0.5 to be assigned to the positive class. However, in this case, I will vary that threshold probability value incrementally from 0 to 1. This will result in the ranges of true positive rates and false positive rates that allow me to build the ROC curve.
要获得ROC曲线,我需要多于一对真实的正/假阳性率。 我需要改变Logistic回归分类器用来预测患者是否患有心脏病(目标= 1)或没有心脏病(目标= 0)的阈值概率 。 请记住,虽然Logistic回归用于分配类别标签,但实际上它是在确定观察值属于特定类别的概率 。 在典型的二元分类问题中,观察值必须具有> 0.5的概率才能分配给阳性分类。 但是,在这种情况下,我将阈值概率值从0递增到1。这将导致真阳性率和假阳性率的范围,使我能够构建ROC曲线。
In the code blocks below, I obtain these true positive rates and false positive rates across a range of threshold probability values. For comparison, I use logistic regression with (1) no regularization and (2) L2 regularization.
在下面的代码块中,我将在一定范围的阈值概率值中获得这些正确率和错误率。 为了进行比较,我将逻辑回归与(1)没有正则化和(2)L2正则化一起使用。
# LOGISTIC REGRESSION (NO REGULARIZATION) # Fit and predict test class probabilities lr = LogisticRegression(max_iter=1000, penalty='none') lr.fit(X_train, y_train) y_test_probs = lr.predict_proba(X_test)[:,1] # Containers for true positive / false positive rates lr_tp_rates = [] lr_fp_rates = [] # Define probability thresholds to use, between 0 and 1 probability_thresholds = np.linspace(0,1,num=100) # Find true positive / false positive rate for each threshold for p in probability_thresholds: y_test_preds = [] for prob in y_test_probs: if prob > p: y_test_preds.append(1) else: y_test_preds.append(0) tp_rate, fp_rate = calc_TP_FP_rate(y_test, y_test_preds) lr_tp_rates.append(tp_rate) lr_fp_rates.append(fp_rate) # LOGISTIC REGRESSION (L2 REGULARIZATION) # Fit and predict test class probabilities lr_l2 = LogisticRegression(max_iter=1000, penalty='l2') lr_l2.fit(X_train, y_train) y_test_probs = lr_l2.predict_proba(X_test)[:,1] # Containers for true positive / false positive rates l2_tp_rates = [] l2_fp_rates = [] # Define probability thresholds to use, between 0 and 1 probability_thresholds = np.linspace(0,1,num=100) # Find true positive / false positive rate for each threshold for p in probability_thresholds: y_test_preds = [] for prob in y_test_probs: if prob > p: y_test_preds.append(1) else: y_test_preds.append(0) tp_rate, fp_rate = calc_TP_FP_rate(y_test, y_test_preds) l2_tp_rates.append(tp_rate) l2_fp_rates.append(fp_rate)Plotting the ROC Curves
绘制ROC曲线
# Plot ROC curves fig, ax = plt.subplots(figsize=(6,6)) ax.plot(lr_fp_rates, lr_tp_rates, label='Logistic Regression') ax.plot(l2_fp_rates, l2_tp_rates, label='L2 Logistic Regression') ax.set_xlabel('False Positive Rate') ax.set_ylabel('True Positive Rate') ax.legend();Both versions of the logistic regression classifier seem to do a pretty good job, but the L2 regularized version appears to perform slightly better.
逻辑回归分类器的两个版本似乎都做得不错,但是L2正则化版本的表现似乎稍好一些。
Calculating AUC scores
计算AUC分数
sklearn has an auc() function, which I’ll make use of here to calculate the AUC scores for both versions of the classifier. auc() takes in the true positive and false positive rates we previously calculated, and returns the AUC score.
sklearn具有auc()函数,在这里我将利用它来计算两个版本的分类器的AUC分数。 auc()接受我们先前计算的真实阳性和假阳性率,并返回AUC分数。
# Get AUC scores from sklearn.metrics import auc print(f'Logistic Regression (No reg.) AUC {auc(lr_fp_rates, lr_tp_rates)}') print(f'Logistic Regression (L2 reg.) AUC {auc(l2_fp_rates, l2_tp_rates)}') Logistic Regression (No reg.) AUC 0.902979902979903Logistic Regression (L2 reg.) AUC 0.9116424116424116As expected, the classifiers both have similar AUC scores, with the L2 regularized version performing slightly better.
不出所料,分类器的AUC得分均相似,而L2正则化版本的性能稍好一些。
Now that we’ve had fun plotting these ROC curves from scratch, you’ll be relieved to know that there is a much, much easier way. sklearn’s plot_roc_curve() function can efficiently plot ROC curves using only a fitted classifier and test data as input. These plots conveniently include the AUC score as well.
既然我们已经从头开始绘制了这些ROC曲线,就很有趣了,那么您就会发现,有一种非常简单的方法会让您放心。 sklearn的plot_roc_curve()函数可以仅使用拟合的分类器和测试数据作为输入来有效地绘制ROC曲线。 这些图方便地还包括AUC分数。
# Use sklearn to plot ROC curves from sklearn.metrics import plot_roc_curve plot_roc_curve(lr, X_test, y_test, name = 'Logistic Regression') plot_roc_curve(lr_l2, X_test, y_test, name = 'L2 Logistic Regression');Closing
闭幕
If you’ve made it this far, thanks for reading! I found it a valuable exercise to inefficiently create my own ROC curves in Python, and I hope you gained something from following along.
如果您已经做到了这一点,感谢您的阅读! 我发现用Python低效地创建自己的ROC曲线是一个有价值的练习,希望您能从中学到一些东西。
Some helpful references on ROC and AUC:
有关ROC和AUC的一些有用参考资料:
演示地址
翻译自: https://towardsdatascience.com/understanding-the-roc-curve-and-auc-dd4f9a192ecb
roc曲线和auc
相关资源:jdk-8u281-windows-x64.exe