血液采样卡技术参数
Photo by Bermix Studio on Unsplash Bermix Studio在 Unsplash上 拍摄的照片Credit card fraud detection is a plague that all financial institutions are at risk with. In general fraud detection is very challenging because fraudsters are coming up with new and innovative ways of detecting fraud in this digital world, so it is difficult to find a pattern that we can detect. For example, in the diagram below all the icons look the same, but there is one icon that is slightly different from the rest and we have to pick that one. Can you spot it?
信用卡欺诈检测是所有金融机构都面临的风险。 通常,欺诈检测非常具有挑战性,因为欺诈者提出了在此数字世界中检测欺诈的新颖创新方法,因此很难找到我们可以检测到的模式。 例如,在下图中,所有图标看起来都相同,但是其中一个图标与其余图标略有不同,因此我们必须选择一个图标。 你能发现吗?
Photo by Author 作者照片Here it is:
这里是:
Photo by Author 作者照片In very simple terms, fraud detection also works in a similar way and we have to identify those patterns that are different from the rest. In the example above, maybe it was really easy to identify the anomalous icon or the ‘odd-one-out’ but in real life it is far more difficult than this.
简单来说,欺诈检测也以类似的方式工作,我们必须找出与其他模式不同的模式。 在上面的示例中,也许很容易识别异常图标或“奇一出”,但在现实生活中要困难得多。
In most cases, because the number of fraudulent transactions is not a huge number, we have to work with a data that typically has a lot of non-frauds compared to Fraud cases. In technical terms such a data is called an ‘imbalanced data’. But, it is still essential to detect the fraud cases, because only 1 fraudulent transaction can cause millions of losses to banks/financial institutions.
在大多数情况下,由于欺诈性交易的数量不是很大,因此与欺诈案件相比,我们必须使用通常具有很多非欺诈数据的数据。 用技术术语来说,这样的数据称为“不平衡数据”。 但是,检测欺诈案件仍然很重要,因为只有一项欺诈交易会给银行/金融机构造成数百万的损失。
With this background let me now provide a plan for today and what you will learn in the context of our use case ‘Credit card fraud detection with imbalanced data’. We will discuss the following:
在此背景下,我现在就今天提供一个计划,并在我们的用例“具有不平衡数据的信用卡欺诈检测”的背景下,您将学到什么。 我们将讨论以下内容:
1. What is data imbalance
1.什么是数据不平衡
2. Possible causes of data Imbalance
2.数据不平衡的可能原因
3. Why is class imbalance a problem in machine learning
3.为什么班级失衡是机器学习中的一个问题
4. Quick Refresher on Random Forest Algorithm
4.快速刷新随机森林算法
5. Different sampling methods to deal with data Imbalance
5.处理数据不平衡的不同采样方法
6. Comparison of which method works well in our context with a practical Demonstration with Python
6.使用Python进行的实际演示,比较哪种方法在我们的上下文中效果很好
7. Business insight on which model to choose and why
7.关于选择哪种模型以及为什么的商业见解
We will be considering the credit card fraud dataset from https://www.kaggle.com/mlg-ulb/creditcardfraud
我们将考虑来自https://www.kaggle.com/mlg-ulb/creditcardfraud的信用卡欺诈数据集
Formally this means that the distribution of samples across different classes is unequal. In our case of binary classification, there are 2 classes
从形式上来讲,这意味着样本在不同类别中的分布是不平等的。 在我们的二进制分类中,有2个类别
a) Majority class- the non-fraudulent/genuine transactions
a)多数类别-非欺诈/真实交易
b) Minority class- the fraudulent transactions
b)少数群体-欺诈交易
In the dataset considered, the class distribution is as follows:
在所考虑的数据集中,类的分布如下:
Table by Author 作者表As we can observe, the dataset is highly imbalanced with only 0.17% of the observations being in the Fraudulent category.
我们可以观察到,数据集高度不平衡,只有0.17%的观察结果属于欺诈类别。
There can be 2 main causes of data imbalance:
数据不平衡可能有两个主要原因:
a) Biased Sampling/Measurement errors: This is due to collection of samples only from one class or from a particular region or samples being mis-classified. This can be resolved by improving the sampling methods.
a)偏向采样/测量误差:这是由于仅从一个类别或特定区域收集样本或样本被错误分类造成的。 这可以通过改进采样方法来解决。
b) Use case/domain characteristic: A more pertinent problem as in our case might be due to the problem of prediction of a rare event, which automatically introduces skewness towards majority class because the occurrence of minor class in practice is not often.
b)用例/域特征:在我们的案例中,一个更相关的问题可能是由于罕见事件的预测问题,该问题自动导致偏向多数阶级,因为在实践中很少发生少数阶级。
This is a problem because most of the algorithms in machine learning focus on learning from the occurrences/observation points that occur frequently i.e the majority class. This is called the frequency bias. So in cases of imbalanced dataset, these algorithms might not work well. Typically few techniques that will work well are tree based algorithms or anomaly detection algorithms. Traditionally, in fraud detection problems business rule based methods are often used. Tree based methods work well because a tree creates rule-based hierarchy that can separate both the classes. Decision trees tend to over-fit the data and to eliminate this possibility we will go with an ensemble method. For our use case, we will use the Random Forest Algorithm today.
这是一个问题,因为机器学习中的大多数算法都侧重于从频繁发生的事件/观察点(即多数类)中学习。 这称为频率偏置。 因此,在数据集不平衡的情况下,这些算法可能效果不佳。 通常,很少有能很好工作的技术是基于树的算法或异常检测算法。 传统上,在欺诈检测问题中,经常使用基于业务规则的方法。 基于树的方法工作得很好,因为树创建了可以将两个类分开的基于规则的层次结构。 决策树倾向于过度拟合数据并消除这种可能性,我们将采用集成方法。 对于我们的用例,我们今天将使用随机森林算法。
Random Forest works by building multiple decision tree predictors and the mode of the classes of these individual decision trees is the final selected class or output. It is like voting for the most popular class. For eg: If 2 trees predict that Rule 1 indicates Fraud while another tree indicates that Rule 1 predicts Non-fraud, then according to Random forest algorithm the final prediction will be Fraud.
随机森林通过构建多个决策树预测变量来工作,而这些单独决策树的类别模式是最终选择的类别或输出。 这就像为最受欢迎的课程投票一样。 例如:如果2棵树预测规则1指示欺诈,而另一棵树指示规则1预测非欺诈,则根据随机森林算法,最终预测将为欺诈。
Formal Definition: A random forest is a classifier consisting of a collection of tree-structured classifiers {h(x,Θk ), k=1, …} where the {Θk} are independent identically distributed random vectors and each tree casts a unit vote for the most popular class at input x. (Source)
正式定义:随机森林是一个分类器,由树结构分类器{h(x,Θk),k = 1,…}组成,其中{Θk}是独立的均匀分布的随机向量,每棵树进行一次单位投票输入x上最受欢迎的类。 (来源)
Each tree depends on a random vector that is independently sampled and all trees have a similar distribution. The generalization error converges as the number of trees increases. In its splitting criteria, Random forest searches for the best feature among a random subset of features and we can also compute variable importance and accordingly do feature selection. The trees can be grown using bagging technique where observations can be randomly selected (without replacement) from the training set. The other method can be random split selection where a random split is selected from K-best splits at each node.
每棵树都依赖于独立采样的随机向量,并且所有树都具有相似的分布。 泛化误差随着树数的增加而收敛。 在其划分标准中,随机森林在特征的随机子集中搜索最佳特征,我们还可以计算变量重要性,并据此进行特征选择。 树木可以使用套袋技术进行种植,其中可以从训练集中随机选择观察值(无需替换)。 另一种方法可以是随机拆分选择,其中从每个节点的K个最佳拆分中选择一个随机拆分。
You can read more about it here.
您可以在此处了解更多信息。
We will now illustrate 3 sampling methods that can take care of data imbalance.
现在,我们将说明3种可以解决数据不平衡的采样方法。
a) Random Under-sampling: Random draws are taken from the non-fraud observations i.e the majority class to match it with the Fraud observations i.e the minority class. This means, we are throwing away some information from the dataset which might not be ideal always. The Fig below illustrates this methodology.
a)随机欠采样:随机抽取来自非欺诈性观察(即多数类),以使其与欺诈性观察(即少数类)相匹配。 这意味着,我们从数据集中丢弃了一些信息,这些信息可能并不总是理想的。 下图说明了这种方法。
Image by Author 图片作者b) Random Over-sampling: In this case, we do exact opposite of under-sampling i.e duplicate the minority class or the Fraudulent observations at random to increase the number of the minority class till we get a balanced dataset. Here the possible limitation is, we are creating a lot of duplicates with this method. You can refer to the Fig below for a pictorial representation of the method.
b)随机过采样:在这种情况下,我们采取了与欠采样完全相反的方法,即随机复制少数派类别或欺诈性观察,以增加少数派类别的数量,直到获得平衡的数据集。 这里可能的限制是,我们使用这种方法创建了很多重复项。 您可以参考下图以图形方式表示该方法。
Image by Author 图片作者c) SMOTE (Synthetic Minority Over-sampling technique): Smote is another method that uses synthetic data with KNN instead of using duplicate data. Each minority class example along with their k-nearest neighbours is considered. Then synthetic examples are created along the line segments that join any/all the minority class examples and their k-nearest neighbours . This is illustrated in the Fig below:
c) SMOTE (合成少数族裔过采样技术):Smote是另一种使用带有KNN的合成数据而不是使用重复数据的方法。 每个少数族裔的例子以及他们的k近邻都被考虑了。 然后,沿着连接任何/所有少数类实例及其k近邻的线段创建综合实例。 如下图所示:
Image by Author 图片作者With only over-sampling, the decision boundary becomes smaller while with SMOTE we can create larger decision regions thereby improving the chance of capturing the minority class better.
仅通过过度采样,决策边界就会变小,而使用SMOTE,我们可以创建更大的决策区域,从而提高更好地捕获少数群体的机会。
One possible limitation is, if the minority class i.e fraudulent observations is spread throughout the data and not distinct then using nearest neighbours to create more fraud cases introduces noise into the data and this can lead to misclassification.
一种可能的局限性是,如果少数派类别(即欺诈性观察)散布在整个数据中并且没有不同,则使用最近的邻居来创建更多欺诈性案件会在数据中引入噪声,这可能导致分类错误。
Some of the metrics that are useful for judging the performance of a model are listed below. These metrics provide a view of how well/how accurately the model is able to predict/classify the target variable/s. We will be using these concepts later on to illustrate model performance and choose the best model for our use case.
下面列出了一些有助于判断模型性能的指标。 这些度量提供了一个模型能够/如何准确地预测/分类目标变量的视图。 稍后,我们将使用这些概念来说明模型性能,并为我们的用例选择最佳模型。
Table by Author 作者表TP (True positive)/TN (True negative) are the cases of correct predictions i.e predicting Fraud cases as Fraud (TP) and predicting non-fraud cases as non-fraud (TN)
TP(真阳性)/ TN(真阴性)是正确预测的情况,即将欺诈案件预测为欺诈(TP),将非欺诈案件预测为非欺诈(TN)
FP (False positive) are those cases that are actually non-fraud but the model predicted as Fraud
FP(假阳性)是实际上没有欺诈但模型预测为欺诈的情况
FN (False negative) are those cases that are actually fraud but the model predicted as non-Fraud
FN(假阴性)是那些实际上是欺诈但模型预测为非欺诈的案件
Precision = TP / (TP + FP): Precision measures how accurately the model is able to capture fraud i.e out of the total predicted fraud cases, how many actually turned out to be fraud.
精度= TP /(TP + FP):精度衡量模型能够捕获欺诈的准确性,即在预测的欺诈案件总数中,实际有多少是欺诈。
Recall = TP/ (TP+FN): Recall measures out of all the actual fraud cases, how many the model could predict correctly as fraud. This is an important metric here.
召回率= TP /(TP + FN):召回率是在所有实际欺诈案件中测出的,其中有多少模型可以正确预测为欺诈。 这是一个重要指标。
Accuracy = (TP +TN)/(TP+FP+FN+TN): Measures how many majority as well as minority classes could be correctly classified.
准确度=(TP + TN)/(TP + FP + FN + TN):衡量可以正确分类多少个多数和少数类别。
F-score = 2*TP/ (2*TP + FP +FN) = 2* Precision *Recall/ (Precision *Recall) ; this is a balance between precision and recall. Note that precision and recall are inversely related, hence F-score is a good measure to achieve a balance between the two.
F分数= 2 * TP /(2 * TP + FP + FN)= 2 *精度*调用/(精度*调用); 这是精度和召回率之间的平衡。 请注意,精度和查全率成反比,因此F分数是在两者之间取得平衡的好方法。
First, we will train the random forest model with some default features. Post that we train the model using under-sampling, oversampling and then SMOTE. Please note optimizing the model with feature selection or cross validation has been kept out-of-scope here for sake of simplicity. The table below illustrates the confusion matrix along with the precision, recall and accuracy metrics for each method.
首先,我们将训练具有某些默认功能的随机森林模型。 发布后,我们将使用欠采样,过采样然后进行SMOTE训练模型。 请注意,为简便起见,此处不包含通过特征选择或交叉验证来优化模型的内容。 下表说明了每种方法的混淆矩阵以及精度,召回率和准确性指标。
Confusion matrix for different methods:Table by Author 不同方法的混淆矩阵:按作者分类 Comparison of different sampling methods: Table by Author 不同采样方法的比较:作者表The code below can be used for training the default random forest model with no sampling.
以下代码可用于训练没有采样的默认随机森林模型。
#Split the data into training and test setfrom sklearn.model_selection import train_test_splitx_train,x_test,y_train,y_test = train_test_split (x,y, test_size=0.2, random_state=0)# Training the modelfrom sklearn.ensemble import RandomForestClassifierclassifier = RandomForestClassifier(n_estimators=10,criterion='entropy', random_state=0)classifier.fit(x_train,y_train)# Predict Y on the test sety_pred = classifier.predict(x_test)# Obtain the results from the classification report and confusion matrix from sklearn.metrics import classification_report, confusion_matrixprint('Classifcation report:\n', classification_report(y_test, y_pred))conf_mat = confusion_matrix(y_true=y_test, y_pred=y_pred)print('Confusion matrix:\n', conf_mat)No sampling result interpretation: Without any sampling we are able to capture 76 fraudulent transactions. Though the overall accuracy is 97%, the recall is 75%. This means that there are quite a few fraudulent transactions that our model is not able to capture.
没有抽样结果解释:无需抽样,我们就可以捕获76起欺诈交易。 尽管总体准确性为97%,但召回率为75%。 这意味着我们的模型无法捕获很多欺诈性交易。
For under-sampling following code snippet can be used. We will be using the pipeline module from the imblearn library. The pipeline module will help to combine the resampling method with the random forest model.
对于欠采样,可以使用以下代码片段。 我们将使用imblearn库中的管道模块。 流水线模块将有助于将重采样方法与随机森林模型相结合。
# This is the pipeline module we need from imblearn for Undersamplingfrom imblearn.under_sampling import RandomUnderSamplerfrom imblearn.pipeline import Pipeline# Define which resampling method and which ML model to use in the pipelineresampling = RandomUnderSampler()model = RandomForestClassifier(n_estimators=10,criterion='entropy', random_state=0)# Define the pipeline and combine sampling method with the RF modelpipeline = Pipeline([('RandomUnderSampler', resampling), ('RF', model)])pipeline.fit(x_train, y_train) predicted = pipeline.predict(x_test)# Obtain the results from the classification report and confusion matrix print('Classifcation report:\n', classification_report(y_test, predicted))conf_mat = confusion_matrix(y_true=y_test, y_pred=predicted)print('Confusion matrix:\n', conf_mat)Under-sampling result interpretation: With under-sampling , though the model is able to capture 90 fraud cases with significant improvement in recall, the accuracy and precision falls drastically. This is because the false positives have increased phenomenally and the model is penalizing a lot of genuine transactions.
欠采样结果的解释:使用欠采样,尽管该模型能够捕获90个欺诈案件,并且召回率显着提高,但准确性和准确性却大大下降。 这是因为误报率显着增加,并且该模型正在惩罚大量真实交易。
For oversampling the previous code with few tweaks will work for us. Below is the code for your reference. Only the change in the steps is given below, rest of the code is similar to the under-sampling code given above.
对于先前的代码,只需进行少量调整就可以为我们工作。 以下是供您参考的代码。 下面仅给出步骤的更改,其余代码与上面给出的欠采样代码相似。
# This is the pipeline module we need from imblearn for Oversamplingfrom imblearn.over_sampling import RandomOverSampler# Define which resampling method and which ML model to use in the pipelineresampling = RandomOverSampler()model = RandomForestClassifier(n_estimators=10,criterion='entropy', random_state=0)# Define the pipeline and combine sampling method with the modelpipeline = Pipeline([('RandomOverSampler', resampling), ('RF', model)])Over-sampling result interpretation: Over-sampling method has the highest precision and accuracy and the recall is also good at 81%. We are able to capture 6 more fraud cases and the false positives is pretty low as well. Overall, from the perspective of all the parameters, this model is a good model.
过度采样结果解释:过度采样方法具有最高的精度和准确度,召回率也达到81%。 我们能够再捕获6个欺诈案件,误报率也很低。 总体而言,从所有参数的角度来看,此模型是一个很好的模型。
Finally we will implement the SMOTE method with the random forest model. Note that the resampling methods need to be implemented on the training data and not the test data. Hence, we have to first split the dataset into training and test samples and then train the model. The code is given below.
最后,我们将使用随机森林模型实现SMOTE方法。 注意,重采样方法需要在训练数据而不是测试数据上实现。 因此,我们必须首先将数据集分为训练样本和测试样本,然后训练模型。 代码如下。
# This is the pipeline module we need from imblearn for SMOTEfrom imblearn.over_sampling import SMOTE# Define which resampling method and which ML model to use in the pipelineresampling = SMOTE(sampling_strategy='auto',random_state=0)model = RandomForestClassifier(n_estimators=10,criterion='entropy', random_state=0)# Define the pipeline and combine sampling method with the modelpipeline = Pipeline([('SMOTE', resampling), ('RF', model)])Smote further improves the over-sampling method with 3 more frauds caught in the net and though false positives increased a bit the recall is pretty healthy at 84%.
Smote进一步改进了过采样方法,在网络中捕获了3个以上的欺诈,尽管误报有所增加,但召回率高达84%,非常健康。
The full code with the details can be accessed here.
带有详细信息的完整代码可在此处访问。
Summary:
总结:
In our use case of fraud detection, the one metric that is most important is recall. This is because banks/financial institutions are more concerned about catching most of the fraud cases because fraud is expensive and they might lose a lot of money over this. Hence, even if there are few false positives i.e flagging of genuine customers as fraud it might not be too cumbersome because this only means blocking some transactions. However, blocking too many genuine transactions is also not a feasible solution, hence depending on the risk appetite of the financial institution we can go with either simple over-sampling method or SMOTE. We can also tune the parameters of the model, to further enhance the model results using grid search.
在我们的欺诈检测用例中,最重要的一项指标是召回率。 这是因为银行/金融机构更担心接获大多数欺诈案件,因为欺诈成本高昂,因此它们可能会损失很多钱。 因此,即使几乎没有误报,例如将真实客户标记为欺诈,也可能不会太麻烦,因为这仅意味着阻止某些交易。 但是,阻止太多的真实交易也不是可行的解决方案,因此,根据金融机构的风险偏好,我们可以采用简单的过采样方法或SMOTE。 我们还可以调整模型的参数,以使用网格搜索进一步增强模型结果。
:
:
[1] Bartosz Krawczyk, Learning from imbalanced data: open challenges and future directions (2016), Springer
[1] Bartosz Krawczyk,从不平衡数据中学习:挑战和未来方向(2016年),Springer
[2] Nitesh V. Chawla, Kevin W. Bowyer , Lawrence O. Hall and W. Philip Kegelmeyer , SMOTE: Synthetic Minority Over-sampling Technique (2002), Journal of Artificial Intelligence research
[2] Nitesh V. Chawla,Kevin W. Bowyer,Lawrence O. Hall和W. Philip Kegelmeyer,SMOTE :综合少数族裔过采样技术(2002年),《人工智能研究》
[3] Leo Breiman, Random Forests (2001), stat.berkeley.edu
[3] Leo Breiman,《随机森林》 ( Random Forests ,2001年),stat.berkeley.edu
[4] Jeremy Jordan, Learning from imbalanced data (2018)
[4]杰里米·乔丹(Jeremy Jordan),从不平衡数据中学习(2018年)
翻译自: https://towardsdatascience.com/credit-card-fraud-detection-with-different-sampling-techniques-3d869becac67
血液采样卡技术参数
相关资源:机器学习实战:基于逻辑回归模型的信用卡欺诈检测