欠采样和过采样

    科技2022-08-03  121

    欠采样和过采样

    简介 (Introduction)

    The Imbalanced classification problem is what we face when there is a severe skew in the class distribution of our training data. Okay, the skew may not be extremely severe (it can vary), but the reason we identify imbalanced classification as a problem is because it can influence the performance on our Machine Learning algorithms.

    吨他不均衡分类问题是,当有在我们的训练数据的类分布的严重扭曲了我们的脸。 好的,偏斜可能不会非常严重(可能会有所不同),但是我们将分类不平衡视为问题的原因是,它会影响我们的机器学习算法的性能。

    One way the imbalance may affect our Machine Learning algorithm is when our algorithm completely ignores the minority class. The reason this is an issue is because the minority class is often the class that we are most interested in. For instance, when building a classifier to classify fraudulent and non-fraudulent transactions from various observations, the data is likely to have more non-fraudulent transactions than that of fraud — I mean think about it, it would be very worrying if we had an equal amount of fraudulent transactions as non-fraud.

    不平衡可能影响我们的机器学习算法的一种方式是,当我们的算法完全忽略少数派类别时。 之所以会出现这个问题,是因为少数派类别通常是我们最感兴趣的类别。例如,当建立一个分类器以根据各种观察结果对欺诈性和非欺诈性交易进行分类时,数据可能会包含更多的非欺诈交易要比欺诈交易多-我的意思是,考虑一下,如果我们有同等数量的欺诈交易与非欺诈交易,那将非常令人担忧。

    Figure 1: Example of class distribution for Fraud detection Problem 图1:欺诈检测问题的类分布示例

    An approach to combat this challenge is Random Sampling. There are two main ways to perform random resampling, both of which have there pros and cons:

    应对这种挑战的一种方法是随机采样。 执行随机重采样的主要方法有两种,两种方法各有利弊:

    Oversampling — Duplicating samples from the minority class

    过度采样 -复制少数群体的样本

    Undersampling — Deleting samples from the majority class.

    欠采样-从多数类别中删除样本。

    In other words, Both oversampling and undersampling involve introducing a bias to select more samples from one class than from another, to compensate for an imbalance that is either already present in the data, or likely to develop if a purely random sample were taken (Source: Wikipedia).

    换句话说,过采样和欠采样都涉及引入偏差以从一个类别中选择比另一个类别更多的样本,以补偿数据中已经存在的不平衡,或者如果采取纯随机样本可能会加剧不平衡(来源: Wikipedia )。

    We define Random Sampling as a naive technique because when performed it assumes nothing of the data. It involves creating a new transformed version of our data in which a there is a new class distribution to reduce the influence of the data on our Machine Learning algorithm.

    我们将随机采样定义为一种朴素的技术,因为执行时它不假设任何数据。 它涉及创建数据的新转换版本,其中存在新的类分布,以减少数据对我们的机器学习算法的影响。

    Note: We refer to Random Resampling as naive because when performed it makes no assumptions of the data.

    注意 :我们将“随机重采样”称为天真,因为执行此操作时不会假设数据。

    In this article we will be leveraging the imbalanced-learn framework which was initiated in 2014 with the main focus being on SMOTE (another technique for imbalanced data) implementation. Over the years, additional oversampling and undersampling methods have been implemented as well as making the framework compatible with the popular machine learning framework scikit-learn. Visit Imbalanced-Learn for guides on installation and the full documentation.

    在本文中,我们将利用imbalanced-learn框架,该框架于2014年启动,主要侧重于SMOTE(另一种用于不平衡数据的技术)的实施。 多年来,已经实现了其他的过采样和欠采样方法,并使该框架与流行的机器学习框架scikit-learn兼容。 访问Imbalanced-Learn ,以获取安装指南和完整文档。

    from sklearn.datasets import make_classificationfrom imblearn.over_sampling import RandomOverSamplerfrom imblearn.under_sampling import RandomUnderSamplerfrom collections import Counter# defining the datasetX, y = make_classification(n_samples= 10000, weights=[.99])# class distributionprint(Counter(y))Counter({0: 9844, 1: 156})

    For the full code you may visit my Github.

    有关完整代码,您可以访问我的Github 。

    随机过采样 (Random Oversampling)

    Random Oversampling includes selecting random examples from the minority class with replacement and supplementing the training data with multiple copies of this instance, hence it is possible that a single instance may be selected multiple times.

    随机过采样包括从少数类中选择随机样本进行替换,并用该实例的多个副本补充训练数据,因此单个实​​例可能会被多次选择。

    “the random oversampling may increase the likelihood of overfitting occurring, since it makes exact copies of the minority class examples. In this way, a symbolic classifier, for instance, might construct rules that are apparently accurate, but actually cove one replicated example.” — Page 83, Learning from Imbalanced Data Sets, 2018.

    “随机过采样可能会增加发生过拟合的可能性,因为它可以精确复制少数类的示例。 这样,例如,一个符号分类器就可以构建看似准确的规则,但实际上却涵盖了一个重复的示例。” —第83页, 从不平衡数据集中学习 ,2018年。

    For Machine Learning algorithms affected by skewed distribution, such as artificial neural networks and SVMs, this is a highly effective technique. However, tuning the target class distribution is advised in many scenarios as seeking a balanced distribution for a severely imbalanced dataset can lead to the algorithm overfitting the minority class, in turn resulting in an increase of our generalization error.

    对于受偏斜分布影响的机器学习算法(例如人工神经网络和SVM) ,这是一种非常有效的技术。 但是,在许多情况下建议调整目标类的分布,因为为严重不平衡的数据集寻求平衡的分布可能会导致算法过度拟合少数类,从而导致泛化误差增加。

    Another thing we ought to be aware of is the increased computational cost. Increasing the number of examples in the minority class (especially for a severely skewed data set) may result in an increased computational when we train our model and considering the model is seeing the same examples multiple times, this isn’t a good thing.

    我们应该注意的另一件事是计算成本的增加。 当我们训练模型并考虑模型多次看到相同的示例时,增加少数类中的示例数量(尤其是对于严重偏斜的数据集)可能会导致计算量增加。

    Nonetheless, Oversampling is a pretty decent solution and should be tested. Here is how we can implement it in Python…

    但是,过采样是一个相当不错的解决方案,应该对其进行测试。 这是我们如何在Python中实现它的方法…

    # instantiating the random over sampler ros = RandomOverSampler()# resampling X, yX_ros, y_ros = ros.fit_resample(X, y)# new class distribution print(Counter(y_ros))Counter({0: 9844, 1: 9844})

    随机欠采样 (Random Undersampling)

    Random Undersampling is the opposite to Random Oversampling. This method seeks to randomly select and remove samples from the majority class, consequently reducing the number of examples in the majority class in the transformed data.

    随机欠采样与随机过采样相反。 此方法试图从多数类中随机选择样本并从中删除样本,因此减少了转换数据中多数类中的示例数。

    “In random under-sampling (potentially), vast quantities of data are discarded. […] This can be highly problematic, as the loss of such data can make the decision boundary between the minority and majority instances harder to learn, resulting in a loss in classification performance.” — Page 45, Imbalanced Learning: Foundations, Algorithms and Applications, 2013

    “在随机欠采样中(潜在地),大量数据被丢弃。 […]这可能是一个很大的问题,因为此类数据的丢失会使少数实例和多数实例之间的决策边界更难于学习,从而导致分类性能下降。” —第45页, 学习失衡:基础,算法和应用 ,2013年

    The result of undersampling is a transformed data set with less examples in the majority class — this process may be repeated until the number of examples in each class is equal.

    欠采样的结果是转换后的数据集,而多数类中的样本较少—可以重复此过程,直到每个类别中的样本数相等为止。

    Using this approach is effective in situations where the minority class has a sufficient amount of examples despite the severe imbalance. On the other hand, it is always important to consider the prospects of valuable information being deleted as we randomly remove them from our data set since we have no way to detect or preserve the examples that are information rich in the majority class.

    在少数群体尽管存在严重失衡的情况下也有足够的例子的情况下,使用这种方法是有效的。 另一方面,考虑有价值的信息被删除的前景总是很重要的,因为我们无法将其从数据集中随机删除,因为我们无法检测或保留大多数类别的信息丰富的示例。

    To better understand this method, here is a python implementation…

    为了更好地理解此方法,这是一个python实现…

    # instantiating the random undersamplerrus = RandomUnderSampler() # resampling X, yX_rus, y_rus = rus.fit_resample(X, y)# new class distributionprint(Counter(y_rus))Counter({0: 156, 1: 156})

    结合两种随机采样技术 (Combining Both Random Sampling Techniques)

    Combining both random sampling methods can occasionally result in overall improved performance in comparison to the methods being performed in isolation.

    与单独执行的方法相比,将两种随机采样方法组合使用有时可能会整体上提高性能。

    The concept is that we can apply a modest amount of oversampling to the minority class, which improves the bias to the minority class examples, whilst we also perform a modest amount of undersampling on the majority class to reduce the bias on the majority class examples.

    其概念是,我们可以对少数派类别应用适度的过采样,从而改善对少数派类别示例的偏见,同时我们也对少数派类别进行适度的过采样,以减少多数派类别示例的偏见。

    To implement this in Python, leveraging the imbalanced-learn framework, we may the sampling_strategy attribute in our oversampling and undersampling techniques.

    为了实现这一点在Python,借力imbalanced-learn框架,我们可以将sampling_strategy在我们的过采样和采样技术属性。

    # instantiating over and under samplerover = RandomOverSampler(sampling_strategy=0.5)under = RandomUnderSampler(sampling_strategy=0.8)# first performing oversampling to minority classX_over, y_over = over.fit_resample(X, y)print(f"Oversampled: {Counter(y_over)}")Oversampled: Counter({0: 9844, 1: 4922})# now to comine under sampling X_combined_sampling, y_combined_sampling = under.fit_resample(X_over, y_over)print(f"Combined Random Sampling: {Counter(y_combined_sampling)}")Combined Random Sampling: Counter({0: 6152, 1: 4922})

    结语 (Wrap Up)

    In this guide we discussed oversampling and undersampling for imbalanced classification. There are various occasions where we may be confronted with an imbalanced dataset and applying random sampling may provide us with a very good model to overcome this problem in training and still maintain a model that generalizes well to new examples.

    在本指南中,我们讨论了不平衡分类的过采样和欠采样。 在很多情况下,我们可能会遇到数据集不平衡的情况,并且应用随机抽样可能会为我们提供一个很好的模型,以克服训练中的这一问题,并且仍然保持可以很好地推广到新示例的模型。

    Let’s continue the conversation on LinkedIn…

    让我们继续在LinkedIn上进行对话…

    翻译自: https://towardsdatascience.com/oversampling-and-undersampling-5e2bbaf56dcf

    欠采样和过采样

    Processed: 0.016, SQL: 8