k值交叉验证 交叉验证集
Cross-Validation also referred to as out of sampling technique is an essential element of a data science project. It is a resampling procedure used to evaluate machine learning models and access how the model will perform for an independent test dataset.
交叉验证(也称为“过采样”技术)是数据科学项目的基本要素。 它是一种重采样过程,用于评估机器学习模型并访问该模型对独立测试数据集的性能。
In this article, you can read about 8 different cross-validation techniques having their pros and cons, listed below:
在本文中,您可以阅读以下大约8种不同的交叉验证技术,各有其优缺点:
Leave p out cross-validation
省略p交叉验证
Leave one out cross-validation
留出一个交叉验证
Holdout cross-validation
保持交叉验证
Repeated random subsampling validation
重复随机二次抽样验证
k-fold cross-validation
k折交叉验证
Stratified k-fold cross-validation
分层k折交叉验证
Time Series cross-validation
时间序列交叉验证
Nested cross-validation
嵌套交叉验证
Before coming to cross-validation techniques let us know why cross-validation should be used in a data science project.
在介绍交叉验证技术之前,让我们知道为什么在数据科学项目中应使用交叉验证。
We often randomly split the dataset into train data and test data to develop a machine learning model. The training data is used to train the ML model and the same model is tested on independent testing data to evaluate the performance of the model.
我们经常将数据集随机分为训练数据和测试数据,以开发机器学习模型。 训练数据用于训练ML模型,同一模型在独立的测试数据上进行测试以评估模型的性能。
With the change in the random state of the split, the accuracy of the model also changes, so we are not able to achieve a fixed accuracy for the model. The testing data should be kept independent of the training data so that no data leakage occurs. During the development of an ML model using the training data, the model performance needs to be evaluated. Here’s the importance of cross-validation data comes into the picture.
随着分裂随机状态的变化,模型的准确性也会发生变化,因此我们无法为模型获得固定的准确性。 测试数据应与训练数据无关,以免发生数据泄漏。 在使用训练数据开发ML模型的过程中,需要评估模型的性能。 这就是交叉验证数据的重要性。
Data needs to split into:
数据需要分为:
Training data: Used for model development
训练数据:用于模型开发
Validation data: Used for validating the performance of the same model
验证数据:用于验证相同模型的性能
(Image by Author), Validation split (作者提供的图像),验证拆分In simple terms cross-validation allows us to utilize our data even better. You can further read, working, and implementation of 7 types of Cross-Validation techniques.
简单来说,交叉验证使我们可以更好地利用我们的数据。 您可以进一步阅读,使用和实施7种类型的交叉验证技术。
Leave p-out cross-validation (LpOCV) is an exhaustive cross-validation technique, that involves using p-observation as validation data, and remaining data is used to train the model. This is repeated in all ways to cut the original sample on a validation set of p observations and a training set.
离开p-out交叉验证(LpOCV)是一种详尽的交叉验证技术,涉及使用p观测作为验证数据,而其余数据则用于训练模型。 以所有方式重复此步骤,以在p个观察值的验证集和一个训练集上切割原始样本。
A variant of LpOCV with p=2 known as leave-pair-out cross-validation has been recommended as a nearly unbiased method for estimating the area under ROC curve of a binary classifier.
已推荐使用p = 2的LpOCV变体(称为休假配对交叉验证)作为估计二进制分类器ROC曲线下面积的几乎无偏的方法。
Leave-one-out cross-validation (LOOCV) is an exhaustive cross-validation technique. It is a category of LpOCV with the case of p=1.
留一法交叉验证(LOOCV)是一种详尽的交叉验证技术。 在p = 1的情况下,它是LpOCV的类别。
Source), LOOCV operations 来源 ),LOOCV操作For a dataset having n rows, 1st row is selected for validation, and the rest (n-1) rows are used to train the model. For the next iteration, the 2nd row is selected for validation and rest to train the model. Similarly, the process is repeated until n steps or the desired number of operations.
对于具有n行的数据集,选择第一行进行验证,其余(n-1)行用于训练模型。 对于下一次迭代,选择第二行进行验证,然后休息以训练模型。 类似地,重复该过程,直到n个步骤或所需的操作数。
Both the above two cross-validation techniques are the types of exhaustive cross-validation. Exhaustive cross-validation methods are cross-validation methods that learn and test in all possible ways. They have the same pros and cons discussed below:
以上两种交叉验证技术都是穷举性交叉验证的类型。 详尽的交叉验证方法是以各种可能的方式学习和测试的交叉验证方法。 它们具有以下讨论的优点和缺点:
The holdout technique is an exhaustive cross-validation method, that randomly splits the dataset into train and test data depending on data analysis.
保持技术是一种详尽的交叉验证方法,该方法根据数据分析将数据集随机分为训练数据和测试数据。
(Image by Author), 70:30 split of Data into training and validation data respectively (作者提供的图片),70:30将数据分别分为训练和验证数据In the case of holdout cross-validation, the dataset is randomly split into training and validation data. Generally, the split of training data is more than test data. The training data is used to induce the model and validation data is evaluates the performance of the model.
在保持交叉验证的情况下,数据集被随机分为训练和验证数据。 通常,训练数据的分割不仅仅是测试数据。 训练数据用于推导模型,而验证数据用于评估模型的性能。
The more data is used to train the model, the better the model is. For the holdout cross-validation method, a good amount of data is isolated from training.
用于训练模型的数据越多,模型越好。 对于保持交叉验证方法,需要从训练中隔离大量数据。
In k-fold cross-validation, the original dataset is equally partitioned into k subparts or folds. Out of the k-folds or groups, for each iteration, one group is selected as validation data, and the remaining (k-1) groups are selected as training data.
在k倍交叉验证中,原始数据集被平均分为k个子部分或折叠。 从k折或组中,对于每次迭代,选择一组作为验证数据,其余(k-1)个组选择为训练数据。
Source), k-fold cross-validation 来源 ),k折交叉验证The process is repeated for k times until each group is treated as validation and remaining as training data.
该过程重复k次,直到将每个组视为验证并保留为训练数据为止。
(Image by Author), k-fold cross-validation (作者提供的图片),k倍交叉验证The final accuracy of the model is computed by taking the mean accuracy of the k-models validation data.
模型的最终精度是通过获取k模型验证数据的平均精度来计算的。
LOOCV is a variant of k-fold cross-validation where k=n.
LOOCV是k倍交叉验证的变体,其中k = n。
Repeated random subsampling validation also referred to as Monte Carlo cross-validation splits the dataset randomly into training and validation. Unlikely k-fold cross-validation split of the dataset into not in groups or folds but splits in this case in random.
重复的随机子采样验证(也称为蒙特卡洛交叉验证)将数据集随机分为训练和验证。 数据集的k倍交叉验证不太可能分成几类,而不是成组或成对,而是在这种情况下随机地成组。
The number of iterations is not fixed and decided by analysis. The results are then averaged over the splits.
迭代次数不是固定的,而是由分析决定的。 然后将结果平均化。
(Image by Author), Repeated random subsampling validation (作者提供的图片),重复随机子采样验证For all the cross-validation techniques discussed above, they may not work well with an imbalanced dataset. Stratified k-fold cross-validation solved the problem of an imbalanced dataset.
对于上面讨论的所有交叉验证技术,它们可能不适用于不平衡的数据集。 分层k折交叉验证解决了数据集不平衡的问题。
In Stratified k-fold cross-validation, the dataset is partitioned into k groups or folds such that the validation data has an equal number of instances of target class label. This ensures that one particular class is not over present in the validation or train data especially when the dataset is imbalanced.
在分层k倍交叉验证中,数据集被划分为k个组或折叠,以使验证数据具有相等数量的目标类标签实例。 这样可以确保在验证或训练数据中不会出现一个特定的类,尤其是在数据集不平衡时。
(Image by Author), Stratified k-fold cross-validation, Each fold has equal instances of the target class (作者提供的图片),分层k折交叉验证,每折具有相等的目标类实例The final score is computed by taking the mean of scores of each fold.
最终分数是通过取各折分数的平均值来计算的。
The order of the data is very important for time-series related problem. For time-related dataset random split or k-fold split of data into train and validation may not yield good results.
数据的顺序对于与时间序列相关的问题非常重要。 对于与时间相关的数据集,将数据随机拆分或k倍拆分为训练和验证可能不会产生良好的结果。
For the time-series dataset, the split of data into train and validation is according to the time also referred to as forward chaining method or rolling cross-validation. For a particular iteration, the next instance of train data can be treated as validation data.
对于时间序列数据集,根据时间将数据分为训练和验证,也称为前向链接方法或滚动交叉验证 。 对于特定的迭代,可以将火车数据的下一个实例视为验证数据。
(Image by Author), Time Series cross-validation (作者提供的图像),时间序列交叉验证As mentioned in the above diagram, for the 1st iteration, 1st 3 rows are considered as train data and the next instance T4 is validation data. The chance of choice of train and validation data is forwarded for further iterations.
如上图所述,对于第一个迭代,第一个3行被视为训练数据,下一个实例T4是验证数据。 选择训练和验证数据的机会将被进一步迭代。
In the case of k-fold and stratified k-fold cross-validation, we get a poor estimate of the error in training and test data. Hyperparameter tuning is done separately in the earlier methods. When cross-validation is used simultaneously for tuning the hyperparameters and generalizing the error estimate, nested cross-validation is required.
在进行k折和分层k折交叉验证的情况下,我们对训练和测试数据中的错误估计差。 超参数调整是在较早的方法中单独完成的。 当交叉验证同时用于调整超参数和泛化误差估计时,需要嵌套交叉验证。
Nested Cross Validation can be applicable in both k-fold and stratified k-fold variants. Read the below article to know more about nested cross-validation and its implementation:
嵌套交叉验证可同时应用于k折和分层k折变体。 阅读以下文章,以了解有关嵌套交叉验证及其实现的更多信息:
Cross-validation is used to compare and evaluate the performance of ML models. In this article, we have covered 8 cross-validation techniques along with their pros and cons. k-fold and stratified k-fold cross-validations are the most used techniques. Time series cross-validation works best with time series related problems.
交叉验证用于比较和评估ML模型的性能。 在本文中,我们介绍了8种交叉验证技术及其优缺点。 k折和分层k折交叉验证是最常用的技术。 时间序列交叉验证最适合与时间序列相关的问题。
Implementation of these cross-validations can be found out in the sklearn package. Read this sklearn documentation for more details.
这些交叉验证的实现可以在sklearn包中找到。 阅读此sklearn文档以获取更多详细信息。
翻译自: https://towardsdatascience.com/understanding-8-types-of-cross-validation-80c935a4976d
k值交叉验证 交叉验证集