交叉验证说明

科技2022-07-12 154

Video Link

影片连结

So far, when implementing all of our regression models in python, we have been using all of our data to construct our model:

到目前为止，在python中实现所有回归模型时，我们一直在使用所有数据来构建模型：

This, however, often leads to models which overfit our data and it becomes very difficult to evaluate and make improvements to our model.

但是，这通常会导致模型过度拟合我们的数据，并且很难评估和改进模型。

To address this problem, before creating our model, we split our data into two sections:

为了解决这个问题，在创建模型之前，我们将数据分为两部分：

1.培训数据 (1. Training Data)

Training data can be thought of as the data we use to construct our model.

训练数据可以被认为是我们用来构建模型的数据。 Most of our data should be used as training data as this is what provides insight into the relationship between our inputs [ Temperature, Wind Speed, Pressure] and our output Humidity.

我们的大多数数据都应用作训练数据，因为这可以深入了解我们的输入[温度，风速，压力]与我们的输出湿度之间的关系。

2.测试数据 (2. Test Data)

Test data can be thought of as data which is hidden to the construction of our model

可以将测试数据视为隐藏在模型构建中的数据 We use test data on our model, to see how well our model performs on data it has not seen before

我们在模型上使用测试数据，以查看模型对之前未见数据的性能如何 Depending on the performance of our model on our test data we can then make adjustments to our model such as:

根据测试数据模型的性能，我们可以对模型进行调整，例如：

Changing the hyper-parameters: α, λ ( Explained in ep4.2, ep5)Adjusting the amount of features/variables in our modelChanging the number of layer in a neural network

更改超参数： α，λ (在ep4.2和ep5中说明 ) 调整模型中特征/变量的数量更改神经网络中的层数

The process of using test data to evaluate our model is called cross-validation.

使用测试数据评估模型的过程称为交叉验证。

But how exactly do we split our data into these two sections.

但是，我们如何精确地将数据分为这两部分。

两种常见方法是： (Two common methods are:)

Train_Test Split

Train_Test拆分

K-means Cross Validation

K均值交叉验证

火车-测试拆分 (Train - Test split)

The train-test split method is where we split our data usually in the ratio 80:20 between training and test data.

训练测试拆分方法是我们通常在训练数据与测试数据之间以80:20的比例划分数据的方法。

This method is best suited when we are working with a lot of data, often with 1,000+ rows of data.

当我们处理大量数据 (通常处理1,000 多个数据)时，此方法最适合。

When working with 100,000+ rows of data we can use a ratio of 90:10 and with 1,000,00+ data rows 99:1

当处理100,000+数据行时，我们可以使用90:10的比率，而1,000,00+数据行则使用99：1的比率。

In general when working with more data, we can use a smaller percentage of test data since we have sufficient training data to build a reasonably accurate model

通常，当使用更多数据时，我们可以使用较少百分比的测试数据，因为我们有足够的训练数据来构建合理准确的模型

We can then look at the the cost function or mean squared error of our test data:

然后，我们可以查看测试数据的成本函数或均方误差：

m_test shows the number of training examples in our test data, which is 4 in this case.

m_test在我们的测试数据中显示了训练示例的数量，在这种情况下为4。

To evaluate the performance of our model and make adjustment accordingly.

评估模型的性能并相应地进行调整。

There are many evaluation metrics such as MSE, RMSE and many more.

有许多评估指标，例如MSE，RMSE等。

优点 (Advantages)

Works well with large data sets

适用于大型数据集 Low computing power, can get feedback for model performance quickly

低计算能力，可以快速获得模型性能的反馈

缺点 (Disadvantages)

Very poor on small data sets

在小数据集上非常差

Often leads to high-bias models when working on small data sets

处理小型数据集时，通常会导致高偏差模型

Possibility of selecting test data with similar values (non-random) resulting in an inaccurate evaluation of model performance

选择具有相似值(非随机)的测试数据的可能性会导致对模型性能的评估不准确

K-表示交叉验证 (K - means Cross Validation)

The K — means cross validation method to split our data works by first splitting our data into k - folds, usually consisting of around 10–20% of our data.

K-是指通过交叉验证方法来分割数据的工作，方法是先将数据分割为k倍，通常约占我们数据的10％至20％。

Here we have split our data into 5 (k) folds.

在这里，我们将数据分为5 ( k )折。

We then use 4 (k-1) folds as our training data to build our model and the remaining 1 fold as our test (also known as cross validation) data.

然后，我们使用4( k-1 )倍作为训练数据来构建模型，其余1倍作为我们的测试(也称为交叉验证 )数据。

We then calculate the mean squared error or cost function of our test data.

然后，我们计算测试数据的均方误差或成本函数。

We repeat the above process:

我们重复以上过程：

Using a different folds as our test data

使用不同的折痕作为我们的测试数据 Using the remaining 4 (k-1) folds as our training data to build our model

使用其余的4(k-1)折作为训练数据来建立模型 Calculating the mean squared error for each test fold

计算每个测试折叠的均方误差

Lastly we average the mean squared error or cost function calculated for each fold to give an overall performance metric for our model.

最后，我们对每次折叠计算出的均方误差或成本函数取平均值，以给出模型的整体性能指标。

As we are using different training folds to construct our model upon each iteration, the parameters produced in each model may differ slightly. We also average the model parameters generated in each case to produce a final model.

由于我们在每次迭代时都使用不同的训练倍数来构建模型，因此每个模型中产生的参数可能会略有不同。我们还对每种情况下生成的模型参数求平均值，以生成最终模型。

优点 (Advantages)

Works well on small and large data sets

适用于小型和大型数据集 All of our data is used in testing our model, giving a fair well rounded evaluation metric

我们所有的数据都用于测试我们的模型，从而给出了一个公正而全面的评估指标 May lead to more accurate models, since we are eventually utilising all of our data in building our model

可能会导致更准确的模型，因为我们最终将利用所有数据来构建模型

缺点 (Disadvantages)

High computing power, make take some time to get feedback on a model’s performance on very large data sets

强大的计算能力，需要花费一些时间来获取有关超大型数据集模型性能的反馈 Slower feedback makes it take longer to find the optimal hyper-parameters (explained above) for our model.

较慢的反馈使查找模型的最佳超参数(上文解释)花费了更长的时间。

上一集 -下一集 (Prev Episode — Next Episode)

翻译自: https://medium.com/ai-in-plain-english/cross-validation-explained-6496e68d62a7

相关资源：微信小程序源码-合集6.rar

Processed: 0.015, SQL: 8