Video Link
影片连结
So far, when implementing all of our regression models in python, we have been using all of our data to construct our model:
到目前为止, 在python中实现所有回归模型时 ,我们一直在使用所有数据来构建模型:
This, however, often leads to models which overfit our data and it becomes very difficult to evaluate and make improvements to our model.
但是,这通常会导致模型过度拟合 我们的数据,并且很难评估和改进模型。
To address this problem, before creating our model, we split our data into two sections:
为了解决这个问题,在创建模型之前,我们将数据分为两部分:
Changing the hyper-parameters: α, λ ( Explained in ep4.2, ep5)Adjusting the amount of features/variables in our modelChanging the number of layer in a neural network
更改超参数: α,λ (在ep4.2和ep5中说明 ) 调整模型中特征/变量的数量更改神经网络中的层数
The process of using test data to evaluate our model is called cross-validation.
使用测试数据评估模型的过程称为交叉验证 。
But how exactly do we split our data into these two sections.
但是,我们如何精确地将数据分为这两部分。
Train_Test Split
Train_Test拆分
K-means Cross Validation
K均值交叉验证
The train-test split method is where we split our data usually in the ratio 80:20 between training and test data.
训练测试拆分方法是我们通常在训练数据与测试数据之间以80:20的比例划分数据的方法。
This method is best suited when we are working with a lot of data, often with 1,000+ rows of data.
当我们处理大量数据 (通常处理1,000 多个数据)时 ,此方法最适合。
When working with 100,000+ rows of data we can use a ratio of 90:10 and with 1,000,00+ data rows 99:1
当处理100,000+数据行时,我们可以使用90:10的比率,而1,000,00+数据行则使用99:1的比率。
In general when working with more data, we can use a smaller percentage of test data since we have sufficient training data to build a reasonably accurate model 通常,当使用更多数据时,我们可以使用较少百分比的测试数据,因为我们有足够的训练数据来构建合理准确的模型We can then look at the the cost function or mean squared error of our test data:
然后,我们可以查看测试数据的成本函数或均方误差:
m_test shows the number of training examples in our test data, which is 4 in this case.
m_test在我们的测试数据中显示了训练示例的数量,在这种情况下为4。
To evaluate the performance of our model and make adjustment accordingly.
评估模型的性能并相应地进行调整。
There are many evaluation metrics such as MSE, RMSE and many more.
有许多评估指标,例如MSE,RMSE等。
Often leads to high-bias models when working on small data sets
处理小型数据集时,通常会导致高偏差模型
Possibility of selecting test data with similar values (non-random) resulting in an inaccurate evaluation of model performance 选择具有相似值(非随机)的测试数据的可能性会导致对模型性能的评估不准确The K — means cross validation method to split our data works by first splitting our data into k - folds, usually consisting of around 10–20% of our data.
K-是指通过交叉验证方法来分割数据的工作,方法是先将数据分割为k倍,通常约占我们数据的10%至20%。
Here we have split our data into 5 (k) folds.
在这里,我们将数据分为5 ( k )折。
We then use 4 (k-1) folds as our training data to build our model and the remaining 1 fold as our test (also known as cross validation) data.
然后,我们使用4( k-1 )倍作为训练数据来构建模型,其余1倍作为我们的测试(也称为交叉验证 )数据。
We then calculate the mean squared error or cost function of our test data.
然后,我们计算测试数据的均方误差或成本函数。
We repeat the above process:
我们重复以上过程:
Using a different folds as our test data 使用不同的折痕作为我们的测试数据 Using the remaining 4 (k-1) folds as our training data to build our model 使用其余的4(k-1)折作为训练数据来建立模型 Calculating the mean squared error for each test fold 计算每个测试折叠的均方误差Lastly we average the mean squared error or cost function calculated for each fold to give an overall performance metric for our model.
最后,我们对每次折叠计算出的均方误差或成本函数取平均值,以给出模型的整体性能指标 。
As we are using different training folds to construct our model upon each iteration, the parameters produced in each model may differ slightly. We also average the model parameters generated in each case to produce a final model.
由于我们在每次迭代时都使用不同的训练倍数来构建模型,因此每个模型中产生的参数可能会略有不同。 我们还对每种情况下生成的模型参数求平均值,以生成最终模型。
翻译自: https://medium.com/ai-in-plain-english/cross-validation-explained-6496e68d62a7
相关资源:微信小程序源码-合集6.rar