Author’s Note: This is the Completion report for my appliedaicourse[1] Capstone Project. All work is original and feel free to use/expand upon/disseminate. [Numbers in brackets are citations to the sources listed in the references section].
作者注意:这是我的应用课程[1] Capstone项目的完成报告。 所有工作均为原创,可以随意使用/扩展/传播。 [括号内的数字是对参考文献部分所列来源的引用]。
(a, b, c): Loading the data, conversion of categorical data into numerical, Missing value analysis.
(a,b,c):加载数据,将分类数据转换为数值,缺少值分析。
(d, e): Data Visualization, analysis of data.
(d,e):数据可视化,数据分析。
(f): Reducing the dimensionality of data Hugely by Detecting Multicollinearity using Variation Inflation Factor(VIF), which reduces the complexity of models, computational power.
(f):使用变化通货膨胀因子(VIF)通过检测多重共线性来大幅降低数据的维数,从而降低了模型的复杂性和计算能力。
(g): Implementing the Gavish-Donoho’s method to find the optimal value of ‘k’ and plotting the singular value curves to visualize the concept practically.
(g):实施Gavish-Donoho方法以找到“ k”的最佳值,并绘制奇异值曲线以实际形象化该概念。
(h): Finding top important features by RFECV and RFE methods.
(h):通过RFECV和RFE方法找到最重要的重要功能。
(i): Adding new features using Dimensionality Reduction techniques
(i):使用降维技术添加新功能
(j): Generating new features using the Two-way and Three-way Feature Interaction, from the top features.
(j):使用双向功能和三向功能交互从顶部功能生成新功能。
Tuning various models to find the best Hyperparameters, fitting models with the best Hyperparameters, analyzing how far the Feature engineering worked, and Comparing the Final results of all the Models 调整各种模型以找到最佳的超参数,用最佳的超参数拟合模型,分析Feature工程的工作范围,并比较所有模型的最终结果Since the first automobile, the Benz Patent Motor Car in 1886, Mercedes-Benz has stood for important automotive innovations. These include, for example, the passenger safety cell with crumple zone, the airbag, and intelligent assistance systems. Mercedes-Benz applies for nearly 2000 patents per year, making the brand the European leader among premium carmakers. Daimler’s Mercedes-Benz cars are leaders in the premium car industry. With a huge selection of features and options, customers can choose the customized Mercedes-Benz of their dreams.
自1886年第一辆汽车-奔驰专利汽车问世以来,梅赛德斯·奔驰就代表着重要的汽车创新。 这些包括,例如,具有压皱区的乘客安全室,安全气囊和智能辅助系统。 梅赛德斯-奔驰每年申请近2000项专利,使该品牌成为高级汽车制造商中的欧洲领导者。 戴姆勒的梅赛德斯-奔驰汽车是高档汽车行业的领导者。 拥有众多功能和选项,客户可以选择梦想中的定制梅赛德斯·奔驰。
To ensure the safety and reliability of each and every unique car configuration before they hit the road, Daimler’s engineers have developed a robust testing system. But, optimizing the speed of their testing system for so many possible feature combinations is complex and time-consuming without a powerful algorithmic approach. As one of the world’s biggest manufacturers of premium cars, safety and efficiency are paramount on Daimler’s production lines.
为了确保每种独特的汽车配置在上路之前的安全性和可靠性,戴姆勒的工程师已经开发了一套强大的测试系统。 但是,如果没有强大的算法方法,则针对许多可能的功能组合优化测试系统的速度既复杂又费时。 作为全球最大的高档汽车制造商之一,戴姆勒的生产线至关重要。
Daimler is challenging Kagglers to tackle the Curse of dimensionality and reduce the time that cars spend on the test bench. Competitors will work with a dataset representing different permutations of Mercedes-Benz car features to predict the time it takes to pass testing. Winning algorithms will contribute to speedier testing, resulting in lower carbon dioxide emissions without reducing Daimler’s standards.
戴姆勒正向卡格勒汽车发起挑战,以解决尺寸问题,并减少汽车在测试台上花费的时间。 参赛者将使用代表奔驰汽车功能的不同排列的数据集来预测通过测试所需的时间。 获奖的算法将有助于加快测试速度,从而在不降低戴姆勒标准的情况下降低二氧化碳排放量。
The motivation behind the problem is that an accurate model will be able to reduce the total time spent testing vehicles by allowing cars with similar testing configurations to be run successively in different paths at Vehicle Testing Layout as shown in the figure below.
问题背后的动机是,通过允许具有相似测试配置的汽车在“车辆测试布局”下的不同路径中连续运行,准确的模型将能够减少测试车辆所花费的总时间,如下图所示。
Vehicle Testing Layout 车辆测试布局Examples of Custom features: 4WD, added air suspension, a head-up display, etc
自定义功能的示例:4WD,增加的空气悬架,平视显示器等
This problem is an example of a Machine-Learning / Deep-Learning Regression task, to predict the continuous target variable(duration of the test).
此问题是机器学习/深度学习回归任务的一个示例,用于预测连续目标变量(测试的持续时间)。
Data is downloaded from this link Mercedes-Benz Greener Manufacturing, Kaggle competition[2], and unzipped.
数据是从此链接下载的,梅赛德斯·奔驰绿色制造商Kaggle竞赛[2]并已解压缩。
Thankfully this is not a big dataset. So it is added to google drive and unzipped directly in the google colab as below.
幸运的是,这不是一个大数据集。 因此,如下所示,它已添加到Google驱动器并直接在Google Colab中解压缩。
But if the dataset is too big then it is better to use the CurlWget (chrome extension) to import the data.
但是,如果数据集太大,则最好使用CurlWget(chrome扩展名)导入数据。
This dataset contains an anonymized set of variables, each representing a custom feature in a Mercedes car. For example, a variable could be 4WD, added air suspension, or a head-up display.
该数据集包含一组匿名变量,每个变量代表梅赛德斯汽车的自定义功能。 例如,变量可以是4WD,增加的空气悬架或平视显示器。
The ground truth is labeled ‘y’ and represents the time (in seconds) that the car took to pass testing for each variable.
基本实况标记为“ y”,代表汽车通过每个变量测试所花费的时间(以秒为单位)。
Finding the Optimal value of ‘k’ in TSVD, with reference[3] to the paper published by Gavish and Donoho[4] “The Optimal Hard Threshold for Singular Values is 4/\sqrt{3}”.
参考Gavish和Donoho发表的论文[3] [4] “奇异值的最佳硬阈值为4 / \ sqrt {3} ”,在TSVD中找到'k'的最佳值。
output: {‘X0_aa’, ‘X0_ab’, ‘X0_ac’, ‘X0_q’, ‘X2_aa’, ‘X2_ar’, ‘X2_c’, ‘X2_l’, ‘X2_o’, ‘X5_u’, ‘y’}
输出:{'X0_aa','X0_ab','X0_ac','X0_q','X2_aa','X2_ar','X2_c','X2_1','X2_o','X5_u','y'}
In the above code, we are identifying non-common features 在上面的代码中,我们正在识别非常见功能output: (4209, 554) (4209, 554)
输出:(4209,554)(4209,554)
We have aligned the data frames by making an inner join of the data frames. 我们通过对数据帧进行内部联接来对齐数据帧。Rows with missing data can be deleted or can be filled using Data Imputation techniques as mentioned in this link.[5]
数据缺失的行可以删除,也可以使用此链接中提到的数据插补技术填充。[5]
In the case of multivariate analysis, if there is a larger no of missing values, then it can be better to drop those cases(rather than to do imputation and replace them). 在多变量分析的情况下,如果没有更大的缺失值,则最好丢弃这些情况(而不是进行插补和替换)。On the other hand in univariate analysis, imputation can decrease the amount of bias in the data, if the values are missing at random.[6]
另一方面,在单变量分析中,如果值随机丢失,则插补可以减少数据中的偏差量。 [6]
But in our dataset doesn’t have any missing values. 但是在我们的数据集中,没有任何缺失值。First, let’s take only the Class variable and plot it on the y_axis and it’s resettled indices on the x_axis.
首先,让我们仅使用Class变量,并将其绘制在y_axis上,并将其重新设置的索引绘制在x_axis上。
Because ID’s are not continuous units. 因为ID不是连续的单位。 Over plotting is one of the most common problems in DataViz. When your dataset is big, dots of your scatterplot tend to overlap, hence we reduced the size of dots to accommodate more number of dots in a unit area. 过度绘制是DataViz中最常见的问题之一。 当数据集很大时,散点图的点往往会重叠,因此我们减小了点的大小,以在单位面积上容纳更多的点。 From the above diagram, we can see that the class label(y-time) looks like a line, apart from a small portion of points at the ends are not on the line. 从上面的图中,我们可以看到类标签(y时间)看起来像一条线,除了末端的一小部分点不在线上。 and also there’s only one point whose time is above 250 which is an outlier. 而且只有一点时间超过250,这是一个离群值。 Because of all the class labels not lying on a line, the Metric R² square won’t have large values, it’s very sensitive to outliers, as SSres increases. 由于所有类别标签均不在一条直线上,因此MetricR²平方不会具有较大的值,因为SSres增大,因此它对异常值非常敏感。 The best possible R² Square value is 1.0. 最佳R²平方值是1.0。Plotting the PDF, CDF, and BoxPlot of the Class Variable:
绘制类变量的PDF,CDF和BoxPlot:
PDF PDF格式 CDF CDFFrom the PDF and CDF, we can see that:
从PDF和CDF中,我们可以看到:
Almost all data points have a Class variable below 140. 几乎所有数据点的Class变量都低于140。 so the points having a class label more than 140 can be considered as outliers. 因此类别标签大于140的点可被视为离群值。 BoxPlot 箱形图 BoxPlot drawn concerning class label very beautifully shows the distribution of data based on a five numbered summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”) and here we can consider the values which are larger than max value as outliers for sure.关于类标签绘制的BoxPlot非常漂亮地显示了基于五个编号的摘要(“最小”,第一四分位数(Q1),中位数,第三四分位数(Q3)和“最大”)的数据分布,在这里我们可以考虑肯定大于离群值的最大值。Outlier data points are dropped. 离群数据点被删除。Multicollinearity occurs when two or more independent variables are highly correlated with one another in a regression model.[7][8]
当两个或多个自变量在回归模型中彼此高度相关时,就会发生多重共线性。 [7] [8]
This means that one independent variable can be predicted from another independent variable in a regression model. 这意味着可以从回归模型中的另一个自变量预测一个自变量。 This can be a problem in a regression model because we would not be able to distinguish between the individual effects of the independent variables on the dependent variable. 在回归模型中这可能是一个问题,因为我们将无法区分自变量对因变量的个体影响。 Multicollinearity may not affect the accuracy of the model as much. But we might lose reliability in determining the effects of individual features on the model. and that can be a problem when it comes to interpretability 多重共线性可能不会太大地影响模型的准确性。 但是,在确定各个特征对模型的影响时,我们可能会失去可靠性。 这在可解释性方面可能是一个问题 This is a Bivariate analysis.这是双变量分析。VIF determines the strength of the correlation between the independent variables. It is predicted by taking a variable and regressing it against every other variable (or) the VIF score of an independent variable represents how well the variable is explained by other independent variables.
VIF确定独立变量之间相关性的强度。 通过采用变量并将其与其他所有变量进行回归来预测(或)自变量的VIF得分表示该变量对其他自变量的解释程度。
VIF — Conclusion:1 = No Multicollinearity, 4–5 = Moderate, 10 or greater = Severe.
VIF —结论: 1 =无多重共线性,4-5 =中等,10或更大=严重。
Generally, a VIF value greater than 10 is considered as severe, whereas in our dataset we even have features with VIF value infinity and even 3 digit numbers. 通常,VIF值大于10被认为是严重的,而在我们的数据集中,我们甚至具有VIF值无穷大甚至3位数字的特征。 We have dropped all the features which have a VIF score of infinity, excluding the top_20_features (details of the top_20_features will be explained in the 7th section). 我们删除了VIF分数为无穷大的所有功能,但top_20_features除外(top_20_features的详细信息将在第7部分中进行说明)。Why Gavish-Donoho method? what’s explained in it? what are the proven conclusions derived from this paper?
为什么采用Gavish-Donoho方法? 里面有什么解释? 从本文得出的证明结论是什么?
Truncated SVD is a matrix factorization technique that factors a matrix W into three matrices U, S, VT. Typically it is used to find the principal components of a matrix
截断SVD是一种矩阵分解技术,可将矩阵W分解为三个矩阵U,S,VT。 通常,它用于查找矩阵的主要成分
Truncated SVD is different from regular SVD. Given an n*n matrix, SVD will produce matrices with n columns, whereas Truncated SVD will produce matrices with a specified number of columns. 截断的SVD与常规的SVD不同。 给定一个n * n矩阵,SVD将生成具有n列的矩阵,而截断SVD将生成具有指定列数的矩阵。 We need to Truncate the SVD because we need the matrix of optimal columns ‘k’ that could accommodate maximum information. For example, if we have the rank ‘k’ = n, the maximum it could be(Higher), then complete information will be preserved(noise also) and accuracy might or might not be more but, the complexity of the model will be high; If we have the rank ‘k’ very low then the information might be lost and model might be less accurate and model will not be too complex. 我们需要截断SVD,因为我们需要可以容纳最大信息的最佳列“ k”的矩阵。 例如,如果我们具有等级“ k” = n,则它的最大值可能是(更高),则将保留完整信息(也有噪声),并且准确性可能会或可能不会更多,但是模型的复杂性将是高; 如果我们的“ k”等级很低,那么信息可能会丢失,模型可能会不太准确,模型也不会太复杂。 So, we need to find the sweet spot, optimal ‘k’ where we get the most of the information in W but without overfitting, to noise, or some little features we don’t care about. 因此,我们需要找到最佳点,即最佳的“ k”,以便在其中获得W中的大部分信息,而又不会过拟合,噪声或一些我们不关心的小功能。 This can be done in many ways by analyzing the singular values, and find out the elbow or knee but, they don’t work unless you have a sharp dropoff in the singular values. Hence, Gavish and Donoho’s method is the best one to find out the optimal rank ‘k’, given some assumptions on the data. 这可以通过分析奇异值并找出肘部或膝盖的多种方式来完成,但是除非您的奇异值急剧下降,否则它们将不起作用。 因此,给定一些数据假设,Gavish和Donoho的方法是找出最佳等级“ k”的最佳方法。 As written in the above equation, our data X can be written as the sum of the true low-rank data signal(Xtrue), and Xnoise which is assumed to be Normally distributed with Zero mean and variance=1 also called as Gaussian noise and it can be large or small depending on the magnitude of gamma. 如上式所示,我们的数据X可以写成真实的低秩数据信号(Xtrue)与Xnoise的和,假定Xnoise以零均值和方差= 1呈正态分布,也称为高斯噪声和根据伽玛的大小,它可以大也可以小。 The Orange curve corresponds to the Gaussian noise matrix, and the Green curve corresponds to our actual high dimensional data 橙色曲线对应于高斯噪声矩阵,绿色曲线对应于我们实际的高维数据Gavish and Donoho realized that when the singular values from the SVD of high-dimensional data when plotted, the curve(Green one) looks like the curve corresponding to the singular values from the SVD of the best fit Gaussian noise matrix, and at some point it deviates as shown in the above figure, and here that level is named as Noise floor.
Gavish和Donoho意识到,当绘制高维数据的SVD的奇异值时,曲线(绿色的)看起来像是与最佳拟合高斯噪声矩阵的SVD的奇异值相对应的曲线。它偏离了上图所示的位置,在这里该级别被称为“噪底”。
This Noise floor separates the signal and noise.
本底噪声将信号和噪声分开。
The first singular value that is larger than the biggest singular value of the noise matrix is the Threshold, and values below it are truncated. 大于噪声矩阵的最大奇异值的第一个奇异值是阈值,其下的值被截断。 The application of this method is explained below taking the two possible cases. 下面以两种可能的情况说明该方法的应用。Case 1: X is a square matrix and gamma is known.
情况1:X是一个方矩阵,并且伽玛是已知的。
Truncate all the values below the threshold(tau) value 截断低于阈值(tau)的所有值 here n = dimensions of square matrix X, gamma= amount of noise(known)这里n =方阵X的尺寸,gamma =噪声量(已知)Case 2: X is a rectangular matrix and gamma is unknown.
情况2:X是矩形矩阵,伽玛未知。
In this case, all we have is measurements of Singular value distribution. 在这种情况下,我们只有奇异值分布的度量。 Based on the median singular value and aspect ratio of the rectangular matrix, we can infer the best fit noise distribution. 根据矩形矩阵的中值奇异值和纵横比,我们可以推断出最佳拟合噪声分布。Conclusion:
结论:
We have data that has structure and noise, even if we don’t know how much noise is added we can estimate it from the median singular value, and then we can infer the Optimal tau(threshold), to truncate the singular values below tau to give the optimal rank ‘r’ 我们拥有具有结构和噪声的数据,即使我们不知道添加了多少噪声,也可以从中值奇异值进行估计,然后可以推断出最优tau(threshold),以将tau之下的奇异值截断给出最佳等级“ r”Code:
码:
After applying all the preprocessing steps above the data is stored in a pandas data frame with the name x_filtered. 在执行上述所有预处理步骤后,数据将存储在名称为x_filtered的熊猫数据框中。 from the code snippet above we’ll get the singular matrices out data matrix and Gaussian noise matrix 从上面的代码片段中,我们将获得数据矩阵和高斯噪声矩阵的奇异矩阵 On plotting, we got the above curves and the horizontal line at y=tau在绘制时,我们得到了上面的曲线和水平线,y = tauHence, we have decided to take the value of ‘k’ as 2 to truncate. 因此,我们决定将“ k”的值设为2进行截断。sklearn’s RFECV automatically generates an optimal number of important features and RFE generates top n features according to our demand.
sklearn的RFECV会根据我们的需求自动生成最佳数量的重要特征,而RFE会生成前n个特征。
output: Index([‘X314’], dtype=’object’)
输出:索引(['X314'],dtype ='object')
RFECV using RandomForestRegressor with the best parameters which are obtained by tuning it on the dataset. 使用具有最佳参数的RandomForestRegressor的RFECV,该参数是通过在数据集上对其进行调整而获得的。output: Index([‘X29’, ‘X314’, ‘X315’], dtype=’object’)
输出:索引(['X29','X314','X315'],dtype ='object')
RFECV using the default XGBRegressor 使用默认XGBRegressor的RFECVoutput: Index([‘X314’], dtype=’object’)
输出:索引(['X314'],dtype ='object')
RFECV using DecisionTreeRegressor using the best max_depth which is found by tuning the model on the dataset. 使用使用最佳max_depth的DecisionTreeRegressor的RFECV,这是通过调整数据集上的模型找到的。 From the output of the above three cells, we have learned that X314, X315, and X29 are the most important features and X314 is more important that X315 and X29 从以上三个单元的输出中,我们了解到X314,X315和X29是最重要的功能,而X314比X315和X29更重要 Using Recursive feature elimination we will find the top 20 important features and perform bivariate analysis on them使用递归特征消除,我们将找到最重要的20个重要特征并对它们执行双变量分析output: Index([‘ID’, ‘X29’, ‘X48’, ‘X54’, ‘X64’, ‘X76’, ‘X118’, ‘X119’, ‘X127’, ‘X136’, ‘X189’, ‘X232’, ‘X263’, ‘X279’, ‘X311’, ‘X314’, ‘X315’, ‘X1_aa’, ‘X6_g’, ‘X6_j’], dtype=’object’)
输出:索引(['ID','X29','X48','X54','X64','X76','X118','X119','X127','X136','X189',' X232”,“ X263”,“ X279”,“ X311”,“ X314”,“ X315”,“ X1_aa”,“ X6_g”,“ X6_j”],dtype =“ object”)
RFE using RandomForestRegressor to output top_20_features 使用RandomForestRegressor的RFE输出top_20_featuresThis set of top_20_features is the superset of the important features obtained by RFECV.
这组top_20_features是RFECV获得的重要功能的超集。
TSVD:
TSVD:
As found from the Gavish and Donoho’s method, we are using 2 components of Truncated SVD 从Gavish和Donoho的方法中发现,我们使用了截断SVD的2个组件output: (4194, 2)
输出:(4194,2)
Also generating 2 features from PCA and ICA, available features reduction techniques at sklearn.decompositon, and let’s see whether they are useful or not. 还从PCA和ICA生成2个功能,在sklearn.decompositon上提供了可用的功能减少技术,让我们看看它们是否有用。PCA:
PCA:
output: (4194, 2)
输出:(4194,2)
ICA:
ICA:
output: (4194, 2)
输出:(4194,2)
Adding all the new features generated through Dimensionality reduction techniques, to the data frames 将通过降维技术生成的所有新功能添加到数据框中output: (4194, 127) (4209, 127)
输出:(4194,127)(4209,127)
output: {‘bootstrap’: True, ‘max_depth’: 70, ‘max_features’: ‘auto’, ‘min_samples_leaf’: 40, ‘min_samples_split’: 110, ‘n_estimators’: 500}
输出:{'bootstrap':True,'max_depth':70,'max_features':'auto','min_samples_leaf':40,'min_samples_split':110,'n_estimators':500}
printing the best parameters 打印最佳参数 Initiating a model with the best hyperparameters, and fitting it to the data set.使用最佳超参数启动模型,并将其拟合到数据集。Plotting bar plots of the relative importance of the features of this model, in predicting the class label. 在预测类别标签时,绘制此模型功能的相对重要性的条形图。 As we can see the feature ‘X314+X315’ generated by Two-way feature interaction, played an important role in predicting the class label. 如我们所见,双向特征交互生成的特征“ X314 + X315”在预测类标签中起着重要作用。 Like RandomForestRegressor, for other models below, the same procedure is carried out, check out the code in my GitHub. 像RandomForestRegressor一样,对于下面的其他模型,将执行相同的过程,请在我的GitHub中检查代码。https://www.appliedaicourse.com/course/11/Applied-Machine-learning-course
https://www.appliedaicourse.com/course/11/Applied-Machine-learning-course
https://www.kaggle.com/c/mercedes-benz-greener-manufacturing/data
https://www.kaggle.com/c/mercedes-benz-greener-manufacturing/data
https://arxiv.org/pdf/1305.5870.pdf
https://arxiv.org/pdf/1305.5870.pdf
https://ieeexplore.ieee.org/document/6846297
https://ieeexplore.ieee.org/document/6846297
https://medium.com/r/?url=https%3A%2F%2Ftowardsdatascience.com%2F6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779
https://medium.com/r/?url=https%3A%2F%2Ftowardsdatascience.com%2F6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779
https://medium.com/r/?url=https%3A%2F%2Fweb.stanford.edu%2Fclass%2Fstats202%2Fcontent%2Flec25-cond.pdf
https://medium.com/r/?url=https%3A%2F%2Fweb.stanford.edu%2Fclass%2Fstats202%2Fcontent%2Flec25-cond.pdf
https://medium.com/r/?url=https%3A%2F%2Fwww.sigmamagic.com%2Fblogs%2Fwhat-is-variance-inflation-factor%2F%23%3A~%3Atext%3DIf%2520there%2520is%2520perfect%2520correlation%2Cto%2520the%2520presence%2520of%2520multicollinearity
https://medium.com/r/?url=https%3A%2F%2Fwww.sigmamagic.com%2Fblogs%2Fwhat-is-variance-inflation-factor%2F%23%3A~%3Atext%3DIf%2520there% 2520is%2520perfect%2520correlation%2Cto%2520the%2520presence%2520of%2520multicollinearity
https://medium.com/r/?url=https%3A%2F%2Fwww.analyticsvidhya.com%2Fblog%2F2020%2F03%2Fwhat-is-multicollinearity%2F
https://medium.com/r/?url=https%3A%2F%2Fwww.analyticsvidhya.com%2Fblog%2F2020%2F03%2Fwhat-is-multicollinearity%2F
https://medium.com/r/?url=http%3A%2F%2Fwww.pyrunner.com%2Fweblog%2F2016%2F08%2F01%2Foptimal-svht%2F
https://medium.com/r/?url=http%3A%2F%2Fwww.pyrunner.com%2Fweblog%2F2016%2F08%2F01%2Foptimal-svht%2F
翻译自: https://medium.com/swlh/mercedes-benz-greener-manufacturing-2181015ee378