线性回归 假设
Linear regression is one of the first machine learning tools that aspiring data scientists learn to use.
线性回归是有抱负的数据科学家学习使用的最早的机器学习工具之一。
It’s easy to implement, its results are easy to interpret, and the math behind it is easy to understand.
它易于实现,其结果易于解释,其背后的数学也易于理解。
Moreover, linear regression is common in data science projects, either as a baseline model or as a part of Exploratory Data Analysis (EDA).
此外,线性回归在数据科学项目中很常见,它既可以作为基线模型,也可以作为探索性数据分析(EDA)的一部分。
It’s no wonder that a new article on linear regression is published seemingly every week.
难怪似乎每周都会发表一篇有关线性回归的新文章。
However, while these articles discuss the math and assumptions behind linear regression, very few discuss what happens when you break these assumptions.
但是,尽管这些文章讨论了线性回归背后的数学和假设,但很少讨论破坏这些假设时会发生什么。
This article has three goals:
本文具有三个目标:
1. Explain the six assumptions of linear regression
1.解释线性回归的六个假设
2. Explain what happens when each of the six assumptions are broken
2.解释六个假设中的每一个被打破时会发生什么
3. Explain what you can do when any of the six assumptions are broken
3.解释六个假设中的任何一个都不成立时的处理方法
I assume a general understanding of linear regression and its assumptions. Therefore, the main emphasis of this article will be the second goal and the third goal.
我假设对线性回归及其假设有一个大致的了解。 因此,本文的主要重点将是第二个目标和第三个目标。
Below is a simple regression model, where Y is the target variable, X is the independent variable, and epsilon is the error term (randomness not captured by the model).
下面是一个简单的回归模型,其中Y是目标变量, X是自变量,epsilon是误差项(模型未捕获随机性)。
What we mean by ‘linear in its parameters’ is that the population model may have a mathematical transformation (a square root, a logarithm, a quadratic) on the target variable, or the independent variables, but not on the parameters.
“其参数呈线性”是指总体模型可以对目标变量或自变量进行数学转换(平方根,对数,二次方),但对参数不具有数学转换。
Thus, changes in our independent variables will have the same marginal effect regardless of their value.
因此,我们的自变量的变化将具有相同的边际效应,而不管其价值如何。
This assumption assures us that our sample is representative of the population. More specifically, it assures us that the sampling method does not affect the characteristics of our sample.
这个假设使我们确信我们的样本可以代表总体。 更具体地说,它向我们保证了采样方法不会影响我们样本的特性。
The independent variables do not share a perfect, linear relationship. They can be related in some fashion — indeed, we would not include variables in our regression model if they were entirely unrelated— however, we should not be able to write one variable as a linear combination of another variable.
自变量没有共享完美的线性关系。 它们可以某种方式关联-实际上,如果变量完全不相关 ,我们就不会在变量回归模型中包含这些变量-但是,我们不能将一个变量写为另一个变量的线性组合。
The error term, epsilon, conditional on the independent variables, equals zero, on average.
以独立变量为条件的误差项epsilon平均等于零。
That is, the error term is unrelated to our independent variables.
也就是说,误差项与我们的自变量无关。
In statistics parlance, homoskedasticity is when the variance of a random variable is constant.
用统计学的话来说,同方差是指随机变量的方差是常数。
Linear regression assumes that the error term, conditional on the independent variables, is homoscedastic. That is,
线性回归假设误差项以自变量为条件,是同调的。 那是,
The error term is normally distributed with zero mean and a constant variance.
误差项通常以零均值和恒定方差分布。
Specifically, suppose the true model was of the form
具体来说,假设真实模型的形式为
but we estimated a linear model.
但是我们估计了线性模型。
Our parameter estimates would be biased, and our model would make poor predictions.
我们的参数估计将有偏差,并且我们的模型将做出较差的预测。
There are two ways to determine if you should use a linear function or a non-linear function to model the relationship in the population.
有两种方法可以确定应使用线性函数还是非线性函数对总体中的关系进行建模。
You could create a scatter plot between the two variables and see if the relationship between them is linear or non-linear. You can then compare the performance between a linear regression and a non-linear regression, and choose the function that performs best.
您可以在两个变量之间创建散点图,并查看它们之间的关系是线性的还是非线性的。 然后,您可以比较线性回归和非线性回归之间的性能,并选择效果最佳的函数。
A second method is to fit the data with a linear regression, and then plot the residuals. If there is no obvious pattern in the residual plot, then the linear regression was likely the correct model. However, if the residuals look non-random, then perhaps a non-linear regression would be the better choice.
第二种方法是使用线性回归拟合数据,然后绘制残差图。 如果残差图中没有明显的模式,则线性回归可能是正确的模型。 但是,如果残差看起来是非随机的,则非线性回归可能是更好的选择。
Say that you want to test if lock downs have impacted consumer spending decisions. You happened to have a survey of online shoppers prior to the pandemic, and you decide to take a second survey of online shoppers three months into the pandemic.
假设您要测试锁定是否影响了消费者支出决策。 在大流行之前,您碰巧对在线购物者进行了调查,因此您决定在大流行三个月后对在线购物者进行第二次调查。
This is an example of non-representative sampling. Not everyone likes to shop online, and people who do shop online may have traits that are absent from people who refuse to shop online. As a researcher, it is impossible for you to know if it is these unobserved traits, or if it is the lock downs, that changed consumer spending decisions.
这是非代表性采样的一个示例。 并非每个人都喜欢在线购物,并且在线购物的人可能具有拒绝在线购物的人所缺乏的特征。 作为研究人员,您不可能知道是这些未观察到的特征,还是锁定因素改变了消费者的支出决定。
A better researcher would have selected a random group of n people prior to the lock downs, and then selected another random group of n people several months into the lock down. If the selection process was perfectly random, then the two groups would be more or less identical. Thus, you could be certain that the pandemic resulted in changes in their spending.
一个更好的研究员会锁定起伏之前选择N多人的随机组,然后选择N多人几个月的另一个随机组到锁定下来。 如果选择过程是完全随机的,那么两组将大致相同。 因此,可以肯定的是,大流行导致了他们支出的变化。
Of course, perfect random sampling is impossible. Thus, it is good practice to do some EDA prior to building a regression model to confirm that the two groups are not drastically different. (There are also different types of regression techniques you can use to simulate randomness — I may discuss these in a separate article).
当然,不可能进行完美的随机采样。 因此,优良作法是在建立回归模型之前进行一些EDA,以确认两组之间的差异不大。 (您还可以使用多种不同的回归技术来模拟随机性-我可能会在另一篇文章中进行讨论)。
(Note: governments excel in creating randomized data sets. For example, Canada’s Labor Force Survey (LFS), which is widely used by labor Economists, selects a random group of people. If you are seeking a random sample, then seek out data sets from government organizations).
(请注意:政府擅长创建随机数据集。例如,劳工经济学家广泛使用的加拿大劳动力调查(LFS)会随机选择一组人。如果您要寻找随机样本,则需要寻找数据集来自政府组织)。
Suppose that
假设
and that
然后
In that case, our third variable is a linear combination of the first two variables.
在这种情况下,我们的第三个变量是前两个变量的线性组合。
Recall that multiple linear regression estimates the effect of one variable by holding all other variables constant. However, this all else equal assumption is impossible in the above regression model. If we change one variable, the first variable, for example, then that changes the third variable. Similarly, if we change the third variable, then that changes the first variable or the second variable (or both).
回想一下,多元线性回归通过保持所有其他变量不变来估计一个变量的影响。 但是,在上述回归模型中, 所有其他相等的假设都是不可能的。 如果我们更改一个变量,例如第一个变量,则将更改第三个变量。 类似地,如果我们更改第三个变量,则将更改第一个变量或第二个变量(或两者)。
The solution to perfect collinearity is to drop one of the variables (in the above example, we would drop the third variable).
完美共线性的解决方案是删除一个变量(在上面的示例中,我们将删除第三个变量)。
This occurs if our regression model differs from the true model. For example, we might think that the true model is
如果我们的回归模型与真实模型不同,则会发生这种情况。 例如,我们可能认为真正的模型是
when the true model is, in fact,
实际上,当真正的模型是
The exclusion of the second and third independent variables causes omitted variable bias. Our slope estimate, B1, will either be larger or smaller, on average, than the true value of B1.
排除第二和第三独立变量会导致省略变量偏差 。 我们的斜率估算值B1平均会大于或小于B1的真实值。
There are two solutions. First, if you know the variables that should be included in the true model, then you can add these variables to the model you are building. This is the best solution; however, it is also unrealistic because we can never truly know what variables are in the true model.
有两种解决方案。 首先,如果您知道应包含在真实模型中的变量,则可以将这些变量添加到要构建的模型中。 这是最好的解决方案。 但是,这也是不现实的,因为我们永远无法真正知道真实模型中的变量。
The second solution is to conduct a Randomized Control Trial (RCT). In an RCT, the researcher randomly allocates participants to the treatment group or the control group. Because the treatment is given randomly, the relationship between the error term and the independent variables is equal to zero.
第二种解决方案是进行随机对照试验(RCT)。 在RCT中,研究人员将参与者随机分配到治疗组或对照组。 因为是随机处理,所以误差项和自变量之间的关系为零。
Of course, unless you are a PhD student, you likely do not have the funds to perform aRCT. Thankfully, there are methods to simulate a RCT — regression discontinuity, difference-in-difference, etc. — which I may discuss in a separate article.
当然,除非您是博士生,否则您可能没有执行aRCT的资金。 值得庆幸的是,有一些模拟RCT的方法-回归不连续性,差异差异等-我可能会在另一篇文章中进行讨论。
This causes the variance of our parameters to be biased. Thus, the OLS standard errors are biased, as well (since they are the square root of our variance). Because our confidence intervals and hypothesis tests are based on OLS standard errors, if the standard errors are biased, then our confidence intervals and hypothesis tests are inaccurate.
这导致我们参数的方差有偏差。 因此,OLS标准误差也有偏差(因为它们是方差的平方根)。 因为我们的置信区间和假设检验是基于OLS标准误差的,所以如果标准误差存在偏差,则我们的置信区间和假设检验将不准确。
The solution is to use robust standard errors.
解决方案是使用可靠的标准错误。
Without going into the math behind it, robust standard errors cause the standard errors of OLS to be homoscedastic.
如果不进行深入的数学运算,则可靠的标准误差会导致OLS的标准误差同等。
In statsmodels, you can specify robust standard errors as an argument in the fit method.
在statsmodels中,您可以将健壮的标准错误指定为fit方法中的参数。
OLS(...).fit(cov_type='HC1')Similar to what occurs if assumption five is violated, if assumption six is violated, then the results of our hypothesis tests and confidence intervals will be inaccurate.
类似于如果违反了假设五,发生的情况,如果违反了假设六,那么我们的假设检验和置信区间的结果将是不准确的。
One solution is to transform your target variable so that it becomes normal. This can have the effect of making the errors normal, as well. The log transform and square root transform are most common. If you want to get fancy, then you can also use a Box-Cox transformation, which I discuss here.
一种解决方案是转换目标变量,使其变为正常。 这也可以使错误正常化。 对数变换和平方根变换最常见。 如果您想花哨的话,还可以使用Box-Cox转换,我在这里进行讨论。
The above assumptions only hold true if we are working with cross-sectional data. Linear regression requires different assumptions if we have panel data or time series data.
仅当我们使用横截面数据时,以上假设才成立。 如果我们有面板数据或时间序列数据,则线性回归需要不同的假设。
Now you know the six assumptions of linear regression, the consequences of violating these assumptions, and what to do if these assumptions are violated.
现在你知道了线性回归的六个假设,违反这些假设的后果,如果这些假设被侵犯的事情。
Now you can build more accurate linear regression models, and you can impress a recruiter with your newfound knowledge.
现在,您可以构建更准确的线性回归模型,并且可以用新发现的知识打动招聘人员。
翻译自: https://towardsdatascience.com/what-happens-when-you-break-the-assumptions-of-linear-regression-f78f2fe90f3a
线性回归 假设