数据线性回归数据
While regression models are easy to run given their short, simple syntax, this accessibility also makes it easy to use regression inappropriately. These models have several key assumptions that need to be met in order for their output to be valid, but your code will typically run whether or not these assumptions have been met.
尽管使用简短,简单的语法就可以轻松运行回归模型,但是这种可访问性还使得可以轻松地不适当地使用回归。 这些模型有几个关键假设,必须满足这些假设才能使输出有效,但是无论是否满足这些假设,代码通常都会运行。
Video tutorial 影片教学For linear regression (used with a continuous outcome), these assumptions are as follows:
对于线性回归(用于连续结果),这些假设如下:
Independence: All observations are independent of each other, residuals are uncorrelated
独立性:所有观察值彼此独立,残差不相关
Linearity: The relationship between X and Y is linear线性:X和Y之间的关系是线性的Homoscedasticity: Constant variance of residuals at different values of X均方差:不同X值处残差的恒定方差Normality: Data should be normally distributed around the regression line正态性:数据应围绕回归线正态分布For logistic regression (used with a binary or ordinal categorical outcome), these assumptions are as follows:
对于逻辑回归(与二元或有序分类结果一起使用),这些假设如下:
Independence: All observations are independent of each other, residuals are uncorrelated
独立性:所有观察值彼此独立,残差不相关
Linearity in the logit: The relationship between X and the logit of Y is linear对数的线性:X与Y的对数之间的关系是线性的Model is correctly specified, including lack of multicollinearity模型已正确指定,包括缺乏多重共线性In both kinds of simple regression models, independent observations are absolutely necessary to fit a valid model. If your data points are correlated, this assumption of independence is violated. Fortunately, there are still ways to produce a valid regression model with correlated data.
在两种简单的回归模型中,独立观察对于拟合有效模型都是绝对必要的。 如果您的数据点相互关联,则违反了这种独立性的假设。 幸运的是,仍然有方法可以使用相关数据生成有效的回归模型。
Correlation in data occurs primarily through multiple measurements (e.g. two measurements are taken on each participant 1 week apart, and data points within individuals are not independent) or if there is clustering in the data (e.g. a survey is conducted among students attending different schools, and data points from students within a given school are not independent).
数据之间的相关性主要是通过多次测量(例如,每位参与者每隔1周进行两次测量,并且个体内的数据点不是独立的)或数据中存在聚类(例如,对在不同学校就读的学生进行的调查,并且来自给定学校的学生的数据点不是独立的)。
The result is that that the outcome has been measured on the level of an individual observation, but that there is a second level of either an individual (in the case of multiple time points) or clusters on which individual data points can be correlated. Ignoring this correlation means that standard error cannot be accurately computed, and in most cases will be artificially low.
结果是,已根据单个观察值对结果进行了度量,但是存在单个级别(在多个时间点的情况下)或可以与单个数据点相关的聚类的第二个级别。 忽略这种相关性意味着无法准确地计算标准误差,并且在大多数情况下人为地降低了标准误差。
The best way to know if your data is correlated is simply through familiarity with your data and the collection process that produced it. If you know that you have repeated measures from the same individuals or have data on participants who can be grouped into families or schools, you can assume that your data points are probably not independent. You can also investigate your data for possible correlation by calculating the ICC (intraclass correlation coefficient) to determine how correlated data points are within possible groups, or by looking for correlation in your residuals.
知道您的数据是否相关的最好方法就是简单地熟悉数据以及生成数据的收集过程。 如果您知道自己重复了同一个人的测量数据,或者掌握了可以分为家庭或学校的参与者的数据,则可以假定您的数据点可能不是独立的。 您还可以通过计算ICC(类内相关系数)来确定相关数据点在可能组中的程度,或者通过查找残差中的相关性来调查数据是否可能具有相关性。
As previously mentioned, simple regression will produce inaccurate standard errors with correlated data and therefore should not be used.
如前所述,简单回归将对相关数据产生不准确的标准误差,因此不应使用。
Instead, you want to use models that can account for the correlation that is present in your data. If the correlation is due to some grouping variable (e.g. school) or repeated measures over time, then you can choose between Generalized Estimating Equations or Multilevel Models. These modeling techniques can handle either binary or continuous outcome variables, so can be used to replace either logistic or linear regression when the data are correlated.
相反,您想使用可以说明数据中存在的相关性的模型。 如果相关性是由于某些分组变量(例如学校)或随着时间的推移重复测量而造成的,则可以在广义估计方程式或多级模型之间进行选择。 这些建模技术可以处理二进制或连续结果变量,因此可以在数据相关时用来代替逻辑回归或线性回归。
Generalized estimating equations (GEE) will give you beta estimates that are the same or similar to those produced by simple regression, but with appropriate standard errors. Generalized estimating equations are particularly useful when you have repeated measures for the same individuals or units. This modeling technique tends to work well when you have many small clusters, which is often the result of having a few measurements on a large number of participants. GEE also allows the user to specify one of numerous correlation structures, which can be a useful feature depending on your data.
广义估计方程(GEE)将为您提供与通过简单回归产生的估计值相同或相似但具有适当标准误差的beta估计值。 当您对同一个人或单位重复测量时,广义估计方程式特别有用。 当您有许多小型群集时,这种建模技术通常会很好地起作用,这通常是对大量参与者进行少量测量的结果。 GEE还允许用户指定众多相关结构之一,这可能是有用的功能,具体取决于您的数据。
Multilevel modeling (MLM) also provides appropriate standard errors when data points are not independent. It is typically the best modeling approach when the user is interested in relationships both within and between clustered groups, and is not simply looking to account for the effect of correlation in standard error estimates. MLM has the additional advantage of being able to handle more than two levels in the response variable. The primary drawback of MLM models is that they require larger sample sizes within each cluster, so may not work well when clusters are small.
当数据点不是独立的时,多级建模(MLM)还提供适当的标准错误。 当用户对聚类组内和聚类组之间的关系感兴趣时,并且不是简单地在标准误差估计中考虑相关性的影响时,这通常是最佳的建模方法。 MLM的另一个优势是能够处理响应变量中的两个以上级别。 MLM模型的主要缺点是它们在每个群集中都需要较大的样本量,因此在群集较小时可能无法很好地工作。
Both GEE and MLM are fairly easy to use in R. Below, I will walk through examples with the two most common kinds of correlated data: data with repeated measures from individuals and data collected from individuals with an important grouping variable (in this case, country). I will fit simple regression, GEE, and MLM models with each dataset, and will discuss which modeling technique is best for these different data types.
GEE和MLM在R中都非常易于使用。下面,我将通过示例介绍两种最常见的关联数据:具有重复测量的个体数据和具有重要分组变量的个体收集的数据(在这种情况下,国家)。 我将为每个数据集拟合简单的回归,GEE和MLM模型,并讨论哪种建模技术最适合这些不同的数据类型。
The data that I will be working with first comes from Years 9 and 15 of the Princeton University Fragile Families & Child Wellbeing Study, which follows the families of selected children born between 1998 and 2000 in major US cities. Data are publicly available, and can be accessed by submitting a brief request on the Fragile Families Data and Documentation page. Since this study follows up with the same families year after year, data points from the same family units at different time points are not independent.
我将首先使用的数据来自普林斯顿大学脆弱家庭与儿童福利研究的9年级和15年级,该研究追踪了1998年至2000年在美国主要城市出生的部分儿童的家庭。 数据是公开可用的,可以通过在“脆弱的家庭数据和文档”页面上提交简短请求来访问数据。 由于这项研究逐年跟进同一家庭,因此同一家庭单位在不同时间点的数据点不是独立的。
This dataset contains dozens of variables representing the health of wellbeing of participating children and their parents. Being in psychiatric epidemiology, I am primarily interested in examining the children’s mental-wellbeing. Participating children are asked if they frequently feel sad, and I will be using answers to this “often feeling sad” question as my outcome. Since substance use is tied to poorer mental wellbeing among adolescents, I will be using variables representing alcohol and tobacco use as predictors*.
该数据集包含数十个变量,代表参与其中的孩子及其父母的健康状况。 在精神病流行病学方面,我主要对检查孩子的心理健康感兴趣。 询问参加活动的孩子是否经常感到难过,我将以这个“经常感到难过”的问题作为答案。 由于毒品的使用与青少年中较弱的心理健康息息相关,因此我将使用代表酒精和烟草使用的变量作为预测指标*。
*Note: Models created in this article are for demonstration purposes only and should not be considered to be meaningful. I have not considered confounding, mediation, other model assumptions, or other possible data issues in the construction of these models.
*注意:本文中创建的模型仅用于演示目的,不应被认为是有意义的。 在构建这些模型时,我没有考虑混淆,中介,其他模型假设或其他可能的数据问题。
First, let’s load the packages that we’ll be using. I’ve loaded “tidyverse” to clean our data, “haven” because the data we’ll be reading in comes in SAS format, “geepack” to run our GEE model, and “lme4” to run our multilevel model:
首先,让我们加载将要使用的软件包。 我已经加载了“ tidyverse”来清理我们的数据,“ haven”是因为要读取的数据以SAS格式,“ geepack”运行我们的GEE模型,以及“ lme4”运行我们的多层模型:
library(tidyverse)library(haven)library(geepack)library(lme4)Now let’s do some data cleaning to get these data ready for modeling!
现在,让我们进行一些数据清理,以准备好这些数据进行建模!
Data from Years 9 and 15 are housed in separate SAS files (identifiable by the .sas7bdat extension), so we have one code chunk to read in and clean each file. This cleaning has to be done separately because variable names and coding differ slightly between study years (see the Data and Documentation page for codebooks).
9年级和15年级的数据存储在单独的SAS文件中(可通过.sas7bdat扩展名识别),因此我们只有一个代码块可读取和清除每个文件。 由于变量名称和编码在学习年之间略有不同,因此必须单独进行此清理(有关编码本,请参见“数据和文档”页面)。
There are hundreds of variables included in the datasets, so we first select those that will be used in our model and assign meaningful variable names that are consistent across data frames. Next, we filter the data to only include individuals with complete data for our variables of interest (the code below excludes individuals with missing data for these variables as well as those who refused to answer).
数据集中包含数百个变量,因此我们首先选择将在我们的模型中使用的变量,然后分配有意义的变量名称,这些名称在数据框架之间保持一致。 接下来,我们将数据过滤为仅包含具有我们感兴趣的变量的完整数据的个人(以下代码排除了这些变量缺少数据的个人以及拒绝回答的个人)。
We then recode our variables in the standard 1 = “yes”, 0 = “no” format. For the “feel_sad” variable, this also means dichotomizing a variable with 4 levels which represent varying degrees of sadness. We end up with a binary variable where 1 = “sad” and 0 = “not sad.” Some regression techniques can handle multiple levels in your response variable (MLM included), but I have binarized it here for simplicity. Finally, we create a “time_order” variable indicating if the observation comes from the first or second round of the study.
然后,我们以标准的1 =“是”,0 =“否”格式重新编码变量。 对于“ feel_sad”变量,这还意味着将变量分为四个等级,分别代表不同程度的悲伤。 我们最终得到一个二进制变量,其中1 =“悲伤”,0 =“不悲伤”。 一些回归技术可以处理您的响应变量(包括MLM)中的多个级别,但是为了简单起见,这里将其二值化。 最后,我们创建一个“ time_order”变量,指示观察结果来自研究的第一轮还是第二轮。
year_9 = read_sas("./data/FF_wave5_2020v2_SAS.sas7bdat") %>% select(idnum, k5g2g, k5f1l, k5f1j) %>% rename("feel_sad" = "k5g2g", "tobacco" = "k5f1l", "alcohol" = "k5f1j") %>% filter( tobacco == 1 | tobacco == 2, alcohol == 1 | alcohol == 2, feel_sad == 0 | feel_sad == 1 | feel_sad == 2 | feel_sad == 3 ) %>% mutate( tobacco = ifelse(tobacco == 1, 1, 0), alcohol = ifelse(alcohol == 1, 1, 0), feel_sad = ifelse(feel_sad == 0, 0, 1), time_order = 1 )year_15 = read_sas("./data/FF_wave6_2020v2_SAS.sas7bdat") %>% select(idnum, k6d2n, k6d40, k6d48) %>% rename("feel_sad" = "k6d2n", "tobacco" = "k6d40", "alcohol" = "k6d48") %>% filter( tobacco == 1 | tobacco == 2, alcohol == 1 | alcohol == 2, feel_sad == 1 | feel_sad == 2 | feel_sad == 3 | feel_sad == 4 ) %>% mutate( tobacco = ifelse(tobacco == 1, 1, 0), alcohol = ifelse(alcohol == 1, 1, 0), feel_sad = ifelse(feel_sad == 4, 0, 1), time_order = 2 )We then combine data from Years 9 and 15 by stacking our two cleaned data frames using rbind(). The rbind() function works well here because both data frames now share all variable names. We next transform the “idnum” variable (which identifies unique family units) into a numeric variable so that it can be properly used to sort the data in the final code chunk. This step is necessary because the geeglm() function that we will be using to run the GEE model assumes that the data frame is sorted first by a unique identifier (in this case, “idnum”), and next by the order of observations (indicated here by the new “time_order” variable).
然后,我们通过使用rbind()堆叠两个清理的数据帧来合并9年级和15年级的数据。 rbind()函数在这里可以很好地工作,因为两个数据帧现在共享所有变量名。 接下来,我们将“ idnum”变量(标识唯一的家族单位)转换为数字变量,以便可以将其正确地用于对最终代码块中的数据进行排序。 此步骤是必需的,因为我们将用于运行GEE模型的geeglm()函数假定数据帧首先按唯一标识符(在本例中为“ idnum”)排序,然后按观察顺序排序(在此以新的“ time_order”变量表示)。
fragile_families = rbind(year_9, year_15) %>% mutate( idnum = as.numeric(idnum) )fragile_families = fragile_families[ with(fragile_families, order(idnum)),]The above code produces the following cleaned data frame, which is now ready to be used for regression modeling:
上面的代码生成以下清理的数据框,现在可以将其用于回归建模:
Let’s fit our models:
让我们拟合我们的模型:
Simple Logistic Regression
简单逻辑回归
First, we use the glm() function to fit a simple logistic regression model using the “fragile_families” data. Since we have a binary outcome variable, “family = binomial” is used to specify that logistic regression should be used. We also use tidy() from the “broom” package to clean up the model output. We are creating this model for comparison purposes only — as indicated before, the independence assumption has been violated and the standard errors associated with this model will not be valid!
首先,我们使用glm()函数使用“ fragile_families”数据拟合简单的逻辑回归模型。 由于我们有一个二进制结果变量,因此使用“家庭=二项式”来指定应使用逻辑回归。 我们还使用“ broom”包中的tidy()清理模型输出。 我们仅出于比较目的而创建此模型-如前所述,违反了独立性假设,并且与此模型相关的标准错误将无效!
glm(formula = feel_sad ~ tobacco + alcohol, family = binomial, data = fragile_families) %>% broom::tidy()The above code produces the following output, which the subsequent modeling approaches will be compared to. Tobacco and alcohol use both appear to be significant predictors of sadness in participating children.
上面的代码产生以下输出,将与后续的建模方法进行比较。 吸烟和饮酒似乎都是参与儿童悲伤的重要预测指标。
2. Generalized Estimating Equations
2.广义估计方程
The syntax used to specify a GEE model using the geeglm() function from the “geepack” package is fairly similar to that used with the standard glm() function. The “formula”, “family”, and “data” are arguments are exactly the same for both functions. What’s new are the “id,” “waves,” and “corstr” arguments (see package documentation for all available arguments). The unique identifier that links observations from the same subject is specified in the “id” argument. In this case the ID is “idnum,” the unique identifier assigned to each family participating in the study. The “time_order” variable created during data cleaning comes into play in the “waves” argument, where it indicates the order in which observations were made. Finally, “corstr” can be used to specify the within-subject correlation structure. “Independence” is actually the default input for this argument, and it makes sense in this context because it is useful when clusters are small. However, “exchangeable” can be specified when all observations within a subject can be considered to be equally correlated, and “ar1” is best when the internal correlations change over time. Information on choosing the right correlation structure can be found here and here.
使用“ geepack”包中的geeglm()函数指定GEE模型的语法与标准glm()函数所使用的语法非常相似。 “公式”,“族”和“数据”是两个函数的参数完全相同。 新增了“ id”,“ waves”和“ corstr”自变量(有关所有可用自变量,请参见程序包文档)。 链接来自同一主题的观察结果的唯一标识符在“ id”自变量中指定。 在这种情况下,ID为“ idnum”,即分配给参与研究的每个家庭的唯一标识符。 在数据清理过程中创建的“ time_order”变量在“ waves”参数中起作用,它指示进行观察的顺序。 最后,“ corstr”可用于指定对象内相关结构。 实际上,“独立”是该参数的默认输入,在这种情况下它是有意义的,因为在簇很小时很有用。 但是,可以将对象内的所有观察结果视为均等关联时指定为“可交换”,而当内部关联随时间变化时,“ ar1”最好。 有关选择正确的相关结构的信息可以在此处和此处找到。
geeglm(formula = feel_sad ~ tobacco + alcohol, family = binomial, id = idnum, data = fragile_families, waves = time_order, corstr = "independence") %>% broom::tidy()Our GEE model gives us the following output:
我们的GEE模型为我们提供以下输出:
As you can see, our beta estimates are exactly the same as those produced using glm(), but standard error differs slightly now that the correlations in the data have been accounted for. While tobacco and alcohol are still significant predictors of sadness, the p-values are somewhat different**. If these p-values were closer to 0.05, having accurate standard error measurements could easily push a p-value over or under the level of significance.
如您所见,我们的beta估算值与使用glm()生成的估算值完全相同,但是由于考虑了数据中的相关性,因此标准误略有不同。 尽管烟草和酒精仍然是悲伤的重要预测指标,但p值却有所不同**。 如果这些p值接近0.05,则具有准确的标准误差测量值很容易将p值推至显着性水平之上或之下。
**Note: The test statistics for GEE and logistic regression look drastically different, but this is only because the test statistic provided in the logistic regression output is a Z-statistic and the test statistic provided in the GEE output is a Wald statistic. The Z-statistic is calculated by dividing the estimate by the standard error, while the Wald statistic is calculated by squaring the result of dividing the estimate by the standard error. The two values are therefore mathematically related, and by taking the square root of the values in the GEE “statistic” column you will see a much more moderate change from the initial Z-statistics.
**注:GEE和逻辑回归的检验统计数据看起来完全不同,但这仅是因为逻辑回归输出中提供的检验统计是Z统计,而GEE输出中提供的检验统计是Wald统计。 Z统计量是通过将估算值除以标准误差来计算的,而Wald统计量是通过将估算值除以标准误差来求平方的。 因此,这两个值在数学上相关,并且通过在GEE“统计”列中取值的平方根,可以看到与初始Z统计量相比温和得多的变化。
With the geeglm() function, it is also important to verify that your clusters have been properly recognized. You can do this by running the above code without the broom::tidy() step, so:
使用geeglm()函数,验证集群已被正确识别也很重要。 您可以通过运行上面的代码而无需执行broom :: tidy()步骤来做到这一点,因此:
geeglm(formula = feel_sad ~ tobacco + alcohol, family = binomial, id = idnum, data = fragile_families, waves = time_order, corstr = "independence")This code produces the output shown below. You want to look to the last line of the output, where “Number of clusters” and “Maximum cluster size” are described. We had 2 observations for several thousand individuals, so these values make sense in the context of our data and indicate that clusters were registered correctly by the function. If, however, the number of clusters is equal to the number of rows in your dataset, something is not working properly (most likely the sorting of your data is off).
此代码产生如下所示的输出。 您想查看输出的最后一行,其中描述了“簇数”和“最大簇大小”。 我们对数千名个体有2个观察值,因此这些值在我们的数据上下文中有意义,表明该函数正确注册了聚类。 但是,如果聚类的数量等于数据集中的行数,则表示某些设备运行不正常(很可能是您的数据排序已关闭)。
3. Multilevel Modeling
3.多层建模
Next, let’s fit a multilevel model using glmer() from the lme4 package. Again, the required code is almost identical to that used for logistic regression. The only required change is specifying random slopes and intercepts in the formula argument. This is done with the “(1 | idnum)” bit of code, which follows the following structure: (random slopes | random intercepts). The grouping variable, in this case “idnum,” is specified to the right of the | as “random intercepts,” and the “1” indicates that we don’t want the predictors’ effects to vary across groups. A useful blog post by Rense Nieuwenhuis provides various examples of this glmer() syntax.
接下来,让我们使用lme4包中的glmer()拟合多级模型。 同样,所需的代码几乎与用于逻辑回归的代码相同。 唯一需要做的更改是在公式参数中指定随机斜率和截距。 这是通过代码的“(1 | idnum)”位完成的,它遵循以下结构:(随机斜率|随机截距)。 分组变量(在本例中为“ idnum”)在|的右侧指定。 如“随机截距”,“ 1”表示我们不希望预测变量的影响在各个组之间有所不同。 Rense Nieuwenhuis的一篇有用的博客文章提供了这种glmer()语法的各种示例。
The lme4 package is not compatible with the broom package, so instead we pull the model’s coefficients after creating a list with a summary of the model’s output.
lme4软件包与broom软件包不兼容,因此在创建带有模型输出摘要的列表后,我们拉模型的系数。
mlm = summary(glmer(formula = feel_sad ~ tobacco + alcohol + (1 | idnum), data = fragile_families, family = binomial))mlm$coefficientsAgain, the output is similar to that of the simple logistic regression model, and both tobacco and alcohol use are still significant predictors of sadness. Estimates vary slightly from those produced using the glm() and geeglm() functions because groupings in the data are no longer ignored or treated as an annoyance to be addressed by correcting standard error; instead, they are now incorporated as an important part of the model. Standard error estimates are higher for all estimates in comparison to those produced through logistic regression, and Z- and p-values remain similar but reflect these important changes in the estimate and standard error values.
同样,输出类似于简单的逻辑回归模型,吸烟和饮酒仍然是悲伤的重要预测指标。 估计值与使用glm()和geeglm()函数产生的估计值略有不同,因为数据中的分组不再被忽略或被视为通过校正标准误差来解决的烦恼。 相反,它们现在已作为模型的重要组成部分被合并。 与通过逻辑回归得出的估计值相比,所有估计值的标准误差估计值都更高,并且Z值和p值保持相似,但反映了估计值和标准误差值中的这些重要变化。
The second dataset that we will walk through comes from the WHO’s Global School-Based Student Health Survey (GSHS). This survey is conducted among schoolchildren aged 13–17 with the goals of helping countries to determine health priorities, establishing the prevalences of health-related behaviors, and facilitating direct comparison of these prevalences across nations. We will be using data from two countries, Indonesia and Bangladesh, which can be downloaded directly from these countries’ respective descriptive pages.
我们将浏览的第二个数据集来自WHO的全球基于学校的学生健康调查(GSHS)。 这项调查是针对13至17岁的学童进行的,目的是帮助各国确定健康优先事项,确定与健康相关的行为的患病率以及促进各国之间对这些患病率的直接比较。 我们将使用来自两个国家(印度尼西亚和孟加拉国)的数据,这些数据可以直接从这两个国家的描述性页面下载。
The data are cross-sectional: an identical survey was conducted one time among schoolchildren in both nations. I am interested in using variables from this dataset to describe the relationship between whether or not a child has friends, whether or not the child is bullied (my predictors) and whether or not the child has seriously contemplated suicide (my outcome). It is likely that these relationships differ between the two countries and that children are more similar to other children from the same country. Therefore, knowing whether a child is from Indonesia or Bangladesh provides important information about that child’s responses and the assumption of independent observations is violated.
数据是横断面的:两个国家的学龄儿童都进行了一次相同的调查。 我有兴趣使用此数据集中的变量来描述孩子是否有朋友,孩子是否被欺负(我的预测变量)和孩子是否认真考虑过自杀(我的结局)之间的关系。 两国之间的关系可能不同,并且孩子与来自同一国家的其他孩子更相似。 因此,知道孩子是来自印度尼西亚还是孟加拉国将提供有关该孩子React的重要信息,并且违反了独立观察的假设。
Let’s load packages again:
让我们再次加载软件包:
library(tidyverse)library(haven)library(lme4)library(gee)Note that the “geepack” package has been replaced with the “gee” package. The “gee” package is easier to use (in my opinion) with data that is clustered by a grouping variable such as country rather than within an individual who has multiple observations.
请注意,“ geepack”软件包已被“ gee”软件包替代。 在我看来,“ gee”软件包更易于使用按分组变量(例如国家/地区)进行聚类的数据,而不是在具有多个观察值的个人中使用。
Next, let’s load in the data (which is also in SAS format, so we use the “haven” package again) and conduct some basic cleaning. Data cleaning here follows a similar structure to the procedure used with the Fragile Families & Child Wellbeing Study data: important variables are selected and assigned meaningful, consistent names, and a new variable is created to indicate which cluster an observation belongs to (in this case the new “country” variable).
接下来,让我们加载数据(它也是SAS格式,因此我们再次使用“ haven”包)并进行一些基本清理。 此处的数据清理遵循与脆弱家庭和儿童福利研究数据所用程序相似的结构:选择重要变量并为其分配有意义的一致名称,并创建一个新变量以指示观察值属于哪个类(在这种情况下)新的“国家/地区”变量)。
indonesia = read_sas("./data/IOH2007_public_use.sas7bdat") %>% select(q21, q25, q27) %>% rename( "bullied" = "q21", "suicidal_thoughts" = "q25", "friends" = "q27" ) %>% mutate( country = 1, )bangladesh = read_sas("./data/bdh2014_public_use.sas7bdat") %>% select(q20, q24, q27) %>% rename( "bullied" = "q20", "suicidal_thoughts" = "q24", "friends" = "q27" ) %>% mutate( country = 2 )Again, the two data frames are stacked together. Since variables were coded consistently during collection in both countries, some cleaning can be conducted only once using this combined dataset. Missing data is eliminated, and all variables are converted from string format to numeric. Finally, variables are mutated into a consistent, binarized format.
同样,两个数据帧堆叠在一起。 由于这两个国家/地区在收集过程中对变量进行了统一编码,因此使用此组合数据集只能进行一次清理。 消除了丢失的数据,并将所有变量从字符串格式转换为数字。 最后,变量被变异为一致的二进制格式。
survey = rbind(indonesia, bangladesh) %>% mutate( suicidal_thoughts = as.numeric(suicidal_thoughts), friends = as.numeric(friends), bullied = as.numeric(bullied), suicidal_thoughts = ifelse(suicidal_thoughts == 1, 1, 0), friends = ifelse(friends == 1, 0, 1), bullied = ifelse(bullied == 1, 0, 1) ) %>% drop_na()Our cleaned data frame now looks like this:
我们清理后的数据框现在看起来像这样:
Let’s fit our models:
让我们拟合我们的模型:
Simple Logistic Regression
简单逻辑回归
With the exception of variable names and the data specified, the glm() code remains identical to that used with the Fragile Families study data.
除了变量名和指定的数据外,glm()代码与“易碎系列”研究数据所使用的代码保持相同。
glm(formula = suicidal_thoughts ~ bullied + friends, family = binomial, data = survey) %>% broom::tidy()Unsurprisingly, whether or not a child has friends and whether or not a child is bullied are both significant predictors of the presence of suicidal thoughts in this sample.
毫不奇怪,孩子是否有朋友以及孩子是否被欺负都是该样本中自杀念头的重要预测指标。
2. Generalized Estimating Equations
2.广义估计方程
The gee() function in the gee package allows us to easily use GEE with our survey data. This function is a better fit than the previously used geeglm() function as data are not correlated over time, but rather by a separate variable that can be indicated with the “id” argument (in this case, “country”). The formula and family arguments remain identical to those used with the glm() function, and the “corstr” argument used with the geeglm() function is the same here as well. However, unlike the geepack package, the gee package is not compatible with the broom::tidy() function so output is viewed using the summary() function instead.
通过gee包中的gee()函数,我们可以轻松地将GEE与我们的调查数据结合使用。 该函数比以前使用的geeglm()函数更合适,因为数据不会随时间而变,而是通过一个单独的变量来表示,该变量可以用“ id”自变量表示(在这种情况下为“国家”)。 公式和族参数与glm()函数所使用的参数相同,并且geeglm()函数所使用的“ corstr”参数在此处也相同。 但是,与geepack软件包不同,gee软件包与broom :: tidy()函数不兼容,因此可以使用summary()函数查看输出。
gee = gee(suicidal_thoughts ~ bullied + friends, data = survey, id = country, family = binomial, corstr = "exchangeable")summary(gee)One of the reasons that I particularly like the gee() function is that the naive standard error and Z-test statistics are actually included in the output (naive meaning that these values are produced by regression where clustering is not accounted for — you’ll see that these are exactly the same as those produced by the glm() function above). You’ll notice drastic changes in the standard errors and Z-test statistics produced using GEE (“Robust”), although both of our predictors remain significant. It appears that accounting for within-country correlation has allowed for much lower standard errors to be used.
我特别喜欢gee()函数的原因之一是,输出中实际上包含了朴素的标准误差和Z检验统计信息(朴素的含义是这些值是由不考虑聚类的回归产生的-您将看到这些与上面的glm()函数产生的完全相同)。 您会注意到使用GEE(“稳健”)产生的标准误差和Z检验统计数据发生了巨大变化,尽管我们的两个预测指标仍然很重要。 看来,考虑国家内部的相关性已允许使用低得多的标准误差。
3. Multilevel Modeling***
3.多层建模***
***Note: As noted above, models are for demonstration purposes only and are not necessarily valid. In this case, we would want more groups than two for our MLM model (meaning data from additional countries). If you are really only using two groups with MLM models, you should consider a small sample size correction.
***注意:如上所述,模型仅用于演示目的,不一定有效。 在这种情况下,对于我们的传销模型,我们希望有两个以上的组(意味着来自其他国家/地区的数据)。 如果您真的只使用两个带有MLM模型的组,则应考虑进行较小的样本量校正。
Finally, we try MLM with the survey dataset. The code is exactly the same as that used with the Fragile Families study data, but with the new formula, grouping variable, and dataset specified.
最后,我们使用调查数据集尝试MLM。 该代码与用于脆弱家庭研究数据的代码完全相同,但是指定了新的公式,分组变量和数据集。
mlm = summary(glmer(formula = suicidal_thoughts ~ bullied + friends + (1 | country), data = survey, family = binomial))mlm$coefficientsAgain, beta estimates and standard error estimates are now adjusted slightly from those produced using glm(). Z- and p-values associated with the “bullied” and “friends” variables are slightly smaller, although bullying and having friends remain significant predictors of suicidal thoughts.
再次,现在使用glm()对beta估计值和标准误差估计值进行了一些调整。 尽管“欺负”和“拥有朋友”仍然是自杀思想的重要预测指标,但与“欺负”和“朋友”变量相关的Z值和p值略小。
Data from Princeton University’s Fragile Families & Child Wellbeing Study would be best represented using GEE. This is due to the maximum cluster size of 2 observations, the fact that individual families have multiple data points over time, and the fact that we were more interested in accounting for grouping in the standard error estimates than actually assessing differences between families.
普林斯顿大学脆弱家庭与儿童幸福研究的数据最好用GEE表示。 这是由于2个观测值的最大聚类大小,每个家庭随时间推移具有多个数据点的事实,以及与实际评估家庭之间的差异相比,我们对考虑标准误差估计中的分组更感兴趣。
Multilevel modeling is most appropriate for data from the Global School-Based Student Health Survey (GSHS) because the data were collected cross-sectionally and can be divided into two large clusters. Additionally, the output could be further explored to determine both within- and between-group variances, and we might be interested in relationships both within and across countries.
多层次建模最适合来自全球学校学生健康调查(GSHS)的数据,因为这些数据是横断面收集的,可以分为两个大类。 此外,可以进一步探索输出以确定组内和组间差异,我们可能对国家内部和国家之间的关系感兴趣。
How you account for violations of the independent observations assumption will depend on the structure of your data and your general knowledge of the data collection process, as well as whether or not you consider the correlation to be an annoyance to adjust for or something meaningful to explore.
如何处理违反独立观察假设的情况将取决于数据的结构和对数据收集过程的一般知识,以及您是否认为相关性是调整的烦恼或有意义的探索内容。
In conclusion, regression is flexible and certain regression models can handle correlated data. However, it is always important to check the assumptions of a given technique and to make sure that your analytic strategy is appropriate for your data.
总之,回归是灵活的,某些回归模型可以处理相关数据。 但是,检查给定技术的假设并确保您的分析策略适合您的数据始终很重要。
翻译自: https://towardsdatascience.com/using-regression-with-correlated-data-5845a2eed3d2
数据线性回归数据