reml

科技2022-07-12 254

reml

生命科学的数学统计和机器学习 (Mathematical Statistics and Machine Learning for Life Sciences)

This is the nineteenth article from the column Mathematical Statistics and Machine Learning for Life Sciences where I try to explain some mysterious analytical techniques used in Bioinformatics and Computational Biology in a simple way. This is the final article in the series dedicated to the Linear Mixed Model (LMM). Previously we talked about How Linear Mixed Model Works, how to derive and program Linear Mixed Model from Scratch in R from the Maximum Likelihood (ML) principle. Today we will discuss the concept of Restricted Maximum Likelihood (REML), why it is useful and how to apply it to the Linear Mixed Models.

这是生命科学的数学统计和机器学习专栏的第十九篇文章，我试图以一种简单的方式来解释生物信息学和计算生物学中使用的一些神秘的分析技术。这是致力于线性混合模型(LMM)的系列文章中的最后一篇。先前我们讨论了线性混合模型的工作原理，如何从头开始导出和编程线性混合模型。 R来自最大似然(ML)原理。今天，我们将讨论限制最大似然(REML)的概念，为何有用以及如何将其应用于线性混合模型。

通过最大似然估计的偏差方差估算器 (Biased Variance Estimator by Maximum Likelihood)

The idea of Restricted Maximum Likelihood (REML) comes from realization that the variance estimator given by the Maximum Likelihood (ML) is biased. What is an estimator and in which way it is biased? An estimator is simply an approximation / estimate of model parameters. Assuming that statistical observations follow Normal distribution, there are two parameters: μ (mean) and σ² (variance) to estimate if one wants to summarize the observations. It turns out that the variance estimator given by Maximum Likelihood (ML) is biased, i.e. the value we obtain from the ML model over- or under-estimates the true variance, see the figure below.

限制最大似然( REML )的思想来自于认识到最大似然(ML)给出的方差估计量是有偏差的。什么是估计量，以哪种方式产生偏差？估计器只是模型参数的近似/估计。假设的统计观察遵循正态分布，有两个参数：μ(平均值)和σ²(方差)如果一个人想总结的意见来估计。事实证明，最大似然(ML)给出的方差估计量是有偏差的，即我们从ML模型获得的值高估或低估了真实方差，请参见下图。

Illustration of biased vs. unbiased estimators. Image by Author 有偏估计量与无偏估计量的图示。图片作者

In practice, when we e.g. solve a Linear Regression model using ML, we rarely think about the bias in the variance estimator, since we are usually interested in the coefficients of the linear model, which is the mean, and often do not even realize that in parallel we estimate one more fitting parameter, which is the variance. In this case the variance is considered to be a so-called nuisance parameter that is not of our primary interest.

在实践中，当我们使用ML求解线性回归模型时，我们很少考虑方差估计量中的偏差，因为我们通常对线性模型的系数感兴趣，该系数是均值，甚至常常没有意识到并行地，我们估计另一个拟合参数，即方差。在这种情况下，方差被认为是所谓的讨厌参数那不是我们的主要利益。

To demonstrate that ML indeed gives a biased variance estimator, consider a simple one-dimensional case with a variable y = (y1,y2,…,yN) following e.g. the Normal distribution.

为了证明ML确实给出了偏差方差估计量，请考虑一个简单的一维情况，其变量y =( y 1， y 2，…， yN )，例如正态分布。

Maximization of the likelihood, Eq.(1), leads to the estimators for mean and variance, Eq. (2), for derivation please check the notebook at my github. To manifest that the variance estimator in Eq. (2) is biased, we will derive the expected value of the variance estimator and prove that it is not equal to the true value, σ², of variance. For this purpose, we first rearrange the variance estimator, Eq. (2), by explicitly including the unknown true mean μ into the equation:

可能性的最大化(等式(1))导致了均值和方差(等式)的估计。 (2)，派生请检查我的github上的笔记本。为了证明方差估计量在等式中。 (2)是有偏的，我们将得出方差估计量的期望值，并证明它不等于方差的真实值σ² 。为此，我们首先重新排列方差估计量Eq。 (2)通过将未知的真实均值μ明确包含在方程中：

Finally, let us compute the the expected value of the variance estimator:

最后，让我们计算方差估计量的期望值：

Here we can see that the expected value of the ML variance estimator is not equal to the true variance σ², although it approaches the true variance at large sample sizes. Thus, the variance estimator given by ML is biased downwards, i.e. it underestimates the true variance. When N >> 1, the bias seems to be negligible until we realize that Eq. (8) was obtained for one-dimensional data. Working with high-dimensional data, that is typical for real world problems, we can get severely biased estimates of the variance because it can be derived, check for example the wonderful tutorial here, that for k-dimensional data the expected value of the variance estimator takes form:

在这里，我们可以看到，ML方差估计的预期值不等于真正的方差σ²，虽然它在大样本量接近真实变化。因此，由ML给出的方差估计量向下偏置，即，它低估了真实方差。当N >> 1时，在我们意识到等式之前，偏差似乎可以忽略不计。对于一维数据获得(8)。使用现实世界中典型的高维数据，我们可以得到方差的严重偏差估计值，因为可以推导它，例如，查看此处的精彩教程，对于k维数据，方差的期望值估算器采用以下形式：

Therefore, the problem of underestimating the true variance by ML becomes especially acute when the number of dimensions k approaches the number of samples / statistical observations, N. We conclude, that in high-dimensional space the Maximum Likelihood (ML) principle works only in the limit k<<N, while biased results can be obtained when k ≈ N. This bias needs to be taken somehow into account, this is exactly where REML comes into play.

因此，当维数k接近样本数/统计观测值N时，用ML低估真实方差的问题变得尤为严重。我们的结论是，在高维空间中的最大似然(ML)原理工作仅在值k << N，而偏压的结果可以得到当k≈ñ。需要以某种方式考虑这种偏见，这正是REML发挥作用的地方。

来自REML的线性混合模型 (Linear Mixed Model Derived from REML)

The problem with the biased variance estimator by ML appears to be due to the fact that we used an unknown estimator for the mean for computing the variance estimator. Instead, if we make sure that the log-likelihood function does not contain any information about the mean, we can optimize it with respect to the variance components and get an unbiased variance estimator. This is essentially what Restricted Maximum Likelihood (REML) does. In this case, the mean (not the variance like for ML) is considered to be a nuisance parameter that should be somehow removed from the equation. A way to get rid of the information about the mean from the log-likelihood function is to compute a marginal probability, i.e. integrate the log-likelihood over the mean. In the previous post, LMM from Scratch, we saw that for multivariate analysis working with high-dimensional data, the extension of Eq. (1) is given by the multivariate Gaussian distribution:

ML的偏差方差估计量存在的问题似乎是由于以下事实：我们使用未知估计量作为均值来计算方差估计量。相反，如果我们确保对数似然函数不包含任何均值信息，则可以针对方差成分对其进行优化，并获得无偏方差估计量。这本质上就是受限最大可能性(REML)所做的。在这种情况下，平均值(不是像ML那样的方差)被认为是一个讨厌的参数，应该以某种方式将其从等式中删除。从对数似然函数中摆脱均值信息的一种方法是计算边际概率，即对数均值对均值进行积分。在上一篇Scratch的LMM文章中，我们看到对于使用高维数据的多元分析，Eq的扩展。 (1)由多元高斯分布给出：

where Σy is the variance-covariance matrix, which is a generalization of the simple residual variance from Eq. (1). Here, we are going to take logarithm of the multivariate Gaussian distribution (log-likelihood), and integrate the log-likelihood with respect to the mean, β, and get an unbiased estimate for the variance components. Therefore, we need to compute the following integral:

∑ y 是方差-协方差矩阵，它是对等式的简单残差的概括。 (1)。在这里，我们将采用多元高斯分布的对数(对数似然)，对均值β进行对数似然积分，并获得方差分量的无偏估计。因此，我们需要计算以下积分：

To do it we will use the saddle point approach (Laplace approximation). In this approach, since the exponential function under the integral in the third term of Eq. (9) decreases very quickly, it is enough to compute the integral in the maximum of the function f(β) in the exponent, exp( f(β) ), that will give the greatest contribution to the exponent, and therefore to the integral in Eq. (9), and hence the likelihood. Denoting the function in the exponent via f(β), we can approximate it via the Taylor series expansion in the proximity of the mean estimator point:

为此，我们将使用鞍点方法( 拉普拉斯逼近法 )。在这种方法中，由于方程第三项中积分下的指数函数。 (9)下降速度非常快，它足以计算的最大的指数函数f(β)的积分，EXP(F(β))，这将使该指数贡献最大，因此对积分式 (9)，因此可能性。通过f ( β )表示指数中的函数，我们可以通过在平均估计点附近的泰勒级数展开来近似它：

Here, the linear term is zero because of the extremum condition. Here, we assume that in reality the likelihood is maximum in the true mean, however the estimator is not far from the true mean so the Taylor series expansion can be performed. Coming back to the third term in the Eq. (9), and denoting the function in the exponent as f(β), the Taylor series expansion around the mean estimator gives:

在这里，由于极值条件，线性项为零。在这里，我们假设现实中的可能性在真实均值中最大，但是估计量与真实均值相差不远，因此可以执行泰勒级数展开。回到等式中的第三项。 (9)，并将指数函数表示为f ( β )，在均值估计量附近的泰勒级数展开给出：

where |…| is the notation for determinant. The first two terms in Eq. (12) are the ML solution that we also obtained in LMM from Scratch (Eq. (10) in that article). In contrast, the third term comes from the REML approach. One can think about this additional term as a penalty (or a bias) in a penalized model (Ridge / Lasso / Elastic Net), where we put a constraint on the coefficients in the linear regression (or LMM) model. Let us check how this additional term coming from REML affects the solution of the Linear Mixed Model (LMM) for the toy data set that was introduced in the LMM from Scratch post.

||| 是行列式的表示法。等式中的前两个术语。 (12)是我们也在LMM中从Scratch中获得的ML解决方案(该文章中的公式(10))。相反，第三个术语来自REML方法。可以将这个附加项视为惩罚模型 ( Ridge / Lasso / Elastic Net )中的罚分(或偏差)，在线性模型中，我们对线性回归(或LMM)模型的系数施加了约束。让我们检查一下来自REML的这个附加术语如何影响Scratch post在LMM中引入的玩具数据集的线性混合模型(LMM)的解决方案。

通过REML的LMM用于玩具数据集 (LMM via REML for Toy Data Set)

To recap, we were considering only 4 data points for simplicity: 2 originating from Individual #1 and the other 2 coming from Individual #2. Further, the 4 points are spread between two conditions: untreated and treated, please see the figure below. In the Treat column 0 means untreated and 1 means treated.

简而言之，为简单起见，我们仅考虑4个数据点：2 个数据源于＃1个人，另外2 个数据源于＃2个人。此外，这4点分布在两个条件之间：未处理和已处理，请参见下图。在处理列中，0表示未处理，1表示已处理。

We fit the data using LMM with fixed effects for slopes and intercepts and random effects for intercepts using the lmer function from lme4 R package. Including random effects intercepts that account for grouping factor Ind (the individual ID), we will need too use a special syntax (1 | Ind) for the lmer function. Now, we can fit the LMM model using the Restricted Maximum Likelihood (REML) approach, for this purpose, we specify REML = TRUE:

我们使用LMM拟合数据，并使用来自lme4 R包的lmer函数对坡度和截距具有固定效果，对截距具有随机效果。包括说明分组因数Ind (单个ID)的随机效应截距，我们还将需要对lmer函数使用特殊的语法(1 | Ind) 。现在，我们可以使用限制最大似然(REML)方法拟合LMM模型，为此，我们指定REML = TRUE ：

Please notice the shared and residual standard deviations, 8.155 and 6.0, respectively, that we denoted as σs and σ in the previous post. We will later reproduce these values when implementing REML solution for LMM. As we saw in the previous LMM_from_Scratch tutorial, knowing the coordinates of the data points y11, y12, y21 and y22, the first two terms in Eq. (12) can be computed as:

请注意，共享和残余标准偏差分别为8.155和6.0 ，我们在上一篇文章中将其表示为σs和σ 。我们稍后将在为LMM实现REML解决方案时重现这些值。正如我们在之前的LMM_from_Scratch教程中所看到的，知道数据点y11，y12，y21和y22的坐标，即方程式中的前两项。 (12)可计算为：

The third term in Eq. (12) can also be analytically derived, since we know both the X, design matrix, and the inverse variance covariance matrix, Σy. Below we present a screenshot from Maple software, where it is shown that the third term in Eq. (12) takes form:

等式中的第三项。由于我们知道X ，设计矩阵和逆方差协方差矩阵Σy ，因此也可以解析得出式 (12)。下面我们展示了Maple软件的屏幕截图，其中显示了等式中的第三项。 (12)形式：

Thus, third term in Eq. (12) has the following simple expression:

因此，等式中的第三项。 (12)具有以下简单表达式：

Therefore, we can now minimize the log-likelihood function in the Restricted Maximum Likelihood (REML) approximation, i.e. when the log-likelihood Eq. (12) function does not contain any information about the mean β, i.e. this is not a parameter of optimization any more but has a fixed / estimated values β1=6, and β2=15.5, that were previously found. Now, everything is ready for performing numerical minimization of the log-likelihood function, Eq. (12), with respect to σs and σ in the REML approximation:

因此，我们现在可以在限制最大似然(REML)近似中最小化对数似然函数，即当对数似然方程为。 (12)函数不包含有关均值β的任何信息，即，这不再是优化的参数，而是具有先前找到的固定/估计值β1 = 6和β2 = 15.5。现在，一切准备就绪，可以执行对数似然函数Eq的数值最小化。 (12)关于REML近似中的σs和σ ：

From the minimization of the log-likelihood function we obtain σ = 6.00 and σs = 8.155, exactly the standard deviations that we also obtained by the lmer function with REML = TRUE. We reproduced the Random Effects residual variance σ and shared across data points variance σs for both Maximum Likelihood (REML=FALSE), in the previous post, and Restricted Maximum Likelihood (REML=TRUE), in this post. And, we have derived and codded it from scratch using R, well done!

从对数似然函数的最小化中，我们获得σ= 6.00和σs= 8.155 ，这正是我们也通过lmer获得的标准偏差 REML = TRUE的函数。我们针对前一篇文章的最大似然(REML = FALSE)和本篇文章中的限制最大似然(REML = TRUE)再现了随机效应残差方差σ并在数据点方差σs中共享。而且，我们已经使用R从头开始推导了它，做得很好！

摘要 (Summary)

In this article, we have learnt that the Maximum Likelihood (ML) variance estimator is biased, especially for high-dimensional data, due to using an unknown mean estimator. Restricted Maximum Likelihood (REML) fixes this issue by removing first all the information about the mean estimator prior to minimizing the log-likelihood function. We have successfully reproduced the variance components reported by lmer with REML = TRUE, and we derived and coded REML from scratch using R.

在本文中，我们了解到，由于使用了未知的均值估计器，因此最大似然(ML) 方差估计器存在偏差，尤其是对于高维数据。受限最大可能性(REML)通过在最小化对数可能性函数之前先删除有关均值估计量的所有信息来解决此问题。我们已经成功地复制了lmer报告的方差成分，其REML = TRUE，并且使用R从头开始导出并编码了REML。

In the comments below, let me know which analytical techniques from Life Sciences seem especially mysterious to you and I will try to cover them in the future posts. Check the codes from the post on my Github. Follow me at Medium Nikolay Oskolkov, in Twitter @NikolayOskolkov and do connect in Linkedin. In the next post, we will cover how to cluster in UMAP space, stay tuned.

在下面的评论中，让我知道生命科学的哪些分析技术对您来说尤其神秘，我将在以后的文章中尝试介绍它们。在我的Github上检查帖子中的代码。跟随我在中型Nikolay Oskolkov，在Twitter @NikolayOskolkov上进行连接，并在Linkedin中进行连接。在下一篇文章中，我们将介绍如何在UMAP空间中进行聚类，敬请期待。

翻译自: https://towardsdatascience.com/maximum-likelihood-ml-vs-reml-78cf79bef2cf

reml

相关资源：微信小程序源码-合集6.rar

Processed: 0.009, SQL: 8