本文为《Linear algebra and its applications》的读书笔记
The adjective “least-squares” arises from the fact that ∥ b − A x ∥ \left\|\boldsymbol b - A\boldsymbol x\right\| ∥b−Ax∥ is the square root of a sum of squares.
Apply the Best Approximation Theorem in Section 6.3 to the subspace C o l A ColA ColA. Let b ^ = p r o j C o l A b \hat \boldsymbol b=proj_{ColA}\boldsymbol b b^=projColAb
Because b ^ \hat\boldsymbol b b^ is in C o l A ColA ColA, the equation A x = b ^ A\boldsymbol x =\hat\boldsymbol b Ax=b^ is consistent, and there is an x ^ \hat\boldsymbol x x^ in R n \mathbb R^n Rn such that A x ^ = b ^ ( 1 ) A\hat\boldsymbol x =\hat\boldsymbol b\ \ \ \ \ \ \ \ \ \ (1) Ax^=b^ (1)
Since b ^ \hat\boldsymbol b b^ is the closest point in C o l A ColA ColA to b \boldsymbol b b, a vector x ^ \hat\boldsymbol x x^ is a least-squares solution of A x = b A\boldsymbol x =\boldsymbol b Ax=b if and only if x ^ \hat\boldsymbol x x^ satisfies (1). See Figure 2. [There are many solutions of (1) if the equation has free variables.]
Suppose x ^ \hat\boldsymbol x x^ satisfies A x ^ = b ^ A\hat\boldsymbol x =\hat\boldsymbol b Ax^=b^. By the Orthogonal Decomposition Theorem in Section 6.3, b − b ^ \boldsymbol b -\hat\boldsymbol b b−b^ is orthogonal to C o l A ColA ColA, so b − A x ^ \boldsymbol b - A\hat\boldsymbol x b−Ax^ is orthogonal to each column of A A A. If a j \boldsymbol a_j aj is any column of A A A, then a j ⋅ ( b − A x ^ ) = 0 \boldsymbol a_j \cdot (\boldsymbol b- A\hat\boldsymbol x)=0 aj⋅(b−Ax^)=0, and a j T ( b − A x ^ ) = 0 \boldsymbol a^T_j(\boldsymbol b - A\hat\boldsymbol x)= 0 ajT(b−Ax^)=0. Since each a j T \boldsymbol a^T_j ajT is a row of A T A^T AT , A T ( b − A x ^ ) = 0 ( 2 ) A^T (\boldsymbol b - A\hat\boldsymbol x)=\boldsymbol 0\ \ \ \ \ \ \ (2) AT(b−Ax^)=0 (2)(This equation also follows from Theorem 3 in Section 6.1.) Thus A T b − A T A x ^ = 0 A T A x ^ = A T b \begin{aligned}A^T \boldsymbol b - A^TA\hat\boldsymbol x&=\boldsymbol 0\\ A^TA\hat\boldsymbol x&=A^T \boldsymbol b\end{aligned} ATb−ATAx^ATAx^=0=ATb
These calculations show that each least-squares solution of A x = b A\boldsymbol x =\boldsymbol b Ax=b satisfies the equation
The equation represents a system of equations called the normal equations (法方程) for A x = b A\boldsymbol x =\boldsymbol b Ax=b. A solution of (3) is often denoted by x ^ \hat\boldsymbol x x^.
The next theorem gives useful criteria for determining when there is only one leastsquares solution of A x = b A\boldsymbol x =\boldsymbol b Ax=b. (Of course, the orthogonal projection b ^ \hat\boldsymbol b b^ is always unique.)
Formula (4) for x ^ \hat \boldsymbol x x^ is useful mainly for theoretical purposes and for hand calculations when A T A A^TA ATA is a 2 × 2 2\times 2 2×2 invertible matrix.
PROOF If A x = 0 A\boldsymbol x =\boldsymbol 0 Ax=0, then A T A x = 0 A^TA\boldsymbol x =\boldsymbol 0 ATAx=0 ∴ N u l A ⊆ N u l A T A \therefore NulA\subseteq NulA^TA ∴NulA⊆NulATA. If A T A x = 0 A^TA\boldsymbol x =\boldsymbol 0 ATAx=0, then x T A T A x = ( A x ) T A x = 0 \boldsymbol x^TA^TA\boldsymbol x =(A\boldsymbol x)^TA\boldsymbol x=\boldsymbol 0 xTATAx=(Ax)TAx=0 ∴ A x = 0 \therefore A\boldsymbol x=\boldsymbol 0 ∴Ax=0 ∴ N u l A T A ⊆ N u l A \therefore NulA^TA\subseteq NulA ∴NulATA⊆NulA. ∴ N u l A = N u l A T A ∴ r a n k A T A = n − d i m N u l A T A = n − d i m N u l A = r a n k A \therefore NulA= NulA^TA\\\therefore rankA^TA=n-dimNulA^TA=n-dimNulA=rankA ∴NulA=NulATA∴rankATA=n−dimNulATA=n−dimNulA=rankA
So when A A A has n n n linearly independent columns, r a n k A T A = r a n k A = n rankA^TA=rankA=n rankATA=rankA=n, which means A T A A^TA ATA is an invertible matrix.
The next example shows how to find a least-squares solution of A x = b A\boldsymbol x =\boldsymbol b Ax=b when the columns of A A A are orthogonal. Such matrices often appear in linear regression problems.
EXAMPLE 4 Find a least-squares solution of A x = b A\boldsymbol x =\boldsymbol b Ax=b for
SOLUTION Because the columns a 1 \boldsymbol a_1 a1 and a 2 \boldsymbol a_2 a2 of A A A are orthogonal, the orthogonal projection of b \boldsymbol b b onto C o l A ColA ColA is given by
Now that b ^ \hat \boldsymbol b b^ is known, we can solve A x ^ = b ^ A\hat\boldsymbol x=\hat\boldsymbol b Ax^=b^. But this is trivial(简单), since we already know what weights to place on the columns of A A A to produce b ^ \hat\boldsymbol b b^. It is clear from (5) that
In some cases, the normal equations for a least-squares problem can be i l l − c o n d i t i o n e d ( 病 态 的 ) ill-conditioned(病态的) ill−conditioned(病态的); that is, small errors in the calculations of the entries of A T A A^TA ATA can sometimes cause relatively large errors in the solution x ^ \hat \boldsymbol x x^. If the columns of A A A are linearly independent, the least-squares solution can often be computed more reliably through a Q R QR QR factorization of A A A (described in Section 6.4).