Skip to article frontmatterSkip to article content

The normal equations

We now solve the general linear least-squares problem in Definition 3.1.1. That is, given ARm×n\mathbf{A}\in\mathbb{R}^{m \times n} and bRm\mathbf{b}\in\mathbb{R}^m, with m>nm>n, find the xRn\mathbf{x}\in\mathbb{R}^n that minimizes bAx2\| \mathbf{b} - \mathbf{A}\mathbf{x} \|_2.

There is a concise explicit solution. In the following proof we make use of the elementary algebraic fact that for two vectors u\mathbf{u} and v\mathbf{v},

(u+v)T(u+v)=uTu+uTv+vTu+vTv=uTu+2vTu+vTv. (\mathbf{u}+\mathbf{v})^T(\mathbf{u}+\mathbf{v}) = \mathbf{u}^T\mathbf{u} + \mathbf{u}^T\mathbf{v} + \mathbf{v}^T\mathbf{u} + \mathbf{v}^T\mathbf{v} = \mathbf{u}^T\mathbf{u} + 2\mathbf{v}^T\mathbf{u} + \mathbf{v}^T\mathbf{v}.

The normal equations have a geometric interpretation, as shown in Figure 3.2.1. The vector in the range (column space) of A\mathbf{A} that lies closest to b\mathbf{b} makes the vector difference Axb\mathbf{A}\mathbf{x}-\mathbf{b} perpendicular to the range. Thus for any z\mathbf{z}, we must have (Az)T(Axb)=0(\mathbf{A} \mathbf{z})^T(\mathbf{A}\mathbf{x}-\mathbf{b})=0, which is satisfied if AT(Axb)=0\mathbf{A}^T(\mathbf{A}\mathbf{x}-\mathbf{b})=\boldsymbol{0}.

Geometry of the normal equations. The smallest residual is orthogonal to the range of the matrix \mathbf{A}.

Figure 3.2.1:Geometry of the normal equations. The smallest residual is orthogonal to the range of the matrix A\mathbf{A}.

3.2.1Pseudoinverse and definiteness

If we associate the left-hand side of the normal equations as (ATA)x(\mathbf{A}^T\mathbf{A})\,\mathbf{x}, we recognize (3.2.3) as a square n×nn\times n linear system to solve for x\mathbf{x}.

Mathematically, the overdetermined least-squares problem Axb\mathbf{A}\mathbf{x}\approx \mathbf{b} has the solution x=A+b\mathbf{x}=\mathbf{A}^+\mathbf{b}.

Computationally we can generalize the observation about Julia from Chapter 2: backslash is equivalent mathematically to left-multiplication by the inverse (in the square case) or pseudoinverse (in the rectangular case) of a matrix. One can also compute the pseudoinverse directly using pinv, but as with matrix inverses, this is rarely necessary in practice.

The matrix ATA\mathbf{A}^T\mathbf{A} appearing in the pseudoinverse has some important properties.

3.2.2Implementation

The definition of the pseudoinverse involves taking the inverse of a matrix, so it is not advisable to use the pseudoinverse computationally. Instead, we use the definition of the normal equations to set up a linear system, which we already know how to solve. In summary, the steps for solving the linear least squares problem Axb\mathbf{A}\mathbf{x}\approx\mathbf{b} are:

Steps 1 and 3 of Algorithm 3.2.1 dominate the flop count.

In the last step we can exploit the fact, proved in Theorem 3.2.2, that N\mathbf{N} is symmetric and positive definite, and use Cholesky factorization as in Exploiting matrix structure. This detail is included in Function 3.2.2.

3.2.3Conditioning and stability

We have already used A\b as the native way to solve the linear least-squares problem Axb\mathbf{A}\mathbf{x}\approx\mathbf{b} in Julia. The algorithm employed by the backslash does not proceed through the normal equations, because of instability.

The conditioning of the linear least-squares problem relates changes in the solution x\mathbf{x} to those in the data, A\mathbf{A} and b\mathbf{b}. A full accounting of the condition number is too messy to present here, but we can get the main idea. We start by generalizing our previous definition of the matrix condition number.

Provided that the minimum residual norm bAx\|\mathbf{b}-\mathbf{A}\mathbf{x}\| is relatively small, the conditioning of the linear least-squares problem is close to κ(A)\kappa(\mathbf{A}).

As an algorithm, the normal equations begin by computing the data for the n×nn\times n system (ATA)x=ATb(\mathbf{A}^T\mathbf{A})\mathbf{x} = \mathbf{A}^T \mathbf{b}. When these equations are solved, perturbations to the data can be amplified by a factor κ(ATA)\kappa(\mathbf{A}^T\mathbf{A}).

The following can be proved using results in Chapter 7.

This squaring of the condition number in the normal equations is the cause of instability. If κ(A)\kappa(\mathbf{A}) is large, the squaring of it can destabilize the normal equations: while the solution of the least-squares problem is sensitive, finding it via the normal equations makes it doubly so.

3.2.4Exercises

  1. ✍ Work out the least-squares solution when

    A=[210122],b=[156].\mathbf{A} = \begin{bmatrix} 2 & -1 \\ 0 & 1 \\ -2 & 2 \end{bmatrix}, \qquad \mathbf{b} = \begin{bmatrix} 1\\-5\\6 \end{bmatrix}.
  2. ✍ Use (3.2.4) to find the pseudoinverse A+\mathbf{A}^+ of the matrix A=[123]T\mathbf{A}=\begin{bmatrix}1&-2&3\end{bmatrix}^T.

  3. ✍ Prove the first statement of Theorem 3.2.2: ATA\mathbf{A}^T\mathbf{A} is symmetric for any m×nm\times n matrix A\mathbf{A} with mnm \ge n.

  4. ✍ Prove that if A\mathbf{A} is an invertible square matrix, then A+=A1\mathbf{A}^+=\mathbf{A}^{-1}.

  5. (a) ✍ Show that for any m×nm\times n A\mathbf{A} with m>nm>n for which ATA\mathbf{A}^T\mathbf{A} is nonsingular, A+A\mathbf{A}^+\mathbf{A} is the n×nn\times n identity.

    (b) ⌨ Show using an example in Julia that AA+\mathbf{A}\mathbf{A}^+ is not an identity matrix. (This matrix has rank no greater than nn, so it can’t be an m×mm\times m identity.)

  6. ✍ Prove that the vector AA+b\mathbf{A}\mathbf{A}^+\mathbf{b} is the vector in the column space (i.e., range) of A\mathbf{A} that is closest to b\mathbf{b} in the sense of the 2-norm.

  7. ✍ Show that the flop count for Function 3.2.2 is asymptotically 2mn2+13n3\sim 2m n^2 + \tfrac{1}{3}n^3. (In finding the asymptotic count you can ignore terms like mnm n whose total degree is less than 3.)

  8. ⌨ Let t1,,tmt_1,\ldots,t_m be mm equally spaced points in [0,2π][0,2\pi]. In this exercise, use m=500m=500.

    (a) Let Aβ\mathbf{A}_\beta be the matrix in (3.1.2) that corresponds to fitting data with the function c1+c2sin(t)+c3cos(βt)c_1 + c_2 \sin(t) + c_3 \cos(\beta t). Using the identity (3.2.7), make a table of the condition numbers of Aβ\mathbf{A}_\beta for β=2,1.1,1.01,,1+108\beta = 2,1.1,1.01,\ldots,1+10^{-8}.

    (b) Repeat part (a) using the fitting function c1+c2sin2(t)+c3cos2(βt).c_1 + c_2 \sin^2(t) + c_3 \cos^2(\beta t).

    (c) Why does it make sense that κ(Aβ)\kappa\bigl(\mathbf{A}_\beta\bigr)\to \infty as β1\beta\to 1 in part (b) but not in part (a)?

  9. ✍ ⌨ When A\mathbf{A} is m×nm\times n with rank less than nn, the pseudoinverse is still defined and can be computed using pinv from LinearAlgebra. However, the behavior in this case is not always intuitive. Let

    As=[11000s].\mathbf{A}_s = \begin{bmatrix} 1 & 1 \\ 0 & 0 \\ 0 & s \end{bmatrix}.

    Then A0\mathbf{A}_0 has rank equal to 1. Demonstrate experimentally that A0+lims0As+\mathbf{A}_0^+\neq \lim_{s\to 0} \mathbf{A}_s^+.