The normal equations

We seek to solve the general linear least squares problem: Given \(\mathbf{A}\in\mathbb{R}^{m \times n}\) and \(\mathbf{b}\in\mathbb{R}^m\), with \(m>n\), find \(\mathbf{x}\in\mathbb{R}^n\) such that \(\| \mathbf{b} - \mathbf{A}\mathbf{x} \|_2\) is minimized. There is a concise explicit solution to the problem.

In the following proof we make use of the elementary algebraic fact that for two vectors \(\mathbf{u}\) and \(\mathbf{v}\),

\[ (\mathbf{u}+\mathbf{v})^T(\mathbf{u}+\mathbf{v}) = \mathbf{u}^T\mathbf{u} + \mathbf{u}^T\mathbf{v} + \mathbf{v}^T\mathbf{u} + \mathbf{v}^T\mathbf{v} = \mathbf{u}^T\mathbf{u} + 2\mathbf{v}^T\mathbf{u} + \mathbf{v}^T\mathbf{v}.\]
Theorem 22

If \(\mathbf{x}\) satisfies \(\mathbf{A}^T(\mathbf{A}\mathbf{x}-\mathbf{b})=\boldsymbol{0}\), then \(\mathbf{x}\) solves the linear least squares problem, i.e., \(\mathbf{x}\) minimizes \(\| \mathbf{b}-\mathbf{A}\mathbf{x} \|_2\).

Proof

Let \(\mathbf{y}\in \mathbb{R}^n\) be any vector. Then \(\mathbf{A}(\mathbf{x}+\mathbf{y})-\mathbf{b}=\mathbf{A}\mathbf{x}-\mathbf{b}+\mathbf{A}\mathbf{y}\), and

\[\begin{split}\begin{split} \| \mathbf{A}(\mathbf{x}+\mathbf{y})-\mathbf{b} \|_2^2 &= [(\mathbf{A}\mathbf{x}-\mathbf{b})+(\mathbf{A}\mathbf{y})]^T[(\mathbf{A}\mathbf{x}-\mathbf{b})+(\mathbf{A}\mathbf{y})]\\ &= (\mathbf{A}\mathbf{x}-\mathbf{b})^T(\mathbf{A}\mathbf{x}-\mathbf{b}) + 2(\mathbf{A}\mathbf{y})^T(\mathbf{A}\mathbf{x}-\mathbf{b}) + (\mathbf{A}\mathbf{y})^T(\mathbf{A}\mathbf{y})\\ &= \| \mathbf{A}\mathbf{x}-\mathbf{b} \|_2^2 + 2\mathbf{y}^T \mathbf{A}^T(\mathbf{A}\mathbf{x}-\mathbf{b}) + \| \mathbf{A}\mathbf{y} \|_2^2 \\ &= \| \mathbf{A}\mathbf{x}-\mathbf{b} \|_2^2 + \| \mathbf{A}\mathbf{y} \|_2^2 \\ & \ge \| \mathbf{A}\mathbf{x}-\mathbf{b} \|_2^2. \end{split}\end{split}\]

The condition \(\mathbf{A}^T(\mathbf{A}\mathbf{x}-\mathbf{b})=\boldsymbol{0}\) is often written as

(63)\[\mathbf{A}^T\mathbf{A}\mathbf{x}=\mathbf{A}^T\mathbf{b},\]

called the normal equations. They have a straightforward geometric interpretation, as shown in Fig. 2. The vector in the range (column space) of \(\mathbf{A}\) that lies closest to \(\mathbf{b}\) makes the vector difference \(\mathbf{A}\mathbf{x}-\mathbf{b}\) perpendicular to the range. Thus for any \(\mathbf{z}\), we must have \((\mathbf{A} \mathbf{z})^T(\mathbf{A}\mathbf{x}-\mathbf{b})=0\), which is satisfied if \(\mathbf{A}^T(\mathbf{A}\mathbf{x}-\mathbf{b})=\boldsymbol{0}\).

../_images/normaleqns2d.svg

Fig. 2 Geometry of the normal equations.

If we group the left-hand side of the normal equations as \((\mathbf{A}^T\mathbf{A})\,\mathbf{x}\), we recognize (63) as a square \(n\times n\) linear system to solve for \(\mathbf{x}\).

Pseudoinverse and definiteness

The \(n\times m\) matrix

(64)\[\mathbf{A}^+ = (\mathbf{A}^T\mathbf{A})^{-1}\mathbf{A}^T\]

is called the pseudoinverse of \(\mathbf{A}\). Mathematically, the overdetermined least squares problem \(\mathbf{A}\mathbf{x}\approx \mathbf{b}\) has the solution \(\mathbf{x}=\mathbf{A}^+\mathbf{b}\). Hence we can generalize our earlier observation: backslash is equivalent mathematically to left-multiplication by the inverse (square case) or pseudoinverse (rectangular case) of a matrix. One may also compute the pseudoinverse directly using pinv, but as with matrix inverses, this is rarely necessary in practice.

The matrix \(\mathbf{A}^T\mathbf{A}\) appearing in the pseudoinverse has some important properties.

Theorem 23 ((AtA))

For any real \(m\times n\) matrix \(\mathbf{A}\) with \(m\ge n\), the following are true:

  1. \(\mathbf{A}^T\mathbf{A}\) is symmetric.

  2. \(\mathbf{A}^T \mathbf{A}\) is singular if and only if the columns of \(\mathbf{A}\) are linearly dependent. (Equivalently, the rank of \(\mathbf{A}\) is less than \(n\).)

  3. If \(\mathbf{A}^T\mathbf{A}\) is nonsingular, then it is positive definite.

Proof

The first part is left as an exercise. For the second part, suppose that \(\mathbf{A}^T\mathbf{A}\mathbf{z}=\boldsymbol{0}\). Note that \(\mathbf{A}^T\mathbf{A}\) is singular if and only if \(\mathbf{z}\) may be nonzero. Left-multiplying by \(\mathbf{z}^T\), we find that

\[0 = \mathbf{z}^T\mathbf{A}^T\mathbf{A}\mathbf{z}=(\mathbf{A}\mathbf{z})^T(\mathbf{A}\mathbf{z}) = \| \mathbf{A}\mathbf{z} \|_2^2,\]

which is equivalent to \(\mathbf{A}\mathbf{z}=\boldsymbol{0}\). Then \(\mathbf{z}\) may be nonzero if and only if the columns of \(\mathbf{A}\) are linearly dependent.

Finally, we can repeat the manipulations above to show that for any nonzero \(n\)-vector \(\mathbf{v}\), \(\mathbf{v}^T(\mathbf{A}^T\mathbf{A})\mathbf{v}=\| \mathbf{A}\mathbf{v} \|_2^2\ge 0\), and equality is not possible thanks to the second part of the theorem.

The definition of the pseudoinverse involves taking the inverse of a matrix and is therefore not advisable to use computationally. Instead, we simply use the definition of the normal equations to set up a linear system, which we already know how to solve. In summary, the steps for solving the linear least squares problem \(\mathbf{A}\mathbf{x}\approx\mathbf{b}\) are:

  1. Compute \(\mathbf{N}=\mathbf{A}^T\mathbf{A}\).

  2. Compute \(\mathbf{z} = \mathbf{A}^T\mathbf{b}\).

  3. Solve the \(n\times n\) linear system \(\mathbf{N}\mathbf{x} = \mathbf{z}\) for \(\mathbf{x}\).

In the last step we can exploit the fact, proved in the AtA theorem, that \(\mathbf{N}\) is symmetric and positive definite, and use Cholesky factorization as in Symmetric positive definite matrices. (The backslash command does this automatically.)

Function 24 (lsnormal)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
"""
lsnormal(A,b)

Solve a linear least squares problem by the normal equations.
Returns the minimizer of ||b-Ax||.
"""
function lsnormal(A,b)

N = A'*A;  z = A'*b;
R = cholesky(N).U
w = forwardsub(R',z)                   # solve R'z=c
x = backsub(R,w)                       # solve Rx=z

return x
end

Conditioning and stability

We have already used A\b as the native way to solve the linear least squares problem \(\mathbf{A}\mathbf{x}\approx\mathbf{b}\) in Julia. The algorithm employed by the backslash does not proceed through the normal equations, because of instability.

The conditioning of the linear least-squares problem relates changes in the solution \(\mathbf{x}\) to those in the data, \(\mathbf{A}\) and \(\mathbf{b}\). A full accounting of the condition number is too messy to present here, but we can generalize from the linear system problem \(\mathbf{A} \mathbf{x} =\mathbf{b}\), where \(m=n\). Recall that the condition number of solving \(\mathbf{A} \mathbf{x}=\mathbf{b}\) is \(\kappa(\mathbf{A})=\|\mathbf{A}\| \cdot \|\mathbf{A}^{-1}\|\).

Provided that the residual norm \(\|\mathbf{b}-\mathbf{A}\mathbf{x}\|\) at the least-squares solution is relatively small, the conditioning of the linear least squares problem is similar. The condition number of the problem generalizes via the pseudoinverse:

(65)\[\kappa(\mathbf{A}) = \|\mathbf{A}\| \cdot \|\mathbf{A}^{+}\|.\]

These are rectangular matrices, but the induced matrix norm is defined by (42) just as in the square case. If \(\mathbf{A}\) has rank less than \(n\), then \(\kappa(\mathbf{A})=\infty\). The Julia function cond computes condition numbers of rectangular matrices in the 2-norm.

As an algorithm, the normal equations begin by computing the \(n\times n\) system \((\mathbf{A}^T\mathbf{A})\mathbf{x} = \mathbf{A}^T \mathbf{b}\). When these equations are solved, perturbations to the data can be amplified by a factor \(\kappa(\mathbf{A}^T\mathbf{A})\). It turns out that

(66)\[\kappa(\mathbf{A}^T\mathbf{A}) = \kappa(\mathbf{A})^2\]

in the 2-norm. If \(\kappa(\mathbf{A})\) is large, the squaring of it can destabilize the normal equations: while the solution of the least squares problem is sensitive, finding it via the normal equations makes it doubly so.

Exercises

  1. ✍ Work out the least squares solution when

    \[\begin{split}\mathbf{A} = \begin{bmatrix} 2 & -1 \\ 0 & 1 \\ -2 & 2 \end{bmatrix}, \qquad \mathbf{b} = \begin{bmatrix} 1\\-5\\6 \end{bmatrix}.\end{split}\]
  2. ✍ Find the pseudoinverse \(\mathbf{A}^+\) of the matrix \(\mathbf{A}=\begin{bmatrix}1&-2&3\end{bmatrix}^T\).

  3. ✍ Prove the first statement of the AtA theorem: \(\mathbf{A}^T\mathbf{A}\) is symmetric for any \(m\times n\) matrix \(\mathbf{A}\) with \(m \ge n\).

  4. ✍ Prove that if \(\mathbf{A}\) is a nonsingular square matrix, then \(\mathbf{A}^+=\mathbf{A}^{-1}\).

  5. (a) ✍ Show that for any \(m\times n\) \(\mathbf{A}\) (\(m>n\)) for which \(\mathbf{A}^T\mathbf{A}\) is nonsingular, \(\mathbf{A}^+\mathbf{A}\) is the \(n\times n\) identity.

    (b) ⌨ Show using an example in Julia that \(\mathbf{A}\mathbf{A}^+\) is not an identity matrix. (This matrix has rank no greater than \(n\), so it can’t be an \(m\times m\) identity.)

  6. ✍ Prove that the vector \(\mathbf{A}\mathbf{A}^+\mathbf{b}\) is the vector in the column space (i.e., range) of \(\mathbf{A}\) that is closest to \(\mathbf{b}\), in the sense of the 2-norm.

  7. ✍ Show that the flop count for lsnormal is asymptotically \(\sim 2m n^2 + \tfrac{1}{3}n^3\). (In finding the asymptotic count you can ignore terms like \(mn\) whose total degree is less than three.)

  8. ⌨ Let \(t_1,\ldots,t_m\) be \(m+1\) equally spaced points in \([0,2\pi]\).

    (a) Let \(\mathbf{A}_\beta\) be the matrix in (58) that corresponds to fitting data with the function \(c_1 + c_2 \sin(t) + c_3 \cos(\beta t)\). Using the identity (66), make a table of the condition numbers of \(\mathbf{A}_\beta\) for \(\beta = 2,1.1,1.01,\ldots,1+10^{-8}\).

    (b) Repeat part (a) using the fitting function \(c_1 + c_2 \sin^2(t) + c_3 \cos^2(\beta t).\)

    (c) Why does it make sense that \(\kappa\bigl(\mathbf{A}_\beta\bigr)\to \infty\) as \(\beta\to 1\) in part (b) but not in part (a)?

  9. ✍ ⌨ When \(\mathbf{A}\) is \(m\times n\) with rank less than \(n\), the pseudoinverse is still defined and can be computed using pinv from LinearAlgebra. However, the behavior in this case is not always intuitive. Let

    \[\begin{split}\mathbf{A}_s = \begin{bmatrix} 1 & 1 \\ 0 & 0 \\ 0 & s \end{bmatrix}.\end{split}\]

    Then \(\mathbf{A}_0\) has rank equal to one. Demonstrate experimentally that \(\mathbf{A}_0^+\neq \lim_{s\to 0} \mathbf{A}_s^+\).