# The normal equations¶

We seek to solve the general linear least squares problem: Given $$\mathbf{A}\in\mathbb{R}^{m \times n}$$ and $$\mathbf{b}\in\mathbb{R}^m$$, with $$m>n$$, find $$\mathbf{x}\in\mathbb{R}^n$$ such that $$\| \mathbf{b} - \mathbf{A}\mathbf{x} \|_2$$ is minimized. There is a concise explicit solution to the problem.

In the following proof we make use of the elementary algebraic fact that for two vectors $$\mathbf{u}$$ and $$\mathbf{v}$$,

$(\mathbf{u}+\mathbf{v})^T(\mathbf{u}+\mathbf{v}) = \mathbf{u}^T\mathbf{u} + \mathbf{u}^T\mathbf{v} + \mathbf{v}^T\mathbf{u} + \mathbf{v}^T\mathbf{v} = \mathbf{u}^T\mathbf{u} + 2\mathbf{v}^T\mathbf{u} + \mathbf{v}^T\mathbf{v}.$
Theorem 22

If $$\mathbf{x}$$ satisfies $$\mathbf{A}^T(\mathbf{A}\mathbf{x}-\mathbf{b})=\boldsymbol{0}$$, then $$\mathbf{x}$$ solves the linear least squares problem, i.e., $$\mathbf{x}$$ minimizes $$\| \mathbf{b}-\mathbf{A}\mathbf{x} \|_2$$.

Proof

Let $$\mathbf{y}\in \mathbb{R}^n$$ be any vector. Then $$\mathbf{A}(\mathbf{x}+\mathbf{y})-\mathbf{b}=\mathbf{A}\mathbf{x}-\mathbf{b}+\mathbf{A}\mathbf{y}$$, and

$\begin{split}\begin{split} \| \mathbf{A}(\mathbf{x}+\mathbf{y})-\mathbf{b} \|_2^2 &= [(\mathbf{A}\mathbf{x}-\mathbf{b})+(\mathbf{A}\mathbf{y})]^T[(\mathbf{A}\mathbf{x}-\mathbf{b})+(\mathbf{A}\mathbf{y})]\\ &= (\mathbf{A}\mathbf{x}-\mathbf{b})^T(\mathbf{A}\mathbf{x}-\mathbf{b}) + 2(\mathbf{A}\mathbf{y})^T(\mathbf{A}\mathbf{x}-\mathbf{b}) + (\mathbf{A}\mathbf{y})^T(\mathbf{A}\mathbf{y})\\ &= \| \mathbf{A}\mathbf{x}-\mathbf{b} \|_2^2 + 2\mathbf{y}^T \mathbf{A}^T(\mathbf{A}\mathbf{x}-\mathbf{b}) + \| \mathbf{A}\mathbf{y} \|_2^2 \\ &= \| \mathbf{A}\mathbf{x}-\mathbf{b} \|_2^2 + \| \mathbf{A}\mathbf{y} \|_2^2 \\ & \ge \| \mathbf{A}\mathbf{x}-\mathbf{b} \|_2^2. \end{split}\end{split}$

The condition $$\mathbf{A}^T(\mathbf{A}\mathbf{x}-\mathbf{b})=\boldsymbol{0}$$ is often written as

(63)$\mathbf{A}^T\mathbf{A}\mathbf{x}=\mathbf{A}^T\mathbf{b},$

called the normal equations. They have a straightforward geometric interpretation, as shown in Fig. 2. The vector in the range (column space) of $$\mathbf{A}$$ that lies closest to $$\mathbf{b}$$ makes the vector difference $$\mathbf{A}\mathbf{x}-\mathbf{b}$$ perpendicular to the range. Thus for any $$\mathbf{z}$$, we must have $$(\mathbf{A} \mathbf{z})^T(\mathbf{A}\mathbf{x}-\mathbf{b})=0$$, which is satisfied if $$\mathbf{A}^T(\mathbf{A}\mathbf{x}-\mathbf{b})=\boldsymbol{0}$$.

Fig. 2 Geometry of the normal equations.

If we group the left-hand side of the normal equations as $$(\mathbf{A}^T\mathbf{A})\,\mathbf{x}$$, we recognize (63) as a square $$n\times n$$ linear system to solve for $$\mathbf{x}$$.

## Pseudoinverse and definiteness¶

The $$n\times m$$ matrix

(64)$\mathbf{A}^+ = (\mathbf{A}^T\mathbf{A})^{-1}\mathbf{A}^T$

is called the pseudoinverse of $$\mathbf{A}$$. Mathematically, the overdetermined least squares problem $$\mathbf{A}\mathbf{x}\approx \mathbf{b}$$ has the solution $$\mathbf{x}=\mathbf{A}^+\mathbf{b}$$. Hence we can generalize our earlier observation: backslash is equivalent mathematically to left-multiplication by the inverse (square case) or pseudoinverse (rectangular case) of a matrix. One may also compute the pseudoinverse directly using pinv, but as with matrix inverses, this is rarely necessary in practice.

The matrix $$\mathbf{A}^T\mathbf{A}$$ appearing in the pseudoinverse has some important properties.

Theorem 23 ((AtA))

For any real $$m\times n$$ matrix $$\mathbf{A}$$ with $$m\ge n$$, the following are true:

1. $$\mathbf{A}^T\mathbf{A}$$ is symmetric.

2. $$\mathbf{A}^T \mathbf{A}$$ is singular if and only if the columns of $$\mathbf{A}$$ are linearly dependent. (Equivalently, the rank of $$\mathbf{A}$$ is less than $$n$$.)

3. If $$\mathbf{A}^T\mathbf{A}$$ is nonsingular, then it is positive definite.

Proof

The first part is left as an exercise. For the second part, suppose that $$\mathbf{A}^T\mathbf{A}\mathbf{z}=\boldsymbol{0}$$. Note that $$\mathbf{A}^T\mathbf{A}$$ is singular if and only if $$\mathbf{z}$$ may be nonzero. Left-multiplying by $$\mathbf{z}^T$$, we find that

$0 = \mathbf{z}^T\mathbf{A}^T\mathbf{A}\mathbf{z}=(\mathbf{A}\mathbf{z})^T(\mathbf{A}\mathbf{z}) = \| \mathbf{A}\mathbf{z} \|_2^2,$

which is equivalent to $$\mathbf{A}\mathbf{z}=\boldsymbol{0}$$. Then $$\mathbf{z}$$ may be nonzero if and only if the columns of $$\mathbf{A}$$ are linearly dependent.

Finally, we can repeat the manipulations above to show that for any nonzero $$n$$-vector $$\mathbf{v}$$, $$\mathbf{v}^T(\mathbf{A}^T\mathbf{A})\mathbf{v}=\| \mathbf{A}\mathbf{v} \|_2^2\ge 0$$, and equality is not possible thanks to the second part of the theorem.

The definition of the pseudoinverse involves taking the inverse of a matrix and is therefore not advisable to use computationally. Instead, we simply use the definition of the normal equations to set up a linear system, which we already know how to solve. In summary, the steps for solving the linear least squares problem $$\mathbf{A}\mathbf{x}\approx\mathbf{b}$$ are:

1. Compute $$\mathbf{N}=\mathbf{A}^T\mathbf{A}$$.

2. Compute $$\mathbf{z} = \mathbf{A}^T\mathbf{b}$$.

3. Solve the $$n\times n$$ linear system $$\mathbf{N}\mathbf{x} = \mathbf{z}$$ for $$\mathbf{x}$$.

In the last step we can exploit the fact, proved in the AtA theorem, that $$\mathbf{N}$$ is symmetric and positive definite, and use Cholesky factorization as in Symmetric positive definite matrices. (The backslash command does this automatically.)

Function 24 (lsnormal)
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 """ lsnormal(A,b) Solve a linear least squares problem by the normal equations. Returns the minimizer of ||b-Ax||. """ function lsnormal(A,b) N = A'*A; z = A'*b; R = cholesky(N).U w = forwardsub(R',z) # solve R'z=c x = backsub(R,w) # solve Rx=z return x end 

## Conditioning and stability¶

We have already used A\b as the native way to solve the linear least squares problem $$\mathbf{A}\mathbf{x}\approx\mathbf{b}$$ in Julia. The algorithm employed by the backslash does not proceed through the normal equations, because of instability.

The conditioning of the linear least-squares problem relates changes in the solution $$\mathbf{x}$$ to those in the data, $$\mathbf{A}$$ and $$\mathbf{b}$$. A full accounting of the condition number is too messy to present here, but we can generalize from the linear system problem $$\mathbf{A} \mathbf{x} =\mathbf{b}$$, where $$m=n$$. Recall that the condition number of solving $$\mathbf{A} \mathbf{x}=\mathbf{b}$$ is $$\kappa(\mathbf{A})=\|\mathbf{A}\| \cdot \|\mathbf{A}^{-1}\|$$.

Provided that the residual norm $$\|\mathbf{b}-\mathbf{A}\mathbf{x}\|$$ at the least-squares solution is relatively small, the conditioning of the linear least squares problem is similar. The condition number of the problem generalizes via the pseudoinverse:

(65)$\kappa(\mathbf{A}) = \|\mathbf{A}\| \cdot \|\mathbf{A}^{+}\|.$

These are rectangular matrices, but the induced matrix norm is defined by (42) just as in the square case. If $$\mathbf{A}$$ has rank less than $$n$$, then $$\kappa(\mathbf{A})=\infty$$. The Julia function cond computes condition numbers of rectangular matrices in the 2-norm.

As an algorithm, the normal equations begin by computing the $$n\times n$$ system $$(\mathbf{A}^T\mathbf{A})\mathbf{x} = \mathbf{A}^T \mathbf{b}$$. When these equations are solved, perturbations to the data can be amplified by a factor $$\kappa(\mathbf{A}^T\mathbf{A})$$. It turns out that

(66)$\kappa(\mathbf{A}^T\mathbf{A}) = \kappa(\mathbf{A})^2$

in the 2-norm. If $$\kappa(\mathbf{A})$$ is large, the squaring of it can destabilize the normal equations: while the solution of the least squares problem is sensitive, finding it via the normal equations makes it doubly so.

## Exercises¶

1. ✍ Work out the least squares solution when

$\begin{split}\mathbf{A} = \begin{bmatrix} 2 & -1 \\ 0 & 1 \\ -2 & 2 \end{bmatrix}, \qquad \mathbf{b} = \begin{bmatrix} 1\\-5\\6 \end{bmatrix}.\end{split}$
2. ✍ Find the pseudoinverse $$\mathbf{A}^+$$ of the matrix $$\mathbf{A}=\begin{bmatrix}1&-2&3\end{bmatrix}^T$$.

3. ✍ Prove the first statement of the AtA theorem: $$\mathbf{A}^T\mathbf{A}$$ is symmetric for any $$m\times n$$ matrix $$\mathbf{A}$$ with $$m \ge n$$.

4. ✍ Prove that if $$\mathbf{A}$$ is a nonsingular square matrix, then $$\mathbf{A}^+=\mathbf{A}^{-1}$$.

5. (a) ✍ Show that for any $$m\times n$$ $$\mathbf{A}$$ ($$m>n$$) for which $$\mathbf{A}^T\mathbf{A}$$ is nonsingular, $$\mathbf{A}^+\mathbf{A}$$ is the $$n\times n$$ identity.

(b) ⌨ Show using an example in Julia that $$\mathbf{A}\mathbf{A}^+$$ is not an identity matrix. (This matrix has rank no greater than $$n$$, so it can’t be an $$m\times m$$ identity.)

6. ✍ Prove that the vector $$\mathbf{A}\mathbf{A}^+\mathbf{b}$$ is the vector in the column space (i.e., range) of $$\mathbf{A}$$ that is closest to $$\mathbf{b}$$, in the sense of the 2-norm.

7. ✍ Show that the flop count for lsnormal is asymptotically $$\sim 2m n^2 + \tfrac{1}{3}n^3$$. (In finding the asymptotic count you can ignore terms like $$mn$$ whose total degree is less than three.)

8. ⌨ Let $$t_1,\ldots,t_m$$ be $$m+1$$ equally spaced points in $$[0,2\pi]$$.

(a) Let $$\mathbf{A}_\beta$$ be the matrix in (58) that corresponds to fitting data with the function $$c_1 + c_2 \sin(t) + c_3 \cos(\beta t)$$. Using the identity (66), make a table of the condition numbers of $$\mathbf{A}_\beta$$ for $$\beta = 2,1.1,1.01,\ldots,1+10^{-8}$$.

(b) Repeat part (a) using the fitting function $$c_1 + c_2 \sin^2(t) + c_3 \cos^2(\beta t).$$

(c) Why does it make sense that $$\kappa\bigl(\mathbf{A}_\beta\bigr)\to \infty$$ as $$\beta\to 1$$ in part (b) but not in part (a)?

9. ✍ ⌨ When $$\mathbf{A}$$ is $$m\times n$$ with rank less than $$n$$, the pseudoinverse is still defined and can be computed using pinv from LinearAlgebra. However, the behavior in this case is not always intuitive. Let

$\begin{split}\mathbf{A}_s = \begin{bmatrix} 1 & 1 \\ 0 & 0 \\ 0 & s \end{bmatrix}.\end{split}$

Then $$\mathbf{A}_0$$ has rank equal to one. Demonstrate experimentally that $$\mathbf{A}_0^+\neq \lim_{s\to 0} \mathbf{A}_s^+$$.