Skip to article frontmatterSkip to article content

Quasi-Newton methods

Newton’s method is a foundation for algorithms to solve equations and minimize quantities. But it is not ideal in its straightforward or pure form. Specifically, its least appealing features are the programming nuisance and computational expense of evaluating the Jacobian matrix, and the tendency of the iteration to diverge from many starting points. There are different quasi-Newton methods that modify the basic idea in an attempt to overcome these issues.

4.6.1Jacobian by finite differences

In the scalar case, we found an easy alternative to a direct evaluation of the derivative. In retrospect, we may interpret the secant formula (4.4.2) as the Newton formula (4.3.2) with f(xk)f'(x_k) replaced by the difference quotient

f(xk)f(xk1)xkxk1. \frac{f(x_k)-f(x_{k-1})}{x_k-x_{k-1}}.

If the sequence of xkx_k values converges to a root rr, then this quotient converges to f(r)f'(r).

In the system case, replacing the Jacobian evaluation is more complicated: derivatives are needed with respect to nn variables, not just one. From (4.5.4), we note that the jjth column of the Jacobian is

J(x)ej=[f1xjf2xjfnxj]. \mathbf{J}(\mathbf{x}) \mathbf{e}_j = \begin{bmatrix} \frac{\partial{f_1}}{\partial x_j} \\[2mm] \frac{\partial{f_2}}{\partial x_j} \\ \vdots \\ \frac{\partial{f_n}}{\partial x_j} \end{bmatrix}.

(As always, ej\mathbf{e}_j represents the jjth column of the identity matrix, here in nn dimensions.) Inspired by (4.6.1), we can replace the differentiation with a quotient involving a change in only xjx_j while the other variables remain fixed:

J(x)ejf(x+δej)f(x)δ,j=1,,n. \mathbf{J}(\mathbf{x}) \mathbf{e}_j \approx \frac{\mathbf{f}(\mathbf{x}+\delta \mathbf{e}_j) - \mathbf{f}(\mathbf{x})}{\delta}, \qquad j=1,\ldots,n.

For reasons explained in Chapter 5, δ is usually chosen close to ϵ\sqrt{\epsilon}, where ε represents the expected noise or uncertainty level in evaluation of f\mathbf{f}. If the only source of noise is floating-point roundoff, then δϵmach\delta \approx \sqrt{\epsilon_\text{mach}}.

The finite-difference formula (4.6.3) is implemented by Function 4.6.1.

4.6.2Broyden’s update

The finite-difference Jacobian is easy to conceive and use. But, as you can see from (4.6.3), it requires nn additional evaluations of the system function at each iteration, which can be unacceptably slow in some applications. Conceptually these function evaluations seem especially wasteful given that the root estimates, and thus presumably the Jacobian matrix, are supposed to change little as the iteration converges. This is a good time to step in with the principle of approximate approximation, which suggests looking for a shortcut in the form of a cheap-but-good-enough way to update the Jacobian from one iteration to the next.

Recall that the Newton iteration is derived by solving the linear model implied by (4.5.3):

f(xk+1)f(xk)+J(xk)(xk+1xk)=0. \mathbf{f}(\mathbf{x}_{k+1}) \approx \mathbf{f}(\mathbf{x}_k) + \mathbf{J}(\mathbf{x}_k)\,(\mathbf{x}_{k+1}-\mathbf{x}_k) = \boldsymbol{0}.

Let sk=xk+1xk\mathbf{s}_k=\mathbf{x}_{k+1}-\mathbf{x}_k be the Newton step. Let yk=f(xk)\mathbf{y}_k=\mathbf{f}(\mathbf{x}_k), and now we replace J(xk)\mathbf{J}(\mathbf{x}_k) by a matrix Ak\mathbf{A}_{k} that is meant to approximate the Jacobian. Hence the Newton step is considered to be defined, as in Algorithm 4.5.1, by

Aksk=yk. \mathbf{A}_k \mathbf{s}_k = -\mathbf{y}_k.

Once xk+1\mathbf{x}_{k+1} is obtained, we should update the approximate Jacobian to a new Ak+1\mathbf{A}_{k+1}. If we think one-dimensionally for a moment, the secant method would assume that Ak+1=(fk+1fk)/(xk+1xk)A_{k+1}=(f_{k+1}-f_k)/(x_{k+1}-x_k). It’s not easy to generalize a fraction to vectors, but we can do it if we instead write it as

yk+1yk=Ak+1(xk+1xk)=Ak+1sk. \mathbf{y}_{k+1}-\mathbf{y}_k = \mathbf{A}_{k+1} (\mathbf{x}_{k+1}-\mathbf{x}_k) = \mathbf{A}_{k+1} \mathbf{s}_k.

This is used to justify the following requirement:

Ak+1sk=yk+1yk. \mathbf{A}_{k+1} \mathbf{s}_k = \mathbf{y}_{k+1}-\mathbf{y}_k.

This isn’t enough to uniquely determine Ak+1\mathbf{A}_{k+1}. However, if we also require that Ak+1Ak\mathbf{A}_{k+1}-\mathbf{A}_k is a matrix of rank 1, then one arrives at the following.

Observe that Ak+1Ak\mathbf{A}_{k+1}-\mathbf{A}_k is proportional to the outer product of two vectors, and that computing it requires no extra evaluations of f\mathbf{f}. Remarkably, under reasonable assumptions, the sequence of xk\mathbf{x}_k resulting when Broyden updates are used converges superlinearly, even though the matrices Ak\mathbf{A}_k do not necessarily converge to the Jacobian of f\mathbf{f}.

In practice, one typically uses finite differences to initialize the Jacobian at iteration k=1k=1. If for some kk the step computed by the update formula fails to make enough improvement in the residual, then Ak\mathbf{A}_k is reinitialized by finite differences and the step is recalculated.

4.6.3Levenberg’s method

The most difficult part of many rootfinding problems is finding a starting point that will lead to convergence. The linear model implicitly constructed during a Newton iteration—whether we use an exact, finite-difference, or iteratively updated Jacobian matrix—becomes increasingly inaccurate as one ventures farther from the most recent root estimate, eventually failing to resemble the exact function much at all.

Although one could imagine trying to do a detailed accuracy analysis of each linear model as we go, in practice simple strategies are valuable here. Suppose, after computing the step suggested by the linear model, we ask a binary question: Would taking that step improve our situation? Since we are trying to find a root of f\mathbf{f}, we have a quantitative way to pose this question: Does the backward error f\|\mathbf{f}\| decrease? If not, we should reject the step and find an alternative.

There are several ways to find alternatives to the standard step, but we will consider just one of them, based on the parameterized equation

(AkTAk+λI)sk=AkTfk. (\mathbf{A}_k^T \mathbf{A}_k + \lambda \mathbf{I})\,\mathbf{s}_k = -\mathbf{A}_k^T \mathbf{f}_k.

Some justification of (4.6.9) comes from considering extreme cases for λ. If λ=0\lambda=0, then

AkTAksk=AkTfk, \mathbf{A}_k^T \mathbf{A}_k \mathbf{s}_k = -\mathbf{A}_k^T \mathbf{f}_k,

which is equivalent to the definition of the usual linear model (i.e., Newton or quasi-Newton) step (4.6.5). On the other hand, as λ\lambda\to\infty, Equation (4.6.9) approaches

λsk=AkTfk. \lambda \mathbf{s}_k = - \mathbf{A}_k^T \mathbf{f}_k.

To interpret this equation, define the scalar residual function

ϕ(x)=f(x)Tf(x)=f(x)2.\phi(\mathbf{x})=\mathbf{f}(\mathbf{x})^T\mathbf{f}(\mathbf{x}) = \|\mathbf{f}(\mathbf{x})\|^2.

Finding a root of f\mathbf{f} is equivalent to minimizing ϕ. A calculation shows that the gradient of ϕ is

ϕ(x)=2J(x)Tf(x). \nabla \phi(\mathbf{x}) = 2 \mathbf{J}(\mathbf{x})^T \mathbf{f}(\mathbf{x}).

Hence, if Ak=J(xk)\mathbf{A}_k=\mathbf{J}(\mathbf{x}_k), then sk\mathbf{s}_k from (4.6.11) is in the opposite direction from the gradient vector. In vector calculus you learn that this direction is the one of most rapid decrease or steepest descent. A small enough step in this direction is guaranteed in all but pathological cases to decrease ϕ, which is exactly what we want from a backup plan.

In effect, the λ parameter in (4.6.9) allows a smooth transition between the pure Newton step, for which convergence is very rapid near a root, and a small step in the gradient descent direction, which guarantees progress for the iteration when we are far from a root.

4.6.4Implementation

To a large extent, the incorporation of finite differences, Jacobian updates, and Levenberg step are independent decisions. Function 4.6.3 shows how they might be combined. This function is one of the most logically complex we have encountered so far.

Each pass through the loop starts by using (4.6.9) to propose a step sk\mathbf{s}_k. The function then asks whether using this step would decrease the value of f\|\mathbf{f}\| from its present value. If so, we accept the new root estimate, we decrease λ in order to get more Newton-like (since things have gone well), and we apply the Broyden formula to get a cheap update of the Jacobian. If the proposed step is not successful, we increase λ to get more gradient-like (since we just failed) and, if the current Jacobian was the result of a cheap update, use finite differences to reevaluate it.

In some cases our simple logic in Function 4.6.3 can make λ oscillate between small and large values; several better but more complicated strategies for controlling λ are known. In addition, the linear system (4.6.9) is usually modified to get the well-known Levenberg–Marquardt algorithm, which does a superior job in some problems as λ\lambda\to \infty.

4.6.5Exercises

  1. ⌨ (Variation on Exercise 4.5.2.) Two curves in the (u,v)(u,v) plane are defined implicitly by the equations ulogu+vlogv=0.3u\log u + v \log v = -0.3 and u4+v2=1u^4 + v^2 = 1.

    (a) ✍ Write the intersection of these curves in the form f(x)=0\mathbf{f}(\mathbf{x}) = \boldsymbol{0} for two-dimensional f\mathbf{f} and x\mathbf{x}.

    (b) ⌨ Use Function 4.6.3 to find an intersection point starting from u=1u=1, v=0.1v=0.1.

    (d) ⌨ Use Function 4.6.3 to find an intersection point starting from u=0.1u=0.1, v=1v=1.

  2. ⌨ (Variation on Exercise 4.5.4) Two elliptical orbits (x1(s),y1(s))(x_1(s),y_1(s)) and (x2(t),y2(t))(x_2(t),y_2(t)) are described by the equations

    [x1(t)y1(t)]=[5+10cos(t)6sin(t)],[x2(t)y2(t)]=[8cos(t)1+12sin(t)],\begin{bmatrix} x_1(t) \\ y_1(t) \end{bmatrix} = \begin{bmatrix} -5+10\cos(t) \\ 6\sin(t) \end{bmatrix}, \qquad \begin{bmatrix} x_2(t)\\y_2(t) \end{bmatrix} = \begin{bmatrix} 8\cos(t) \\ 1+12\sin(t) \end{bmatrix},

    where tt represents time.

    (a) ✍ Write out a 2×22\times 2 nonlinear system of equations that describes an intersection of these orbits. (Note: An intersection is not the same as a collision—they don’t have to occupy the same point at the same time.)

    (b) ⌨ Use Function 4.6.3 to find all of the unique intersections.

  3. ⌨ (Variation on Exercise 4.5.5) Suppose one wants to find the points on the ellipsoid x2/25+y2/16+z2/9=1x^2/25 + y^2/16 + z^2/9 = 1 that are closest to and farthest from the point (5,4,3)(5,4,3). The method of Lagrange multipliers implies that any such point satisfies

    x5=λx25,y4=λy16,z3=λz9,1=125x2+116y2+19z2\begin{split} x-5 &= \frac{\lambda x}{25}, \\[1mm] y-4 &= \frac{\lambda y}{16}, \\[1mm] z-3 &= \frac{\lambda z}{9}, \\[1mm] 1 &= \frac{1}{25}x^2 + \frac{1}{16}y^2 + \frac{1}{9}z^2 \end{split}

    for an unknown value of λ.

    (a) Write out this system in the form f(u)=0\mathbf{f}(\mathbf{u}) = \boldsymbol{0}. (Note that the system has four variables to go with the four equations.)

    (b) Use Function 4.6.3 with different initial guesses to find the two roots of this system. Which is the closest point to (5,4,3)(5,4,3), and which is the farthest?

  4. ✍ The Broyden update formula (4.6.8) is just one instance of so-called rank-1 updating. Verify the Sherman–Morrison formula,

    (A+uvT)1=A1A1uvT1+vTA1uA1,(\mathbf{A}+\mathbf{u}\mathbf{v}^T)^{-1} = \mathbf{A}^{-1} - \mathbf{A}^{-1}\frac{\mathbf{u}\mathbf{v}^T}{1+\mathbf{v}^T\mathbf{A}^{-1}\mathbf{u}}\mathbf{A}^{-1},

    which is valid whenever A\mathbf{A} is invertible and the denominator above is nonzero. (Hint: Show that A+uvT\mathbf{A}+\mathbf{u}\mathbf{v}^T times the matrix above simplifies to the identity matrix.)

  5. ✍ Derive Equation (4.6.13).

  6. ⌨ (See also Exercise 4.5.11.) Suppose that

    f(x)=[x1x2+x221x1x23+x12x22+1].\mathbf{f}(\mathbf{x}) = \begin{bmatrix} x_1x_2+x_2^2-1 \\[1mm] x_1x_2^3 + x_1^2x_2^2 + 1 \end{bmatrix}.

    Let x1=[2,1]T\mathbf{x}_1=[-2,1]^T and let A1=J(x1)\mathbf{A}_1=\mathbf{J}(\mathbf{x}_1) be the exact Jacobian.

    (a) Solve (4.6.9) for s1\mathbf{s}_1 with λ=0\lambda=0; this is the “pure” Newton step. Show numerically that f(x1+s1)>f(x1)\|\mathbf{f}(\mathbf{x}_1+\mathbf{s}_1)\| > \|\mathbf{f}(\mathbf{x}_1)\|. (Thus, the Newton step made us go to a point seemingly farther from a root than where we started.)

    (b) Now repeat part (a) with λ=0.01j\lambda=0.01j for j=1,2,3,.j=1,2,3,\ldots. What is the smallest value of jj such that f(x1+s1)<f(x1)\|\mathbf{f}(\mathbf{x}_1+\mathbf{s}_1)\| < \|\mathbf{f}(\mathbf{x}_1)\|?

  7. ✍ Show that Equation (4.6.9) is equivalent to the linear least-squares problem

    minv(Akv+fk22+λ2v22).\min_{\mathbf{v}} \Bigl( \bigl\|\mathbf{A}_k\mathbf{v} + \mathbf{f}_k\bigr\|_2^2 + \lambda^2 \bigl\| \mathbf{v} \bigr\|_2^2 \Bigr).

    (Hint: Express the minimized quantity using block matrix notation, such that (4.6.9) becomes the normal equations for it.)

    Thus, another interpretation of Levenberg’s method is that it is the Newton step plus a penalty, weighted by λ, for taking large steps.