Newton’s method is a foundation for algorithms to solve equations and minimize quantities. But it is not ideal in its straightforward or pure form. Specifically, its least appealing features are the programming nuisance and computational expense of evaluating the Jacobian matrix, and the tendency of the iteration to diverge from many starting points. There are different quasi-Newton methods that modify the basic idea in an attempt to overcome these issues.
In the scalar case, we found an easy alternative to a direct evaluation of the derivative. In retrospect, we may interpret the secant formula (4.4.2) as the Newton formula (4.3.2) with f′(xk) replaced by the difference quotient
If the sequence of xk values converges to a root r, then this quotient converges to f′(r).
In the system case, replacing the Jacobian evaluation is more complicated: derivatives are needed with respect to n variables, not just one. From (4.5.4), we note that the jth column of the Jacobian is
(As always, ej represents the jth column of the identity matrix, here in n dimensions.) Inspired by (4.6.1), we can replace the differentiation with a quotient involving a change in only xj while the other variables remain fixed:
For reasons explained in Chapter 5, δ is usually chosen close to ϵ, where ε represents the expected noise or uncertainty level in evaluation of f. If the only source of noise is floating-point roundoff, then δ≈ϵmach.
The finite-difference Jacobian is easy to conceive and use. But, as you can see from (4.6.3), it requires n additional evaluations of the system function at each iteration, which can be unacceptably slow in some applications. Conceptually these function evaluations seem especially wasteful given that the root estimates, and thus presumably the Jacobian matrix, are supposed to change little as the iteration converges. This is a good time to step in with the principle of approximate approximation, which suggests looking for a shortcut in the form of a cheap-but-good-enough way to update the Jacobian from one iteration to the next.
Recall that the Newton iteration is derived by solving the linear model implied by (4.5.3):
Let sk=xk+1−xk be the Newton step. Let yk=f(xk), and now we replace J(xk) by a matrix Ak that is meant to approximate the Jacobian. Hence the Newton step is considered to be defined, as in Algorithm 4.5.1, by
Once xk+1 is obtained, we should update the approximate Jacobian to a new Ak+1. If we think one-dimensionally for a moment, the secant method would assume that Ak+1=(fk+1−fk)/(xk+1−xk). It’s not easy to generalize a fraction to vectors, but we can do it if we instead write it as
This isn’t enough to uniquely determine Ak+1. However, if we also require that Ak+1−Ak is a matrix of rank 1, then one arrives at the following.
Observe that Ak+1−Ak is proportional to the outer product of two vectors, and that computing it requires no extra evaluations of f. Remarkably, under reasonable assumptions, the sequence of xk resulting when Broyden updates are used converges superlinearly, even though the matrices Ak do not necessarily converge to the Jacobian of f.
In practice, one typically uses finite differences to initialize the Jacobian at iteration k=1. If for some k the step computed by the update formula fails to make enough improvement in the residual, then Ak is reinitialized by finite differences and the step is recalculated.
The most difficult part of many rootfinding problems is finding a starting point that will lead to convergence. The linear model implicitly constructed during a Newton iteration—whether we use an exact, finite-difference, or iteratively updated Jacobian matrix—becomes increasingly inaccurate as one ventures farther from the most recent root estimate, eventually failing to resemble the exact function much at all.
Although one could imagine trying to do a detailed accuracy analysis of each linear model as we go, in practice simple strategies are valuable here. Suppose, after computing the step suggested by the linear model, we ask a binary question: Would taking that step improve our situation? Since we are trying to find a root of f, we have a quantitative way to pose this question: Does the backward error ∥f∥ decrease? If not, we should reject the step and find an alternative.
There are several ways to find alternatives to the standard step, but we will consider just one of them, based on the parameterized equation
which is equivalent to the definition of the usual linear model (i.e., Newton or quasi-Newton) step (4.6.5). On the other hand, as λ→∞, Equation (4.6.9) approaches
Hence, if Ak=J(xk), then sk from (4.6.11) is in the opposite direction from the gradient vector. In vector calculus you learn that this direction is the one of most rapid decrease or steepest descent. A small enough step in this direction is guaranteed in all but pathological cases to decrease ϕ, which is exactly what we want from a backup plan.
In effect, the λ parameter in (4.6.9) allows a smooth transition between the pure Newton step, for which convergence is very rapid near a root, and a small step in the gradient descent direction, which guarantees progress for the iteration when we are far from a root.
To a large extent, the incorporation of finite differences, Jacobian updates, and Levenberg step are independent decisions. Function 4.6.3 shows how they might be combined. This function is one of the most logically complex we have encountered so far.
Each pass through the loop starts by using (4.6.9) to propose a step sk. The function then asks whether using this step would decrease the value of ∥f∥ from its present value. If so, we accept the new root estimate, we decrease λ in order to get more Newton-like (since things have gone well), and we apply the Broyden formula to get a cheap update of the Jacobian. If the proposed step is not successful, we increase λ to get more gradient-like (since we just failed) and, if the current Jacobian was the result of a cheap update, use finite differences to reevaluate it.
In some cases our simple logic in Function 4.6.3 can make λ oscillate between small and large values; several better but more complicated strategies for controlling λ are known. In addition, the linear system (4.6.9) is usually modified to get the well-known Levenberg–Marquardt algorithm, which does a superior job in some problems as λ→∞.
(a) ✍ Write out a 2×2 nonlinear system of equations that describes an intersection of these orbits. (Note: An intersection is not the same as a collision—they don’t have to occupy the same point at the same time.)
(b) ⌨ Use Function 4.6.3 to find all of the unique intersections.
⌨ (Variation on Exercise 4.5.5) Suppose one wants to find the points on the ellipsoid x2/25+y2/16+z2/9=1 that are closest to and farthest from the point (5,4,3). The method of Lagrange multipliers implies that any such point satisfies
(a) Write out this system in the form f(u)=0. (Note that the system has four variables to go with the four equations.)
(b) Use Function 4.6.3 with different initial guesses to find the two roots of this system. Which is the closest point to (5,4,3), and which is the farthest?
✍ The Broyden update formula (4.6.8) is just one instance of so-called rank-1 updating. Verify the Sherman–Morrison formula,
which is valid whenever A is invertible and the denominator above is nonzero. (Hint: Show that A+uvT times the matrix above simplifies to the identity matrix.)
Let x1=[−2,1]T and let A1=J(x1) be the exact Jacobian.
(a) Solve (4.6.9) for s1 with λ=0; this is the “pure” Newton step. Show numerically that ∥f(x1+s1)∥>∥f(x1)∥. (Thus, the Newton step made us go to a point seemingly farther from a root than where we started.)
(b) Now repeat part (a) with λ=0.01j for j=1,2,3,…. What is the smallest value of j such that ∥f(x1+s1)∥<∥f(x1)∥?
✍ Show that Equation (4.6.9) is equivalent to the linear least-squares problem