Quasi-Newton methods - Fundamentals of Numerical Computation

Newton’s method is a foundation for algorithms to solve equations and minimize quantities. But it is not ideal in its straightforward or pure form. Specifically, its least appealing features are the programming nuisance and computational expense of evaluating the Jacobian matrix, and the tendency of the iteration to diverge from many starting points. There are different quasi-Newton methods that modify the basic idea in an attempt to overcome these issues.

4.6.1Jacobian by finite differences¶

In the scalar case, we found an easy alternative to a direct evaluation of the derivative. In retrospect, we may interpret the secant formula (4.4.2) as the Newton formula (4.3.2) with $f'(x_k)$ replaced by the difference quotient

\frac{f(x_k)-f(x_{k-1})}{x_k-x_{k-1}}.

(4.6.1)

If the sequence of $x_k$ values converges to a root $r$ , then this quotient converges to $f'(r)$ .

In the system case, replacing the Jacobian evaluation is more complicated: derivatives are needed with respect to $n$ variables, not just one. From (4.5.4), we note that the $j$ th column of the Jacobian is

\mathbf{J}(\mathbf{x}) \mathbf{e}_j = \begin{bmatrix} \frac{\partial{f_1}}{\partial x_j} \\[2mm] \frac{\partial{f_2}}{\partial x_j} \\ \vdots \\ \frac{\partial{f_n}}{\partial x_j} \end{bmatrix}.

(4.6.2)

(As always, $\mathbf{e}_j$ represents the $j$ th column of the identity matrix, here in $n$ dimensions.) Inspired by (4.6.1), we can replace the differentiation with a quotient involving a change in only $x_j$ while the other variables remain fixed:

\mathbf{J}(\mathbf{x}) \mathbf{e}_j \approx \frac{\mathbf{f}(\mathbf{x}+\delta \mathbf{e}_j) - \mathbf{f}(\mathbf{x})}{\delta}, \qquad j=1,\ldots,n.

(4.6.3)

For reasons explained in Chapter 5, δ is usually chosen close to $\sqrt{\epsilon}$ , where ε represents the expected noise or uncertainty level in evaluation of $\mathbf{f}$ . If the only source of noise is floating-point roundoff, then $\delta \approx \sqrt{\epsilon_\text{mach}}$ .

The finite-difference formula (4.6.3) is implemented by Function 4.6.1.

Algorithm 4.6.1 (fdjac)

Julia

MATLAB

Python

Finite differences for Jacobian

fdjac.jl

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
"""
    fdjac(f, x₀ [,y₀])

Compute a finite-difference approximation of the Jacobian matrix for
`f` at `x₀`, where `y₀`=`f(x₀)` may be given.
"""
function fdjac(f, x₀, y₀ = f(x₀))
    δ = sqrt(eps()) * max(norm(x₀), 1)    # near-optimum FD step size
    m, n = length(y₀), length(x₀)
    if n == 1
        # Vector result for univariate functions.
        J = (f(x₀ + δ) - y₀) / δ
    else
        J = zeros(m, n)
        x = copy(x₀)
        for j in 1:n
            # Difference in the jth direction.
            x[j] += δ
            J[:, j] = (f(x) - y₀) / δ
            x[j] -= δ
        end
    end
    return J
end

Finite differences for Jacobian

fdjac.m

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
function J = fdjac(f,x0,y0)
% FDJAC   Finite-difference approximation of a Jacobian.
% Input:
%   f        function to be differentiated
%   x0       evaluation point (n-vector)
%   y0       value of f at x0 (m-vector)
% Output       
%   J        approximate Jacobian (m-by-n)

delta = sqrt(eps);   % FD step size
m = length(y0);  n = length(x0);
J = zeros(m,n);
I = eye(n);
for j = 1:n
    J(:,j) = ( f(x0+delta*I(:,j)) - y0) / delta;
end

Finite differences for Jacobian

fdjac.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def fdjac(f, x0, y0):
    """
    fdjac(f,x0,y0)

    Compute a finite-difference approximation of the Jacobian matrix for f at x0,
    where y0=f(x0) is given.
    """

    delta = np.sqrt(np.finfo(float).eps)  # FD step size
    m, n = len(y0), len(x0)
    J = np.zeros((m, n))
    I = np.eye(n)
    for j in range(n):
        J[:, j] = (f(x0 + delta * I[:, j]) - y0) / delta
    return J

4.6.2Broyden’s update¶

The finite-difference Jacobian is easy to conceive and use. But, as you can see from (4.6.3), it requires $n$ additional evaluations of the system function at each iteration, which can be unacceptably slow in some applications. Conceptually these function evaluations seem especially wasteful given that the root estimates, and thus presumably the Jacobian matrix, are supposed to change little as the iteration converges. This is a good time to step in with the principle of approximate approximation, which suggests looking for a shortcut in the form of a cheap-but-good-enough way to update the Jacobian from one iteration to the next.

Recall that the Newton iteration is derived by solving the linear model implied by (4.5.3):

\mathbf{f}(\mathbf{x}_{k+1}) \approx \mathbf{f}(\mathbf{x}_k) + \mathbf{J}(\mathbf{x}_k)\,(\mathbf{x}_{k+1}-\mathbf{x}_k) = \boldsymbol{0}.

(4.6.4)

Let $\mathbf{s}_k=\mathbf{x}_{k+1}-\mathbf{x}_k$ be the Newton step. Let $\mathbf{y}_k=\mathbf{f}(\mathbf{x}_k)$ , and now we replace $\mathbf{J}(\mathbf{x}_k)$ by a matrix $\mathbf{A}_{k}$ that is meant to approximate the Jacobian. Hence the Newton step is considered to be defined, as in Algorithm 4.5.1, by

\mathbf{A}_k \mathbf{s}_k = -\mathbf{y}_k.

(4.6.5)

Once $\mathbf{x}_{k+1}$ is obtained, we should update the approximate Jacobian to a new $\mathbf{A}_{k+1}$ . If we think one-dimensionally for a moment, the secant method would assume that $A_{k+1}=(f_{k+1}-f_k)/(x_{k+1}-x_k)$ . It’s not easy to generalize a fraction to vectors, but we can do it if we instead write it as

\mathbf{y}_{k+1}-\mathbf{y}_k = \mathbf{A}_{k+1} (\mathbf{x}_{k+1}-\mathbf{x}_k) = \mathbf{A}_{k+1} \mathbf{s}_k.

(4.6.6)

This is used to justify the following requirement:

\mathbf{A}_{k+1} \mathbf{s}_k = \mathbf{y}_{k+1}-\mathbf{y}_k.

(4.6.7)

This isn’t enough to uniquely determine $\mathbf{A}_{k+1}$ . However, if we also require that $\mathbf{A}_{k+1}-\mathbf{A}_k$ is a matrix of rank 1, then one arrives at the following.

Observe that $\mathbf{A}_{k+1}-\mathbf{A}_k$ is proportional to the outer product of two vectors, and that computing it requires no extra evaluations of $\mathbf{f}$ . Remarkably, under reasonable assumptions, the sequence of $\mathbf{x}_k$ resulting when Broyden updates are used converges superlinearly, even though the matrices $\mathbf{A}_k$ do not necessarily converge to the Jacobian of $\mathbf{f}$ .

In practice, one typically uses finite differences to initialize the Jacobian at iteration $k=1$ . If for some $k$ the step computed by the update formula fails to make enough improvement in the residual, then $\mathbf{A}_k$ is reinitialized by finite differences and the step is recalculated.

4.6.3Levenberg’s method¶

The most difficult part of many rootfinding problems is finding a starting point that will lead to convergence. The linear model implicitly constructed during a Newton iteration—whether we use an exact, finite-difference, or iteratively updated Jacobian matrix—becomes increasingly inaccurate as one ventures farther from the most recent root estimate, eventually failing to resemble the exact function much at all.

Although one could imagine trying to do a detailed accuracy analysis of each linear model as we go, in practice simple strategies are valuable here. Suppose, after computing the step suggested by the linear model, we ask a binary question: Would taking that step improve our situation? Since we are trying to find a root of $\mathbf{f}$ , we have a quantitative way to pose this question: Does the backward error $\|\mathbf{f}\|$ decrease? If not, we should reject the step and find an alternative.

There are several ways to find alternatives to the standard step, but we will consider just one of them, based on the parameterized equation

(\mathbf{A}_k^T \mathbf{A}_k + \lambda \mathbf{I})\,\mathbf{s}_k = -\mathbf{A}_k^T \mathbf{f}_k.

(4.6.9)

Some justification of (4.6.9) comes from considering extreme cases for λ. If $\lambda=0$ , then

\mathbf{A}_k^T \mathbf{A}_k \mathbf{s}_k = -\mathbf{A}_k^T \mathbf{f}_k,

(4.6.10)

which is equivalent to the definition of the usual linear model (i.e., Newton or quasi-Newton) step (4.6.5). On the other hand, as $\lambda\to\infty$ , Equation (4.6.9) approaches

\lambda \mathbf{s}_k = - \mathbf{A}_k^T \mathbf{f}_k.

(4.6.11)

To interpret this equation, define the scalar residual function

\phi(\mathbf{x})=\mathbf{f}(\mathbf{x})^T\mathbf{f}(\mathbf{x}) = \|\mathbf{f}(\mathbf{x})\|^2.

(4.6.12)

Finding a root of $\mathbf{f}$ is equivalent to minimizing ϕ. A calculation shows that the gradient of ϕ is

\nabla \phi(\mathbf{x}) = 2 \mathbf{J}(\mathbf{x})^T \mathbf{f}(\mathbf{x}).

(4.6.13)

Hence, if $\mathbf{A}_k=\mathbf{J}(\mathbf{x}_k)$ , then $\mathbf{s}_k$ from (4.6.11) is in the opposite direction from the gradient vector. In vector calculus you learn that this direction is the one of most rapid decrease or steepest descent. A small enough step in this direction is guaranteed in all but pathological cases to decrease ϕ, which is exactly what we want from a backup plan.

In effect, the λ parameter in (4.6.9) allows a smooth transition between the pure Newton step, for which convergence is very rapid near a root, and a small step in the gradient descent direction, which guarantees progress for the iteration when we are far from a root.

4.6.4Implementation¶

To a large extent, the incorporation of finite differences, Jacobian updates, and Levenberg step are independent decisions. Function 4.6.3 shows how they might be combined. This function is one of the most logically complex we have encountered so far.

Each pass through the loop starts by using (4.6.9) to propose a step $\mathbf{s}_k$ . The function then asks whether using this step would decrease the value of $\|\mathbf{f}\|$ from its present value. If so, we accept the new root estimate, we decrease λ in order to get more Newton-like (since things have gone well), and we apply the Broyden formula to get a cheap update of the Jacobian. If the proposed step is not successful, we increase λ to get more gradient-like (since we just failed) and, if the current Jacobian was the result of a cheap update, use finite differences to reevaluate it.

Algorithm 4.6.3 (levenberg)

Julia

MATLAB

Python

Levenberg’s method

levenberg.jl

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
"""
    levenberg(f, x₁ [;maxiter, ftol, xtol])

Use Levenberg's quasi-Newton iteration to find a root of the system
`f` starting from `x₁` Returns the history of root estimates as a
vector of vectors.

The optional keyword parameters set the maximum number of iterations
and the stopping tolerance for values of `f` and changes in `x`.

"""
function levenberg(f, x₁; maxiter = 40, ftol = 1e-12, xtol = 1e-12)
    x = [float(x₁)]
    yₖ = f(x₁)
    k = 1
    s = Inf
    A = fdjac(f, x[k], yₖ)   # start with FD Jacobian
    jac_is_new = true

    λ = 10
    while (norm(s) > xtol) && (norm(yₖ) > ftol)
        # Compute the proposed step.
        B = A' * A + λ * I
        z = A' * yₖ
        s = -(B \ z)

        x̂ = x[k] + s
        ŷ = f(x̂)

        # Do we accept the result?
        if norm(ŷ) < norm(yₖ)    # accept
            λ = λ / 10   # get closer to Newton
            # Broyden update of the Jacobian.
            A += (ŷ - yₖ - A * s) * (s' / (s' * s))
            jac_is_new = false

            push!(x, x̂)
            yₖ = ŷ
            k += 1
        else    # don't accept
            # Get closer to gradient descent.
            λ = 4λ
            # Re-initialize the Jacobian if it's out of date.
            if !jac_is_new
                A = fdjac(f, x[k], yₖ)
                jac_is_new = true
            end
        end

        if k == maxiter
            @warn "Maximum number of iterations reached."
            break
        end
    end
    return x
end

Levenberg’s method

levenberg.m

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
function x = levenberg(f,x1,tol)
% LEVENBERG   Quasi-Newton method for nonlinear systems.
% Input:
%   f         objective function 
%   x1        initial root approximation
%   tol       stopping tolerance (default is 1e-12)
% Output       
%   x         array of approximations (one per column)

% Operating parameters.
if nargin < 3, tol = 1e-12; end
ftol = tol;  xtol = tol;  maxiter = 40;

x = x1(:);     fk = f(x1);
k = 1;  s = Inf;        
Ak = fdjac(f,x(:,1),fk);   % start with FD Jacobian
jac_is_new = true;
I = eye(length(x));

lambda = 10; 
while (norm(s) > xtol) && (norm(fk) > ftol) && (k < maxiter)
    % Compute the proposed step.
    B = Ak'*Ak + lambda*I;
    z = Ak'*fk;
    s = -(B\z);

    xnew = x(:,k) + s;   fnew = f(xnew);
    
    % Do we accept the result?
    if norm(fnew) < norm(fk)    % accept
        y = fnew - fk;
        x(:,k+1) = xnew;  fk = fnew;  
        k = k+1;
        
        lambda = lambda/10;  % get closer to Newton
        % Broyden update of the Jacobian.
        Ak = Ak + (y-Ak*s)*(s'/(s'*s));
        jac_is_new = false;
    else                       % don't accept
        % Get closer to steepest descent.
        lambda = lambda*4;
        % Re-initialize the Jacobian if it's out of date.
        if ~jac_is_new
            Ak = fdjac(f,x(:,k),fk);
            jac_is_new = true;
        end
    end
end

if (norm(fk) > 1e-3)
    warning('Iteration did not find a root.')
end

Levenberg’s method

levenberg.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
def levenberg(f, x1, tol=1e-12):
    """
    levenberg(f,x1,tol)

    Use Levenberg's quasi-Newton iteration to find a root of the system f, 
    starting from x1, with tol as the stopping tolerance in both step size and residual norm. Returns root estimates as a matrix, one estimate per column.
    """

    # Operating parameters.
    ftol = tol
    xtol = tol
    maxiter = 40

    n = len(x1)
    x = np.zeros((maxiter+1, n))
    x[0] = x1
    fk = f(x1)
    k = 0
    s = 10.0
    Ak = fdjac(f, x[0], fk)  # start with FD Jacobian
    jac_is_new = True

    lam = 10
    while (norm(s) > xtol) and (norm(fk) > ftol) and (k < maxiter):
        # Compute the proposed step.
        B = Ak.T @ Ak + lam * np.eye(n)
        z = Ak.T @ fk
        s = -lstsq(B, z)[0]

        xnew = x[k] + s
        fnew = f(xnew)

        # Do we accept the result?
        if norm(fnew) < norm(fk):  # accept
            y = fnew - fk
            x[k + 1] = xnew
            fk = fnew
            k = k + 1

            lam = lam / 10  # get closer to Newton
            # Broyden update of the Jacobian.
            Ak = Ak + np.outer(y - Ak @ s, s / np.dot(s, s))
            jac_is_new = False
        else:  # don't accept
            # Get closer to steepest descent.
            lam = lam * 4
            # Re-initialize the Jacobian if it's out of date.
            if not jac_is_new:
                Ak = fdjac(f, x[k], fk)
                jac_is_new = True

    if norm(fk) > 1e-3:
        warnings.warn("Iteration did not find a root.")
    return x[:k+1]

In some cases our simple logic in Function 4.6.3 can make λ oscillate between small and large values; several better but more complicated strategies for controlling λ are known. In addition, the linear system (4.6.9) is usually modified to get the well-known Levenberg–Marquardt algorithm, which does a superior job in some problems as $\lambda\to \infty$ .

Example 4.6.1 (Using Levenberg’s method)

Julia

MATLAB

Python

Example 4.6.1

To solve a nonlinear system, we need to code only the function defining the system, and not its Jacobian.

f(x) = 
    [
        exp(x[2] - x[1]) - 2,
        x[1] * x[2] + x[3],
        x[2] * x[3] + x[1]^2 - x[2]
    ]

f (generic function with 1 method)

In all other respects usage is the same as for the newtonsys function.

x₁ = [0.0, 0.0, 0.0]
x = FNC.levenberg(f, x₁)

12-element Vector{Vector{Float64}}:
 [0.0, 0.0, 0.0]
 [-0.08396946536317919, 0.07633587873004255, 0.0]
 [-0.42205075841965206, 0.21991260740534585, 0.012997569823167984]
 [-0.48610710938504953, 0.2138968287772044, 0.09771872586402451]
 [-0.45628390809556546, 0.24211047709245145, 0.10100440258901365]
 [-0.4556388336696561, 0.23470443548745376, 0.10854665717226099]
 [-0.45839614510679244, 0.2353095686241835, 0.10739828073307472]
 [-0.45804340381597397, 0.2351212406112955, 0.10768079583159752]
 [-0.45803332584412787, 0.23511390840121466, 0.10768998049540802]
 [-0.45803327880719313, 0.23511389867393448, 0.10768999250671268]
 [-0.4580332805601996, 0.2351138998630789, 0.10768999097568899]
 [-0.458033280641234, 0.23511389991865284, 0.10768999090414473]

It’s always a good idea to check the accuracy of the root, by measuring the residual (backward error).

r = x[end]
println("backward error = $(norm(f(r)))")

backward error = 1.2707848769787674e-13

Looking at the convergence in norm, we find a convergence rate between linear and quadratic, like with the secant method.

logerr = [log(norm(r - x[k])) for k in 1:length(x)-1]
ratios = [NaN; [logerr[i+1] / logerr[i] for i in 1:length(logerr)-1]]
@pt :header=["iteration", "log error", "ratio"] [eachindex(logerr) logerr ratios]

Example 4.6.1

To solve a nonlinear system, we need to code only the function defining the system, and not its Jacobian.

Tip

A rule of thumb is that if you use a function as an input argument for another function, there needs to be an @ involved once: either for an anonymous definition or to reference a function defined elsewhere.

f45_nlsystem.m

function [f, J] = nlsystem(x)
    f = zeros(3, 1);   % ensure a column vector output
    f(1) = exp(x(2) - x(1)) - 2;
    f(2) = x(1) * x(2) + x(3);
    f(3) = x(2) * x(3) + x(1)^2 - x(2);
    J(1, :) = [-exp(x(2) - x(1)), exp(x(2) - x(1)), 0];
    J(2, :) = [x(2), x(1), 1];
    J(3, :) = [2 * x(1), x(3)-1, x(2)];
end

In all other respects usage is the same as for the newtonsys function.

f = @f46_nlsystem;
x1 = [0; 0; 0];   
x = levenberg(f, x1);

It’s always a good idea to check the accuracy of the root, by measuring the residual (backward error).

r = x(:, end)
backward_err = norm(f(r))

Looking at the convergence of the first component, we find a rate between linear and quadratic, like with the secant method.

log10( abs(x(1, 1:end-1) - r(1)) )'

Example 4.6.1

To solve a nonlinear system, we need to code only the function defining the system, and not its Jacobian.

def func(x):
    return array([
        exp(x[1] - x[0]) - 2, 
        x[0] * x[1] + x[2], 
        x[1] * x[2] + x[0]**2 - x[1]
    ])

In all other respects usage is the same as for the newtonsys function.

x1 = zeros(3)
x = FNC.levenberg(func, x1)
print(f"Took {len(x) - 1} iterations.")

Took 11 iterations.

It’s always a good idea to check the accuracy of the root, by measuring the residual (backward error).

r = x[-1]
print("backward error:", norm(func(r)))

backward error: 1.2708308198538738e-13

Looking at the convergence in norm, we find a convergence rate between linear and quadratic, like with the secant method:

logerr = [log(norm(x[k] - r)) for k in range(len(x) - 1)]
for k in range(len(logerr) - 1):
    print(logerr[k+1] / logerr[k])

1.3488093916747845
2.6294109965748875
1.4519728849832465
1.3970235480592712
1.2898554702767009
1.273302844194998
1.458758688582501
1.5234698304213956
1.1687771474875415
1.1579168503648603

4.6.5Exercises¶

⌨ (Variation on Exercise 4.5.2.) Two curves in the $(u,v)$ plane are defined implicitly by the equations $u\log u + v \log v = -0.3$ and $u^4 + v^2 = 1$ .
(a) ✍ Write the intersection of these curves in the form $\mathbf{f}(\mathbf{x}) = \boldsymbol{0}$ for two-dimensional $\mathbf{f}$ and $\mathbf{x}$ .
(b) ⌨ Use Function 4.6.3 to find an intersection point starting from $u=1$ , $v=0.1$ .
(d) ⌨ Use Function 4.6.3 to find an intersection point starting from $u=0.1$ , $v=1$ .
⌨ (Variation on Exercise 4.5.4) Two elliptical orbits $(x_1(s),y_1(s))$ and $(x_2(t),y_2(t))$ are described by the equations
$\begin{bmatrix} x_1(t) \\ y_1(t) \end{bmatrix} = \begin{bmatrix} -5+10\cos(t) \\ 6\sin(t) \end{bmatrix}, \qquad \begin{bmatrix} x_2(t)\\y_2(t) \end{bmatrix} = \begin{bmatrix} 8\cos(t) \\ 1+12\sin(t) \end{bmatrix},$
(4.6.14)
where $t$ represents time.
(a) ✍ Write out a $2\times 2$ nonlinear system of equations that describes an intersection of these orbits. (Note: An intersection is not the same as a collision—they don’t have to occupy the same point at the same time.)
(b) ⌨ Use Function 4.6.3 to find all of the unique intersections.
⌨ (Variation on Exercise 4.5.5) Suppose one wants to find the points on the ellipsoid $x^2/25 + y^2/16 + z^2/9 = 1$ that are closest to and farthest from the point $(5,4,3)$ . The method of Lagrange multipliers implies that any such point satisfies
$\begin{split} x-5 &= \frac{\lambda x}{25}, \\[1mm] y-4 &= \frac{\lambda y}{16}, \\[1mm] z-3 &= \frac{\lambda z}{9}, \\[1mm] 1 &= \frac{1}{25}x^2 + \frac{1}{16}y^2 + \frac{1}{9}z^2 \end{split}$
(4.6.15)
for an unknown value of λ.
(a) Write out this system in the form $\mathbf{f}(\mathbf{u}) = \boldsymbol{0}$ . (Note that the system has four variables to go with the four equations.)
(b) Use Function 4.6.3 with different initial guesses to find the two roots of this system. Which is the closest point to $(5,4,3)$ , and which is the farthest?
✍ The Broyden update formula (4.6.8) is just one instance of so-called rank-1 updating. Verify the Sherman–Morrison formula,
$(\mathbf{A}+\mathbf{u}\mathbf{v}^T)^{-1} = \mathbf{A}^{-1} - \mathbf{A}^{-1}\frac{\mathbf{u}\mathbf{v}^T}{1+\mathbf{v}^T\mathbf{A}^{-1}\mathbf{u}}\mathbf{A}^{-1},$
(4.6.16)
which is valid whenever $\mathbf{A}$ is invertible and the denominator above is nonzero. (Hint: Show that $\mathbf{A}+\mathbf{u}\mathbf{v}^T$ times the matrix above simplifies to the identity matrix.)
✍ Derive Equation (4.6.13).
⌨ (See also Exercise 4.5.11.) Suppose that
$\mathbf{f}(\mathbf{x}) = \begin{bmatrix} x_1x_2+x_2^2-1 \\[1mm] x_1x_2^3 + x_1^2x_2^2 + 1 \end{bmatrix}.$
(4.6.17)
Let $\mathbf{x}_1=[-2,1]^T$ and let $\mathbf{A}_1=\mathbf{J}(\mathbf{x}_1)$ be the exact Jacobian.
(a) Solve (4.6.9) for $\mathbf{s}_1$ with $\lambda=0$ ; this is the “pure” Newton step. Show numerically that $\|\mathbf{f}(\mathbf{x}_1+\mathbf{s}_1)\| > \|\mathbf{f}(\mathbf{x}_1)\|$ . (Thus, the Newton step made us go to a point seemingly farther from a root than where we started.)
(b) Now repeat part (a) with $\lambda=0.01j$ for $j=1,2,3,\ldots.$ What is the smallest value of $j$ such that $\|\mathbf{f}(\mathbf{x}_1+\mathbf{s}_1)\| < \|\mathbf{f}(\mathbf{x}_1)\|$ ?
✍ Show that Equation (4.6.9) is equivalent to the linear least-squares problem
$\min_{\mathbf{v}} \Bigl( \bigl\|\mathbf{A}_k\mathbf{v} + \mathbf{f}_k\bigr\|_2^2 + \lambda^2 \bigl\| \mathbf{v} \bigr\|_2^2 \Bigr).$
(4.6.18)
(Hint: Express the minimized quantity using block matrix notation, such that (4.6.9) becomes the normal equations for it.)
Thus, another interpretation of Levenberg’s method is that it is the Newton step plus a penalty, weighted by λ, for taking large steps.

Preface

Newton for nonlinear systems

Preface

Nonlinear least squares