Nonlinear least squares

After the solution of square linear systems, we generalized to the case of having more constraints to satisfy than available variables. Our next step is to do the same for nonlinear equations, thus filling out this table:







\(\min\, \bigl\|\mathbf{A}\mathbf{x} - \mathbf{b}\bigr\|_2\)

\(\min\, \bigl\|\mathbf{f}(\mathbf{x}) \bigr\|_2\)

We now define the nonlinear least squares problem.

Definition 41 (Nonlinear least squares problem)

Given a function \(\mathbf{f}(\mathbf{x})\) mapping from \(\real^n\) to \(\real^m\), find \(\mathbf{x}\in\real^n\) such that \(\bigl\|\mathbf{f}(\mathbf{x})\bigr\|_2\) is minimized.

As in the linear case, we consider only overdetermined problems, where \(m>n\). Minimizing a positive quantity is equivalent to minimizing its square, so we could also define the result as minimizing \(\phi(\mathbf{x})=\mathbf{f}(\mathbf{x})^T\mathbf{f}(\mathbf{x})\).

You should not be surprised to learn that we can formulate an algorithm by substituting a linear model function for \(\mathbf{f}\). At a current estimate \(\mathbf{x}_k\) we define

\[ \mathbf{q}(\mathbf{x}) = \mathbf{f}(\mathbf{x}_k) + \mathbf{A}_k(\mathbf{x}-\mathbf{x}_k),\]

where \(\mathbf{A}_k\) might be the exact \(m\times n\) Jacobian matrix, \(\mathbf{J}(\mathbf{x}_k)\), or a finite-difference or Broyden approximation of it as described in Quasi-Newton methods.

In the square case we solved \(\mathbf{q}=\boldsymbol{0}\) to define the new value for \(\mathbf{x}\), leading to the condition \(\mathbf{A}_k\mathbf{s}_k=-\mathbf{f}_k\), where \(\mathbf{s}_k=\mathbf{x}_{k+1}-\mathbf{x}_k\). Now, with \(m>n\), we cannot expect to solve \(\mathbf{q}=\boldsymbol{0}\), so instead we define \(\mathbf{x}_{k+1}\) as the value that minimizes \(\| \mathbf{q} \|_2\):

(102)\[ \mathbf{x}_{k+1} = \mathbf{x}_k + \mathbf{s}_k, \text{ where } \| \mathbf{A}_k \mathbf{s}_k + \mathbf{f}(\mathbf{x}_k)\|_2 \text{ is minimized.}\]

We have just described the Gauss–Newton method. Gauss–Newton solves a series of linear least-squares problems in order to solve a nonlinear least squares problem.

Here we reap an amazing benefit from Julia: The functions newtonsys and levenberg, which were introduced for the case of \(m=n\) nonlinear equations, work without modification as the Gauss–Newton method for the overdetermined case! The reason is that the backslash operator applies equally well to the linear system and linear least squares problems, and nothing else in the function was written with explicit reference to \(n\).


In the multidimensional Newton method for a nonlinear system, we expect quadratic convergence to a solution in the typical case. For the Gauss–Newton method, the picture is more complicated. As always in least squares problems, the residual \(\mathbf{f}(\mathbf{x})\) will not necessarily be zero when \(\|\mathbf{f}\|\) is minimized. Suppose that the minimum value of \(\|\mathbf{f}\|\) is \(R>0\). In general, we might observe quadratic-like convergence until the iterate \(\|\mathbf{x}_k\|\) is within distance \(R\) of a true minimizer, and linear convergence thereafter. When \(R\) is not sufficiently small, the convergence can be quite slow.

Nonlinear data fitting

In Fitting functions to data we saw how to fit functions to data values, provided that the set of candidate fitting functions depends linearly on the undetermined coefficients. We now have the tools to generalize that process to fitting functions that depend nonlinearly on unknown parameters. Suppose that \((t_i,y_i)\) for \(i=1,\ldots,m\) are given points. We wish to model the data by a function \(g(t;\mathbf{c})\) that depends on unknown parameters \(c_1,\ldots,c_n\) in an arbitrary way. A standard approach is to minimize the discrepancy between the model and the observations, in a least squares sense:

\[ \min_{\mathbf{c}\in\mathbb{R}^n} \sum_{i=1}^m \bigl[ g(t_i;\mathbf{c})-y_i \bigr]^2 = \min_{\mathbf{c}\in\mathbb{R}^n} \bigl\| \mathbf{f}(\mathbf{c}) \bigr\|^2,\]

where \(\mathbf{f}(\mathbf{c})\) is the vector of values \(g(t_i;\mathbf{c})-y_i\). We call \(\mathbf{f}\) a misfit function: the smaller the norm of the misfit, the better the fit.

The form of \(g\) is up to the modeler. There may be compelling theoretical choices, or you may just be looking for enough algebraic power to express the data well. Naturally, in the special case where the dependence on \(\mathbf{c}\) is linear, i.e.,

\[ g(t;\mathbf{c}) = c_1 g_1(t) + c_2 g_2(t) + \cdots + c_m g_m(t),\]

then the misfit function is also linear in \(\mathbf{c}\) and the fitting problem reduces to linear least squares.


  1. ✍ Define \(\mathbf{f}(x)=[ x-8, \; x^2-4 ]^T\).

    (a) Write out the linear model of \(\mathbf{f}\) at \(x=2\).

    (b) Find the estimate produced by one step of the Gauss–Newton method, starting at \(x=2\).

  2. ✍ (continuation of preceding problem) The Gauss–Newton method replaces \(\mathbf{f}(\mathbf{x})\) by a linear model and minimizes the residual norm of it. An alternative is to replace \(\| \mathbf{f}(\mathbf{x}) \|_2^2\) by a scalar quadratic model \(q(\mathbf{x})\), and minimize that.

    (a) Using \(\mathbf{f}(x) = [ x-8, \; x^2-4 ]^T\), let \(q(x)\) be defined by the first three terms in the Taylor series for \(\| \mathbf{f}(x) \|_2^2\) at \(x=2\).

    (b) Find the unique \(x\) that minimizes \(q(x)\). Is the result the same as the estimate produced by Gauss–Newton?

  3. ⌨ A famous result by Kermack and McKendrick in 1927 [KMW27] suggests that in epidemics that kill only a small fraction of a susceptible population, the death rate as a function of time is well modeled by

    \[w'(t) = A \operatorname{sech}^2[B(t-C)],\]

    for constant values of the parameters \(A,B,C\). Since the maximum of sech is \(\operatorname{sech}(0)=1\), \(A\) is the maximum death rate and \(C\) is the time of peak deaths. You will use this model to fit the deaths per week from plague recorded in Mumbai in a period during 1906:

    5, 10, 17, 22, 30, 50, 51, 90, 120, 180, 292, 395, 445, 775, 780,
    700, 698, 880, 925, 800, 578, 400, 350, 202, 105, 65, 55, 40, 30, 20

    (a) Use levenberg to find the best least-squares fit to the data using the \(\operatorname{sech}^2\) model. Make a plot of the model fit superimposed on the data. What are \(A\) and \(C\)?

    (b) In practice, one would like a model to predict the full course of the epidemic before it has reached its peak and subsided. Redo the fitting from part~(a) using only the first 12 data values. Add this model to your plot and report \(A\) and \(C\). Is this model a useful predictor of the value and timing of the maximum death rate?

    (c) Repeat part (b) using the first 13 data values.

  4. ⌨ The Rosenbrock banana function is defined as \(\| \mathbf{f}(\mathbf{x}) \|_2^2\), where

    \[\begin{split}\mathbf{f}(\mathbf{x}) = \begin{bmatrix} 10(x_2-x_1^2) \\ 1-x_1 \end{bmatrix}.\end{split}\]

    Use newtonsys to find a minimizer of the banana function starting from \((-1.4,5.1)\). (If you’re curious about its name, make a contour plot of the residual over \(-2\le x_1 \le 3\), \(-1\le x_2 \le 4\).) Show all the digits of the final result.

  5. ⌨ In this problem you are to fit a function of the form

    \[P(t) = c_1 + c_2 e^{c_3 t}\]

    to US census data for the twentieth century. Starting in 1900, the population in millions every ten years was:

    76.0, 92.0, 105.7, 122.8, 131.7, 150.7, 179.0, 205.0, 226.5, 248.7

    Use nonlinear least squares to determine the unknown parameters \(c_1\), \(c_2\), \(c_3\) in \(P\). To aid convergence, rescale the data using the time variable \(t = (\text{year}-1900)/100\) and divide the population numbers above by \(100\). Using your model \(P(t)\), predict the result of the 2000 census, and compare it to the exact figure (which can be found easily on the internet).

  6. ⌨ The position of the upper lid during an eye blink can be measured from high speed video [WBS+14], and it may be possible to classify blinks based in part on fits to the lid position [BWB+17]. The lid position functions proposed to fit blinks is a product of a monomial or polynomial multiplying a decaying exponential [BM98]. In this problem, you will generate representative data, add a small amount of noise to it, and then perform nonlinear least squares fits to the data.

    (a) Consider the function \(y(\mathbf{a}) = a_1 \exp \left( -a_2 t^{a_3} \right)\), using the vector of coefficients \(\mathbf{a} = [a_1 \ a_2\ a_3]\), and create eyelid position data as follows:

    N = 20;                            % number of time values
    t = linspace(1,N,N)'/N;            % equally spaced to t=1
    a = [10, 10, 2]';                  % baseline values
    y = a(1)*t.^2.*exp(-a(2)*t.^a(3)); % ideal data
    ym = y;                            % vector for data
    ir = 1:N-1;                        % range to add noise
    rng(13);                           % set seed for rand
    noise = 0.03;                      % amplitude of noise
    ym(ir) = y(ir) + noise*rand(size(y(ir)));  % add noise

    (b) Using the data (t,ym), find the nonlinear least squares fit using levenberg.

    (c) Plot the fits using np = 100 points over t=(1:np)/np together with symbols for the N measured data points ym.

    (d) Increase the noise to 5% and 10%. You may have to increase the number of measured points N and/or the maximum number of iterations. How close are the coefficients? Plot the data and the resulting fit for each case.

  7. ⌨ Repeat the previous problem using the fitting function \(y(\mathbf{a}) = (a_1+a_2 t + a_3 t^2) t^2 \exp \left( -a_4 t^{a_5} \right)\), using the vector of coefficients \(\mathbf{a} = [a_1 \ a_2\ a_3\ a_4\ a_5]\). (This was the choice used in Brosch et al [BWB+17].) Use a = [20, -10, -8, 7, 2] to create the data and as an initial guess for the coefficients for the fit to the noisy data.