MINRES and conjugate gradients - Fundamentals of Numerical Computation

We have seen before that certain matrix properties enhance solutions to linear algebra problems. One of the most important of these is when $\mathbf{A}^*=\mathbf{A}$ ; i.e., $\mathbf{A}$ is hermitian. The Arnoldi iteration has a particularly useful specialization to this case. While in this section we describe the resulting algorithms, we do not present them in detail or show implementations.

8.6.1Lanczos iteration¶

Starting from (8.4.9), we left-multiply by $\mathbf{Q}_m^*$ to get

\mathbf{Q}_m^* \mathbf{A} \mathbf{Q}_m = \mathbf{Q}_m^* \mathbf{Q}_{m+1} \mathbf{H}_m = \tilde{\mathbf{H}}_m,

(8.6.1)

where $\tilde{\mathbf{H}}_m$ is rows 1 through $m$ of $\mathbf{H}_m$ . If $\mathbf{A}$ is hermitian, then so is the left side of this equation, hence $\tilde{\mathbf{H}}_m$ is hermitian too. But it is also upper Hessenberg, meaning that the $(i,j)$ element is zero if $i > j+1$ . By symmetry, this means that elements are zero when $j > i+1$ as well.

Equation (8.4.6) of the Arnoldi iteration now simplifies to a much shorter expression:

\mathbf{A} \mathbf{q}_m = H_{m-1,m} \,\mathbf{q}_{m-1} + H_{mm} \,\mathbf{q}_m + H_{m+1,m}\,\mathbf{q}_{m+1}.

(8.6.2)

As before in deriving the Arnoldi iteration, when given the first $m$ vectors we can solve for the entries in column $m$ of $\mathbf{H}$ and then for $\mathbf{q}_{m+1}$ . The resulting process is known as the Lanczos iteration. Its most important practical advantage is that while Arnoldi needs $O(m)$ steps to get $\mathbf{q}_{m+1}$ from the previous vectors, Lanczos needs only $O(1)$ steps, so restarting isn’t required for symmetric matrices.^[1]

8.6.2MINRES¶

When $\mathbf{A}$ is hermitian and the Arnoldi iteration is reduced to Lanczos, the analog of GMRES is known as MINRES. Like GMRES, MINRES minimizes the residual $\|\mathbf{b}-\mathbf{A}\mathbf{x}\|$ over increasingly larger Krylov spaces.

MINRES is also more theoretically tractable than GMRES. The following result relies on some advanced approximation theory. Recall that the eigenvalues of a hermitian matrix are real.

Theorem 8.6.1 (Convergence of MINRES (indefinite case))

Suppose $\mathbf{A}$ is hermitian, invertible, and indefinite. Divide its eigenvalues into positive and negative sets $\Lambda_+$ and $\Lambda_-$ , and define

\kappa_+ = \frac{ \max_{\lambda \in \Lambda_+} |\lambda| }{ \min_{\lambda \in \Lambda_+} |\lambda| }, \qquad \kappa_- = \frac{ \max_{\lambda \in \Lambda_-} |\lambda| }{ \min_{\lambda \in \Lambda_-} |\lambda| }.

(8.6.3)

Then $\mathbf{x}_m$ , the $m$ th solution estimate of MINRES, satisfies

\frac{\|\mathbf{r}_m\|_2}{\|\mathbf{b}\|_2} \le \left( \frac{\sqrt{\kappa_+\kappa_-} - 1}{\sqrt{\kappa_+\kappa_-} + 1} \right)^{\lfloor m/2\rfloor},

(8.6.4)

where $\lfloor m/2\rfloor$ means to round $m/2$ down to the nearest integer.

The bound for a definite matrix is better, as the next theorem shows. The upper bound (8.6.4) on the residual obeys a linear convergence rate. As the product $\kappa_+\kappa_-$ grows, the rate of this convergence approaches 1. Hence the presence of eigenvalues close to the origin (relative to the max eigenvalues) is expected to force a slower convergence.

Example 8.6.2 (MINRES)

Julia

MATLAB

Python

Example 8.6.2

The following matrix is indefinite.

A = FNC.poisson(10) - 20I
λ = eigvals(Matrix(A))
isneg = @. λ < 0
@show sum(isneg), sum(.!isneg);

(sum(isneg), sum(.!(isneg))) = (13, 87)

We can compute the relevant quantities from Theorem 8.6.1.

mn, mx = extrema(-λ[isneg])
κ₋ = mx / mn
mn, mx = extrema(λ[.!isneg])
κ₊ = mx / mn
ρ = (sqrt(κ₋ * κ₊) - 1) / (sqrt(κ₋ * κ₊) + 1)

0.9026418585584018

Because the iteration number $m$ is halved in (8.6.4), the rate constant of linear convergence is the square root of this number, which makes it even closer to 1.

Now we apply MINRES to a linear system with this matrix, and compare the observed convergence to the upper bound from the theorem.

using IterativeSolvers, Plots
b = rand(100)
x, hist = minres(A, b, reltol=1e-10, maxiter=51, log=true);
relres = hist[:resnorm] / norm(b)
m = 0:length(relres)-1
plot(m, relres;
    label="observed", legend=:left,
    xaxis=L"m",  yaxis=(:log10, "relative residual"),
    title=("Convergence of MINRES"))
plot!(m, ρ .^ (m / 2), l=:dash, label="upper bound")

The upper bound turns out to be pessimistic here, especially in the later iterations. However, you can see a slow linear phase in the convergence that tracks the bound closely.

Example 8.6.2

The following matrix is indefinite.

A = (11 / pi)^2 * gallery('poisson', 10);
A = A - 20 * eye(100);
lambda = eig(full(A));
isneg = lambda < 0;
disp(sprintf("%d negative and %d positive eigenvalues", sum(isneg), sum(~isneg)))

13 negative and 87 positive eigenvalues

We can compute the relevant quantities from Theorem 8.6.1.

m = min(-lambda(isneg));
M = max(-lambda(isneg));
kappa_minus = M / m;
m = min(lambda(~isneg));
M = max(lambda(~isneg));
kappa_plus = M / m;
S = sqrt(kappa_minus * kappa_plus);
rho = sqrt((S - 1) / (S + 1));
fprintf("convergence rate: %.3f", rho)

convergence rate:

0.950

Because the iteration number $m$ is halved in (8.6.4), the rate constant of linear convergence is the square root of this number, which makes it even closer to 1.

Now we apply MINRES to a linear system with this matrix, and compare the observed convergence to the upper bound from the theorem.

b = rand(100, 1);
[xMR, ~,~ , ~, residMR] = minres(A, b, 1e-10, 100);
relres = residMR / norm(b);
m = 0:length(relres) - 1;
clf,  semilogy(m, relres, '.-')
hold on
semilogy(m, rho .^ m, 'k--')
xlabel('m'),  ylabel('relative residual') 
title('Convergence of MINRES') 
legend('MINRES', 'upper bound', 'location', 'southwest');

The upper bound turns out to be pessimistic here, especially in the later iterations. However, you can see a slow linear phase in the convergence that tracks the bound closely.

Example 8.6.2

The following matrix is indefinite.

from numpy.linalg import eig
import scipy.sparse as sp
A = FNC.poisson2d(10) - 20*sp.eye(100)
ev, _ = eig(A.todense())
num_negative_ev = sum(ev < 0)
print(f"There are {sum(ev < 0)} negative and {sum(ev > 0)} positive eigenvalues")

There are 13 negative and 87 positive eigenvalues

We can compute the relevant quantities from Theorem 8.6.1.

m, M = min(-ev[ev < 0]), max(-ev[ev < 0])
kappa_minus = M / m
m, M = min(ev[ev > 0]), max(ev[ev > 0])
kappa_plus = M / m
S = sqrt(kappa_plus * kappa_minus)
rho = sqrt((S - 1) / (S + 1))
print(f"Condition numbers: {kappa_minus:.2e}, {kappa_plus:.2e}")
print(f"Convergence rate: {rho:.3f}")

Condition numbers: 1.02e+01, 3.76e+01
Convergence rate: 0.950

Because the iteration number $m$ is halved in (8.6.4), the rate constant of linear convergence is the square root of this number, which makes it even closer to 1.

Now we apply MINRES to a linear system with this matrix, and compare the observed convergence to the upper bound from the theorem.

from scipy.sparse.linalg import minres
b = random.rand(100)
resid = [norm(b)]
hist = lambda x: resid.append(norm(b - A @ x))
x, flag = minres(A, b, rtol=1e-8, maxiter=1000, callback=hist)

semilogy(resid, ".-");
upper = norm(b) * rho**arange(len(resid))
semilogy(upper, "k--")
xlabel("$m$"),  ylabel("residual norm")
legend(["MINRES", "upper bound"], loc="lower left")
title("Convergence of MINRES");

The upper bound turns out to be pessimistic here, especially in the later iterations. However, you can see a slow linear phase in the convergence that tracks the bound closely.

8.6.3Conjugate gradients¶

Given positive definiteness in addition to symmetry, we arrive at perhaps the most famous Krylov subspace method for $\mathbf{A}\mathbf{x}=\mathbf{b}$ , called conjugate gradients.

Suppose now that $\mathbf{A}$ is hermitian and positive definite (HPD). Then $\mathbf{A}$ has a Cholesky factorization, which in the complex case is $\mathbf{A}=\mathbf{R}^*\mathbf{R}$ . Therefore, for any vector $\mathbf{u}$ ,

\mathbf{u}^*\mathbf{A}\mathbf{u} = (\mathbf{R}\mathbf{u})^*(\mathbf{R}\mathbf{u})=\|\mathbf{R} \mathbf{u}\|^2,

(8.6.8)

which is nonnegative and zero only when $\mathbf{u}=\boldsymbol{0}$ , provided $\mathbf{A}$ (and therefore $\mathbf{R}$ ) is nonsingular. Hence we can define a special vector norm relative to $\mathbf{A}$ :

\| \mathbf{u} \|_{\mathbf{A}} = \left( \mathbf{u}^*\mathbf{A}\mathbf{u} \right)^{1/2}.

(8.6.9)

8.6.4Convergence¶

The convergence of CG and MINRES is dependent on the eigenvalues of $\mathbf{A}$ . In the HPD case the eigenvalues are real and positive, and they equal the singular values. Hence the condition number κ is equal to the ratio of the largest eigenvalue to the smallest one. The following theorem suggests that MINRES and CG are not so different in convergence.

Theorem 8.6.2 characterizes the convergence of MINRES and CG similarly, differing only in whether the measurement is of the residual or the $\mathbf{A}$ -norm of the error, respectively. While these are different quantities, in practice one may not find a consistent advantage for one method over the other.

As in the indefinite case with MINRES, a larger condition number is associated with slower convergence in the positive definite case. Specifically, to make the bound in (8.6.10) less than a number ε requires

\begin{gather*} 2 \left( \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1} \right)^m \approx \epsilon, \\ m \log \left( \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1} \right) \approx \log\Bigl( \frac{\epsilon}{2} \Bigr). \end{gather*}

(8.6.11)

We estimate

\begin{align*} \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1} &= (1 - \kappa^{-1/2}\,) (1 + \kappa^{-1/2}\,)^{-1}\\ &= (1 - \kappa^{-1/2}\,) (1 - \kappa^{-1/2} + \kappa^{-1} + \cdots)\\ &= 1 - 2\kappa^{-1/2} + O(\kappa^{-1}) \quad \text{ as $\kappa \rightarrow \infty$.} \end{align*}

(8.6.12)

With the Taylor expansion $\log(1+x) = x - (x^2/2) + \cdots$ , we finally conclude

\begin{gather*} 2 m \kappa^{-1/2} \approx \log\Bigl( \frac{\epsilon}{2} \Bigr), \text{ or } m = O(\sqrt{\kappa}), \end{gather*}

(8.6.13)

as an estimate of the number of iterations needed to achieve a fixed accuracy.

This estimate fails for very large κ, however.

Example 8.6.3 (Convergence of MINRES and CG)

Julia

MATLAB

Python

Example 8.6.3

We will compare MINRES and CG on some quasi-random SPD problems. The first matrix has a condition number of 100.

n = 5000
density = 0.001
A = FNC.sprandsym(n, density, 1 / 100)
x = (1:n) / n
b = A * x;

Now we apply both methods and compare the convergence of the system residuals, using implementations imported from IterativeSolvers.

plt = plot(title="Convergence of MINRES and CG",
    xaxis=("Krylov dimension"),  yaxis=(:log10, "relative residual norm"))
for method in [minres, cg]
    x̃, history = method(A, b, reltol=1e-6, maxiter=1000, log=true)
    relres = history[:resnorm] / norm(b)
    plot!(0:length(relres)-1, relres, label="$method")
    err = round(norm(x̃ - x) / norm(x), sigdigits=4)
    println("$method error: $err")
end
plt

There is little difference between the two methods here. Next, we increase the condition number of the matrix by a factor of 25. The rule of thumb predicts that the number of iterations required should increase by a factor of about 5.

A = FNC.sprandsym(n, density, 1 / 2500)
b = A * x;

plt = plot(title="Convergence of MINRES and CG",
    xaxis=("Krylov dimension"), yaxis=(:log10, "relative residual norm"))
for method in [minres, cg]
    x̃, history = method(A, b, reltol=1e-6, maxiter=1000, log=true)
    relres = history[:resnorm] / norm(b)
    plot!(0:length(relres)-1, relres, label="$method")
    err = round(norm(x̃ - x) / norm(x), sigdigits=4)
    println("$method error: $err")
end
plt

Both methods have an early superlinear phase that allow them to finish slightly sooner than the factor of 5 predicted: Theorem 8.6.2 is an upper bound, not necessarily an approximation. Both methods ultimately achieve the same reduction in the residual; MINRES stops earlier, but with a slightly larger error.

Example 8.6.3

We will compare MINRES and CG on some quasi-random SPD problems. The first matrix has a condition number of 100.

n = 5000;
density = 0.001;
A = sprandsym(n, density, 1e-2, 2);

We generate a system with a known solution.

x = (1:n)' / n;
b = A * x;

Now we apply both methods and compare the convergence of the system residuals.

[xMR, ~, ~, ~, residMR] = minres(A, b, 1e-7, 100);
[xCG, ~, ~, ~, residCG] = pcg(A, b, 1e-7, 100);
M = length(residMR) - 1;
clf,  semilogy(0:M, residMR / norm(b), '.-')
M = length(residCG) - 1;
hold on,  semilogy(0:M, residCG / norm(b), '.-')
title('Convergence of MINRES and CG')
xlabel('Krylov dimension m')
ylabel('||r_m|| / ||b||')
legend('MINRES', 'CG');

A = sprandsym(n, density, 1e-2 / 25, 2);
b = A * x;

[xMR, ~, ~, ~, residMR] = minres(A, b, 1e-7, 400);
[xCG, ~, ~, ~, residCG] = pcg(A, b, 1e-7, 400);
M = length(residMR) - 1;
clf,  semilogy(0:M, residMR / norm(b), '.-')
M = length(residCG) - 1;
hold on,  semilogy(0:M, residCG / norm(b), '.-')
title('Convergence of MINRES and CG')
xlabel('Krylov dimension m')
ylabel('||r_m|| / ||b||')
legend('MINRES', 'CG');

Example 8.6.3

We will compare MINRES and CG on some quasi-random SPD problems. The first matrix has a condition number of 100.

n = 5000
density = 0.001
A = FNC.sprandsym(n, density, rcond=1e-2)
x = arange(1, n+1) / n
b = A @ x

Now we apply both methods and compare the convergence of the system residuals, using implementations imported from scipy.sparse.linalg.

from scipy.sparse.linalg import cg, minres
hist = lambda x: resid.append(norm(b - A @ x))

resid = [norm(b)]
xMR, flag = minres(A, b, rtol=1e-12, maxiter=100, callback=hist)
semilogy(resid / norm(b), label="MINRES")

resid = [norm(b)]
xCG, flag = cg(A, b, rtol=1e-12, maxiter=100, callback=hist)
semilogy(resid / norm(b), label="CG")

xlabel("Krylov dimension $m$"), ylabel("$\\|r_m\\| / \\|b\\|$")
grid(),  legend(),  title("Convergence of MINRES and CG");

There is little difference between the two methods here. Both achieve relative residual of 10^-6 in aout 60 iterations, for example. The final errors are similar, too.

print(f"MINRES error: {norm(xMR - x) / norm(x):.2e}")
print(f"CG error: {norm(xCG - x) / norm(x):.2e}")

MINRES error: 4.68e-09
CG error: 2.98e-09

Next, we increase the condition number of the matrix by a factor of 25. The rule of thumb predicts that the number of iterations required should increase by a factor of about 5; i.e., 300 iterations to reach 10^-6.

A = FNC.sprandsym(n, density, rcond=1e-2 / 25)
x = arange(1, n+1) / n
b = A @ x

from scipy.sparse.linalg import cg, minres
hist = lambda x: resid.append(norm(b - A @ x))

resid = [norm(b)]
xMR, flag = minres(A, b, rtol=1e-12, maxiter=400, callback=hist)
semilogy(resid / norm(b), label="MINRES")

resid = [norm(b)]
xCG, flag = cg(A, b, rtol=1e-12, maxiter=400, callback=hist)
semilogy(resid / norm(b), label="CG")

xlabel("Krylov dimension $m$"), ylabel("$\\|r_m\\| / \\|b\\|$")
grid(),  legend(),  title("Convergence of MINRES and CG")

print(f"MINRES error: {norm(xMR - x) / norm(x):.2e}")
print(f"CG error: {norm(xCG - x) / norm(x):.2e}")

MINRES error: 2.52e-07
CG error: 1.50e-07

Both methods have an early superlinear phase that allow them to finish slightly sooner than the factor of 5 predicted: Theorem 8.6.2 is an upper bound, not necessarily an approximation.

8.6.5Exercises¶

✍ For each part, the eigenvalues of $\mathbf{A}$ are given. Suppose MINRES is applied to solve $\mathbf{A}\mathbf{x}=\mathbf{b}$ . Use (8.6.4) or (8.6.10), whichever is most appropriate, to determine a lower bound on $m$ to guarantee reduction of the residual norm by a factor 10^-4.
(a) $-100,-99,\ldots,-1,1,2,\ldots,100$
(b) $-100,1,2,\ldots,100$
(c) $1,2,\ldots,100$
⌨ Let $\mathbf{b}$ be a random unit vector of length 202. Define a diagonal matrix with diagonal entries $d_k$ given by
$\begin{align*} d_k &= -200 + 1.95k, \quad k=0,1,\ldots,100, \\ d_{k+101} &= 10 + 0.9k, \quad k=0,1,\ldots,100. \end{align*}$
(8.6.14)
(a) Apply 120 iterations of MINRES to solve $\mathbf{A}\mathbf{x}=\mathbf{b}$ . Compute the relative error of the answer, and plot the norm of the residual as a function of $m$ on a log-linear scale.
(b) Add to your graph the line representing the upper bound (8.6.4). (Ignore the rounding in the exponent.) This line should stay strictly on or above the error curve.
⌨ Let $\mathbf{b}$ be a random unit vector of length 501. Define a sparse diagonal matrix $\mathbf{A}$ with diagonal entries $d_k$ given by
$\begin{align*} d_k &= 4 + k\cdot\frac{9996}{500}, \quad k=0,1,\ldots,500. \end{align*}$
(8.6.15)
(a) Apply 100 iterations of MINRES to solve $\mathbf{A}\mathbf{x}=\mathbf{b}$ . Compute the relative norm of the answer. Plot the norm of the residual as a function of $m$ .
(b) Add to your graph the line representing the upper bound (8.6.10). This line should stay strictly on or above the convergence curve.
(c) Add a convergence curve for 100 iterations of cg.
✍ Suppose a family of SPD matrices $\mathbf{A}$ is parameterized by $t$ , and that the condition numbers of the matrices scale like $O(t^2)$ as $t\to\infty$ . Given that CG takes 60 iterations to reach a certain reduction in the error of a linear system when $t=200$ , estimate the number of iterations CG will need to reach the same accuracy at $t=300$ .
✍ Given real $n\times n$ symmetric $\mathbf{A}$ and vector $\mathbf{b}=\mathbf{A}\mathbf{x}$ , we can define the scalar-valued function
$\varphi(\mathbf{u}) = \mathbf{u}^T \mathbf{A} \mathbf{u} - 2 \mathbf{u}^T \mathbf{b}, \qquad \mathbf{u}\in\mathbb{R}^n.$
(8.6.16)
(a) Expand and simplify the expression $\varphi(\mathbf{x}+\mathbf{v})-\varphi(\mathbf{x})$ , keeping in mind that $\mathbf{A}\mathbf{x}=\mathbf{b}$ .
(b) Using the result of part (a), prove that if $\mathbf{A}$ is an SPD matrix, φ has a global minimum at $\mathbf{x}$ .
(c) Show that for any vector $\mathbf{u}$ , $\|\mathbf{u}-\mathbf{x}\|_{\mathbf{A}}^2-\varphi(\mathbf{u})$ is constant.
(d) Using the result of part (c), prove that CG minimizes $\varphi(\mathbf{u})$ over Krylov subspaces.
⌨ Let $n=50$ . Define the matrix $\mathbf{A}_k$ as FNC.poisson(n) minus $k^2$ times the $n^2\times n^2$ identity matrix, and define vector $\mathbf{b}$ as $n^2$ copies of -1. The linear system $\mathbf{A}_k\mathbf{x}=\mathbf{b}$ arises from the Helmholtz equation for wave propagation at a single frequency $k$ .
(a) Apply both MINRES and CG to the Helmholtz system for $k=1.3$ , solving to a relative residual tolerance of 10^-5. Plotting their convergence curves together.
(b) Repeat part (a) for $k=8$ .
(c) Use eigs on the matrix from part (b) to show that it is indefinite. (Hint: Use additional arguments to get the eigenvalues with smallest and largest real parts.) This helps explain why the CG convergence curve for this matrix looks rather strange.

Footnotes¶

In principle, the implementation of Lanczos iteration is minor change from Arnoldi, but numerical stability requires some extra analysis and effort. We do not present the details.
↩

Preface

GMRES

Preface

Matrix-free iterations