Given that matrix-vector multiplication is fast for sparse matrices, let’s see what we might accomplish with only that at our disposal.
Example 8.2.1 (Power iteration Power iteration Power iteration Power iteration)
Power iterationHere we choose a random 5×5 matrix and a random 5-vector.
A = rand(1.0:9.0, 5, 5)
A = A ./ sum(A, dims=1)
x = randn(5)
Applying matrix-vector multiplication once doesn’t do anything recognizable.
Repeating the multiplication still doesn’t do anything obvious.
But if we keep repeating the matrix-vector multiplication, something remarkable happens: A x ≈ x \mathbf{A} \mathbf{x} \approx \mathbf{x} Ax ≈ x .
for j in 1:8
x = A * x
end
[x A * x]
This phenomenon seems to occur regardless of the starting vector.
x = randn(5)
for j in 1:8
x = A * x
end
[x A * x]
Example 8.2.1 Here we choose a magic 5×5 matrix and a random 5-vector.
A = magic(5) / 65;
x = randn(5, 1);
Applying matrix-vector multiplication once doesn’t do anything recognizable.
Repeating the multiplication still doesn’t do anything obvious.
But if we keep repeating the matrix-vector multiplication, something remarkable happens: A x ≈ x \mathbf{A} \mathbf{x} \approx \mathbf{x} Ax ≈ x .
for j = 1:8
x = A * x;
end
[x, A * x]
This phenomenon seems to occur regardless of the starting vector.
x = randn(5, 1);
for j = 1:8
x = A * x;
end
[x, A * x]
Example 8.2.1 Here we choose a random 5×5 matrix and a random 5-vector.
A = random.choice(range(10), (5, 5))
A = A / sum(A, 0)
x = random.randn(5)
print(x)
Applying matrix-vector multiplication once doesn’t do anything recognizable.
Repeating the multiplication still doesn’t do anything obvious.
But if we keep repeating the matrix-vector multiplication, something remarkable happens: A x ≈ x \mathbf{A} \mathbf{x} \approx \mathbf{x} Ax ≈ x .
x = random.randn(5)
for j in range(6):
x = A @ x
print(x)
print(A @ x)
This phenomenon is unlikely to be a coincidence!
There was a little cheating in Demo 8.2.1 to make the story come out neatly (specifically, the line A=A./sum(A,dims=1)
). But it illustrates an important general fact that we investigate now.
8.2.1 Dominant eigenvalue ¶ Analysis of matrix powers is most straightforward in the diagonalizable case. Let A \mathbf{A} A be any diagonalizable n × n n\times n n × n matrix having eigenvalues λ 1 , … , λ n \lambda_1,\ldots,\lambda_n λ 1 , … , λ n and corresponding linearly independent eigenvectors v 1 , … , v n \mathbf{v}_1,\ldots,\mathbf{v}_n v 1 , … , v n . Furthermore, suppose the eigenvalues are such that
∣ λ 1 ∣ > ∣ λ 2 ∣ ≥ ∣ λ 3 ∣ ≥ ⋯ ≥ ∣ λ n ∣ . |\lambda_1| > |\lambda_2| \ge |\lambda_3| \ge \cdots \ge |\lambda_n|. ∣ λ 1 ∣ > ∣ λ 2 ∣ ≥ ∣ λ 3 ∣ ≥ ⋯ ≥ ∣ λ n ∣. Given (8.2.1) we say that λ 1 \lambda_1 λ 1 is the dominant eigenvalue . This was the case with λ 1 = 1 \lambda_1=1 λ 1 = 1 for A \mathbf{A} A in Demo 8.2.1 .
Now let x \mathbf{x} x be an n n n -vector, let k k k be a positive integer, and refer to (7.2.10) :
A k x = V D k V − 1 x . \mathbf{A}^k \mathbf{x} = \mathbf{V}\mathbf{D}^k\mathbf{V}^{-1}\mathbf{x}. A k x = V D k V − 1 x . Let z = V − 1 x \mathbf{z}=\mathbf{V}^{-1}\mathbf{x} z = V − 1 x , and recall that D \mathbf{D} D is a diagonal matrix of eigenvalues. Then
A k x = V D k z = V [ λ 1 k z 1 λ 2 k z 2 ⋮ λ n k z n ] = λ 1 k [ z 1 v 1 + z 2 ( λ 2 λ 1 ) k v 2 + ⋯ + z n ( λ n λ 1 ) k v n ] . \begin{split}
\mathbf{A}^k\mathbf{x} &= \mathbf{V}\mathbf{D}^k \mathbf{z} = \mathbf{V}\begin{bmatrix} \lambda_1^kz_1 \\[0.5ex] \lambda_2^kz_2 \\ \vdots \\ \lambda_n^kz_n \end{bmatrix} \\
&= \lambda_1^k \left[ z_1 \mathbf{v}_{1} +
z_2 \left(\frac{\lambda_2}{\lambda_1}\right) ^k
\mathbf{v}_{2} + \cdots + z_n \left(\frac{\lambda_n}{\lambda_1}\right)^k
\mathbf{v}_{n} \right].
\end{split} A k x = V D k z = V ⎣ ⎡ λ 1 k z 1 λ 2 k z 2 ⋮ λ n k z n ⎦ ⎤ = λ 1 k [ z 1 v 1 + z 2 ( λ 1 λ 2 ) k v 2 + ⋯ + z n ( λ 1 λ n ) k v n ] . Since λ 1 \lambda_1 λ 1 is dominant, we conclude that if z 1 ≠ 0 z_1\neq 0 z 1 = 0 ,
∥ A k x λ 1 k − z 1 v 1 ∥ ≤ ∣ z 2 ∣ ⋅ ∣ λ 2 λ 1 ∣ k ∥ v 2 ∥ + ⋯ + ∣ z n ∣ ⋅ ∣ λ n λ 1 ∣ k ∥ v n ∥ → 0 as k → ∞ . \left\| \frac{ \mathbf{A}^k\mathbf{x}}{\lambda_1^k}
- z_1\mathbf{v}_1\right\| \le |z_2|\cdot\left|\frac{\lambda_2}{\lambda_1}\right| ^k
\| \mathbf{v}_{2} \| + \cdots + |z_n|\cdot\left|\frac{\lambda_n}{\lambda_1}\right|^k
\| \mathbf{v}_{n} \| \rightarrow 0 \text{ as $k\rightarrow \infty$}. ∥ ∥ λ 1 k A k x − z 1 v 1 ∥ ∥ ≤ ∣ z 2 ∣ ⋅ ∣ ∣ λ 1 λ 2 ∣ ∣ k ∥ v 2 ∥ + ⋯ + ∣ z n ∣ ⋅ ∣ ∣ λ 1 λ n ∣ ∣ k ∥ v n ∥ → 0 as k → ∞ . That is, A k x \mathbf{A}^k\mathbf{x} A k x eventually is a scalar multiple of the dominant eigenvector.[1]
For algorithmic purposes, it is important to interpret A k x \mathbf{A}^k\mathbf{x} A k x as A ( ⋯ ( A ( A x ) ) ⋯ ) \mathbf{A}\bigl( \cdots\bigl( \mathbf{A} (\mathbf{A}\mathbf{x})\bigl) \cdots\bigl) A ( ⋯ ( A ( Ax ) ) ⋯ ) , i.e., as repeated applications of A \mathbf{A} A to a vector. Doing so allows us to fully exploit sparsity of A \mathbf{A} A , something which is not preserved by taking a matrix power A k \mathbf{A}^k A k explicitly before the multiplication with x \mathbf{x} x (see Demo 8.1.2 ).
8.2.2 Power iteration ¶ An important technicality separates us from an algorithm: unless ∣ λ 1 ∣ = 1 |\lambda_1|=1 ∣ λ 1 ∣ = 1 , the factor λ 1 k \lambda_1^k λ 1 k tends to make ∥ A k x ∥ \|\mathbf{A}^k\mathbf{x}\| ∥ A k x ∥ either very large or very small. Nor can we easily normalize by λ 1 k \lambda_1^k λ 1 k , as in (8.2.4) , unless we know λ 1 \lambda_1 λ 1 in advance.
To make a practical algorithm, we alternate matrix-vector multiplication with a renormalization of the vector. In the following, we use x k , m x_{k,m} x k , m and y k , m y_{k,m} y k , m to mean the m m m th component of vectors x k \mathbf{x}_k x k and y k \mathbf{y}_k y k .
Note that by definition, ∥ x k + 1 ∥ ∞ = 1 \| \mathbf{x}_{k+1}\|_\infty=1 ∥ x k + 1 ∥ ∞ = 1 . Also, we can write
x k = ( α 1 α 2 ⋯ α k ) A k x 1 . \mathbf{x}_{k} = (\alpha_1 \alpha_2 \cdots \alpha_k ) \mathbf{A}^k \mathbf{x}_{1}. x k = ( α 1 α 2 ⋯ α k ) A k x 1 . Thus Algorithm 8.2.1 modifies (8.2.3) and (8.2.4) only slightly.
Finally, if x k \mathbf{x}_k x k is nearly a dominant eigenvector of A \mathbf{A} A , then A x k \mathbf{A}\mathbf{x}_k A x k is nearly λ 1 x k \lambda_1\mathbf{x}_k λ 1 x k , and we can take the ratio β k = y k , m / x k , m \beta_k=y_{k,m}/x_{k,m} β k = y k , m / x k , m as an eigenvalue estimate. In fact, revisiting (8.2.3) , the extra α j \alpha_j α j normalization factors cancel in the ratio, and, after some simplification, we get
β k = y k , m x k , m = λ 1 1 + r 2 k + 1 b 2 + ⋯ + r n k + 1 b n 1 + r 2 k b 2 + ⋯ + r n k b n , \beta_k = \frac{y_{k,m}}{x_{k,m}} = \lambda_1
\frac{1+r_2^{k+1} b_2 + \cdots + r_n^{k+1} b_n}{1+r_2^{k} b_2 + \cdots + r_n^{k} b_n}, β k = x k , m y k , m = λ 1 1 + r 2 k b 2 + ⋯ + r n k b n 1 + r 2 k + 1 b 2 + ⋯ + r n k + 1 b n , where r j = λ j / λ 1 r_j=\lambda_j/\lambda_1 r j = λ j / λ 1 and the b j b_j b j are constants. By assumption (8.2.1) , each r j r_j r j satisfies ∣ r j ∣ < 1 |r_j|<1 ∣ r j ∣ < 1 , so we see that β k → λ 1 \beta_k\rightarrow \lambda_1 β k → λ 1 as k → ∞ k\rightarrow\infty k → ∞ .
Function 8.2.2 is our implementation of power iteration.
Power iteration1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
function [gamma,x] = poweriter(A,numiter)
% POWERITER Power iteration for the dominant eigenvalue.
% Input:
% A square matrix
% numiter number of iterations
% Output:
% gamma sequence of eigenvalue approximations (vector)
% x final eigenvector approximation
n = length(A);
x = randn(n,1);
x = x/norm(x,inf);
for k = 1:numiter
y = A*x;
[normy,m] = max(abs(y));
gamma(k) = y(m)/x(m);
x = y/y(m);
end
Power iteration1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def poweriter(A, numiter):
"""
poweriter(A, numiter)
Perform numiter power iterations with the matrix A, starting from a random vector,
and return a vector of eigenvalue estimates and the final eigenvector approximation.
"""
n = A.shape[0]
x = np.random.randn(n)
x = x / np.linalg.norm(x, np.inf)
gamma = np.zeros(numiter)
for k in range(numiter):
y = A @ x
m = np.argmax(abs(y))
gamma[k] = y[m] / x[m]
x = y / y[m]
return gamma, x
Observe that the only use of A \mathbf{A} A is to find the matrix-vector product A x \mathbf{A}\mathbf{x} Ax , which makes exploitation of the sparsity of A \mathbf{A} A trivial.
8.2.3 Convergence ¶ Let’s examine the terms in the numerator and denominator of (8.2.6) more carefully:
r 2 k b 2 + ⋯ + r n k b n = r 2 k [ b 2 + ( r 3 r 2 ) k b 3 + ⋯ + ( r n r 2 ) k b n ] = r 2 k [ b 2 + ( λ 3 λ 2 ) k b 3 + ⋯ + ( λ n λ 2 ) k b n ] . \begin{split}
r_2^{k} b_2 + \cdots + r_n^{k} b_n &= r_2^k \left[ b_2 + \left( \frac{r_3}{r_2} \right)^kb_3 + \cdots + \left( \frac{r_n}{r_2} \right)^kb_n \right] \\
&= r_2^k \left[ b_2 + \left( \frac{\lambda_3}{\lambda_2} \right)^kb_3 + \cdots + \left( \frac{\lambda_n}{\lambda_2} \right)^kb_n \right].
\end{split} r 2 k b 2 + ⋯ + r n k b n = r 2 k [ b 2 + ( r 2 r 3 ) k b 3 + ⋯ + ( r 2 r n ) k b n ] = r 2 k [ b 2 + ( λ 2 λ 3 ) k b 3 + ⋯ + ( λ 2 λ n ) k b n ] . At this point we’ll introduce an additional assumption,
∣ λ 2 ∣ > ∣ λ 3 ∣ ≥ ⋯ ≥ ∣ λ n ∣ . |\lambda_2| > |\lambda_3| \ge \cdots \ge |\lambda_n|. ∣ λ 2 ∣ > ∣ λ 3 ∣ ≥ ⋯ ≥ ∣ λ n ∣. This condition isn’t strictly necessary, but it simplifies the following statements considerably because now it’s clear that the quantity in (8.2.7) approaches b 2 r 2 k b_2 r_2^k b 2 r 2 k as k → ∞ k\rightarrow \infty k → ∞ .
Next we estimate (8.2.6) for large k k k , using a geometric series expansion for the denominator to get
β k → λ 1 ( 1 + b 2 r 2 k + 1 ) ( 1 − b 2 r 2 k + O ( r 2 2 k ) ) , β k − λ 1 → λ 1 b 2 ( r 2 − 1 ) r 2 k . \begin{split}
\beta_k & \to \lambda_1 \left( 1+b_2 r_2^{k+1} \right) \left( 1 - b_2 r_2^{k} + O(r_2^{2k}) \right), \\
\beta_k - \lambda_1 &\to \lambda_1 b_2 ( r_2 - 1 ) r_2^{k}.
\end{split} β k β k − λ 1 → λ 1 ( 1 + b 2 r 2 k + 1 ) ( 1 − b 2 r 2 k + O ( r 2 2 k ) ) , → λ 1 b 2 ( r 2 − 1 ) r 2 k . This is linear convergence with factor r 2 r_2 r 2 :
β k + 1 − λ 1 β k − λ 1 → r 2 = λ 2 λ 1 as k → ∞ . \frac{\beta_{k+1} - \lambda_1}{\beta_{k}-\lambda_1} \rightarrow r_2 = \frac{\lambda_2}{\lambda_1} \quad \text{as } k\rightarrow \infty. β k − λ 1 β k + 1 − λ 1 → r 2 = λ 1 λ 2 as k → ∞. Example 8.2.2 (Convergence of power iteration Convergence of power iteration Convergence of power iteration Convergence of power iteration)
Convergence of power iterationWe will experiment with the power iteration on a 5×5 matrix with prescribed eigenvalues and dominant eigenvalue at 1.
λ = [1, -0.75, 0.6, -0.4, 0]
# Make a triangular matrix with eigenvalues on the diagonal.
A = triu(ones(5, 5), 1) + diagm(λ)
We run the power iteration 60 times. The best estimate of the dominant eigenvalue is the last entry of the first output.
β, x = FNC.poweriter(A, 60)
eigval = β[end]
We check for linear convergence using a log-linear plot of the error.
err = @. 1 - β
plot(0:59, abs.(err), m=:o, title="Convergence of power iteration",
xlabel=L"k", yaxis=(L"|\lambda_1-\beta_k|", :log10, [1e-10, 1]))
The asymptotic trend seems to be a straight line, consistent with linear convergence. To estimate the convergence rate, we look at the ratio of two consecutive errors in the linear part of the convergence curve. The ratio of the first two eigenvalues should match the observed rate.
@show theory = λ[2] / λ[1];
@show observed = err[40] / err[39];
Note that the error is supposed to change sign on each iteration. The effect of these alternating signs is that estimates oscillate around the exact value.
In practical situations, we don’t know the exact eigenvalue that the algorithm is supposed to find. In that case we would base errors on the final β that was found, as in the following plot.
err = @. β[end] - β[1:end-1]
plot(0:58, abs.(err), m=:o, title="Convergence of power iteration",
xlabel=L"k", yaxis=(L"|\beta_{60}-\beta_k|", :log10, [1e-10, 1]))
The results are very similar until the last few iterations, when the limited accuracy of the reference value begins to show. That is, while it is a good estimate of λ 1 \lambda_1 λ 1 , it is less good as an estimate of the error in nearby estimates.
Example 8.2.2 We will experiment with the power iteration on a 5×5 matrix with prescribed eigenvalues and dominant eigenvalue at 1.
ev = [1, -0.75, 0.6, -0.4, 0];
% Make a triangular matrix with eigenvalues on the diagonal.
A = triu(ones(5, 5), 1) + diag(ev);
We run the power iteration 60 times. The best estimate of the dominant eigenvalue is the last entry of the first output.
[beta, x] = poweriter(A, 60);
format long
beta(1:12)
We check for linear convergence using a log-linear plot of the error.
err = 1 - beta;
clf, semilogy(abs(err), '.-')
title('Convergence of power iteration')
xlabel('k'), ylabel(('|\lambda_1 - \beta_k|'));
The asymptotic trend seems to be a straight line, consistent with linear convergence. To estimate the convergence rate, we look at the ratio of two consecutive errors in the linear part of the convergence curve. The ratio of the first two eigenvalues should match the observed rate.
theory = ev(2) / ev(1)
observed = err(40) / err(39)
Note that the error is supposed to change sign on each iteration. The effect of these alternating signs is that estimates oscillate around the exact value.
In practical situations, we don’t know the exact eigenvalue that the algorithm is supposed to find. In that case we would base errors on the final β that was found, as in the following plot.
err = beta(end) - beta(1:end-1);
semilogy(abs(err), '.-')
title('Convergence of power iteration')
xlabel('k'), ylabel(('|\beta_{60} - \beta_k|'));
The results are very similar until the last few iterations, when the limited accuracy of the reference value begins to show. That is, while it is a good estimate of λ 1 \lambda_1 λ 1 , it is less good as an estimate of the error in nearby estimates.
Example 8.2.2 We will experiment with the power iteration on a 5×5 matrix with prescribed eigenvalues and dominant eigenvalue at 1.
ev = [1, -0.75, 0.6, -0.4, 0]
A = triu(ones([5, 5]), 1) + diag(ev) # triangular matrix, eigs on diagonal
We run the power iteration 60 times. The first output should be a sequence of estimates converging to the dominant eigenvalue—which, in this case, we set up to be 1.
beta, x = FNC.poweriter(A, 60)
print(beta)
We check for linear convergence using a log-linear plot of the error.
err = 1 - beta
semilogy(arange(60), abs(err), "-o")
ylim(1e-10, 1)
xlabel("$k$")
ylabel("$|\\lambda_1 - \\beta_k|$")
title("Convergence of power iteration");
The asymptotic trend seems to be a straight line, consistent with linear convergence. To estimate the convergence rate, we look at the ratio of two consecutive errors in the linear part of the convergence curve. The ratio of the first two eigenvalues should match the observed rate.
print(f"theory: {ev[1] / ev[0]:.5f}")
print(f"observed: {err[40] / err[39]:.5f}")
Note that the error is supposed to change sign on each iteration. The effect of these alternating signs is that estimates oscillate around the exact value.
In practical situations, we don’t know the exact eigenvalue that the algorithm is supposed to find. In that case we would base errors on the final β that was found, as in the following plot.
err = beta[-1] - beta
semilogy(arange(60), abs(err), "-o")
ylim(1e-10, 1), xlabel("$k$")
ylabel("$|\\lambda_1 - \\beta_k|$")
title("Convergence of power iteration");
The results are very similar until the last few iterations, when the limited accuracy of the reference value begins to show. That is, while it is a good estimate of λ 1 \lambda_1 λ 1 , it is less good as an estimate of the error in nearby estimates.
The practical utility of (8.2.10) is limited because if we knew λ 1 \lambda_1 λ 1 and λ 2 \lambda_2 λ 2 , we wouldn’t be running the power iteration in the first place! Sometimes it’s possible to find estimates of or bounds on the ratio. If nothing else, though, it is useful to know that linear convergence is expected at a rate based solely on the dominant eigenvalues.
8.2.4 Exercises ¶ ⌨ Use Function 8.2.2 to perform 20 power iterations for the following matrices. Quantitatively compare the observed convergence to the prediction in (8.2.10) .
(a)
A = [ 1.1 1 0.1 2.4 ] \mathbf{A} = \begin{bmatrix}
1.1 & 1 \\
0.1 & 2.4
\end{bmatrix} \quad A = [ 1.1 0.1 1 2.4 ]
(b) A = [ 2 1 1 0 ] \mathbf{A} = \begin{bmatrix}
2 & 1 \\
1 & 0
\end{bmatrix} \quad A = [ 2 1 1 0 ]
(c) A = [ 6 5 4 5 4 3 4 3 2 ] \mathbf{A} = \begin{bmatrix}
6 & 5 & 4 \\
5 & 4 & 3 \\
4 & 3 & 2
\end{bmatrix} A = ⎣ ⎡ 6 5 4 5 4 3 4 3 2 ⎦ ⎤
(d) A = [ 8 − 14 0 − 14 − 8 1 1 1 − 4 − 2 0 2 8 − 7 − 1 − 7 ] \mathbf{A} = \begin{bmatrix}
8 & -14 & 0 & -14 \\
-8 & 1 & 1 & 1 \\
-4 & -2 & 0 & 2 \\
8 & -7 & -1 & -7
\end{bmatrix} A = ⎣ ⎡ 8 − 8 − 4 8 − 14 1 − 2 − 7 0 1 0 − 1 − 14 1 2 − 7 ⎦ ⎤
✍ Describe what happens during power iteration using the matrix A = [ 0 1 1 0 ] \mathbf{A}= \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix} A = [ 0 1 1 0 ] and initial vector x = [ 0.4 0.7 ] \mathbf{x}=\begin{bmatrix} 0.4\\0.7 \end{bmatrix} x = [ 0.4 0.7 ] . Does the algorithm converge to an eigenvector? How does this relate to (8.2.3) ?
⌨ In Exercise 2.3.5 we considered a mass-lumped model of a hanging string that led to a tridiagonal system of linear equations. Then, in Exercise 7.2.6 , we found that eigenvectors of the same matrix correspond to vibrational modes of the string. The same setup can be applied to a membrane hanging from a square frame. Lumping the mass onto a Cartesian grid, each interacts with the four neighbors to the north, south, east, and west. If n n n masses are used in each coordinate direction, we get an n 2 × n 2 n^2\times n^2 n 2 × n 2 sparse matrix A \mathbf{A} A that can be constructed by FNC.poisson(n)
.
(a) Let n = 10 n=10 n = 10 and make a spy
plot of A \mathbf{A} A . What is the density of A \mathbf{A} A ? Most rows all have the same number of nonzeros; find this number.
(b) Find the dominant λ 1 \lambda_1 λ 1 using eigs
for n = 10 , 15 , 20 , 25 n=10,15,20,25 n = 10 , 15 , 20 , 25 .
(c) For each n n n in part (b), apply 100 steps of Function 8.2.2 . On one graph, plot the four convergence curves ∣ β k − λ 1 ∣ |\beta_k-\lambda_1| ∣ β k − λ 1 ∣ using a semi-log scale. (They will not be smooth curves because this matrix has many repeated eigenvalues that complicate the convergence analysis.)
⌨ Copy the instructions from Exercise 8.1.5 to obtain a large, sparse matrix A \mathbf{A} A . Use Function 8.2.2 to find the leading eigenvalue of A T A \mathbf{A}^T\mathbf{A} A T A to at least six significant digits.
⌨ For symmetric matrices, the Rayleigh quotient (7.4.5) converts an O ( ϵ ) O(\epsilon) O ( ϵ ) eigenvector estimate into an O ( ϵ 2 ) O(\epsilon^2) O ( ϵ 2 ) eigenvalue estimate. Duplicate Function 8.2.2 and rename it to powersym
. Modify the new function to use the Rayleigh quotient to produce the entries of β
. Your function should not introduce any additional matrix-vector multiplications. Apply the original Function 8.2.2 and the new powersym
on the matrix matrixdepot("fiedler",100)
, plotting the convergence curves on one graph.