The SVD has another important property that proves very useful in a variety of applications. Let A \mathbf{A} A be a real m × n m\times n m × n matrix with SVD A = U S V T \mathbf{A}=\mathbf{U}\mathbf{S}\mathbf{V}^T A = US V T and (momentarily) m ≥ n m\ge n m ≥ n . Another way of writing the thin form of the SVD is
A = U ^ S ^ V T = [ u 1 u 2 ⋯ u n ] [ σ 1 ⋱ σ n ] [ v 1 T ⋮ v n T ] = [ σ 1 u 1 ⋯ σ n u n ] [ v 1 T ⋮ v n T ] = σ 1 u 1 v 1 T + ⋯ + σ r u r v r T = ∑ i = 1 r σ i u i v i T , \begin{split}
\mathbf{A} = \hat{\mathbf{U}}\hat{\mathbf{S}}\mathbf{V}^T &=
\begin{bmatrix}
\rule[-0.3em]{0pt}{1em} \mathbf{u}_1 & \mathbf{u}_2 & \cdots & \mathbf{u}_n
\end{bmatrix} \:
\begin{bmatrix}
\sigma_1 & & \\
& \ddots & \\
& & \sigma_n
\end{bmatrix} \:
\begin{bmatrix}
\mathbf{v}_1^T \\ \vdots \\ \mathbf{v}_n^T
\end{bmatrix}\ \\
&=
\begin{bmatrix}
\rule[-0.3em]{0pt}{1em} \sigma_1\mathbf{u}_1 & \cdots & \sigma_n\mathbf{u}_n
\end{bmatrix}\:
\begin{bmatrix}
\mathbf{v}_1^T \\ \vdots \\ \mathbf{v}_n^T
\end{bmatrix} \\
&= \sigma_1 \mathbf{u}_{1}\mathbf{v}_{1}^T + \cdots + \sigma_r \mathbf{u}_{r}\mathbf{v}_{r}^T = \sum_{i=1}^r \sigma_i \mathbf{u}_{i}\mathbf{v}_{i}^T,
\end{split} A = U ^ S ^ V T = [ u 1 u 2 ⋯ u n ] ⎣ ⎡ σ 1 ⋱ σ n ⎦ ⎤ ⎣ ⎡ v 1 T ⋮ v n T ⎦ ⎤ = [ σ 1 u 1 ⋯ σ n u n ] ⎣ ⎡ v 1 T ⋮ v n T ⎦ ⎤ = σ 1 u 1 v 1 T + ⋯ + σ r u r v r T = i = 1 ∑ r σ i u i v i T , where r r r is the rank of A \mathbf{A} A . The final formula also holds for the case m < n m<n m < n .
Each outer product u i v i T \mathbf{u}_{i}\mathbf{v}_{i}^T u i v i T is a rank-1 matrix of unit 2-norm. Thanks to the ordering of singular values, then, Equation (7.5.1) expresses A \mathbf{A} A as a sum of decreasingly important contributions. This motivates the definition, for 1 ≤ k ≤ r 1\le k \le r 1 ≤ k ≤ r ,
A k = ∑ i = 1 k σ i u i v i T = U k S k V k T , \mathbf{A}_k = \sum_{i=1}^k \sigma_i \mathbf{u}_{i}\mathbf{v}_{i}^T = \mathbf{U}_k \mathbf{S}_k \mathbf{V}_k^T, A k = i = 1 ∑ k σ i u i v i T = U k S k V k T , where U k \mathbf{U}_k U k and V k \mathbf{V}_k V k are the first k k k columns of U \mathbf{U} U and V \mathbf{V} V , respectively, and S k \mathbf{S}_k S k is the upper-left k × k k\times k k × k submatrix of S \mathbf{S} S .
The rank of a sum of matrices is always less than or equal to the sum of the ranks, so A k \mathbf{A}_k A k is a rank-k k k approximation to A \mathbf{A} A . It turns out that A k \mathbf{A}_k A k is the best rank-k k k approximation of A \mathbf{A} A , as measured in the matrix 2-norm.
(part 1 only) Note that (7.5.2) is identical to (7.5.1) with σ k + 1 , … , σ r \sigma_{k+1},\ldots,\sigma_r σ k + 1 , … , σ r all set to zero. This implies that
A − A k = U ( S − S ^ ) V T , \mathbf{A} - \mathbf{A}_k = \mathbf{U}(\mathbf{S}-\hat{\mathbf{S}})\mathbf{V}^T, A − A k = U ( S − S ^ ) V T , where S ^ \hat{\mathbf{S}} S ^ has those same values of σ i \sigma_i σ i replaced by zero. But that makes the above an SVD of A − A k \mathbf{A} - \mathbf{A}_k A − A k , with singular values 0 , … , 0 , σ k + 1 , … , σ r 0,\ldots,0,\sigma_{k+1},\ldots,\sigma_r 0 , … , 0 , σ k + 1 , … , σ r , the largest of which is σ k + 1 \sigma_{k+1} σ k + 1 . That proves the first claim.
7.5.1 Compression ¶ If the singular values of A \mathbf{A} A decrease sufficiently rapidly, then A k \mathbf{A}_{k} A k may capture the most significant behavior of the matrix for a reasonably small value of k k k .
Image compressionWe make an image from some text, then reload it as a matrix.
plot(annotations=(0.5, 0.5, text("Hello world", 44, :center, :center)),
grid=:none, frame=:none, size=(400, 150))
savefig("hello.png")
img = load("hello.png")
A = @. Float64(Gray(img))
Gray.(A)
Next we show that the singular values decrease until they reach zero (more precisely, until they are about ϵ mach \epsilon_\text{mach} ϵ mach times the norm of the matrix) at around k = 45 k=45 k = 45 .
U, σ, V = svd(A)
scatter(σ, xaxis=(L"i"), yaxis=(:log10, L"\sigma_i"),
title="Singular values")
The rapid decrease suggests that we can get fairly good low-rank approximations.
plt = plot(layout=(2, 2), frame=:none, aspect_ratio=1, titlefontsize=10)
for i in 1:4
k = 3i
Ak = U[:, 1:k] * diagm(σ[1:k]) * V[:, 1:k]'
plot!(Gray.(Ak), subplot=i, title="rank = $k")
end
plt
Consider how little data is needed to reconstruct these images. For rank-9, for instance, we have 9 left and right singular vectors plus 9 singular values, for a compression ratio of better than 12:1.
m, n = size(A)
compression = m * n / (9 * (m + n + 1))
Image compressionWe make an image from some text, then reload it as a matrix.
clf
tobj = text(0, 0,'Hello world','fontsize',44);
ex = get(tobj, 'extent');
axis([ex(1) ex(1) + ex(3) ex(2) ex(2) + ex(4)]), axis off
exportgraphics(gca, 'hello.png', resolution=300)
A = imread('hello.png');
A = double(im2gray(A));
size_A = size(A)
Next we show that the singular values decrease until they reach zero (more precisely, until they are about ϵ mach \epsilon_\text{mach} ϵ mach times the norm of the matrix) at around k = 100 k=100 k = 100 .
[U, S, V] = svd(A);
sigma = diag(S);
semilogy(sigma, '.')
title('singular values'), axis tight % ignore this line
xlabel('i'), ylabel('\sigma_i') % ignore this line
r = find(sigma / sigma(1) > 10*eps, 1, 'last')
The rapid decrease suggests that we can get fairly good low-rank approximations.
for i = 1:4
subplot(2, 2, i)
k = 2*i;
Ak = U(:, 1:k) * S(1:k, 1:k) * V(:, 1:k)';
imshow(Ak, [0, 255])
title(sprintf('rank = %d', k))
end
Consider how little data is needed to reconstruct these images. For rank-9, for instance, we have 9 left and right singular vectors plus 9 singular values, for a compression ratio of better than 12:1.
[m, n] = size(A);
full_size = m * n;
compressed_size = 8 * (m + n + 1);
fprintf("compression ratio: %.1f", full_size / compressed_size)
Image compressionWe make an image from some text, then reload it as a matrix.
text(
0.5,
0.5,
"Hello world",
dict(fontsize=44),
horizontalalignment="center",
verticalalignment="center",
)
axis("off")
savefig("hello.png")
img = imread("hello.png")[:, :, :3]
A = rgb2gray(img)
print(f"image of size {A.shape}")
Next we show that the singular values decrease until they reach zero (more precisely, until they are about ϵ mach \epsilon_\text{mach} ϵ mach times the norm of the matrix) at around index 38.
from numpy.linalg import svd
U, sigma, Vt = svd(A)
semilogy(sigma, "o")
title("Singular values")
xlabel("$i$"), ylabel("$\\sigma_i$");
significant = sigma / sigma[0] > 10 * 2**-52
print(f"last significant singular value at index {max(where(significant)[0])}")
The rapid decrease suggests that we can get fairly good low-rank approximations.
for k in range(4):
r = 2 + 2 * k
Ak = U[:, :r] @ diag(sigma[:r]) @ Vt[:r, :]
subplot(2, 2, k + 1)
imshow(Ak, cmap="gray", clim=(0.0, 1.0))
title(f"rank = {r}")
xticks([]), yticks([])
Consider how little data is needed to reconstruct these images. For rank-8, for instance, we have 8 left and right singular vectors plus 8 singular values.
m, n = A.shape
full_size = m * n
compressed_size = 8 * (m + n + 1)
print(f"compression ratio: {full_size / compressed_size:.1f}")
7.5.2 Capturing major trends ¶ The use of dimension reduction offered by low-rank SVD approximation goes well beyond simply reducing computation time. By isolating the most important contributions to the matrix, dimension reduction can uncover deep connections and trends that are otherwise obscured by weaker effects and noise.
One useful way to quantify the decay in the singular values is to compute
s k = ∑ i = 1 k σ i 2 , τ k = s k s r , k = 1 , … , r . s_k = \sum_{i=1}^k \sigma_i^2, \quad \tau_k = \frac{s_k}{s_r}, \quad k=1,\ldots,r. s k = i = 1 ∑ k σ i 2 , τ k = s r s k , k = 1 , … , r . Clearly 0 ≤ τ k ≤ 1 0\le \tau_k \le 1 0 ≤ τ k ≤ 1 and τ k \tau_k τ k is non-decreasing as a function of k k k . We can think of τ k \tau_k τ k as the fraction of energy (or in statistical terms, variance) contained in the singular values up to and including the k k k th.[1]
Dimension reduction in voting recordsThis matrix describes the votes on bills in the 111th session of the United States Senate. (The data set was obtained from [https:// voteview .com ].) Each row is one senator, and each column is a vote item.
If we visualize the votes (yellow is “yea,” blue is “nay”), we can see great similarity between many rows, reflecting party unity.
heatmap(A, color=:viridis,
title="Votes in 111th U.S. Senate", xlabel="bill", ylabel="senator")
We use (7.5.4) to quantify the decay rate of the values.
U, σ, V = svd(A)
τ = cumsum(σ .^ 2) / sum(σ .^ 2)
scatter(τ[1:16], xaxis=("k"), yaxis=(L"\tau_k"),
title="Fraction of singular value energy")
The first and second singular triples contain about 58% and 17%, respectively, of the energy of the matrix. All others have far less effect, suggesting that the information is primarily two-dimensional. The first left and right singular vectors also contain interesting structure.
scatter(U[:, 1], label="", layout=(1, 2),
xlabel="senator", title="left singular vector")
scatter!(V[:, 1], label="", subplot=2,
xlabel="bill", title="right singular vector")
Both vectors have values greatly clustered near ± C \pm C ± C for a constant C C C . These can be roughly interpreted as how partisan a particular senator or bill was, and for which political party. Projecting the senators’ vectors into the first two V \mathbf{V} V -coordinates gives a particularly nice way to reduce them to two dimensions. Political scientists label these dimensions partisanship and bipartisanship . Here we color them by actual party affiliation (also given in the data file): red for Republican, blue for Democrat, and yellow for independent.
x1 = A * V[:, 1];
x2 = A * V[:, 2];
@load "voting.jld2" Rep Dem Ind
Rep = vec(Rep);
Dem = vec(Dem);
Ind = vec(Ind);
scatter(x1[Dem], x2[Dem], color=:blue, label="D",
xaxis=("partisanship"), yaxis=("bipartisanship"), title="111th US Senate by voting record")
scatter!(x1[Rep], x2[Rep], color=:red, label="R")
scatter!(x1[Ind], x2[Ind], color=:yellow, label="I")
Dimension reduction in voting recordsThis matrix describes the votes on bills in the 111th session of the United States Senate. (The data set was obtained from [https:// voteview .com ].) Each row is one senator, and each column is a vote item.
If we visualize the votes (yellow is “yea,” blue is “nay”), we can see great similarity between many rows, reflecting party unity.
clf
imagesc(A)
colormap parula
title('Votes in 111th U.S. Senate') % ignore this line
ylabel('senator'), xlabel('bill') % ignore this line
We use (7.5.4) to quantify the decay rate of the values.
[U, S, V] = svd(A);
sigma = diag(S);
tau = cumsum(sigma.^2) / sum(sigma.^2);
plot(tau(1:16), 'o')
xlabel('k'), ylabel('\tau_k') % ignore this line
title('Fraction of singular value energy') % ignore this line
The first and second singular triples contain about 58% and 17%, respectively, of the energy of the matrix. All others have far less effect, suggesting that the information is primarily two-dimensional. The first left and right singular vectors also contain interesting structure.
subplot(211), plot(U(:, 1), '.')
xlabel('senator number'), title('left singular vector') % ignore this line
subplot(212), plot(V(:, 1), '.')
xlabel('bill number'), title('right singular vector') % ignore this line
Both vectors have values greatly clustered near ± C \pm C ± C for a constant C C C . These can be roughly interpreted as how partisan a particular senator or bill was, and for which political party. Projecting the senators’ vectors into the first two V \mathbf{V} V -coordinates gives a particularly nice way to reduce them to two dimensions. Political scientists label these dimensions partisanship and bipartisanship . Here we color them by actual party affiliation (also given in the data file): red for Republican, blue for Democrat, and yellow for independent.
clf
x1 = V(:, 1)'*A'; x2 = V(:, 2)'*A';
scatter(x1(Dem), x2(Dem), 20, 'b'), hold on
scatter(x1(Rep), x2(Rep), 20, 'r')
scatter(x1(Ind), x2(Ind), 20, 'k')
xlabel('partisanship'), ylabel('bipartisanship') % ignore this line
title('111th US Senate in 2D')
Dimension reduction in voting recordsThis matrix describes the votes on bills in the 111th session of the United States Senate. (The data set was obtained from [https:// voteview .com ].) Each row is one senator, and each column is a vote item.
from scipy.io import loadmat
vars = loadmat("voting.mat")
A = vars["A"]
m, n = A.shape
print("size:", (m, n))
If we visualize the votes (yellow is “yea,” blue is “nay”), we can see great similarity between many rows, reflecting party unity.
imshow(A, cmap="viridis")
xlabel("bill")
ylabel("senator")
title("Votes in 111th U.S. Senate");
We use (7.5.4) to quantify the decay rate of the values.
U, sigma, Vt = svd(A)
tau = cumsum(sigma**2) / sum(sigma**2)
plot(range(1, 17), tau[:16], "o")
xlabel("$k$")
ylabel("$\tau_k$")
title("Fraction of singular value energy");
The first and second singular triples contain about 58% and 17%, respectively, of the energy of the matrix. All others have far less effect, suggesting that the information is primarily two-dimensional. The first left and right singular vectors also contain interesting structure.
subplot(1, 2, 1)
plot(U[:, 0], "o")
xlabel("senator"),title("left singular vector")
subplot(1, 2, 2)
plot(Vt[0, :], "o")
xlabel("bill"), title("right singular vector");
Both vectors have values greatly clustered near ± C \pm C ± C for a constant C C C . These can be roughly interpreted as how partisan a particular senator or bill was, and for which political party. Projecting the senators’ vectors into the first two V \mathbf{V} V -coordinates gives a particularly nice way to reduce them to two dimensions. Political scientists label these dimensions partisanship and bipartisanship . Here we color them by actual party affiliation (also given in the data file): red for Republican, blue for Democrat, and yellow for independent.
x1 = sigma[0] * U[:, 0]
x2 = sigma[1] * U[:, 1]
Rep = vars["Rep"] - 1
Dem = vars["Dem"] - 1
Ind = vars["Ind"] - 1
scatter(x1[Dem], x2[Dem], color="blue", label="D")
scatter(x1[Rep], x2[Rep], color="red", label="R")
scatter(x1[Ind], x2[Ind], color="darkorange", label="I")
xlabel("partisanship")
ylabel("bipartisanship")
legend()
title("111th US Senate in 2D");
Not all data sets can be reduced effectively to a small number of dimensions, but as Demo 7.5.2 illustrates, in some cases reduction reveals information that corresponds to real-world understanding.
7.5.3 Exercises ¶ ✍ Suppose that A \mathbf{A} A is an n × n n\times n n × n matrix. Explain why σ n \sigma_n σ n is the distance (in 2-norm) from A \mathbf{A} A to the set of all singular matrices.
✍ Suppose A \mathbf{A} A is a 7 × 4 7\times 4 7 × 4 matrix and the eigenvalues of A ∗ A \mathbf{A}^*\mathbf{A} A ∗ A are 3, 4, 7, and 10. How close is A \mathbf{A} A in the 2-norm to (a) a rank-3 matrix? (b) a rank-2 matrix?
(a) ⌨ Find the rank-1 matrix closest to
A = [ 1 5 5 1 ] , \mathbf{A}=\displaystyle \begin{bmatrix}
1 & 5 \\ 5 & 1
\end{bmatrix}, A = [ 1 5 5 1 ] , as measured in the 2-norm.
(b) ⌨ Repeat part (a) for
A = [ 1 5 0 1 ] . \mathbf{A}=\displaystyle \begin{bmatrix}
1 & 5 \\ 0 & 1
\end{bmatrix}. A = [ 1 0 5 1 ] . ✍ Find the rank-1 matrix closest to
A = [ 1 b b 1 ] , \mathbf{A}=\displaystyle \begin{bmatrix}
1 & b \\ b & 1
\end{bmatrix}, A = [ 1 b b 1 ] , as measured in the 2-norm, where b > 0 b>0 b > 0 .
⌨ Following Demo 7.5.1 as a guide, load the “mandrill” test image and convert it to a matrix of floating-point pixel grayscale intensities. Using the SVD, display as images the best approximations of rank 5, 10, 15, and 20.