Skip to content

Commit 6e53bf3

Browse files
Tom's March 10 edits of svd lecture
1 parent 6913d5a commit 6e53bf3

File tree

1 file changed

+65
-54
lines changed

1 file changed

+65
-54
lines changed

lectures/svd_intro.md

Lines changed: 65 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -32,15 +32,15 @@ import pandas as pd
3232
## Overview
3333

3434
The **singular value decomposition** is a work-horse in applications of least squares projection that
35-
form the backbone of important parts of modern machine learning methods.
35+
form a foundation for some important machine learning methods.
3636

3737
This lecture describes the singular value decomposition and two of its uses:
3838

3939
* principal components analysis (PCA)
4040

4141
* dynamic mode decomposition (DMD)
4242

43-
Each of these can be thought of as data-reduction methods that are designed to capture salient patterns in data by projecting data onto a limited set of factors.
43+
Each of these can be thought of as a data-reduction procedure designed to capture salient patterns by projecting data onto a limited set of factors.
4444

4545
## The Setup
4646

@@ -64,7 +64,7 @@ We'll be interested in two cases
6464

6565
We'll apply a **singular value decomposition** of $X$ in both situations.
6666

67-
In the first case in which there are many more observations $n$ than random variables $m$, we learn about the joint distribution of the random variables by taking averages across observations of functions of the observations.
67+
In the first case in which there are many more observations $n$ than random variables $m$, we learn about a joint distribution by taking averages across observations of functions of the observations.
6868

6969
Here we'll look for **patterns** by using a **singular value decomposition** to do a **principal components analysis** (PCA).
7070

@@ -102,42 +102,42 @@ $U_{ij}^T$ is the complex conjugate of $U_{ji}$.
102102

103103
* Similarly, when $V$ is a complex valued matrix, $V^T$ denotes the **conjugate-transpose** or **Hermitian-transpose** of $V$
104104

105-
The shapes of $U$, $\Sigma$, and $V$ are $\left(m, m\right)$, $\left(m, n\right)$, $\left(n, n\right)$, respectively.
105+
In what is called a **full** SVD, the shapes of $U$, $\Sigma$, and $V$ are $\left(m, m\right)$, $\left(m, n\right)$, $\left(n, n\right)$, respectively.
106106

107-
Below, we shall assume these shapes.
108107

109-
The above description corresponds to a standard shape convention often called a **full** SVD.
110108

111-
There is an alternative shape convention called **economy** or **reduced** SVD that we could have used, and will sometimes use below.
109+
There is also an alternative shape convention called an **economy** or **reduced** SVD .
112110

113111
Thus, note that because we assume that $A$ has rank $r$, there are only $r $ nonzero singular values, where $r=\textrm{rank}(A)\leq\min\left(m, n\right)$.
114112

115-
Therefore, we could also write $U$, $\Sigma$, and $V$ as matrices with shapes $\left(m, r\right)$, $\left(r, r\right)$, $\left(r, n\right)$.
113+
A **reduced** SVD uses this fact to express $U$, $\Sigma$, and $V$ as matrices with shapes $\left(m, r\right)$, $\left(r, r\right)$, $\left(r, n\right)$.
116114

117-
Sometimes, we will choose the former convention.
115+
Sometimes, we will use a full SVD
118116

119-
At other times, we'll use the latter convention in which $\Sigma$ is an $r \times r$ diagonal matrix.
120-
121-
Also, when we discuss the **dynamic mode decomposition** below, we'll use a special case of the latter convention in which it is understood that
122-
$r$ is just a pre-specified small number of leading singular values that we think capture the most interesting dynamics.
117+
At other times, we'll use a reduced SVD in which $\Sigma$ is an $r \times r$ diagonal matrix.
123118

124119
## Digression: Polar Decomposition
125120

126-
Through the following identities, the singular value decomposition (SVD) is related to the **polar decomposition** of $X$
121+
A singular value decomposition (SVD) is related to the **polar decomposition** of $X$
122+
123+
$$
124+
X = SQ
125+
$$
126+
127+
where
127128

128129
\begin{align*}
129-
X & = SQ \cr
130-
S & = U\Sigma U^T \cr
130+
S & = U\Sigma U^T \cr
131131
Q & = U V^T
132132
\end{align*}
133133

134-
where $S$ is evidently a symmetric matrix and $Q$ is an orthogonal matrix.
134+
and $S$ is evidently a symmetric matrix and $Q$ is an orthogonal matrix.
135135

136136
## Principle Components Analysis (PCA)
137137

138138
Let's begin with a case in which $n >> m$, so that we have many more observations $n$ than random variables $m$.
139139

140-
The data matrix $X$ is **short and fat** in an $n >> m$ case as opposed to a **tall and skinny** case with $m > > n $ to be discussed later.
140+
The matrix $X$ is **short and fat** in an $n >> m$ case as opposed to a **tall and skinny** case with $m > > n $ to be discussed later.
141141

142142
We regard $X$ as an $m \times n$ matrix of **data**:
143143

@@ -151,7 +151,7 @@ In a **time series** setting, we would think of columns $j$ as indexing differen
151151

152152
In a **cross section** setting, we would think of columns $j$ as indexing different __individuals__ for which random variables are observed, while rows index different **random variables**.
153153

154-
The number of singular values equals the rank of matrix $X$.
154+
The number of positive singular values equals the rank of matrix $X$.
155155

156156
Arrange the singular values in decreasing order.
157157

@@ -189,15 +189,21 @@ $$ (eq:PCA2)
189189
Here is how we would interpret the objects in the matrix equation {eq}`eq:PCA2` in
190190
a time series context:
191191
192-
* $ V_{k}^T= \begin{bmatrix}V_{k1} & V_{k2} & \ldots & V_{kn}\end{bmatrix} \quad \textrm{for each} \ k=1, \ldots, n $ is a time series $\lbrace V_{kj} \rbrace_{j=1}^n$ for the $k$th principal component
192+
* $ V_{k}^T= \begin{bmatrix}V_{k1} & V_{k2} & \ldots & V_{kn}\end{bmatrix} \quad \textrm{for each} \ k=1, \ldots, n $ is a time series $\lbrace V_{kj} \rbrace_{j=1}^n$ for the $k$th **principal component**
193193
194194
* $U_j = \begin{bmatrix}U_{1k}\\U_{2k}\\\ldots\\U_{mk}\end{bmatrix} \ k=1, \ldots, m$
195-
is a vector of loadings of variables $X_i$ on the $k$th principle component, $i=1, \ldots, m$
195+
is a vector of **loadings** of variables $X_i$ on the $k$th principle component, $i=1, \ldots, m$
196196
197197
* $\sigma_k $ for each $k=1, \ldots, r$ is the strength of $k$th **principal component**
198198
199199
## Reduced Versus Full SVD
200200
201+
Earlier, we mentioned **full** and **reduced** SVD's.
202+
203+
204+
You can read about reduced and full SVD here
205+
<https://numpy.org/doc/stable/reference/generated/numpy.linalg.svd.html>
206+
201207
In a **full** SVD
202208
203209
* $U$ is $m \times m$
@@ -210,10 +216,10 @@ In a **reduced** SVD
210216
* $\Sigma$ is $r \times r$
211217
* $V$ is $n \times r$
212218
213-
You can read about reduced and full SVD here
214-
<https://numpy.org/doc/stable/reference/generated/numpy.linalg.svd.html>
215219
216-
Let's do a couple of small experiments to see the difference
220+
Let's do a some small exerecise to compare **full** and **reduced** SVD's.
221+
222+
First, let's study a case in which $m = 5 > n = 2$.
217223
218224
```{code-cell} ipython3
219225
import numpy as np
@@ -238,7 +244,7 @@ an optimal reduced rank approximation of a matrix, in the sense of minimizing t
238244
norm of the discrepancy between the approximating matrix and the matrix being approximated.
239245
Optimality in this sense is established in the celebrated Eckart–Young theorem. See <https://en.wikipedia.org/wiki/Low-rank_approximation>.
240246
241-
Let's do another experiment now.
247+
Let's do another exercise, but now we'll set $m = 2 < 5 = n $
242248
243249
```{code-cell} ipython3
244250
import numpy as np
@@ -265,7 +271,7 @@ Let $X_{m \times n}$ be our $m \times n$ data matrix.
265271
266272
Let's assume that sample means of all variables are zero.
267273
268-
We can make sure that this is true by **pre-processing** the data by substracting sample means appropriately.
274+
We can assure this by **pre-processing** the data by subtracting sample means.
269275
270276
Define the sample covariance matrix $\Omega$ as
271277
@@ -294,13 +300,13 @@ $$
294300
where
295301
296302
$$
297-
\epsilon\epsilon^T=\Lambda
303+
\epsilon\epsilon^T=\Lambda .
298304
$$
299305
300306
We can verify that
301307
302308
$$
303-
XX^T=P\Lambda P^T
309+
XX^T=P\Lambda P^T .
304310
$$
305311
306312
It follows that we can represent the data matrix as
@@ -314,20 +320,20 @@ X=\begin{bmatrix}X_1|X_2|\ldots|X_m\end{bmatrix} =\begin{bmatrix}P_1|P_2|\ldots|
314320
where
315321
316322
$$
317-
\epsilon\epsilon^T=\Lambda
323+
\epsilon\epsilon^T=\Lambda .
318324
$$
319325
320326
To reconcile the preceding representation with the PCA that we obtained through the SVD above, we first note that $\epsilon_j^2=\lambda_j\equiv\sigma^2_j$.
321327
322-
Now define $\tilde{\epsilon_j} = \frac{\epsilon_j}{\sqrt{\lambda_j}}$
328+
Now define $\tilde{\epsilon_j} = \frac{\epsilon_j}{\sqrt{\lambda_j}}$,
323329
which evidently implies that $\tilde{\epsilon}_j\tilde{\epsilon}_j^T=1$.
324330
325331
Therefore
326332
327333
$$
328334
\begin{aligned}
329335
X&=\sqrt{\lambda_1}P_1\tilde{\epsilon_1}+\sqrt{\lambda_2}P_2\tilde{\epsilon_2}+\ldots+\sqrt{\lambda_m}P_m\tilde{\epsilon_m}\\
330-
&=\sigma_1P_1\tilde{\epsilon_2}+\sigma_2P_2\tilde{\epsilon_2}+\ldots+\sigma_mP_m\tilde{\epsilon_m}
336+
&=\sigma_1P_1\tilde{\epsilon_2}+\sigma_2P_2\tilde{\epsilon_2}+\ldots+\sigma_mP_m\tilde{\epsilon_m} ,
331337
\end{aligned}
332338
$$
333339
@@ -345,7 +351,7 @@ provided that we set
345351
346352
Since there are several possible ways of computing $P$ and $U$ for given a data matrix $X$, depending on algorithms used, we might have sign differences or different orders between eigenvectors.
347353
348-
We resolve such ambiguities about $U$ and $P$ by
354+
We can resolve such ambiguities about $U$ and $P$ by
349355
350356
1. sorting eigenvalues and singular values in descending order
351357
2. imposing positive diagonals on $P$ and $U$ and adjusting signs in $V^T$ accordingly
@@ -562,7 +568,7 @@ def compare_pca_svd(da):
562568
563569
We turn to the case in which $m >>n$ in which an $m \times n$ data matrix $\tilde X$ contains many more random variables $m$ than observations $n$.
564570
565-
This is the **tall and skinny** case associated with **Dynamic Mode Decomposition**.
571+
This **tall and skinny** case is associated with **Dynamic Mode Decomposition**.
566572
567573
You can read about Dynamic Mode Decomposition here {cite}`DMD_book`.
568574
@@ -593,11 +599,11 @@ $$
593599

594600
Here $'$ does not denote matrix transposition but instead is part of the name of the matrix $X'$.
595601

596-
In forming $ X$ and $X'$, we have in each case dropped a column from $\tilde X$, in the case of $X$ the last column, and in the case of $X'$ the first column.
602+
In forming $ X$ and $X'$, we have in each case dropped a column from $\tilde X$, the last column in the case of $X$, and the first column in the case of $X'$.
597603

598604
Evidently, $ X$ and $ X'$ are both $m \times \tilde n$ matrices where $\tilde n = n - 1$.
599605

600-
We denote the rank of $X$ as $p \neq \min(m, \tilde n) = \tilde n$.
606+
We denote the rank of $X$ as $p \leq \min(m, \tilde n) = \tilde n$.
601607

602608
We start with a system consisting of $m$ least squares regressions of **everything** on one lagged value of **everything**:
603609

@@ -624,11 +630,11 @@ Consider the (reduced) singular value decomposition
624630
625631
626632
627-
where $U$ is $m \times p$, $\Sigma$ is a $p \times p$ diagonal matrix, and $ V^T$ is a $p \times \tilde n$ matrix.
633+
where $U$ is $m \times p$, $\Sigma$ is a $p \times p$ diagonal matrix, and $ V^T$ is a $p \times m$ matrix.
628634
629635
Here $p$ is the rank of $X$, where necessarily $p \leq \tilde n$.
630636
631-
(We have described and illustrated a reduced singular value decomposition above, and compared it with a full singular value decomposition.)
637+
(We described and illustrated a **reduced** singular value decomposition above, and compared it with a **full** singular value decomposition.)
632638
633639
We could construct the generalized inverse $X^+$ of $X$ by using
634640
a singular value decomposition $X = U \Sigma V^T$ to compute
@@ -647,12 +653,11 @@ where $r < p$.
647653
648654
The idea behind **dynamic mode decomposition** is to construct this low rank approximation to $A$ that
649655
650-
* sidesteps computing the generalized inverse $X^{+}$
651656
652657
* constructs an $m \times r$ matrix $\Phi$ that captures effects on all $m$ variables of $r \leq p$ **modes** that are associated with the $r$ largest eigenvalues of $A$
653658
654659
655-
* uses $\Phi$ and powers of the $r$ largest eigenvalues of $A$ to forecast *future* $X_t$'s
660+
* uses $\Phi$, the current value of $X_t$, and powers of the $r$ largest eigenvalues of $A$ to forecast *future* $X_{t+j}$'s
656661
657662
658663
An important properities of the DMD algorithm that we shall describe soon is that
@@ -680,21 +685,26 @@ $$
680685
A = X' V \Sigma^{-1} U^T
681686
$$ (eq:Aformbig)
682687
683-
where $V$ is an $\tilde n \times p$ matrix, $\Sigma^{-1}$ is a $p \times p$ matrix, and $U$ is a $p \times m$ matrix,
684-
and where $U^T U = I_p$ and $V V^T = I_m $.
688+
where $V$ is an $\tilde n \times p$ matrix, $\Sigma^{-1}$ is a $p \times p$ matrix, $U$ is a $p \times m$ matrix,
689+
and $U^T U = I_p$ and $V V^T = I_m $.
685690
686691
We use the $p$ columns of $U$, and thus the $p$ rows of $U^T$, to define a $p \times 1$ vector $\tilde X_t$ to be used in a lower-dimensional description of the evolution of the system:
687692
688693
689694
$$
690695
\tilde X_t = U^T X_t .
691-
$$
696+
$$ (eq:tildeXdef2)
692697
693-
Since $U^T U$ is a $p \times p$ identity matrix, we can recover $X_t$ from $\tilde X_t$ by using
698+
Since $U^T U$ is a $p \times p$ identity matrix, it follows from equation {eq}`eq:tildeXdef2` that we can recover $X_t$ from $\tilde X_t$ by using
694699
695700
$$
696701
X_t = U \tilde X_t .
697-
$$
702+
$$ (eq:Xdecoder)
703+
704+
705+
* Equation {eq}`eq:tildeXdef2` serves as an **encoder** that reduces summarizes the $m \times 1$ vector $X_t$ by the $p \times 1$ vector $\tilde X_t$
706+
707+
* Equation {eq}`eq:Xdecoder` serves as a **decoder** that recovers the $m \times 1$ vector $X_t$ from the $p \times 1$ vector $\tilde X_t$
698708
699709
The following $p \times p$ transition matrix governs the motion of $\tilde X_t$:
700710
@@ -712,10 +722,10 @@ Notice that if we multiply both sides of {eq}`eq:xtildemotion` by $U$
712722
we get
713723
714724
$$
715-
U \tilde X_t = U \tilde A \tilde X_t = U^T \tilde A U^T X_t
725+
U \tilde X_t = U \tilde A \tilde X_t = U \tilde A U^T X_t
716726
$$
717727
718-
which gives
728+
which by virtue of decoder equation {eq}`eq:xtildemotion` recovers
719729
720730
$$
721731
X_{t+1} = A X_t .
@@ -728,9 +738,6 @@ $$
728738
### Lower Rank Approximations
729739
730740
731-
An attractive feature of **dynamic mode decomposition** is that we avoid computing the huge matrix $A = X' X^{+}$ of regression coefficients, while with low computational effort, we possibly acquire a good low-rank approximation of $A$.
732-
733-
734741
Instead of using formula {eq}`eq:Aformbig`, we'll compute the $r$ largest singular values of $X$ and form matrices $\tilde V, \tilde U$ corresponding to those $r$ singular values.
735742
736743
We'll then construct a reduced-order system of dimension $r$ by forming an $r \times r$ transition matrix
@@ -740,6 +747,8 @@ $$
740747
\tilde A = \tilde U^T A \tilde U
741748
$$ (eq:tildeA_1)
742749
750+
Here we use $\tilde U$ rather than $U$ as we did earlier.
751+
743752
This redefined $\tilde A$ matrix governs the dynamics of a redefined $r \times 1$ vector $\tilde X_t $
744753
according to
745754
@@ -751,7 +760,7 @@ where an approximation $\check X_t$ to the original $m \times 1$ vector $X_t$
751760
the columns of $\tilde U$:
752761
753762
$$
754-
\check X_t = \tilde U \tilde X_t
763+
\check X_t = \tilde U \tilde X_t .
755764
$$
756765
757766
We'll provide a formula for $\tilde X_t$ soon.
@@ -764,7 +773,7 @@ $$
764773
$$ (eq:tildeAform)
765774
766775
767-
Next, we'll Construct an eigencomposition of $\tilde A$
776+
Next, we'll Construct an eigencomposition of $\tilde A$ defined in equation {eq}`eq:tildeA_1`:
768777
769778
$$
770779
\tilde A W = W \Lambda
@@ -793,17 +802,17 @@ We can construct an $r \times m$ matrix generalized inverse $\Phi^{+}$ of $\Ph
793802
794803
795804
796-
We define an $ r \times 1$ initial vector $b$ of dominant modes by
805+
We define an $ r \times 1$ vector $b$ of $r$ modes associated with the $r$ largest singular values.
797806
798807
$$
799808
b= \Phi^{+} X_1
800809
$$ (eq:bphieqn)
801810
802811
803812
804-
**Proof of Eigenvector Sharing**
813+
**Proposition** The $r$ columns of $\Phi$ are eigenvectors of $A$ that correspond to the largest $r$ eigenvalues of $A$.
805814
806-
From formula {eq}`eq:Phiformula` we have
815+
**Proof:** From formula {eq}`eq:Phiformula` we have
807816
808817
$$
809818
\begin{aligned}
@@ -831,6 +840,8 @@ $$
831840
832841
Thus, $\phi_i$ is an eigenvector of $A$ that corresponds to eigenvalue $\lambda_i$ of $A$.
833842
843+
This concludes the proof.
844+
834845
835846
836847

0 commit comments

Comments
 (0)