Skip to content

Commit c9c8af9

Browse files
Tom's April 3 edits of SVD lecture
1 parent 4471731 commit c9c8af9

File tree

1 file changed

+117
-108
lines changed

1 file changed

+117
-108
lines changed

lectures/svd_intro.md

Lines changed: 117 additions & 108 deletions
Original file line numberDiff line numberDiff line change
@@ -147,91 +147,6 @@ When we study Dynamic Mode Decomposition below, we shall want to remember this c
147147

148148

149149

150-
151-
152-
153-
154-
155-
## Digression: Polar Decomposition
156-
157-
A singular value decomposition (SVD) is related to the **polar decomposition** of $X$
158-
159-
$$
160-
X = SQ
161-
$$
162-
163-
where
164-
165-
\begin{align*}
166-
S & = U\Sigma U^T \cr
167-
Q & = U V^T
168-
\end{align*}
169-
170-
and $S$ is evidently a symmetric matrix and $Q$ is an orthogonal matrix.
171-
172-
## Principle Components Analysis (PCA)
173-
174-
Let's begin with a case in which $n >> m$, so that we have many more observations $n$ than random variables $m$.
175-
176-
The matrix $X$ is **short and fat** in an $n >> m$ case as opposed to a **tall and skinny** case with $m > > n $ to be discussed later.
177-
178-
We regard $X$ as an $m \times n$ matrix of **data**:
179-
180-
$$
181-
X = \begin{bmatrix} X_1 \mid X_2 \mid \cdots \mid X_n\end{bmatrix}
182-
$$
183-
184-
where for $j = 1, \ldots, n$ the column vector $X_j = \begin{bmatrix}X_{1j}\\X_{2j}\\\vdots\\X_{mj}\end{bmatrix}$ is a vector of observations on variables $\begin{bmatrix}x_1\\x_2\\\vdots\\x_m\end{bmatrix}$.
185-
186-
In a **time series** setting, we would think of columns $j$ as indexing different __times__ at which random variables are observed, while rows index different random variables.
187-
188-
In a **cross section** setting, we would think of columns $j$ as indexing different __individuals__ for which random variables are observed, while rows index different **random variables**.
189-
190-
The number of positive singular values equals the rank of matrix $X$.
191-
192-
Arrange the singular values in decreasing order.
193-
194-
Arrange the positive singular values on the main diagonal of the matrix $\Sigma$ of into a vector $\sigma_R$.
195-
196-
Set all other entries of $\Sigma$ to zero.
197-
198-
## Relationship of PCA to SVD
199-
200-
To relate a SVD to a PCA (principal component analysis) of data set $X$, first construct the SVD of the data matrix $X$:
201-
202-
$$
203-
X = U \Sigma V^T = \sigma_1 U_1 V_1^T + \sigma_2 U_2 V_2^T + \cdots + \sigma_r U_r V_r^T
204-
$$ (eq:PCA1)
205-
206-
where
207-
208-
$$
209-
U=\begin{bmatrix}U_1|U_2|\ldots|U_m\end{bmatrix}
210-
$$
211-
212-
$$
213-
V^T = \begin{bmatrix}V_1^T\\V_2^T\\\ldots\\V_n^T\end{bmatrix}
214-
$$
215-
216-
In equation {eq}`eq:PCA1`, each of the $m \times n$ matrices $U_{j}V_{j}^T$ is evidently
217-
of rank $1$.
218-
219-
Thus, we have
220-
221-
$$
222-
X = \sigma_1 \begin{pmatrix}U_{11}V_{1}^T\\U_{21}V_{1}^T\\\cdots\\U_{m1}V_{1}^T\\\end{pmatrix} + \sigma_2\begin{pmatrix}U_{12}V_{2}^T\\U_{22}V_{2}^T\\\cdots\\U_{m2}V_{2}^T\\\end{pmatrix}+\ldots + \sigma_r\begin{pmatrix}U_{1r}V_{r}^T\\U_{2r}V_{r}^T\\\cdots\\U_{mr}V_{r}^T\\\end{pmatrix}
223-
$$ (eq:PCA2)
224-
225-
Here is how we would interpret the objects in the matrix equation {eq}`eq:PCA2` in
226-
a time series context:
227-
228-
* $ V_{k}^T= \begin{bmatrix}V_{k1} & V_{k2} & \ldots & V_{kn}\end{bmatrix} \quad \textrm{for each} \ k=1, \ldots, n $ is a time series $\lbrace V_{kj} \rbrace_{j=1}^n$ for the $k$th **principal component**
229-
230-
* $U_j = \begin{bmatrix}U_{1k}\\U_{2k}\\\ldots\\U_{mk}\end{bmatrix} \ k=1, \ldots, m$
231-
is a vector of **loadings** of variables $X_i$ on the $k$th principle component, $i=1, \ldots, m$
232-
233-
* $\sigma_k $ for each $k=1, \ldots, r$ is the strength of $k$th **principal component**
234-
235150
## Reduced Versus Full SVD
236151

237152
Earlier, we mentioned **full** and **reduced** SVD's.
@@ -276,13 +191,6 @@ rr = np.linalg.matrix_rank(X)
276191
print('rank of X - '), rr
277192
```
278193

279-
**Remark:** The cells above illustrate application of the `fullmatrices=True` and `full-matrices=False` options.
280-
Using `full-matrices=False` returns a reduced singular value decomposition. This option implements
281-
an optimal reduced rank approximation of a matrix, in the sense of minimizing the Frobenius
282-
norm of the discrepancy between the approximating matrix and the matrix being approximated.
283-
Optimality in this sense is established in the celebrated Eckart–Young theorem. See <https://en.wikipedia.org/wiki/Low-rank_approximation>.
284-
285-
When we study Dynamic Mode Decompositions below, it will be important for us to remember the following important properties of full and reduced SVD's in such tall-skinny cases.
286194

287195
**Properties:**
288196

@@ -299,14 +207,25 @@ print('UUT, UTU = '), UUT, UTU
299207

300208

301209
```{code-cell} ipython3
302-
UTUhat = Uhat.T@Uhat
303-
UUThat = Uhat@Uhat.T
304-
print('UUThat, UTUhat= '), UUThat, UTUhat
210+
UhatUhatT = Uhat.T@Uhat
211+
UhatTUThat = Uhat@Uhat.T
212+
print('UhatUhatT, UhatTUhat= '), UhatUhatT, UhatTUhat
305213
```
306214

307215

308216

309217

218+
**Remark:** The cells above illustrate application of the `fullmatrices=True` and `full-matrices=False` options.
219+
Using `full-matrices=False` returns a reduced singular value decomposition. This option implements
220+
an optimal reduced rank approximation of a matrix, in the sense of minimizing the Frobenius
221+
norm of the discrepancy between the approximating matrix and the matrix being approximated.
222+
Optimality in this sense is established in the celebrated Eckart–Young theorem. See <https://en.wikipedia.org/wiki/Low-rank_approximation>.
223+
224+
When we study Dynamic Mode Decompositions below, it will be important for us to remember the following important properties of full and reduced SVD's in such tall-skinny cases.
225+
226+
227+
228+
310229

311230
Let's do another exercise, but now we'll set $m = 2 < 5 = n $
312231

@@ -326,6 +245,85 @@ print('Uhat, Shat, Vhat = '), Uhat, Shat, Vhat
326245
rr = np.linalg.matrix_rank(X)
327246
print('rank X = '), rr
328247
```
248+
## Digression: Polar Decomposition
249+
250+
A singular value decomposition (SVD) is related to the **polar decomposition** of $X$
251+
252+
$$
253+
X = SQ
254+
$$
255+
256+
where
257+
258+
\begin{align*}
259+
S & = U\Sigma U^T \cr
260+
Q & = U V^T
261+
\end{align*}
262+
263+
and $S$ is evidently a symmetric matrix and $Q$ is an orthogonal matrix.
264+
265+
## Principle Components Analysis (PCA)
266+
267+
Let's begin with a case in which $n >> m$, so that we have many more observations $n$ than random variables $m$.
268+
269+
The matrix $X$ is **short and fat** in an $n >> m$ case as opposed to a **tall and skinny** case with $m > > n $ to be discussed later.
270+
271+
We regard $X$ as an $m \times n$ matrix of **data**:
272+
273+
$$
274+
X = \begin{bmatrix} X_1 \mid X_2 \mid \cdots \mid X_n\end{bmatrix}
275+
$$
276+
277+
where for $j = 1, \ldots, n$ the column vector $X_j = \begin{bmatrix}X_{1j}\\X_{2j}\\\vdots\\X_{mj}\end{bmatrix}$ is a vector of observations on variables $\begin{bmatrix}x_1\\x_2\\\vdots\\x_m\end{bmatrix}$.
278+
279+
In a **time series** setting, we would think of columns $j$ as indexing different __times__ at which random variables are observed, while rows index different random variables.
280+
281+
In a **cross section** setting, we would think of columns $j$ as indexing different __individuals__ for which random variables are observed, while rows index different **random variables**.
282+
283+
The number of positive singular values equals the rank of matrix $X$.
284+
285+
Arrange the singular values in decreasing order.
286+
287+
Arrange the positive singular values on the main diagonal of the matrix $\Sigma$ of into a vector $\sigma_R$.
288+
289+
Set all other entries of $\Sigma$ to zero.
290+
291+
## Relationship of PCA to SVD
292+
293+
To relate a SVD to a PCA (principal component analysis) of data set $X$, first construct the SVD of the data matrix $X$:
294+
295+
$$
296+
X = U \Sigma V^T = \sigma_1 U_1 V_1^T + \sigma_2 U_2 V_2^T + \cdots + \sigma_r U_r V_r^T
297+
$$ (eq:PCA1)
298+
299+
where
300+
301+
$$
302+
U=\begin{bmatrix}U_1|U_2|\ldots|U_m\end{bmatrix}
303+
$$
304+
305+
$$
306+
V^T = \begin{bmatrix}V_1^T\\V_2^T\\\ldots\\V_n^T\end{bmatrix}
307+
$$
308+
309+
In equation {eq}`eq:PCA1`, each of the $m \times n$ matrices $U_{j}V_{j}^T$ is evidently
310+
of rank $1$.
311+
312+
Thus, we have
313+
314+
$$
315+
X = \sigma_1 \begin{pmatrix}U_{11}V_{1}^T\\U_{21}V_{1}^T\\\cdots\\U_{m1}V_{1}^T\\\end{pmatrix} + \sigma_2\begin{pmatrix}U_{12}V_{2}^T\\U_{22}V_{2}^T\\\cdots\\U_{m2}V_{2}^T\\\end{pmatrix}+\ldots + \sigma_r\begin{pmatrix}U_{1r}V_{r}^T\\U_{2r}V_{r}^T\\\cdots\\U_{mr}V_{r}^T\\\end{pmatrix}
316+
$$ (eq:PCA2)
317+
318+
Here is how we would interpret the objects in the matrix equation {eq}`eq:PCA2` in
319+
a time series context:
320+
321+
* $ V_{k}^T= \begin{bmatrix}V_{k1} & V_{k2} & \ldots & V_{kn}\end{bmatrix} \quad \textrm{for each} \ k=1, \ldots, n $ is a time series $\lbrace V_{kj} \rbrace_{j=1}^n$ for the $k$th **principal component**
322+
323+
* $U_j = \begin{bmatrix}U_{1k}\\U_{2k}\\\ldots\\U_{mk}\end{bmatrix} \ k=1, \ldots, m$
324+
is a vector of **loadings** of variables $X_i$ on the $k$th principle component, $i=1, \ldots, m$
325+
326+
* $\sigma_k $ for each $k=1, \ldots, r$ is the strength of $k$th **principal component**
329327
330328
## PCA with Eigenvalues and Eigenvectors
331329
@@ -762,6 +760,14 @@ $$ (eq:hatAversion0)
762760
763761
This is the case that we are interested in here.
764762
763+
If we use formula {eq}`eq:hatAversion0` to calculate $\hat A X$ we find that
764+
765+
$$
766+
\hat A X = X'
767+
$$
768+
769+
so that the regression equation **fits perfectly**, the usual outcome in an **underdetermined least-squares** model.
770+
765771
766772
Thus, we want to fit equation {eq}`eq:VARfirstorder` in a situation in which we have a number $n$ of observations that is small relative to the number $m$ of
767773
variables that appear in the vector $X_t$.
@@ -781,14 +787,14 @@ $$
781787
\hat A = X' X^{+}
782788
$$ (eq:hatAform)
783789
784-
where the (possibly huge) $ \tilde n \times m $ matrix $ X^{+} = (X^T X)^{-1} X^T$ is again the pseudo-inverse of $ X $.
790+
where the (possibly huge) $ \tilde n \times m $ matrix $ X^{+} = (X^T X)^{-1} X^T$ is again a pseudo-inverse of $ X $.
785791
786-
For some situations that we are interested in, $X^T X $ can be close to singular, a situation that can lead some numerical algorithms to be error-prone.
792+
For some situations that we are interested in, $X^T X $ can be close to singular, a situation that can make some numerical algorithms be error-prone.
787793
788-
To confront that situationa, we'll use efficient algorithms for computing and for constructing reduced rank approximations of $\hat A$ in formula {eq}`eq:hatAversion0`.
794+
To confront that possibility, we'll use efficient algorithms for computing and for constructing reduced rank approximations of $\hat A$ in formula {eq}`eq:hatAversion0`.
789795
790796
791-
The $ i $th row of $ \hat A $ is an $ m \times 1 $ vector of pseudo-regression coefficients of $ X_{i,t+1} $ on $ X_{j,t}, j = 1, \ldots, m $.
797+
The $ i $th row of $ \hat A $ is an $ m \times 1 $ vector of regression coefficients of $ X_{i,t+1} $ on $ X_{j,t}, j = 1, \ldots, m $.
792798
793799
An efficient way to compute the pseudo-inverse $X^+$ is to start with the (reduced) singular value decomposition
794800
@@ -800,7 +806,7 @@ $$ (eq:SVDDMD)
800806
801807
where $ U $ is $ m \times p $, $ \Sigma $ is a $ p \times p $ diagonal matrix, and $ V^T $ is a $ p \times \tilde n $ matrix.
802808
803-
Here $ p $ is the rank of $ X $, where necessarily $ p \leq \tilde n $ because we are in the case in which $m > > \tilde n$.
809+
Here $ p $ is the rank of $ X $, where necessarily $ p \leq \tilde n $ because we are in a situation in which $m > > \tilde n$.
804810
805811
806812
Since we are in the $m > > \tilde n$ case, we can use the singular value decomposition {eq}`eq:SVDDMD` efficiently to construct the pseudo-inverse $X^+$
@@ -847,8 +853,9 @@ Next, we describe some alternative __reduced order__ representations of our firs
847853
848854
## Representation 1
849855
856+
In constructing this representation and also whenever we use it, we use a **full** SVD of $X$.
850857
851-
We use the $p$ columns of $U$, and thus the $p$ rows of $U^T$, to define a $p \times 1$ vector $\tilde X_t$ as follows
858+
We use the $p$ columns of $U$, and thus the $p$ rows of $U^T$, to define a $p \times 1$ vector $\tilde b_t$ as follows
852859
853860
854861
$$
@@ -863,7 +870,9 @@ $$ (eq:Xdecoder)
863870
864871
(Here we use the notation $b$ to remind ourselves that we are creating a **b**asis vector.)
865872
866-
Since $U U^T$ is an $m \times m$ identity matrix, it follows from equation {eq}`eq:tildeXdef2` that we can reconstruct $X_t$ from $\tilde b_t$ by using
873+
Since we are using a **full** SVD, $U U^T$ is an $m \times m$ identity matrix.
874+
875+
So it follows from equation {eq}`eq:tildeXdef2` that we can reconstruct $X_t$ from $\tilde b_t$ by using
867876
868877
869878
@@ -910,8 +919,8 @@ This representation is the one originally proposed by {cite}`schmid2010`.
910919
It can be regarded as an intermediate step to a related and perhaps more useful representation 3.
911920
912921
922+
As with Representation 1, we continue to
913923
914-
To work it requires that we
915924
916925
* use all $p$ singular values of $X$
917926
* use a **full** SVD and **not** a reduced SVD
@@ -937,7 +946,7 @@ where $\Lambda$ is a diagonal matrix of eigenvalues and $W$ is a $p \times p$
937946
matrix whose columns are eigenvectors corresponding to rows (eigenvalues) in
938947
$\Lambda$.
939948
940-
Note that when $U U^T = I_{p \times p}$, as is true with a full SVD of X (but **not** true with a reduced SVD)
949+
Note that when $U U^T = I_{m \times m}$, as is true with a full SVD of X (but **not** true with a reduced SVD)
941950
942951
$$
943952
\hat A = U \tilde A U^T = U W \Lambda W^{-1} U^T
@@ -1065,12 +1074,12 @@ We also have the following
10651074
{eq}`eq:Atilde0`, define it as the following $r \times r$ counterpart
10661075
10671076
$$
1068-
\tilde A = U^T \hat A U
1077+
\tilde A = \tilde U^T \hat A U
10691078
$$ (eq:Atilde10)
10701079
1071-
where in equation {eq}`eq:Atilde10` $U$ is now the $m \times r$ matrix consisting of the eigevectors of $X X^T$ corresponding to the $r$
1080+
where in equation {eq}`eq:Atilde10` $\tilde U$ is now the $m \times r$ matrix consisting of the eigevectors of $X X^T$ corresponding to the $r$
10721081
largest singular values of $X$.
1073-
The conclusions of the proposition remain true with this altered definition of $U$. (**Beware:** We have **recycled** notation here by temporarily redefining $U$ as being just $r$ columns instead of $p$ columns as we have up to now.)
1082+
The conclusions of the proposition remain true when we replace $U$ by $\tilde U$.
10741083
10751084
10761085
Also see {cite}`DDSE_book` (p. 238)
@@ -1100,7 +1109,7 @@ X_t & = \Phi \check b_t
11001109
$$
11011110
11021111
1103-
There is a better way to compute the $r \times 1$ vector $\check b_t$
1112+
But there is a better way to compute the $r \times 1$ vector $\check b_t$
11041113
11051114
In particular, the following argument from {cite}`DDSE_book` (page 240) provides a computationally efficient way
11061115
to compute $\check b_t$.

0 commit comments

Comments
 (0)