You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: lectures/svd_intro.md
+65-54Lines changed: 65 additions & 54 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -32,15 +32,15 @@ import pandas as pd
32
32
## Overview
33
33
34
34
The **singular value decomposition** is a work-horse in applications of least squares projection that
35
-
form the backbone of important parts of modern machine learning methods.
35
+
form a foundation for some important machine learning methods.
36
36
37
37
This lecture describes the singular value decomposition and two of its uses:
38
38
39
39
* principal components analysis (PCA)
40
40
41
41
* dynamic mode decomposition (DMD)
42
42
43
-
Each of these can be thought of as data-reduction methods that are designed to capture salient patterns in data by projecting data onto a limited set of factors.
43
+
Each of these can be thought of as a data-reduction procedure designed to capture salient patterns by projecting data onto a limited set of factors.
44
44
45
45
## The Setup
46
46
@@ -64,7 +64,7 @@ We'll be interested in two cases
64
64
65
65
We'll apply a **singular value decomposition** of $X$ in both situations.
66
66
67
-
In the first case in which there are many more observations $n$ than random variables $m$, we learn about the joint distribution of the random variables by taking averages across observations of functions of the observations.
67
+
In the first case in which there are many more observations $n$ than random variables $m$, we learn about a joint distribution by taking averages across observations of functions of the observations.
68
68
69
69
Here we'll look for **patterns** by using a **singular value decomposition** to do a **principal components analysis** (PCA).
70
70
@@ -102,42 +102,42 @@ $U_{ij}^T$ is the complex conjugate of $U_{ji}$.
102
102
103
103
* Similarly, when $V$ is a complex valued matrix, $V^T$ denotes the **conjugate-transpose** or **Hermitian-transpose** of $V$
104
104
105
-
The shapes of $U$, $\Sigma$, and $V$ are $\left(m, m\right)$, $\left(m, n\right)$, $\left(n, n\right)$, respectively.
105
+
In what is called a **full** SVD, the shapes of $U$, $\Sigma$, and $V$ are $\left(m, m\right)$, $\left(m, n\right)$, $\left(n, n\right)$, respectively.
106
106
107
-
Below, we shall assume these shapes.
108
107
109
-
The above description corresponds to a standard shape convention often called a **full** SVD.
110
108
111
-
There is an alternative shape convention called **economy** or **reduced** SVD that we could have used, and will sometimes use below.
109
+
There is also an alternative shape convention called an **economy** or **reduced** SVD .
112
110
113
111
Thus, note that because we assume that $A$ has rank $r$, there are only $r $ nonzero singular values, where $r=\textrm{rank}(A)\leq\min\left(m, n\right)$.
114
112
115
-
Therefore, we could also write $U$, $\Sigma$, and $V$ as matrices with shapes $\left(m, r\right)$, $\left(r, r\right)$, $\left(r, n\right)$.
113
+
A **reduced** SVD uses this fact to express $U$, $\Sigma$, and $V$ as matrices with shapes $\left(m, r\right)$, $\left(r, r\right)$, $\left(r, n\right)$.
116
114
117
-
Sometimes, we will choose the former convention.
115
+
Sometimes, we will use a full SVD
118
116
119
-
At other times, we'll use the latter convention in which $\Sigma$ is an $r \times r$ diagonal matrix.
120
-
121
-
Also, when we discuss the **dynamic mode decomposition** below, we'll use a special case of the latter convention in which it is understood that
122
-
$r$ is just a pre-specified small number of leading singular values that we think capture the most interesting dynamics.
117
+
At other times, we'll use a reduced SVD in which $\Sigma$ is an $r \times r$ diagonal matrix.
123
118
124
119
## Digression: Polar Decomposition
125
120
126
-
Through the following identities, the singular value decomposition (SVD) is related to the **polar decomposition** of $X$
121
+
A singular value decomposition (SVD) is related to the **polar decomposition** of $X$
122
+
123
+
$$
124
+
X = SQ
125
+
$$
126
+
127
+
where
127
128
128
129
\begin{align*}
129
-
X & = SQ \cr
130
-
S & = U\Sigma U^T \cr
130
+
S & = U\Sigma U^T \cr
131
131
Q & = U V^T
132
132
\end{align*}
133
133
134
-
where $S$ is evidently a symmetric matrix and $Q$ is an orthogonal matrix.
134
+
and $S$ is evidently a symmetric matrix and $Q$ is an orthogonal matrix.
135
135
136
136
## Principle Components Analysis (PCA)
137
137
138
138
Let's begin with a case in which $n >> m$, so that we have many more observations $n$ than random variables $m$.
139
139
140
-
The data matrix $X$ is **short and fat** in an $n >> m$ case as opposed to a **tall and skinny** case with $m > > n $ to be discussed later.
140
+
The matrix $X$ is **short and fat** in an $n >> m$ case as opposed to a **tall and skinny** case with $m > > n $ to be discussed later.
141
141
142
142
We regard $X$ as an $m \times n$ matrix of **data**:
143
143
@@ -151,7 +151,7 @@ In a **time series** setting, we would think of columns $j$ as indexing differen
151
151
152
152
In a **cross section** setting, we would think of columns $j$ as indexing different __individuals__ for which random variables are observed, while rows index different **random variables**.
153
153
154
-
The number of singular values equals the rank of matrix $X$.
154
+
The number of positive singular values equals the rank of matrix $X$.
155
155
156
156
Arrange the singular values in decreasing order.
157
157
@@ -189,15 +189,21 @@ $$ (eq:PCA2)
189
189
Here is how we would interpret the objects in the matrix equation {eq}`eq:PCA2` in
190
190
a time series context:
191
191
192
-
* $ V_{k}^T= \begin{bmatrix}V_{k1} & V_{k2} & \ldots & V_{kn}\end{bmatrix} \quad \textrm{for each} \ k=1, \ldots, n $ is a time series $\lbrace V_{kj} \rbrace_{j=1}^n$ for the $k$th principal component
192
+
* $ V_{k}^T= \begin{bmatrix}V_{k1} & V_{k2} & \ldots & V_{kn}\end{bmatrix} \quad \textrm{for each} \ k=1, \ldots, n $ is a time series $\lbrace V_{kj} \rbrace_{j=1}^n$ for the $k$th **principal component**
To reconcile the preceding representation with the PCA that we obtained through the SVD above, we first note that $\epsilon_j^2=\lambda_j\equiv\sigma^2_j$.
321
327
322
-
Now define $\tilde{\epsilon_j} = \frac{\epsilon_j}{\sqrt{\lambda_j}}$
328
+
Now define $\tilde{\epsilon_j} = \frac{\epsilon_j}{\sqrt{\lambda_j}}$,
323
329
which evidently implies that $\tilde{\epsilon}_j\tilde{\epsilon}_j^T=1$.
Since there are several possible ways of computing $P$ and $U$ for given a data matrix $X$, depending on algorithms used, we might have sign differences or different orders between eigenvectors.
347
353
348
-
We resolve such ambiguities about $U$ and $P$ by
354
+
We can resolve such ambiguities about $U$ and $P$ by
349
355
350
356
1. sorting eigenvalues and singular values in descending order
351
357
2. imposing positive diagonals on $P$ and $U$ and adjusting signs in $V^T$ accordingly
@@ -562,7 +568,7 @@ def compare_pca_svd(da):
562
568
563
569
We turn to the case in which $m >>n$ in which an $m \times n$ data matrix $\tilde X$ contains many more random variables $m$ than observations $n$.
564
570
565
-
This is the **tall and skinny** case associated with **Dynamic Mode Decomposition**.
571
+
This **tall and skinny** case is associated with **Dynamic Mode Decomposition**.
566
572
567
573
You can read about Dynamic Mode Decomposition here {cite}`DMD_book`.
568
574
@@ -593,11 +599,11 @@ $$
593
599
594
600
Here $'$ does not denote matrix transposition but instead is part of the name of the matrix $X'$.
595
601
596
-
In forming $ X$ and $X'$, we have in each case dropped a column from $\tilde X$, in the case of $X$the last column, and in the case of $X'$ the first column.
602
+
In forming $ X$ and $X'$, we have in each case dropped a column from $\tilde X$, the last column in the case of $X$, and the first columnin the case of $X'$.
597
603
598
604
Evidently, $ X$ and $ X'$ are both $m \times \tilde n$ matrices where $\tilde n = n - 1$.
599
605
600
-
We denote the rank of $X$ as $p \neq \min(m, \tilde n) = \tilde n$.
606
+
We denote the rank of $X$ as $p \leq \min(m, \tilde n) = \tilde n$.
601
607
602
608
We start with a system consisting of $m$ least squares regressions of **everything** on one lagged value of **everything**:
603
609
@@ -624,11 +630,11 @@ Consider the (reduced) singular value decomposition
624
630
625
631
626
632
627
-
where $U$ is $m \times p$, $\Sigma$ is a $p \times p$ diagonal matrix, and $ V^T$ is a $p \times \tilde n$ matrix.
633
+
where $U$ is $m \times p$, $\Sigma$ is a $p \times p$ diagonal matrix, and $ V^T$ is a $p \times m$ matrix.
628
634
629
635
Here $p$ is the rank of $X$, where necessarily $p \leq \tilde n$.
630
636
631
-
(We have described and illustrated a reduced singular value decomposition above, and compared it with a full singular value decomposition.)
637
+
(We described and illustrated a **reduced** singular value decomposition above, and compared it with a **full** singular value decomposition.)
632
638
633
639
We could construct the generalized inverse $X^+$ of $X$ by using
634
640
a singular value decomposition $X = U \Sigma V^T$ to compute
@@ -647,12 +653,11 @@ where $r < p$.
647
653
648
654
The idea behind **dynamic mode decomposition** is to construct this low rank approximation to $A$ that
649
655
650
-
* sidesteps computing the generalized inverse $X^{+}$
651
656
652
657
* constructs an $m \times r$ matrix $\Phi$ that captures effects on all $m$ variables of $r \leq p$ **modes** that are associated with the $r$ largest eigenvalues of $A$
653
658
654
659
655
-
* uses $\Phi$and powers of the $r$ largest eigenvalues of $A$ to forecast *future* $X_t$'s
660
+
* uses $\Phi$, the current value of $X_t$, and powers of the $r$ largest eigenvalues of $A$ to forecast *future* $X_{t+j}$'s
656
661
657
662
658
663
An important properities of the DMD algorithm that we shall describe soon is that
@@ -680,21 +685,26 @@ $$
680
685
A = X' V \Sigma^{-1} U^T
681
686
$$ (eq:Aformbig)
682
687
683
-
where $V$ is an $\tilde n \times p$ matrix, $\Sigma^{-1}$ is a $p \times p$ matrix, and $U$ is a $p \times m$ matrix,
684
-
and where $U^T U = I_p$ and $V V^T = I_m $.
688
+
where $V$ is an $\tilde n \times p$ matrix, $\Sigma^{-1}$ is a $p \times p$ matrix, $U$ is a $p \times m$ matrix,
689
+
and $U^T U = I_p$ and $V V^T = I_m $.
685
690
686
691
We use the $p$ columns of $U$, and thus the $p$ rows of $U^T$, to define a $p \times 1$ vector $\tilde X_t$ to be used in a lower-dimensional description of the evolution of the system:
687
692
688
693
689
694
$$
690
695
\tilde X_t = U^T X_t .
691
-
$$
696
+
$$ (eq:tildeXdef2)
692
697
693
-
Since $U^T U$ is a $p \times p$ identity matrix, we can recover $X_t$ from $\tilde X_t$ by using
698
+
Since $U^T U$ is a $p \times p$ identity matrix, it follows from equation {eq}`eq:tildeXdef2` that we can recover $X_t$ from $\tilde X_t$ by using
694
699
695
700
$$
696
701
X_t = U \tilde X_t .
697
-
$$
702
+
$$ (eq:Xdecoder)
703
+
704
+
705
+
* Equation {eq}`eq:tildeXdef2` serves as an **encoder** that reduces summarizes the $m \times 1$ vector $X_t$ by the $p \times 1$ vector $\tilde X_t$
706
+
707
+
* Equation {eq}`eq:Xdecoder` serves as a **decoder** that recovers the $m \times 1$ vector $X_t$ from the $p \times 1$ vector $\tilde X_t$
698
708
699
709
The following $p \times p$ transition matrix governs the motion of $\tilde X_t$:
700
710
@@ -712,10 +722,10 @@ Notice that if we multiply both sides of {eq}`eq:xtildemotion` by $U$
712
722
we get
713
723
714
724
$$
715
-
U \tilde X_t = U \tilde A \tilde X_t = U^T \tilde A U^T X_t
725
+
U \tilde X_t = U \tilde A \tilde X_t = U \tilde A U^T X_t
716
726
$$
717
727
718
-
which gives
728
+
which by virtue of decoder equation {eq}`eq:xtildemotion` recovers
719
729
720
730
$$
721
731
X_{t+1} = A X_t .
@@ -728,9 +738,6 @@ $$
728
738
### Lower Rank Approximations
729
739
730
740
731
-
An attractive feature of **dynamic mode decomposition** is that we avoid computing the huge matrix $A = X' X^{+}$ of regression coefficients, while with low computational effort, we possibly acquire a good low-rank approximation of $A$.
732
-
733
-
734
741
Instead of using formula {eq}`eq:Aformbig`, we'll compute the $r$ largest singular values of $X$ and form matrices $\tilde V, \tilde U$ corresponding to those $r$ singular values.
735
742
736
743
We'll then construct a reduced-order system of dimension $r$ by forming an $r \times r$ transition matrix
@@ -740,6 +747,8 @@ $$
740
747
\tilde A = \tilde U^T A \tilde U
741
748
$$ (eq:tildeA_1)
742
749
750
+
Here we use $\tilde U$ rather than $U$ as we did earlier.
751
+
743
752
This redefined $\tilde A$ matrix governs the dynamics of a redefined $r \times 1$ vector $\tilde X_t $
744
753
according to
745
754
@@ -751,7 +760,7 @@ where an approximation $\check X_t$ to the original $m \times 1$ vector $X_t$
751
760
the columns of $\tilde U$:
752
761
753
762
$$
754
-
\check X_t = \tilde U \tilde X_t
763
+
\check X_t = \tilde U \tilde X_t .
755
764
$$
756
765
757
766
We'll provide a formula for $\tilde X_t$ soon.
@@ -764,7 +773,7 @@ $$
764
773
$$ (eq:tildeAform)
765
774
766
775
767
-
Next, we'll Construct an eigencomposition of $\tilde A$
776
+
Next, we'll Construct an eigencomposition of $\tilde A$ defined in equation {eq}`eq:tildeA_1`:
768
777
769
778
$$
770
779
\tilde A W = W \Lambda
@@ -793,17 +802,17 @@ We can construct an $r \times m$ matrix generalized inverse $\Phi^{+}$ of $\Ph
793
802
794
803
795
804
796
-
We define an $ r \times 1$ initial vector $b$ of dominant modes by
805
+
We define an $ r \times 1$ vector $b$ of $r$ modes associated with the $r$ largest singular values.
797
806
798
807
$$
799
808
b= \Phi^{+} X_1
800
809
$$ (eq:bphieqn)
801
810
802
811
803
812
804
-
**Proof of Eigenvector Sharing**
813
+
**Proposition** The $r$ columns of $\Phi$ are eigenvectors of $A$ that correspond to the largest $r$ eigenvalues of $A$.
805
814
806
-
From formula {eq}`eq:Phiformula` we have
815
+
**Proof:** From formula {eq}`eq:Phiformula` we have
807
816
808
817
$$
809
818
\begin{aligned}
@@ -831,6 +840,8 @@ $$
831
840
832
841
Thus, $\phi_i$ is an eigenvector of $A$ that corresponds to eigenvalue $\lambda_i$ of $A$.
0 commit comments