You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: lectures/svd_intro.md
+73-33Lines changed: 73 additions & 33 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -40,20 +40,22 @@ This lecture describes the singular value decomposition and two of its uses:
40
40
41
41
* dynamic mode decomposition (DMD)
42
42
43
-
Each of these can be thought of as data-reduction methods that are designed to capture principal patterns in data by projecting data onto a limited set of factors.
43
+
Each of these can be thought of as data-reduction methods that are designed to capture salient patterns in data by projecting data onto a limited set of factors.
44
44
45
45
## The Setup
46
46
47
47
Let $X$ be an $m \times n$ matrix of rank $r$.
48
48
49
+
Necessarily, $r \leq \min(m,n)$.
50
+
49
51
In this lecture, we'll think of $X$ as a matrix of **data**.
50
52
51
53
* each column is an **individual** -- a time period or person, depending on the application
52
54
53
55
* each row is a **random variable** measuring an attribute of a time period or a person, depending on the application
54
56
55
57
56
-
We'll be interested in two distinct cases
58
+
We'll be interested in two cases
57
59
58
60
* A **short and fat** case in which $m << n$, so that there are many more columns than rows.
59
61
@@ -64,7 +66,7 @@ We'll apply a **singular value decomposition** of $X$ in both situations.
64
66
65
67
In the first case in which there are many more observations $n$ than random variables $m$, we learn about the joint distribution of the random variables by taking averages across observations of functions of the observations.
66
68
67
-
Here we'll look for **patterns** by using a **singular value decomosition** to do a **principal components analysis** (PCA).
69
+
Here we'll look for **patterns** by using a **singular value decomposition** to do a **principal components analysis** (PCA).
68
70
69
71
In the second case in which there are many more random variables $m$ than observations $n$, we'll proceed in a different way.
70
72
@@ -91,7 +93,7 @@ where
91
93
92
94
* $V$ is an $n \times n$ matrix whose columns are eigenvectors of $X X^T$
93
95
94
-
* $\Sigma$ is an $m \times r$ matrix in which the first $r$ places on its main diagonal are positive numbers $\sigma_1, \sigma_2, \ldots, \sigma_r$ called **singular values**; remaining entries of $\Sigma$ are all zero
96
+
* $\Sigma$ is an $m \times n$ matrix in which the first $r$ places on its main diagonal are positive numbers $\sigma_1, \sigma_2, \ldots, \sigma_r$ called **singular values**; remaining entries of $\Sigma$ are all zero
95
97
96
98
* The $r$ singular values are square roots of the eigenvalues of the $m \times m$ matrix $X X^T$ and the $n \times n$ matrix $X^T X$
97
99
@@ -104,13 +106,15 @@ The shapes of $U$, $\Sigma$, and $V$ are $\left(m, m\right)$, $\left(m, n\right)
104
106
105
107
Below, we shall assume these shapes.
106
108
107
-
However, though we chose not to, there is an alternative shape convention that we could have used.
109
+
The above description corresponds to a standard shape convention often called a **full** SVD.
110
+
111
+
There is an alternative shape convention called **economy** or **reduced** SVD that we could have used, and will sometimes use below.
108
112
109
113
Thus, note that because we assume that $A$ has rank $r$, there are only $r $ nonzero singular values, where $r=\textrm{rank}(A)\leq\min\left(m, n\right)$.
110
114
111
115
Therefore, we could also write $U$, $\Sigma$, and $V$ as matrices with shapes $\left(m, r\right)$, $\left(r, r\right)$, $\left(r, n\right)$.
112
116
113
-
Sometimes, we will choose the former one to be consistent with what is adopted by `numpy`.
117
+
Sometimes, we will choose the former convention.
114
118
115
119
At other times, we'll use the latter convention in which $\Sigma$ is an $r \times r$ diagonal matrix.
116
120
@@ -133,7 +137,7 @@ where $S$ is evidently a symmetric matrix and $Q$ is an orthogonal matrix.
133
137
134
138
Let's begin with a case in which $n >> m$, so that we have many more observations $n$ than random variables $m$.
135
139
136
-
The data matrix $X$ is **short and fat** in an $n >> m$ case as opposed to a **tall and skinny** case with $m > > n $ to be discussed later in this lecture.
140
+
The data matrix $X$ is **short and fat** in an $n >> m$ case as opposed to a **tall and skinny** case with $m > > n $ to be discussed later.
137
141
138
142
We regard $X$ as an $m \times n$ matrix of **data**:
139
143
@@ -194,10 +198,22 @@ is a vector of loadings of variables $X_i$ on the $k$th principle component, $i
We now turn to using the eigen decomposition of a sample covariance matrix to do PCA.
262
+
We now use an eigen decomposition of a sample covariance matrix to do PCA.
229
263
230
264
Let $X_{m \times n}$ be our $m \times n$ data matrix.
231
265
@@ -311,9 +345,7 @@ provided that we set
311
345
312
346
Since there are several possible ways of computing $P$ and $U$ for given a data matrix $X$, depending on algorithms used, we might have sign differences or different orders between eigenvectors.
313
347
314
-
We want a way that leads to the same $U$ and $P$.
315
-
316
-
In the following, we accomplish this by
348
+
We resolve such ambiguities about $U$ and $P$ by
317
349
318
350
1. sorting eigenvalues and singular values in descending order
319
351
2. imposing positive diagonals on $P$ and $U$ and adjusting signs in $V^T$ accordingly
@@ -528,7 +560,7 @@ def compare_pca_svd(da):
528
560
529
561
## Dynamic Mode Decomposition (DMD)
530
562
531
-
We now turn to the case in which $m >>n $ so that there are many more random variables $m$ than observations $n$.
563
+
We now turn to the case in which $m >>n$ in which an $m \times n$ data matrix $\tilde X$ contains many more random variables $m$ than observations $n$.
532
564
533
565
This is the **tall and skinny** case associated with **Dynamic Mode Decomposition**.
534
566
@@ -545,7 +577,7 @@ where for $t = 1, \ldots, n$, the $m \times 1 $ vector $X_t$ is
where $T$ denotes transposition and $X_{i,t}$ is an observation on variable $i$ at time $t$.
580
+
where $T$ again denotes complex transposition and $X_{i,t}$ is an observation on variable $i$ at time $t$.
549
581
550
582
From $\tilde X$, form two matrices
551
583
@@ -561,10 +593,12 @@ $$
561
593
562
594
Here $'$ does not denote matrix transposition but instead is part of the name of the matrix $X'$.
563
595
564
-
In forming $ X$ and $X'$, we have in each case dropped a column from $\tilde X$.
596
+
In forming $ X$ and $X'$, we have in each case dropped a column from $\tilde X$, in the case of $X$ the last column, and in the case of $X'$ the first column.
565
597
566
598
Evidently, $ X$ and $ X'$ are both $m \times \tilde n$ matrices where $\tilde n = n - 1$.
567
599
600
+
We now let the rank of $X$ be $p \neq \min(m, \tilde n) = \tilde n$.
601
+
568
602
We start with a system consisting of $m$ least squares regressions of **everything** on one lagged value of **everything**:
569
603
570
604
$$
@@ -577,10 +611,12 @@ $$
577
611
A = X' X^{+}
578
612
$$
579
613
580
-
and where the (huge) $m \times m $ matrix $X^{+}$ is the Moore-Penrose generalized inverse of $X$.
614
+
and where the (possibly huge) $m \times m $ matrix $X^{+}$ is the Moore-Penrose generalized inverse of $X$.
615
+
616
+
The $i$ the row of $A$ is an $m \times 1$ vector of regression coefficients of $X_{i,t+1}$ on $X_{j,t}, j = 1, \ldots, m$.
581
617
582
618
583
-
Think about the singular value decomposition
619
+
Think about the (reduced) singular value decomposition
584
620
585
621
$$
586
622
X = U \Sigma V^T
@@ -591,8 +627,8 @@ where $U$ is $m \times p$, $\Sigma$ is a $p \times p$ diagonal matrix, and $ V^
591
627
Here $p$ is the rank of $X$, where necessarily $p \leq \tilde n$.
592
628
593
629
594
-
We could compute the generalized inverse $X^+$ by using
595
-
as
630
+
We could construct the generalized inverse $X^+$ of $X$ by using
631
+
a singular value decomposition $X = U \Sigma V^T$ to compute
596
632
597
633
$$
598
634
X^{+} = V \Sigma^{-1} U^T
@@ -604,12 +640,12 @@ The idea behind **dynamic mode decomposition** is to construct an approximation
604
640
605
641
* sidesteps computing the generalized inverse $X^{+}$
606
642
607
-
* constructs an $m \times r$ matrix $\Phi$ that captures effects on all $m$ variables of $r < < p$ dynamic modes that are associated with the $r$ largest singular values
643
+
* constructs an $m \times r$ matrix $\Phi$ that captures effects on all $m$ variables of $r < < p$ **modes** that are associated with the $r$ largest singular values
608
644
609
645
610
646
* uses $\Phi$ and powers of $r$ singular values to forecast *future* $X_t$'s
611
647
612
-
The magic of **dynamic mode decomposition** is that we accomplish this without ever computing the regression coefficients $A = X' X^{+}$.
648
+
The beauty of **dynamic mode decomposition** is that we accomplish this without ever computing the regression coefficients $A = X' X^{+}$.
613
649
614
650
To construct a DMD, we deploy the following steps:
615
651
@@ -624,7 +660,7 @@ To construct a DMD, we deploy the following steps:
624
660
625
661
But we won't do that.
626
662
627
-
We'll first compute the $r$ largest singular values of $X$.
663
+
We'll compute the $r$ largest singular values of $X$.
628
664
629
665
We'll form matrices $\tilde V, \tilde U$ corresponding to those $r$ singular values.
630
666
@@ -645,12 +681,14 @@ To construct a DMD, we deploy the following steps:
645
681
\tilde X_{t+1} = \tilde A \tilde X_t
646
682
$$
647
683
648
-
where an approximation to (i.e., a projection of) the original $m \times 1$ vector $X_t$ can be acquired from inverting
684
+
where an approximation $\check X_t$ to (i.e., a projection of) the original $m \times 1$ vector $X_t$ can be acquired from
649
685
650
686
$$
651
-
X_t = \tilde U \tilde X_t
687
+
\check X_t = \tilde U \tilde X_t
652
688
$$
653
689
690
+
We'll provide a formula for $\tilde X_t$ soon.
691
+
654
692
From equation {eq}`eq:tildeA_1` and {eq}`eq:bigAformula` it follows that
655
693
656
694
@@ -683,12 +721,9 @@ To construct a DMD, we deploy the following steps:
683
721
We can construct an $r \times m$ matrix generalized inverse $\Phi^{+}$ of $\Phi$.
684
722
685
723
686
-
* We interrupt the flow with a digression at this point
687
-
724
+
* It will be helpful below to notice that from formula {eq}`eq:Phiformula`, we have
688
725
689
-
* notice that from formula {eq}`eq:Phiformula`, we have
690
-
691
-
$$
726
+
$$
692
727
\begin{aligned}
693
728
A \Phi & = (X' \tilde V \tilde \Sigma^{-1} \tilde U^T) (X' \tilde V \tilde \Sigma^{-1} W) \cr
694
729
& = X' \tilde V \Sigma^{-1} \tilde A W \cr
@@ -702,17 +737,16 @@ To construct a DMD, we deploy the following steps:
702
737
703
738
704
739
705
-
* Define an initial vector $b$ of dominant modes by
740
+
* Define an $ r \times 1$ initial vector $b$ of dominant modes by
706
741
707
742
$$
708
743
b= \Phi^{+} X_1
709
744
$$ (eq:bphieqn)
710
745
711
-
where evidently $b$ is an $r \times 1$ vector.
712
-
746
+
713
747
(Since it involves smaller matrices, formula {eq}`eq:beqnsmall` below is a computationally more efficient way to compute $b$)
714
748
715
-
* Then define _projected data_ $\hat X_1$ by
749
+
* Then define _projected data_ $\tilde X_1$ by
716
750
717
751
$$
718
752
\tilde X_1 = \Phi b
@@ -777,6 +811,12 @@ $$
777
811
\check X_{t+j} = \Phi \Lambda^j \Phi^{+} X_t
778
812
$$
779
813
814
+
or
815
+
816
+
$$
817
+
\check X_{t+j} = \Phi \Lambda^j (W \Lambda)^{-1} \tilde X_t
0 commit comments