Tom's April 3 edits of SVD lecture

thomassargent30 · thomassargent30 · commit c9c8af9fc3e3 · 2022-04-03T13:56:09.000-06:00
diff --git a/lectures/svd_intro.md b/lectures/svd_intro.md
@@ -147,91 +147,6 @@ When we study Dynamic Mode Decomposition below, we shall want to remember this c
 
 
 
-
-
-
-
-
-## Digression:  Polar Decomposition
-
-A singular value decomposition (SVD) is related to the **polar decomposition** of $X$
-
-$$
-X  = SQ   
-$$
-
-where
-
-\begin{align*}
- S & = U\Sigma U^T \cr
-Q & = U V^T 
-\end{align*}
-
-and $S$ is evidently a symmetric matrix and $Q$ is an orthogonal matrix.
-
-## Principle Components Analysis (PCA)
-
-Let's begin with a case in which $n >> m$, so that we have many  more observations $n$ than random variables $m$.
-
-The  matrix $X$ is **short and fat**  in an  $n >> m$ case as opposed to a **tall and skinny** case with $m > > n $ to be discussed later.
-
-We regard  $X$ as an  $m \times n$ matrix of **data**:
-
-$$
-X =  \begin{bmatrix} X_1 \mid X_2 \mid \cdots \mid X_n\end{bmatrix}
-$$
-
-where for $j = 1, \ldots, n$ the column vector $X_j = \begin{bmatrix}X_{1j}\\X_{2j}\\\vdots\\X_{mj}\end{bmatrix}$ is a  vector of observations on variables $\begin{bmatrix}x_1\\x_2\\\vdots\\x_m\end{bmatrix}$.
-
-In a **time series** setting, we would think of columns $j$ as indexing different __times__ at which random variables are observed, while rows index different random variables.
-
-In a **cross section** setting, we would think of columns $j$ as indexing different __individuals__ for  which random variables are observed, while rows index different **random variables**.
-
-The number of positive singular values equals the rank of  matrix $X$.
-
-Arrange the singular values  in decreasing order.
-
-Arrange   the positive singular values on the main diagonal of the matrix $\Sigma$ of into a vector $\sigma_R$.
-
-Set all other entries of $\Sigma$ to zero.
-
-## Relationship of PCA to SVD
-
-To relate a SVD to a PCA (principal component analysis) of data set $X$, first construct  the  SVD of the data matrix $X$:
-
-$$
-X = U \Sigma V^T = \sigma_1 U_1 V_1^T + \sigma_2 U_2 V_2^T + \cdots + \sigma_r U_r V_r^T
-$$ (eq:PCA1)
-
-where
-
-$$
-U=\begin{bmatrix}U_1|U_2|\ldots|U_m\end{bmatrix}
-$$
-
-$$
-V^T = \begin{bmatrix}V_1^T\\V_2^T\\\ldots\\V_n^T\end{bmatrix}
-$$
-
-In equation {eq}`eq:PCA1`, each of the $m \times n$ matrices $U_{j}V_{j}^T$ is evidently
-of rank $1$. 
-
-Thus, we have 
-
-$$
-X = \sigma_1 \begin{pmatrix}U_{11}V_{1}^T\\U_{21}V_{1}^T\\\cdots\\U_{m1}V_{1}^T\\\end{pmatrix} + \sigma_2\begin{pmatrix}U_{12}V_{2}^T\\U_{22}V_{2}^T\\\cdots\\U_{m2}V_{2}^T\\\end{pmatrix}+\ldots + \sigma_r\begin{pmatrix}U_{1r}V_{r}^T\\U_{2r}V_{r}^T\\\cdots\\U_{mr}V_{r}^T\\\end{pmatrix}
-$$ (eq:PCA2)
-
-Here is how we would interpret the objects in the  matrix equation {eq}`eq:PCA2` in 
-a time series context:
-
-* $ V_{k}^T= \begin{bmatrix}V_{k1} &  V_{k2} & \ldots & V_{kn}\end{bmatrix}  \quad \textrm{for each} \   k=1, \ldots, n $ is a time series  $\lbrace V_{kj} \rbrace_{j=1}^n$ for the $k$th **principal component**
-
-* $U_j = \begin{bmatrix}U_{1k}\\U_{2k}\\\ldots\\U_{mk}\end{bmatrix} \  k=1, \ldots, m$
-is a vector of **loadings** of variables $X_i$ on the $k$th principle component,  $i=1, \ldots, m$
-
-* $\sigma_k $ for each $k=1, \ldots, r$ is the strength of $k$th **principal component**
-
 ## Reduced Versus Full SVD
 
 Earlier, we mentioned **full** and **reduced** SVD's.
@@ -276,13 +191,6 @@ rr = np.linalg.matrix_rank(X)
 print('rank of X - '), rr
 ```
 
-**Remark:** The cells above illustrate application of the  `fullmatrices=True` and `full-matrices=False` options.
-Using `full-matrices=False` returns a reduced singular value decomposition. This option implements
-an optimal reduced rank approximation of a matrix, in the sense of  minimizing the Frobenius
-norm of the discrepancy between the approximating matrix and the matrix being approximated.
-Optimality in this sense is  established in the celebrated Eckart–Young theorem. See <https://en.wikipedia.org/wiki/Low-rank_approximation>.
-
-When we study Dynamic Mode Decompositions below, it  will be important for us to remember the following important properties of full and reduced SVD's in such tall-skinny cases.  
 
 **Properties:**
 
@@ -299,14 +207,25 @@ print('UUT, UTU = '), UUT, UTU
 
 
 ```{code-cell} ipython3
-UTUhat = Uhat.T@Uhat
-UUThat = Uhat@Uhat.T
-print('UUThat, UTUhat= '), UUThat, UTUhat
+UhatUhatT = Uhat.T@Uhat
+UhatTUThat = Uhat@Uhat.T
+print('UhatUhatT, UhatTUhat= '), UhatUhatT, UhatTUhat
 ```
 
 
 
 
+**Remark:** The cells above illustrate application of the  `fullmatrices=True` and `full-matrices=False` options.
+Using `full-matrices=False` returns a reduced singular value decomposition. This option implements
+an optimal reduced rank approximation of a matrix, in the sense of  minimizing the Frobenius
+norm of the discrepancy between the approximating matrix and the matrix being approximated.
+Optimality in this sense is  established in the celebrated Eckart–Young theorem. See <https://en.wikipedia.org/wiki/Low-rank_approximation>.
+
+When we study Dynamic Mode Decompositions below, it  will be important for us to remember the following important properties of full and reduced SVD's in such tall-skinny cases.  
+
+
+
+
 
 Let's do another exercise, but now we'll set $m = 2 < 5 = n $
 
@@ -326,6 +245,85 @@ print('Uhat, Shat, Vhat = '), Uhat, Shat, Vhat
 rr = np.linalg.matrix_rank(X)
 print('rank X = '), rr
 ```
+## Digression:  Polar Decomposition
+
+A singular value decomposition (SVD) is related to the **polar decomposition** of $X$
+
+$$
+X  = SQ   
+$$
+
+where
+
+\begin{align*}
+ S & = U\Sigma U^T \cr
+Q & = U V^T 
+\end{align*}
+
+and $S$ is evidently a symmetric matrix and $Q$ is an orthogonal matrix.
+
+## Principle Components Analysis (PCA)
+
+Let's begin with a case in which $n >> m$, so that we have many  more observations $n$ than random variables $m$.
+
+The  matrix $X$ is **short and fat**  in an  $n >> m$ case as opposed to a **tall and skinny** case with $m > > n $ to be discussed later.
+
+We regard  $X$ as an  $m \times n$ matrix of **data**:
+
+$$
+X =  \begin{bmatrix} X_1 \mid X_2 \mid \cdots \mid X_n\end{bmatrix}
+$$
+
+where for $j = 1, \ldots, n$ the column vector $X_j = \begin{bmatrix}X_{1j}\\X_{2j}\\\vdots\\X_{mj}\end{bmatrix}$ is a  vector of observations on variables $\begin{bmatrix}x_1\\x_2\\\vdots\\x_m\end{bmatrix}$.
+
+In a **time series** setting, we would think of columns $j$ as indexing different __times__ at which random variables are observed, while rows index different random variables.
+
+In a **cross section** setting, we would think of columns $j$ as indexing different __individuals__ for  which random variables are observed, while rows index different **random variables**.
+
+The number of positive singular values equals the rank of  matrix $X$.
+
+Arrange the singular values  in decreasing order.
+
+Arrange   the positive singular values on the main diagonal of the matrix $\Sigma$ of into a vector $\sigma_R$.
+
+Set all other entries of $\Sigma$ to zero.
+
+## Relationship of PCA to SVD
+
+To relate a SVD to a PCA (principal component analysis) of data set $X$, first construct  the  SVD of the data matrix $X$:
+
+$$
+X = U \Sigma V^T = \sigma_1 U_1 V_1^T + \sigma_2 U_2 V_2^T + \cdots + \sigma_r U_r V_r^T
+$$ (eq:PCA1)
+
+where
+
+$$
+U=\begin{bmatrix}U_1|U_2|\ldots|U_m\end{bmatrix}
+$$
+
+$$
+V^T = \begin{bmatrix}V_1^T\\V_2^T\\\ldots\\V_n^T\end{bmatrix}
+$$
+
+In equation {eq}`eq:PCA1`, each of the $m \times n$ matrices $U_{j}V_{j}^T$ is evidently
+of rank $1$. 
+
+Thus, we have 
+
+$$
+X = \sigma_1 \begin{pmatrix}U_{11}V_{1}^T\\U_{21}V_{1}^T\\\cdots\\U_{m1}V_{1}^T\\\end{pmatrix} + \sigma_2\begin{pmatrix}U_{12}V_{2}^T\\U_{22}V_{2}^T\\\cdots\\U_{m2}V_{2}^T\\\end{pmatrix}+\ldots + \sigma_r\begin{pmatrix}U_{1r}V_{r}^T\\U_{2r}V_{r}^T\\\cdots\\U_{mr}V_{r}^T\\\end{pmatrix}
+$$ (eq:PCA2)
+
+Here is how we would interpret the objects in the  matrix equation {eq}`eq:PCA2` in 
+a time series context:
+
+* $ V_{k}^T= \begin{bmatrix}V_{k1} &  V_{k2} & \ldots & V_{kn}\end{bmatrix}  \quad \textrm{for each} \   k=1, \ldots, n $ is a time series  $\lbrace V_{kj} \rbrace_{j=1}^n$ for the $k$th **principal component**
+
+* $U_j = \begin{bmatrix}U_{1k}\\U_{2k}\\\ldots\\U_{mk}\end{bmatrix} \  k=1, \ldots, m$
+is a vector of **loadings** of variables $X_i$ on the $k$th principle component,  $i=1, \ldots, m$
+
+* $\sigma_k $ for each $k=1, \ldots, r$ is the strength of $k$th **principal component**
 
 ## PCA with Eigenvalues and Eigenvectors
 
@@ -762,6 +760,14 @@ $$ (eq:hatAversion0)
 
 This is the case that we are interested in here. 
 
+If we use formula {eq}`eq:hatAversion0` to calculate $\hat A X$ we find that
+
+$$
+\hat A X = X'
+$$
+
+so that the regression equation **fits perfectly**, the usual outcome in an **underdetermined least-squares** model.
+
 
 Thus, we want to fit equation {eq}`eq:VARfirstorder` in a situation in which we have a number $n$ of observations  that is small relative to the number $m$ of
 variables that appear in the vector $X_t$.
@@ -781,14 +787,14 @@ $$
 \hat A =  X'  X^{+}  
 $$ (eq:hatAform)
 
-where the (possibly huge) $ \tilde n \times m $ matrix $ X^{+} = (X^T X)^{-1} X^T$ is again the pseudo-inverse of $ X $.
+where the (possibly huge) $ \tilde n \times m $ matrix $ X^{+} = (X^T X)^{-1} X^T$ is again a pseudo-inverse of $ X $.
 
-For some situations that we are interested in, $X^T X $ can be close to singular, a situation that can lead some numerical algorithms to be error-prone.
+For some situations that we are interested in, $X^T X $ can be close to singular, a situation that can make some numerical algorithms  be error-prone.
 
-To confront that situationa, we'll use  efficient algorithms for computing and for constructing reduced rank approximations of  $\hat A$ in formula {eq}`eq:hatAversion0`.
+To confront that possibility, we'll use  efficient algorithms for computing and for constructing reduced rank approximations of  $\hat A$ in formula {eq}`eq:hatAversion0`.
  
 
-The $ i $th  row of $ \hat A $ is an $ m \times 1 $ vector of pseudo-regression coefficients of $ X_{i,t+1} $ on $ X_{j,t}, j = 1, \ldots, m $.
+The $ i $th  row of $ \hat A $ is an $ m \times 1 $ vector of regression coefficients of $ X_{i,t+1} $ on $ X_{j,t}, j = 1, \ldots, m $.
 
 An efficient way to compute the pseudo-inverse $X^+$ is to start with  the (reduced) singular value decomposition
 
@@ -800,7 +806,7 @@ $$ (eq:SVDDMD)
 
 where $ U $ is $ m \times p $, $ \Sigma $ is a $ p \times p $ diagonal  matrix, and $ V^T $ is a $ p \times \tilde n $ matrix.
 
-Here $ p $ is the rank of $ X $, where necessarily $ p \leq \tilde n $ because we are in the case in which $m > > \tilde n$.
+Here $ p $ is the rank of $ X $, where necessarily $ p \leq \tilde n $ because we are in a situation in which $m > > \tilde n$.
 
 
 Since we are in the $m > > \tilde n$ case, we can use the singular value decomposition {eq}`eq:SVDDMD` efficiently to construct the pseudo-inverse $X^+$
@@ -847,8 +853,9 @@ Next, we describe some alternative __reduced order__ representations of our firs
 
 ## Representation 1
  
+In constructing this representation and also whenever we use it, we use a **full** SVD of $X$.
 
-We use the $p$  columns of $U$, and thus the $p$ rows of $U^T$,  to define   a $p \times 1$  vector $\tilde X_t$ as follows
+We use the $p$  columns of $U$, and thus the $p$ rows of $U^T$,  to define   a $p \times 1$  vector $\tilde b_t$ as follows
 
 
 $$
@@ -863,7 +870,9 @@ $$ (eq:Xdecoder)
 
 (Here we use the notation $b$ to remind ourselves that we are creating a **b**asis vector.)
 
-Since $U U^T$ is an $m \times m$ identity matrix, it follows from equation {eq}`eq:tildeXdef2` that we can reconstruct  $X_t$ from $\tilde b_t$ by using 
+Since we are using a **full** SVD, $U U^T$ is an $m \times m$ identity matrix.
+
+So it follows from equation {eq}`eq:tildeXdef2` that we can reconstruct  $X_t$ from $\tilde b_t$ by using 
 
 
 
@@ -910,8 +919,8 @@ This representation is the one originally proposed by  {cite}`schmid2010`.
 It can be regarded as an intermediate step to  a related and perhaps more useful  representation 3.
 
 
+As with Representation 1, we continue to
 
-To work it requires that we
 
 * use all $p$ singular values of $X$
 * use a **full** SVD and **not** a reduced SVD
@@ -937,7 +946,7 @@ where $\Lambda$ is a diagonal matrix of eigenvalues and $W$ is a $p \times p$
 matrix whose columns are eigenvectors  corresponding to rows (eigenvalues) in 
 $\Lambda$.
 
-Note that when $U U^T = I_{p \times p}$, as is true with a full SVD of X (but **not** true with a reduced SVD)
+Note that when $U U^T = I_{m \times m}$, as is true with a full SVD of X (but **not** true with a reduced SVD)
 
 $$ 
 \hat A = U \tilde A U^T = U W \Lambda W^{-1} U^T 
@@ -1065,12 +1074,12 @@ We also have the following
 {eq}`eq:Atilde0`, define it as the following $r \times r$ counterpart
 
 $$ 
-\tilde A = U^T \hat A U 
+\tilde A = \tilde U^T \hat A U 
 $$ (eq:Atilde10)
 
-where  in equation {eq}`eq:Atilde10` $U$ is now  the $m \times r$ matrix consisting of the eigevectors of $X X^T$ corresponding to the $r$
+where  in equation {eq}`eq:Atilde10` $\tilde U$ is now  the $m \times r$ matrix consisting of the eigevectors of $X X^T$ corresponding to the $r$
 largest singular values of $X$.
-The conclusions of the proposition remain true  with this altered definition of $U$. (**Beware:** We have **recycled** notation  here by temporarily redefining $U$ as being just $r$ columns instead of $p$ columns as we have up to now.)
+The conclusions of the proposition remain true when we replace $U$ by $\tilde U$. 
 
 
 Also see {cite}`DDSE_book` (p. 238)
@@ -1100,7 +1109,7 @@ X_t & = \Phi \check b_t
 $$
 
 
-There is a better way to compute the $r \times 1$ vector $\check b_t$
+But there is a better way to compute the $r \times 1$ vector $\check b_t$
 
 In particular, the following argument from {cite}`DDSE_book` (page 240) provides a computationally efficient way
 to compute $\check b_t$.