0%

Linear Dimensionality Reduction


主要包括主成分分析(PCA)和经典相关分析(CCA)的推导

Outline:

  • Introduction
  • Principal Component Analysis
  • Canonical Variate and Correlation Analysis

7.1 Introduction

Target: find a projection that project data onto a lower-dimensional subspace without losing important information.

  • One possible way to accomplish this is through feature selection.
  • Another way is by creating a reduced set of transformations(linear or nonlinear) of the input variables, which is often referred to as feature extraction.

Principal Component Analysis(PCA) and Cononical Variate and Correlation Analysis(CVA/CCA) are linear methods that are most popular in use today. Both these mothods are eigenvalue-eigenvector problems and can be viewed as special cases of multivariate reduced-rank regression.

7.2 Principal Component Analysis

PCA was introduced as technique for deriving a reduced set of orthogonal linear projections of a single collection of correlated variables $X=(X_1,…,X_r)^T$, where the projections are oredered by decreasing variances. It’s also a method of decorrelating $X$.

Note: Variance is a second-order property of a random variable and is an important measurement of the amount of information in that variable.

How to make use of the principal component scores (in decreasing order):

  • The first few ones: revealing whether most of the data actually live on a linear subspace of $\mathbb{R}^r$ and identifying outliers, distributional peculiarities and clusters of points.
  • The last few ones: detect collinearity and outliers that pop up and alter the perceived dimensionality of the data (which has near-zero variance)

7.2.1 Population Principal Components

Assume that the ramdom $r$-vector $X=(X_1,…,X_r)^T$ has mean $\mu_X$ and $(r\times r)$ covariance matrix $\Sigma_{XX}$.

PCA: set of $r$ variables $X_r$ (unordered and correlated) $\rightarrow$ set of $t$ variables $\xi_t$(ordered and uncorrelated), where $t<r$

First t principal components:

where we minimize the loss of information due to replacement.(7.2.2,7.2.3)

We here interpreted “information” with the total variation of the original input variables:

From the spectral decomposition theorem, we can write

where the diagonal matrix $\Lambda$ has diagonal elements the eigenvalues $\left\{\lambda_j\right\}$ of $\Sigma_{XX}$, and the columns of $U$ are the eigenvectors of $\Sigma_{XX}$.

Thus, we obtain the total variation by

The $j$th coefficient vector $b_j=(b_{1j},…,b_{rj})^T$ is chosen so that:

  • The first $t$ linear projections $\xi_j,j=1,2,…,t$ of $X$ are ranked in importance through their variances $\left\{var(\xi_j)\right\}$ that $var(\xi_1)\geq var(\xi_2)\geq …\geq var(\xi_t)$
  • $\xi_j$ is uncorrelated with all $\xi_k,k<j$

There are two popular derivations of the set of principal components of $X$:

  • Least-Squares Optimality crieterion (analogous with regression)
  • Variance-maximizing technique

7.2.2 Least-Squares Optimality of PCA

Let

be a $(t\times r)$-matrix of weights ($t\leq r$), then the linear projections can be written as

where $\xi=(\xi_1,…,\xi_t)^T\in\mathbb{R}^{t\times n}$. Note that $X\in\mathbb{R}^{r\times n}$ is the observations matrix with variables as rows and individual observations as columns.

We want to find a $r$-vector $\mu$ and an ($r\times t$)-matrix $A$ such that

(用 $X$ 对 $\xi$ 回归,类似“压缩”后“还原”的思想,通过“还原”后与原本数据的误差来评估“压缩器”的好坏)

Here we use least-squares error criterion to find the optimum $\mu,A$:

recall that we have $\xi=BX$, we can write

When $t=1$, the least-squares problem is equivalent to

where $X=(X_1,…,X_r)^T,\mu=(\mu_1,…,\mu_r)^T,A=a_1=(a_{11},…,a_{r1})^T$, and $B=b_1^T$

For $t>1$, the criterion is similar to Weighted Sum-of-Squares Criterion with $Y\equiv X,s=r,\Gamma=I_r$. Hence, the minimum solution is obtained by reduced-rank regression

where $v_j=v_j(\Sigma_{XX})$ is the eigenvector associated with the $j$th largest eigenvalue of $\Sigma_{XX}$ (i.e.$\lambda_j$).

Thus, our best rank-$t$ approximation to the original $X$ is given by

where

is the reduced-rank regression coefficient matrix with rank $t$ for the principal components case.

The minumum of $E[(X-\mu-ABX)^T(X-\mu-ABX)]$ is given by $\sum_{j=t+1}^r \lambda_j$.

The first $t$ principal components of $X$ are given by

and the covariance between $\xi_i$ and $\xi_j$ is

7.2.3 PCA as a Variance-Maximization Technique

In the original derivation of principal components, the coefficient vectors

were chosen in a sequential manner so that:

  • $var(\xi_j)=b_j^T\Sigma_{XX}b_j$ are arranged in descending order subject to the normalizations $b_j^Tb_j=1,j=1,…,t$
  • $cov(\xi_i,\xi_j)=b_i^T\Sigma_{XX}b_j=0,i<j$

The first principal component $\xi_1$ is obtained by choosing the $r$ coefficients $b_1$ for the linear projection $\xi_1$ (subject to normalization). A unique choice of $\left\{\xi_j\right\}$ is obtained through the normalization constraint $b_j^Tb_j=1,j=1,…,t$

where $\lambda_1$ is a Lagrangian multiplier.

We require $b_1\neq{0}$, so $|\Sigma_{XX}-\lambda_1I_r|=0$.

Note that when $\lambda_1$ is the largest eigenvalue of $\Sigma_{XX}$, let $b_1$ be the relevant eigenvector. We have $\Sigma_{XX}b_1=\lambda_1b_1$ hence $|\Sigma_{XX}-\lambda_1I_r|=0$.

Why has to be the largest?

  • If not, $var(\xi_1)=b_1^T\Sigma_{XX}b_1=\lambda_1b_1^Tb_1=\lambda_1$ is not the largest case.

Similarly, the second principal component $\xi_2$ and coefficients $b_2$ are chosen. That is, maximize $var(\xi_2)=b_2^T\Sigma_{XX}b_2$ subject to $b_2^Tb_2=1$ and $b_1^Tb_2=0$

where $\lambda_2,\mu$ are Lagrangian multipliers.

Premultiplying $b_1^T$ then

Premultiplying $(\Sigma_{XX}-\lambda_1I_r)b_1=0$ by $b_2^T$ yields $b_2^T\Sigma_{XX}b_1=0$, whence $\mu=0$.

This means that $\lambda_2$ is the second largest eigenvalue of $\Sigma_{XX}$ and $b_2$ is the relevant eigenvector.

In this sequential manner, we obtain the remaining sets of coefficients for the principal components.

Matrix form:

7.2.4 Sample Principal Components

We estimate the principal components of $X$ using a random sample with observed values $\left\{x_i\right\}$

Let $x_{ci}=x_i-\bar{x}$ and $X_c=(x_{c1},…,x_{cn})$ we estimate $\Sigma_{XX}$ by

The ordered eigenvalues of $\hat{\Sigma}_{XX}$ are denoted by $\hat{\lambda}_1\geq\hat{\lambda}_2\geq…\geq\hat{\lambda}_r\geq 0$ and the relevent eigenvector $\hat{v}_j$.

Then we estimate $A^{(t)}$ and $B^{(t)}$ by

The best rank-$t$ reconstruction of $X$ is given by

where

is the reduced-rank regression coefficient matrix.

The $j$th sample PC score is given by

7.2.5 Selection of the number of components

Scree Plot: The sample eigenvalues from a PCA are ordered from largest to smallest. (Find the “elbow”)

PC Rank Trace: Obtain a useful estimate of the rank of the regression coefficient matrix $C$.

A plot of $\Delta\hat{\Sigma}_{\epsilon\epsilon}^{(t)}$ against $\Delta\hat{C}^{(t)}$ is the PC rank trace plot.(Find the “elbow”)

Kaiser’s Rule: Only those eigenvalues exceed 1 would be retained. (too controversial)

7.2.6 Invariance and Scaling

Note that the principal components are NOT invariant under rescalings of the initial variables. (sensitive to the units of measurement of the different input variables)

One possible way to solve this is to rescale the input data

which is equivalent to carrying out PCA using the correlation matrix (rather than the covariance matrix).

7.2.7 What can be gained from using PCA

  • decorrelate the original variables
  • carry out data compression
  • reconstruct the original input data using a reduced number of variables according to a least-squares criterion
  • identify potential clusters in the data.

PCA can be misleading when there are outliers in the data, or the linearity of PCA fits badly to the data.

7.3 Canonical Variate and Correlation Analysis

7.3.1 Canonical Variates and Canonical Correlations

Assume that

is a collection of $r+s$ variables partitioned into two disjoint subcollections, where $X$ and $Y$ are jointly distributed with mean vector and covariance matrix given by

CVA seeks to replace the two sets of correlated variables by $t$ pairs of new variables

where

The $j$th pairs of coefficient vectors are chosen so that

  • the pairs are ranked in importance through their correlations that $\rho_1\geq…\geq\rho_t$
  • $cov(\xi_j,\xi_k)=g_j^T\Sigma_{XX}g_k=0,k<j.$
  • $cov(\omega_j,\omega_k)=h_j^T\Sigma_{XX}h_k=0,k<j.$

The paris are known as the first $t$ pairs of canonical variates of $X$ and $Y$ and their correlations as the $t$ largest canonical correlations.

7.3.2 Least-Squares Optimality of CVA

Let the $(t\times r)$-matrix $G$ and the $(t\times s$-matrix $H$, with $1\leq t\leq\min(r,s)$ be such that $X$ and $Y$ are linearly projected

Consider the problem of finding $\nu,G,H$ so that

which minimize the $(t\times t)$-matrix

where we assume that the covariance matrix of $\omega$ is $\Sigma_{\omega\omega}=H\Sigma_{YY}H^T=I_t$

According to the Poincaré Separation Theorem, the cirterion is minimize when the coloumns of $U$ are the eigenvectors relevent to the first $t$ eigenvalues of $R$. (see Section 3.2.10)

Some conditions when the inequalities become equalities:

To summarize:

where $u_j$ is the eigenvector relevent to the $j$th largest eigenvalue $\lambda_j^2$ of

and $v_j$ is the eigenvector relevent to the $j$th largest eigenvalue $\lambda_j^2$ of

The $jth$ pair of canonical variates socres is given by

where

7.3.3 CVA as a Correlation-Maximization Technique

Consider the linear poojections

Assume that $E(X)=\mu_X=0,E(Y)=\mu_Y=0$ and $g^T\Sigma_{XX}g=1,h^T\Sigma_{YY}h=1$.

We try to maximize the correlation

The problem is equivalent to maximize

where $\lambda,\mu$ are Lagrangian multipliers.

First, premultiplying $g^T$ and $h^T$ respectively, we have

whence, the correlation between $\xi$ and $\omega$ satisfies

Hence, the original problem is equivalent to maximize the Lagrangrian multipliers.

Eigen-decomposition method:

Premultiplying $\Sigma_{XX}^{-1}$ and $\Sigma_{YY}^{-1}$ respectively

Rearrange the two equalities

It can be shown that $g$ and $h$ are respectively chosen by the eigenvectors of

To maximize $\lambda$, choose the eigenvectors relevant to the largest $t$ eigenvalues of $N,N^*$.

Least-square form:

Premultiplying $\Sigma_{YX}\Sigma_{XX}^{-1}$ to $\Sigma_{XY}h-\lambda\Sigma_{XX}g=0$ we have

Recall that $\Sigma_{YX}g-\mu\Sigma_{YY}h=0$, we have

For there to be a nontrival solution, the following determinant has to be zero

It can be shown that the determinant above is a polynomial in $\lambda^2$ of degree $s$, having $s$ real roots $\lambda_1^2\geq\lambda_2^2\geq…\geq\lambda_s^2\geq 0$, which are the ordered eigenvalues of

with relevant eigenvectors $v_1,…,v_s$.

The resultant choice of $g$ and $h$ are given by

Then the pairs are

and their correlation is

Then we choose the other projections in this sequential manner by adding constraint

Appendix A: Weighted Sum-of-Squares Criterion

#TODO