An explanation and application of Principal component analysis (PCA)¶
PCA is a statistical technique for reducing the dimensionality of a dataset. This is accomplished by linearly transforming the data into a new coordinate system where (most of) the variation in the data can be described with fewer dimensions than the initial data.
- Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)
Definition and Derivation of Principal Components¶
Definning PCs¶
Suppose that $ X $ is a vector of $p$ random variables, $$X= (x_1,\dots,x_p)$$, and that the variances of the $p$ random variables and the structure of the covariances or correlations between the $p$ variables are of interest. Unless $p$ is small, or the structure is very simple, it will often not be very helpful to simply look at the $p$ variances and all of the $\binom{p}{2} = \frac{p!}{2!(p-2)!}=p(p - 1)/2$ correlations or covariances. An alternative approach is to look for a few $(<<p)$ derived variables that pre- serve most of the information given by these variances and correlations or covariances.
Although PCA does not ignore covariances and correlations, it concentrates on variances. The first step is to look for a linear function $\alpha_1^T X$ of the elements of $X$ having maximum variance, where $\alpha_1$ is a vector of $p$ constants $\alpha_{11}, \alpha_{12}\dots,\alpha_{1p}$, and $T$ denotes transpose, so that
$$\alpha_1^T X=\alpha_{11} x_1 + \alpha_{12} x_2 +\cdots +\alpha_{1p} x_p =\sum_{j=1}^p\alpha_{1j} x_j $$
Next, look for a linear function $\alpha_2^T X$, uncorrelated with $\alpha_1^T X$ having maximum variance, and so on, so that at the $k$th stage a linear function $\alpha_k^T X$ is found that has maximum variance subject to being uncorrelated with $\alpha_1^T X, \alpha_2^T X,\dots, \alpha_{k-1}^T X$. The $k$th derived variable, $\alpha_k^T X$ is the $k$th PC. Up to $p$ PCs could be found, but it is hoped, in general, that most of the variation in $X$ will be accounted for by $m$ PCs, where $m <p$.
How to find them¶
Consider, for the moment, the case where the vector of random variables $X$ has a known covariance matrix $\Sigma$. This is the matrix whose $(i, j)$th element is the (known) covariance between the $i$th and $j$th elements of $X$ when $i \neq j$, and the variance of the $j$th element of $X$ when $i = j$. The more realistic case, where $\Sigma$ is unknown, follows by replacing $\Sigma$ by a sample covariance matrix $S$. It turns out that for $k = 1,2,··· ,p$, the $k$th PC is given by $z_k = \alpha_k^T X$ where $\alpha_k$ is an eigenvector of $\Sigma$ corresponding to its $k$th largest eigenvalue $\lambda_k$. Furthermore, if $\alpha_k$ is chosen to have unit length ($\alpha_k^T\alpha_k=1$), then $var(z_k) = \lambda_k$, where $var(z_k)$ denotes the variance of $z_k$.
Derive the form of the PCs¶
Consider first $\alpha_1^TX$; the vector $\alpha_1$ maximizes $var[\alpha_1^TX] = \alpha_1^T\Sigma \alpha_1$. It is clear that, as it stands, the maximum will not be achieved for finite $\alpha_1$ so a normalization constraint must be imposed. The constraint used in the derivation is $\alpha_1^T\alpha_1=1$, that is, the sum of squares of elements of $\alpha_1$ equals 1, ($\sum_{i=1}^p \alpha_{1i}^2=1$). Other constraints, for example $\text{Max}_j |\alpha_j | = 1$, may more useful in other circumstances, and can easily be substituted later on. However, the use of constraints other than $\alpha_1^T\alpha_1 = constant$ in the derivation leads to a more difficult optimization problem, and it will produce a set of derived variables different from the PCs.
A short review about variance of a random vector¶
$$ \begin{bmatrix} Var(x_1) & Cov(x_1,x_2) & \cdots & Cov(x_1,x_p)\\ Cov(x_2) & Var(x_2,x_2) & \cdots & Cov(x_2,x_p)\\ \vdots&\vdots&\ddots&\vdots\\ Cov(x_p) & Cov(x_p,x_2) & \cdots & Var(x_p,x_p) \end{bmatrix} $$
- Definition 1: The expectation $E[X]$ of a random vector $X$ is given by $E[X]=(E[x_1],E[x_2],\cdots, E[x_p])^T$.
- Definition 2: The variance-covariance matrix (or simply the covariance
matrix ) of a random vector $X$ is given by $Cov (X) = E[(X- E[X])(X- E[X])^T]$
- Proposition 1: $Cov(X)=E[XX^T]-E[X]E[X]^T$
- Proposition 2: $Cov(X)=$
Thus, $Cov(X)$ is symmetric matrix, since $Cov(X,Y)=Cov(Y,X)$.
- Proposition 3: Let $\alpha \in \mathbb{R}^p$, then $$Var(\alpha^T X)= \alpha^T Cov(X)\alpha$$
Maximizing $\alpha_1^T \Sigma \alpha_1$¶
Now, to maximize $\alpha_1^T \Sigma \alpha_1$ subject to $\alpha^T_1\alpha_1=1$, the standard approach is to use the technique of Lagrange multipliers. Maximize
$$\alpha^T_1\Sigma \alpha_1 - \lambda(\alpha_1^T \alpha_1 -1)$$where $\lambda$ is a Lagrange multiplier. Differentiation with respect to $\alpha_1$ gives
$$\Sigma \alpha_1 - \lambda \alpha_1 = 0$$, or $(\Sigma - \lambda I_p)\alpha_1=0$,
where $I_p$ is the ($p\times p $) identity matrix. Thus, $\lambda$ is an eigenvalue of $\Sigma$ and $\alpha_1$ is the corresponding eigenvector. To decide which of the p eigenvectors gives $\alpha_1^T X$ with maximum variance, note that the quantity to be maximized is
$\alpha_1^T\Sigma \alpha_1 = \alpha_1^T \lambda \alpha_1= \lambda \alpha_1^T \alpha_1 = \lambda$
so $\lambda$ must be as large as possible. Thus, $\alpha_1$ is the eigenvector corresponding to the largest eigenvalue of $\Sigma$, and $var(\alpha_1^T X) = \alpha_1^T \Sigma \alpha_1 = \lambda_1$, the largest eigenvalue.
In general, the $k$th PC of $X$ is $\alpha_k^TX$ and $var(\alpha_k^T) = \lambda_k$, where $\lambda_k$ is the $k$th largest eigenvalue of $\Sigma$, and $\alpha_k$ is the corresponding eigenvector. This will now be proved for $k = 2$; the proof for $k ≥ 3$ is slightly more complicated, but very similar.
Maximizing $\alpha_2^T \Sigma \alpha_2$¶
The second PC, $\alpha_2^T X$, maximizes $\alpha_2^T \Sigma \alpha_2$ subject to being uncorrelated with $\alpha_1^T X$, or equivalently subject to $cov[\alpha_1^T X,\alpha_2^T X] = 0$. But
$cov[\alpha_1^T X,\alpha_2^T X]=\alpha_1^T \Sigma \alpha_2 =\alpha_2^T \Sigma \alpha_1 = \alpha_2^T \lambda_1 \alpha_1^T = \lambda_1 \alpha_2^T \alpha_1 =\lambda_1 \alpha_1^T \alpha_2 $
Thus, any of the equations
$\alpha_1^T \Sigma \alpha_2 =0$, $\alpha_2^T \Sigma \alpha_1=0$, $\alpha_1^T\alpha_2 = 0$, $\alpha_2^T\alpha_1 = 0$
could be used to specify zero correlation between $\alpha_1^TX$ and $\alpha_2^TX$. Choosing the last of these (an arbitrary choice), and noting that a normalization constraint is again necessary, the quantity to be maximized is
$\alpha_2^T \Sigma \alpha_2 - \lambda(\alpha_2^T\alpha_2-1)-\phi\alpha_2^T\alpha_1$,
where $\lambda$, $\phi$ are Lagrange multipliers. Differentiation with respect to $\alpha_2$ gives
$\Sigma \alpha_2 -\lambda \alpha_2 -\phi\alpha_1 = 0$
and multiplication of this equation on the left by $\alpha_1^T$ gives
$\alpha_1^T\Sigma \alpha_2 -\lambda \alpha_1^T \alpha_2 -\phi\alpha_1^T\alpha_1 = 0$
which, since the first two terms are zero and $\alpha_1^T \alpha_1=1$, reduces to $\phi= 0$. Therefore, $\Sigma \alpha_2 -\lambda \alpha_2 = 0$, or equivalently $(\Sigma -\lambda I_p )\alpha_2 = 0$, so $\lambda$ is once more an eigenvalue of $\Sigma$, and $\alpha_2$ the corresponding eigenvector.
Again, $\lambda = \alpha_2^T \Sigma \alpha_2$, so $\lambda$ is to be as large as possible. Assuming that $\Sigma$ does not have repeated eigenvalues, $\lambda$ cannot equal $\lambda_1$. If it did, it follows that $\alpha_2 = \alpha_1$, violating the constraint $\alpha_1^T \alpha_2= 0$. Hence $\lambda$ is the second largest eigenvalue of $\Sigma$, and $\alpha_2$ is the corresponding eigenvector.
Now with $\alpha_k^T \Sigma \alpha_k$¶
As stated above, it can be shown that for the third, fourth, ..., $p$th PCs, the vectors of coefficients $\alpha_3, \alpha_4, . . . , \alpha_p$ are the eigenvectors of $\Sigma$ corresponding to $\lambda_3,\lambda_3,\dots,\lambda_p$, the third and fourth largest, ..., and the smallest eigenvalue, respectively. Furthermore,
$var[\alpha_k^T X] = \lambda_k$ for $k = 1,2,...,p.$
Application of PCA to a data set¶
- HOW DO YOU DO A PRINCIPAL COMPONENT ANALYSIS?
- Standardize the range of continuous initial variables
- Compute the covariance matrix to identify correlations
- Compute the eigenvectors and eigenvalues of the covariance matrix to identify the principal components
- Create a feature vector to decide which principal components to keep
- Recast the data along the principal components axes
Exploration of the data¶
The data is in the archive gym_crowdedness.csv, it is the number of attendees every 10 minutes from one year in a gym.
The libraries and load the csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#from sklearn.preprocessing import StandardScaler
df = pd.read_csv('gym_crowdedness.csv')
df.head()
| number_people | date | timestamp | day_of_week | is_weekend | is_holiday | temperature | is_start_of_semester | is_during_semester | month | hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 37 | 2015-08-14 17:00:11-07:00 | 61211 | 4 | 0 | 0 | 71.76 | 0 | 0 | 8 | 17 |
| 1 | 45 | 2015-08-14 17:20:14-07:00 | 62414 | 4 | 0 | 0 | 71.76 | 0 | 0 | 8 | 17 |
| 2 | 40 | 2015-08-14 17:30:15-07:00 | 63015 | 4 | 0 | 0 | 71.76 | 0 | 0 | 8 | 17 |
| 3 | 44 | 2015-08-14 17:40:16-07:00 | 63616 | 4 | 0 | 0 | 71.76 | 0 | 0 | 8 | 17 |
| 4 | 45 | 2015-08-14 17:50:17-07:00 | 64217 | 4 | 0 | 0 | 71.76 | 0 | 0 | 8 | 17 |
We see a general glance at the data
df.dtypes
number_people int64 date object timestamp int64 day_of_week int64 is_weekend int64 is_holiday int64 temperature float64 is_start_of_semester int64 is_during_semester int64 month int64 hour int64 dtype: object
df.describe()
| number_people | timestamp | day_of_week | is_weekend | is_holiday | temperature | is_start_of_semester | is_during_semester | month | hour | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 62184.000000 | 62184.000000 | 62184.000000 | 62184.000000 | 62184.000000 | 62184.000000 | 62184.000000 | 62184.000000 | 62184.000000 | 62184.000000 |
| mean | 29.072543 | 45799.437958 | 2.982504 | 0.282870 | 0.002573 | 58.557108 | 0.078831 | 0.660218 | 7.439824 | 12.236460 |
| std | 22.689026 | 24211.275891 | 1.996825 | 0.450398 | 0.050660 | 6.316396 | 0.269476 | 0.473639 | 3.445069 | 6.717631 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 38.140000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 |
| 25% | 9.000000 | 26624.000000 | 1.000000 | 0.000000 | 0.000000 | 55.000000 | 0.000000 | 0.000000 | 5.000000 | 7.000000 |
| 50% | 28.000000 | 46522.500000 | 3.000000 | 0.000000 | 0.000000 | 58.340000 | 0.000000 | 1.000000 | 8.000000 | 12.000000 |
| 75% | 43.000000 | 66612.000000 | 5.000000 | 1.000000 | 0.000000 | 62.280000 | 0.000000 | 1.000000 | 10.000000 | 18.000000 |
| max | 145.000000 | 86399.000000 | 6.000000 | 1.000000 | 1.000000 | 87.170000 | 1.000000 | 1.000000 | 12.000000 | 23.000000 |
df.shape
(62184, 11)
Then we see if there is any null data element
df.isnull().any()
number_people False date False timestamp False day_of_week False is_weekend False is_holiday False temperature False is_start_of_semester False is_during_semester False month False hour False dtype: bool
In this example we will drop the column 'date' for simplicity
gym = df.drop('date',axis = 1)
1. Standardize the range of continuous initial variables¶
median = gym.mean()
standard_deviation = gym.std()
scaled = (gym - median) / standard_deviation
scaled
| number_people | timestamp | day_of_week | is_weekend | is_holiday | temperature | is_start_of_semester | is_during_semester | month | hour | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.349396 | 0.636545 | 0.509557 | -0.628046 | -0.05079 | 2.090257 | -0.292532 | -1.393927 | 0.162602 | 0.709110 |
| 1 | 0.701989 | 0.686232 | 0.509557 | -0.628046 | -0.05079 | 2.090257 | -0.292532 | -1.393927 | 0.162602 | 0.709110 |
| 2 | 0.481619 | 0.711056 | 0.509557 | -0.628046 | -0.05079 | 2.090257 | -0.292532 | -1.393927 | 0.162602 | 0.709110 |
| 3 | 0.657915 | 0.735879 | 0.509557 | -0.628046 | -0.05079 | 2.090257 | -0.292532 | -1.393927 | 0.162602 | 0.709110 |
| 4 | 0.701989 | 0.760702 | 0.509557 | -0.628046 | -0.05079 | 2.090257 | -0.292532 | -1.393927 | 0.162602 | 0.709110 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 62179 | -0.267642 | 0.890022 | 1.010352 | 1.592215 | -0.05079 | 0.397836 | -0.292532 | 0.717386 | -1.288748 | 0.857972 |
| 62180 | -0.355791 | 0.915093 | 1.010352 | 1.592215 | -0.05079 | 0.397836 | -0.292532 | 0.717386 | -1.288748 | 0.857972 |
| 62181 | -0.179494 | 0.940081 | 1.010352 | 1.592215 | -0.05079 | -0.292431 | -0.292532 | 0.717386 | -1.288748 | 1.006834 |
| 62182 | -0.488013 | 0.965152 | 1.010352 | 1.592215 | -0.05079 | -0.292431 | -0.292532 | 0.717386 | -1.288748 | 1.006834 |
| 62183 | -0.267642 | 0.990099 | 1.010352 | 1.592215 | -0.05079 | -0.292431 | -0.292532 | 0.717386 | -1.288748 | 1.006834 |
62184 rows × 10 columns
2. Compute the covariance matrix to identify correlations¶
covariance_matrix = np.cov(scaled.T)
covariance_matrix
array([[ 1.00000000e+00, 5.50218370e-01, -1.62061859e-01,
-1.73957812e-01, -4.82493487e-02, 3.73327299e-01,
1.82682899e-01, 3.35350361e-01, -9.78535478e-02,
5.52049444e-01],
[ 5.50218370e-01, 1.00000000e+00, -1.79319084e-03,
-5.08807521e-04, 2.85073776e-03, 1.84849490e-01,
9.55090525e-03, 4.46758988e-02, -2.32210763e-02,
9.99077439e-01],
[-1.62061859e-01, -1.79319084e-03, 1.00000000e+00,
7.91338197e-01, -7.58620382e-02, 1.11687310e-02,
-1.17820251e-02, -4.82362857e-03, 1.55586861e-02,
-1.91427432e-03],
[-1.73957812e-01, -5.08807521e-04, 7.91338197e-01,
1.00000000e+00, -3.18988342e-02, 2.06733408e-02,
-1.66457755e-02, -3.61271915e-02, 8.46234643e-03,
-5.17288766e-04],
[-4.82493487e-02, 2.85073776e-03, -7.58620382e-02,
-3.18988342e-02, 1.00000000e+00, -8.85265918e-02,
-1.48579083e-02, -7.07984358e-02, -9.49422885e-02,
2.84316486e-03],
[ 3.73327299e-01, 1.84849490e-01, 1.11687310e-02,
2.06733408e-02, -8.85265918e-02, 1.00000000e+00,
9.32418634e-02, 1.52475895e-01, 6.31245806e-02,
1.85120732e-01],
[ 1.82682899e-01, 9.55090525e-03, -1.17820251e-02,
-1.66457755e-02, -1.48579083e-02, 9.32418634e-02,
1.00000000e+00, 2.09862098e-01, -1.37159611e-01,
1.00907232e-02],
[ 3.35350361e-01, 4.46758988e-02, -4.82362857e-03,
-3.61271915e-02, -7.07984358e-02, 1.52475895e-01,
2.09862098e-01, 1.00000000e+00, 9.65556768e-02,
4.55808573e-02],
[-9.78535478e-02, -2.32210763e-02, 1.55586861e-02,
8.46234643e-03, -9.49422885e-02, 6.31245806e-02,
-1.37159611e-01, 9.65556768e-02, 1.00000000e+00,
-2.36235024e-02],
[ 5.52049444e-01, 9.99077439e-01, -1.91427432e-03,
-5.17288766e-04, 2.84316486e-03, 1.85120732e-01,
1.00907232e-02, 4.55808573e-02, -2.36235024e-02,
1.00000000e+00]])
plt.figure(figsize =(10,10))
sns.set(font_scale = 1.5)
sns.heatmap(covariance_matrix,
cbar = True,
annot = True,
square = True,
fmt = '.2f',
annot_kws={'size':12}
)
<AxesSubplot: >
- If positive then: the two variables increase or decrease together (correlated)
- If negative then: one increases when the other decreases (Inversely correlated)
3. Compute the eigenvectors and eigenvalues of the covariance matrix to identify the principal components¶
eigen_values, eigen_vectors = np.linalg.eig(covariance_matrix)
4. Create a feature vector to decide which principal components to keep¶
variance_explained = []
for i in eigen_values:
variance_explained.append(i/sum(eigen_values)*100)
x = [i for i in range(len(variance_explained))]
plt.bar(x,variance_explained, width = 0.5, color = ['red', 'blue'])
plt.xticks(np.arange(10),('pca_1','pca_2', 'pca_3', 'pca_4','pca_5','pca_6','pca_7','pca_8','pca_9','pca_10'),rotation = 45)
plt.title('Variance by principal component')
plt.xlabel('Principal component')
plt.ylabel('Percentage of variance explained')
plt.show()
From the previous graphic, we can select for example PCA_1, PCA2, PCA_6, PCA_7 and PCA_10 and we captured 78.79% of the variance.
variance_captured = variance_explained[0]+variance_explained[1]+variance_explained[5]+variance_explained[6]+variance_explained[9]
variance_captured
78.7952503836204
Now we build the feature vector
feature_vector = [eigen_vectors.T[0],eigen_vectors.T[1],eigen_vectors.T[5],eigen_vectors.T[6],eigen_vectors.T[9]]
feature_vector=np.array(feature_vector)
5. Recast the data along the principal components axes¶
scaled.shape
(62184, 10)
feature_vector.T.shape
(10, 5)
gym_recast = np.dot(scaled,feature_vector.T)
gym_recast = pd.DataFrame(gym_recast)
sns.pairplot(gym_recast)
<seaborn.axisgrid.PairGrid at 0x7fc18c0dc040>