An explanation and application of Principal component analysis (PCA)¶

PCA is a statistical technique for reducing the dimensionality of a dataset. This is accomplished by linearly transforming the data into a new coordinate system where (most of) the variation in the data can be described with fewer dimensions than the initial data.

  • Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)

Definition and Derivation of Principal Components¶

Definning PCs¶

Suppose that $ X $ is a vector of $p$ random variables, $$X= (x_1,\dots,x_p)$$, and that the variances of the $p$ random variables and the structure of the covariances or correlations between the $p$ variables are of interest. Unless $p$ is small, or the structure is very simple, it will often not be very helpful to simply look at the $p$ variances and all of the $\binom{p}{2} = \frac{p!}{2!(p-2)!}=p(p - 1)/2$ correlations or covariances. An alternative approach is to look for a few $(<<p)$ derived variables that pre- serve most of the information given by these variances and correlations or covariances.

Although PCA does not ignore covariances and correlations, it concentrates on variances. The first step is to look for a linear function $\alpha_1^T X$ of the elements of $X$ having maximum variance, where $\alpha_1$ is a vector of $p$ constants $\alpha_{11}, \alpha_{12}\dots,\alpha_{1p}$, and $T$ denotes transpose, so that

$$\alpha_1^T X=\alpha_{11} x_1 + \alpha_{12} x_2 +\cdots +\alpha_{1p} x_p =\sum_{j=1}^p\alpha_{1j} x_j $$

Next, look for a linear function $\alpha_2^T X$, uncorrelated with $\alpha_1^T X$ having maximum variance, and so on, so that at the $k$th stage a linear function $\alpha_k^T X$ is found that has maximum variance subject to being uncorrelated with $\alpha_1^T X, \alpha_2^T X,\dots, \alpha_{k-1}^T X$. The $k$th derived variable, $\alpha_k^T X$ is the $k$th PC. Up to $p$ PCs could be found, but it is hoped, in general, that most of the variation in $X$ will be accounted for by $m$ PCs, where $m <p$.

How to find them¶

Consider, for the moment, the case where the vector of random variables $X$ has a known covariance matrix $\Sigma$. This is the matrix whose $(i, j)$th element is the (known) covariance between the $i$th and $j$th elements of $X$ when $i \neq j$, and the variance of the $j$th element of $X$ when $i = j$. The more realistic case, where $\Sigma$ is unknown, follows by replacing $\Sigma$ by a sample covariance matrix $S$. It turns out that for $k = 1,2,··· ,p$, the $k$th PC is given by $z_k = \alpha_k^T X$ where $\alpha_k$ is an eigenvector of $\Sigma$ corresponding to its $k$th largest eigenvalue $\lambda_k$. Furthermore, if $\alpha_k$ is chosen to have unit length ($\alpha_k^T\alpha_k=1$), then $var(z_k) = \lambda_k$, where $var(z_k)$ denotes the variance of $z_k$.

Derive the form of the PCs¶

Consider first $\alpha_1^TX$; the vector $\alpha_1$ maximizes $var[\alpha_1^TX] = \alpha_1^T\Sigma \alpha_1$. It is clear that, as it stands, the maximum will not be achieved for finite $\alpha_1$ so a normalization constraint must be imposed. The constraint used in the derivation is $\alpha_1^T\alpha_1=1$, that is, the sum of squares of elements of $\alpha_1$ equals 1, ($\sum_{i=1}^p \alpha_{1i}^2=1$). Other constraints, for example $\text{Max}_j |\alpha_j | = 1$, may more useful in other circumstances, and can easily be substituted later on. However, the use of constraints other than $\alpha_1^T\alpha_1 = constant$ in the derivation leads to a more difficult optimization problem, and it will produce a set of derived variables different from the PCs.

A short review about variance of a random vector¶
  • Definition 1: The expectation $E[X]$ of a random vector $X$ is given by $E[X]=(E[x_1],E[x_2],\cdots, E[x_p])^T$.
  • Definition 2: The variance-covariance matrix (or simply the covariance

matrix ) of a random vector $X$ is given by $Cov (X) = E[(X- E[X])(X- E[X])^T]$

  • Proposition 1: $Cov(X)=E[XX^T]-E[X]E[X]^T$
  • Proposition 2: $Cov(X)=$
$$ \begin{bmatrix} Var(x_1) & Cov(x_1,x_2) & \cdots & Cov(x_1,x_p)\\ Cov(x_2) & Var(x_2,x_2) & \cdots & Cov(x_2,x_p)\\ \vdots&\vdots&\ddots&\vdots\\ Cov(x_p) & Cov(x_p,x_2) & \cdots & Var(x_p,x_p) \end{bmatrix} $$

Thus, $Cov(X)$ is symmetric matrix, since $Cov(X,Y)=Cov(Y,X)$.

  • Proposition 3: Let $\alpha \in \mathbb{R}^p$, then $$Var(\alpha^T X)= \alpha^T Cov(X)\alpha$$

Maximizing $\alpha_1^T \Sigma \alpha_1$¶

Now, to maximize $\alpha_1^T \Sigma \alpha_1$ subject to $\alpha^T_1\alpha_1=1$, the standard approach is to use the technique of Lagrange multipliers. Maximize

$$\alpha^T_1\Sigma \alpha_1 - \lambda(\alpha_1^T \alpha_1 -1)$$

where $\lambda$ is a Lagrange multiplier. Differentiation with respect to $\alpha_1$ gives

$$\Sigma \alpha_1 - \lambda \alpha_1 = 0$$

, or $(\Sigma - \lambda I_p)\alpha_1=0$,

where $I_p$ is the ($p\times p $) identity matrix. Thus, $\lambda$ is an eigenvalue of $\Sigma$ and $\alpha_1$ is the corresponding eigenvector. To decide which of the p eigenvectors gives $\alpha_1^T X$ with maximum variance, note that the quantity to be maximized is

$\alpha_1^T\Sigma \alpha_1 = \alpha_1^T \lambda \alpha_1= \lambda \alpha_1^T \alpha_1 = \lambda$

so $\lambda$ must be as large as possible. Thus, $\alpha_1$ is the eigenvector corresponding to the largest eigenvalue of $\Sigma$, and $var(\alpha_1^T X) = \alpha_1^T \Sigma \alpha_1 = \lambda_1$, the largest eigenvalue.

In general, the $k$th PC of $X$ is $\alpha_k^TX$ and $var(\alpha_k^T) = \lambda_k$, where $\lambda_k$ is the $k$th largest eigenvalue of $\Sigma$, and $\alpha_k$ is the corresponding eigenvector. This will now be proved for $k = 2$; the proof for $k ≥ 3$ is slightly more complicated, but very similar.

Maximizing $\alpha_2^T \Sigma \alpha_2$¶

The second PC, $\alpha_2^T X$, maximizes $\alpha_2^T \Sigma \alpha_2$ subject to being uncorrelated with $\alpha_1^T X$, or equivalently subject to $cov[\alpha_1^T X,\alpha_2^T X] = 0$. But

$cov[\alpha_1^T X,\alpha_2^T X]=\alpha_1^T \Sigma \alpha_2 =\alpha_2^T \Sigma \alpha_1 = \alpha_2^T \lambda_1 \alpha_1^T = \lambda_1 \alpha_2^T \alpha_1 =\lambda_1 \alpha_1^T \alpha_2 $

Thus, any of the equations

$\alpha_1^T \Sigma \alpha_2 =0$, $\alpha_2^T \Sigma \alpha_1=0$, $\alpha_1^T\alpha_2 = 0$, $\alpha_2^T\alpha_1 = 0$

could be used to specify zero correlation between $\alpha_1^TX$ and $\alpha_2^TX$. Choosing the last of these (an arbitrary choice), and noting that a normalization constraint is again necessary, the quantity to be maximized is

$\alpha_2^T \Sigma \alpha_2 - \lambda(\alpha_2^T\alpha_2-1)-\phi\alpha_2^T\alpha_1$,

where $\lambda$, $\phi$ are Lagrange multipliers. Differentiation with respect to $\alpha_2$ gives

$\Sigma \alpha_2 -\lambda \alpha_2 -\phi\alpha_1 = 0$

and multiplication of this equation on the left by $\alpha_1^T$ gives

$\alpha_1^T\Sigma \alpha_2 -\lambda \alpha_1^T \alpha_2 -\phi\alpha_1^T\alpha_1 = 0$

which, since the first two terms are zero and $\alpha_1^T \alpha_1=1$, reduces to $\phi= 0$. Therefore, $\Sigma \alpha_2 -\lambda \alpha_2 = 0$, or equivalently $(\Sigma -\lambda I_p )\alpha_2 = 0$, so $\lambda$ is once more an eigenvalue of $\Sigma$, and $\alpha_2$ the corresponding eigenvector.

Again, $\lambda = \alpha_2^T \Sigma \alpha_2$, so $\lambda$ is to be as large as possible. Assuming that $\Sigma$ does not have repeated eigenvalues, $\lambda$ cannot equal $\lambda_1$. If it did, it follows that $\alpha_2 = \alpha_1$, violating the constraint $\alpha_1^T \alpha_2= 0$. Hence $\lambda$ is the second largest eigenvalue of $\Sigma$, and $\alpha_2$ is the corresponding eigenvector.

Now with $\alpha_k^T \Sigma \alpha_k$¶

As stated above, it can be shown that for the third, fourth, ..., $p$th PCs, the vectors of coefficients $\alpha_3, \alpha_4, . . . , \alpha_p$ are the eigenvectors of $\Sigma$ corresponding to $\lambda_3,\lambda_3,\dots,\lambda_p$, the third and fourth largest, ..., and the smallest eigenvalue, respectively. Furthermore,

$var[\alpha_k^T X] = \lambda_k$ for $k = 1,2,...,p.$

Application of PCA to a data set¶

  • HOW DO YOU DO A PRINCIPAL COMPONENT ANALYSIS?
  1. Standardize the range of continuous initial variables
  2. Compute the covariance matrix to identify correlations
  3. Compute the eigenvectors and eigenvalues of the covariance matrix to identify the principal components
  4. Create a feature vector to decide which principal components to keep
  5. Recast the data along the principal components axes

Exploration of the data¶

The data is in the archive gym_crowdedness.csv, it is the number of attendees every 10 minutes from one year in a gym.

The libraries and load the csv

https://www.kaggle.com/datasets/nsrose7224/crowdedness-at-the-campus-gym

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#from sklearn.preprocessing import StandardScaler
df = pd.read_csv('gym_crowdedness.csv')
df.head()
Out[2]:
number_people date timestamp day_of_week is_weekend is_holiday temperature is_start_of_semester is_during_semester month hour
0 37 2015-08-14 17:00:11-07:00 61211 4 0 0 71.76 0 0 8 17
1 45 2015-08-14 17:20:14-07:00 62414 4 0 0 71.76 0 0 8 17
2 40 2015-08-14 17:30:15-07:00 63015 4 0 0 71.76 0 0 8 17
3 44 2015-08-14 17:40:16-07:00 63616 4 0 0 71.76 0 0 8 17
4 45 2015-08-14 17:50:17-07:00 64217 4 0 0 71.76 0 0 8 17

We see a general glance at the data

In [3]:
df.dtypes
Out[3]:
number_people             int64
date                     object
timestamp                 int64
day_of_week               int64
is_weekend                int64
is_holiday                int64
temperature             float64
is_start_of_semester      int64
is_during_semester        int64
month                     int64
hour                      int64
dtype: object
In [4]:
df.describe()
Out[4]:
number_people timestamp day_of_week is_weekend is_holiday temperature is_start_of_semester is_during_semester month hour
count 62184.000000 62184.000000 62184.000000 62184.000000 62184.000000 62184.000000 62184.000000 62184.000000 62184.000000 62184.000000
mean 29.072543 45799.437958 2.982504 0.282870 0.002573 58.557108 0.078831 0.660218 7.439824 12.236460
std 22.689026 24211.275891 1.996825 0.450398 0.050660 6.316396 0.269476 0.473639 3.445069 6.717631
min 0.000000 0.000000 0.000000 0.000000 0.000000 38.140000 0.000000 0.000000 1.000000 0.000000
25% 9.000000 26624.000000 1.000000 0.000000 0.000000 55.000000 0.000000 0.000000 5.000000 7.000000
50% 28.000000 46522.500000 3.000000 0.000000 0.000000 58.340000 0.000000 1.000000 8.000000 12.000000
75% 43.000000 66612.000000 5.000000 1.000000 0.000000 62.280000 0.000000 1.000000 10.000000 18.000000
max 145.000000 86399.000000 6.000000 1.000000 1.000000 87.170000 1.000000 1.000000 12.000000 23.000000
In [5]:
df.shape
Out[5]:
(62184, 11)

Then we see if there is any null data element

In [6]:
df.isnull().any()
Out[6]:
number_people           False
date                    False
timestamp               False
day_of_week             False
is_weekend              False
is_holiday              False
temperature             False
is_start_of_semester    False
is_during_semester      False
month                   False
hour                    False
dtype: bool

In this example we will drop the column 'date' for simplicity

In [7]:
gym = df.drop('date',axis = 1)

1. Standardize the range of continuous initial variables¶

In [8]:
median = gym.mean()
standard_deviation = gym.std()
scaled = (gym - median) / standard_deviation
In [9]:
scaled
Out[9]:
number_people timestamp day_of_week is_weekend is_holiday temperature is_start_of_semester is_during_semester month hour
0 0.349396 0.636545 0.509557 -0.628046 -0.05079 2.090257 -0.292532 -1.393927 0.162602 0.709110
1 0.701989 0.686232 0.509557 -0.628046 -0.05079 2.090257 -0.292532 -1.393927 0.162602 0.709110
2 0.481619 0.711056 0.509557 -0.628046 -0.05079 2.090257 -0.292532 -1.393927 0.162602 0.709110
3 0.657915 0.735879 0.509557 -0.628046 -0.05079 2.090257 -0.292532 -1.393927 0.162602 0.709110
4 0.701989 0.760702 0.509557 -0.628046 -0.05079 2.090257 -0.292532 -1.393927 0.162602 0.709110
... ... ... ... ... ... ... ... ... ... ...
62179 -0.267642 0.890022 1.010352 1.592215 -0.05079 0.397836 -0.292532 0.717386 -1.288748 0.857972
62180 -0.355791 0.915093 1.010352 1.592215 -0.05079 0.397836 -0.292532 0.717386 -1.288748 0.857972
62181 -0.179494 0.940081 1.010352 1.592215 -0.05079 -0.292431 -0.292532 0.717386 -1.288748 1.006834
62182 -0.488013 0.965152 1.010352 1.592215 -0.05079 -0.292431 -0.292532 0.717386 -1.288748 1.006834
62183 -0.267642 0.990099 1.010352 1.592215 -0.05079 -0.292431 -0.292532 0.717386 -1.288748 1.006834

62184 rows × 10 columns

2. Compute the covariance matrix to identify correlations¶

In [10]:
covariance_matrix = np.cov(scaled.T)
covariance_matrix
Out[10]:
array([[ 1.00000000e+00,  5.50218370e-01, -1.62061859e-01,
        -1.73957812e-01, -4.82493487e-02,  3.73327299e-01,
         1.82682899e-01,  3.35350361e-01, -9.78535478e-02,
         5.52049444e-01],
       [ 5.50218370e-01,  1.00000000e+00, -1.79319084e-03,
        -5.08807521e-04,  2.85073776e-03,  1.84849490e-01,
         9.55090525e-03,  4.46758988e-02, -2.32210763e-02,
         9.99077439e-01],
       [-1.62061859e-01, -1.79319084e-03,  1.00000000e+00,
         7.91338197e-01, -7.58620382e-02,  1.11687310e-02,
        -1.17820251e-02, -4.82362857e-03,  1.55586861e-02,
        -1.91427432e-03],
       [-1.73957812e-01, -5.08807521e-04,  7.91338197e-01,
         1.00000000e+00, -3.18988342e-02,  2.06733408e-02,
        -1.66457755e-02, -3.61271915e-02,  8.46234643e-03,
        -5.17288766e-04],
       [-4.82493487e-02,  2.85073776e-03, -7.58620382e-02,
        -3.18988342e-02,  1.00000000e+00, -8.85265918e-02,
        -1.48579083e-02, -7.07984358e-02, -9.49422885e-02,
         2.84316486e-03],
       [ 3.73327299e-01,  1.84849490e-01,  1.11687310e-02,
         2.06733408e-02, -8.85265918e-02,  1.00000000e+00,
         9.32418634e-02,  1.52475895e-01,  6.31245806e-02,
         1.85120732e-01],
       [ 1.82682899e-01,  9.55090525e-03, -1.17820251e-02,
        -1.66457755e-02, -1.48579083e-02,  9.32418634e-02,
         1.00000000e+00,  2.09862098e-01, -1.37159611e-01,
         1.00907232e-02],
       [ 3.35350361e-01,  4.46758988e-02, -4.82362857e-03,
        -3.61271915e-02, -7.07984358e-02,  1.52475895e-01,
         2.09862098e-01,  1.00000000e+00,  9.65556768e-02,
         4.55808573e-02],
       [-9.78535478e-02, -2.32210763e-02,  1.55586861e-02,
         8.46234643e-03, -9.49422885e-02,  6.31245806e-02,
        -1.37159611e-01,  9.65556768e-02,  1.00000000e+00,
        -2.36235024e-02],
       [ 5.52049444e-01,  9.99077439e-01, -1.91427432e-03,
        -5.17288766e-04,  2.84316486e-03,  1.85120732e-01,
         1.00907232e-02,  4.55808573e-02, -2.36235024e-02,
         1.00000000e+00]])
In [11]:
plt.figure(figsize =(10,10))
sns.set(font_scale = 1.5)
sns.heatmap(covariance_matrix,
cbar = True,
annot = True,
square = True,
fmt = '.2f',
annot_kws={'size':12}
)
Out[11]:
<AxesSubplot: >
No description has been provided for this image
  • If positive then: the two variables increase or decrease together (correlated)
  • If negative then: one increases when the other decreases (Inversely correlated)

3. Compute the eigenvectors and eigenvalues of the covariance matrix to identify the principal components¶

In [12]:
eigen_values, eigen_vectors = np.linalg.eig(covariance_matrix)

4. Create a feature vector to decide which principal components to keep¶

In [23]:
variance_explained = []
for i in eigen_values:
    variance_explained.append(i/sum(eigen_values)*100)
In [29]:
x = [i for i in range(len(variance_explained))]
plt.bar(x,variance_explained, width = 0.5, color = ['red', 'blue'])
plt.xticks(np.arange(10),('pca_1','pca_2', 'pca_3', 'pca_4','pca_5','pca_6','pca_7','pca_8','pca_9','pca_10'),rotation = 45)
plt.title('Variance by principal component')
plt.xlabel('Principal component')
plt.ylabel('Percentage of variance explained')
plt.show()
No description has been provided for this image

From the previous graphic, we can select for example PCA_1, PCA2, PCA_6, PCA_7 and PCA_10 and we captured 78.79% of the variance.

In [31]:
variance_captured = variance_explained[0]+variance_explained[1]+variance_explained[5]+variance_explained[6]+variance_explained[9]
variance_captured
Out[31]:
78.7952503836204

Now we build the feature vector

In [38]:
feature_vector = [eigen_vectors.T[0],eigen_vectors.T[1],eigen_vectors.T[5],eigen_vectors.T[6],eigen_vectors.T[9]]
feature_vector=np.array(feature_vector)

5. Recast the data along the principal components axes¶

In [35]:
scaled.shape
Out[35]:
(62184, 10)
In [40]:
feature_vector.T.shape
Out[40]:
(10, 5)
In [41]:
gym_recast = np.dot(scaled,feature_vector.T)
In [44]:
gym_recast = pd.DataFrame(gym_recast)
In [48]:
sns.pairplot(gym_recast)
Out[48]:
<seaborn.axisgrid.PairGrid at 0x7fc18c0dc040>
No description has been provided for this image
Created in deepnote.com Created in Deepnote