An explanation and application of Principal component analysis (PCA)¶

PCA is a statistical technique for reducing the dimensionality of a dataset. This is accomplished by linearly transforming the data into a new coordinate system where (most of) the variation in the data can be described with fewer dimensions than the initial data.

  • Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)

Definition and Derivation of Principal Components¶

Definning PCs¶

Suppose that XX is a vector of pp random variables, X=(x1,…,xp)X=(x1,…,xp), and that the variances of the pp random variables and the structure of the covariances or correlations between the pp variables are of interest. Unless pp is small, or the structure is very simple, it will often not be very helpful to simply look at the pp variances and all of the (p2)=p!2!(p−2)!=p(p−1)/2(p2)=p!2!(p−2)!=p(p−1)/2 correlations or covariances. An alternative approach is to look for a few (<<p)(<<p) derived variables that pre- serve most of the information given by these variances and correlations or covariances.

Although PCA does not ignore covariances and correlations, it concentrates on variances. The first step is to look for a linear function αT1Xα1TX of the elements of XX having maximum variance, where α1α1 is a vector of pp constants α11,α12…,α1pα11,α12…,α1p, and TT denotes transpose, so that

αT1X=α11x1+α12x2+⋯+α1pxp=p∑j=1α1jxjα1TX=α11x1+α12x2+⋯+α1pxp=∑j=1pα1jxj

Next, look for a linear function αT2Xα2TX, uncorrelated with αT1Xα1TX having maximum variance, and so on, so that at the kkth stage a linear function αTkXαkTX is found that has maximum variance subject to being uncorrelated with αT1X,αT2X,…,αTk−1Xα1TX,α2TX,…,αk−1TX. The kkth derived variable, αTkXαkTX is the kkth PC. Up to pp PCs could be found, but it is hoped, in general, that most of the variation in XX will be accounted for by mm PCs, where m<pm<p.

How to find them¶

Consider, for the moment, the case where the vector of random variables XX has a known covariance matrix ΣΣ. This is the matrix whose (i,j)(i,j)th element is the (known) covariance between the iith and jjth elements of XX when i≠ji≠j, and the variance of the jjth element of XX when i=ji=j. The more realistic case, where ΣΣ is unknown, follows by replacing ΣΣ by a sample covariance matrix SS. It turns out that for k=1,2,⋅⋅⋅,pk=1,2,···,p, the kkth PC is given by zk=αTkXzk=αkTX where αkαk is an eigenvector of ΣΣ corresponding to its kkth largest eigenvalue λkλk. Furthermore, if αkαk is chosen to have unit length (αTkαk=1αkTαk=1), then var(zk)=λkvar(zk)=λk, where var(zk)var(zk) denotes the variance of zkzk.

Derive the form of the PCs¶

Consider first αT1Xα1TX; the vector α1α1 maximizes var[αT1X]=αT1Σα1var[α1TX]=α1TΣα1. It is clear that, as it stands, the maximum will not be achieved for finite α1α1 so a normalization constraint must be imposed. The constraint used in the derivation is αT1α1=1α1Tα1=1, that is, the sum of squares of elements of α1α1 equals 1, (∑pi=1α21i=1∑i=1pα1i2=1). Other constraints, for example Maxj|αj|=1Maxj|αj|=1, may more useful in other circumstances, and can easily be substituted later on. However, the use of constraints other than αT1α1=constantα1Tα1=constant in the derivation leads to a more difficult optimization problem, and it will produce a set of derived variables different from the PCs.

A short review about variance of a random vector¶
  • Definition 1: The expectation E[X]E[X] of a random vector XX is given by E[X]=(E[x1],E[x2],⋯,E[xp])TE[X]=(E[x1],E[x2],⋯,E[xp])T.
  • Definition 2: The variance-covariance matrix (or simply the covariance

matrix ) of a random vector XX is given by Cov(X)=E[(X−E[X])(X−E[X])T]Cov(X)=E[(X−E[X])(X−E[X])T]

  • Proposition 1: Cov(X)=E[XXT]−E[X]E[X]TCov(X)=E[XXT]−E[X]E[X]T
  • Proposition 2: Cov(X)=Cov(X)=
⎡⎢ ⎢ ⎢ ⎢ ⎢⎣Var(x1)Cov(x1,x2)⋯Cov(x1,xp)Cov(x2)Var(x2,x2)⋯Cov(x2,xp)⋮⋮⋱⋮Cov(xp)Cov(xp,x2)⋯Var(xp,xp)⎤⎥ ⎥ ⎥ ⎥ ⎥⎦[Var(x1)Cov(x1,x2)⋯Cov(x1,xp)Cov(x2)Var(x2,x2)⋯Cov(x2,xp)⋮⋮⋱⋮Cov(xp)Cov(xp,x2)⋯Var(xp,xp)]

Thus, Cov(X)Cov(X) is symmetric matrix, since Cov(X,Y)=Cov(Y,X)Cov(X,Y)=Cov(Y,X).

  • Proposition 3: Let α∈Rpα∈Rp, then Var(αTX)=αTCov(X)αVar(αTX)=αTCov(X)α

Maximizing αT1Σα1α1TΣα1¶

Now, to maximize αT1Σα1α1TΣα1 subject to αT1α1=1α1Tα1=1, the standard approach is to use the technique of Lagrange multipliers. Maximize

αT1Σα1−λ(αT1α1−1)α1TΣα1−λ(α1Tα1−1)

where λλ is a Lagrange multiplier. Differentiation with respect to α1α1 gives

Σα1−λα1=0Σα1−λα1=0

, or (Σ−λIp)α1=0(Σ−λIp)α1=0,

where IpIp is the (p×pp×p) identity matrix. Thus, λλ is an eigenvalue of ΣΣ and α1α1 is the corresponding eigenvector. To decide which of the p eigenvectors gives αT1Xα1TX with maximum variance, note that the quantity to be maximized is

αT1Σα1=αT1λα1=λαT1α1=λα1TΣα1=α1Tλα1=λα1Tα1=λ

so λλ must be as large as possible. Thus, α1α1 is the eigenvector corresponding to the largest eigenvalue of ΣΣ, and var(αT1X)=αT1Σα1=λ1var(α1TX)=α1TΣα1=λ1, the largest eigenvalue.

In general, the kkth PC of XX is αTkXαkTX and var(αTk)=λkvar(αkT)=λk, where λkλk is the kkth largest eigenvalue of ΣΣ, and αkαk is the corresponding eigenvector. This will now be proved for k=2k=2; the proof for k≥3k≥3 is slightly more complicated, but very similar.

Maximizing αT2Σα2α2TΣα2¶

The second PC, αT2Xα2TX, maximizes αT2Σα2α2TΣα2 subject to being uncorrelated with αT1Xα1TX, or equivalently subject to cov[αT1X,αT2X]=0cov[α1TX,α2TX]=0. But

cov[αT1X,αT2X]=αT1Σα2=αT2Σα1=αT2λ1αT1=λ1αT2α1=λ1αT1α2cov[α1TX,α2TX]=α1TΣα2=α2TΣα1=α2Tλ1α1T=λ1α2Tα1=λ1α1Tα2

Thus, any of the equations

αT1Σα2=0α1TΣα2=0, αT2Σα1=0α2TΣα1=0, αT1α2=0α1Tα2=0, αT2α1=0α2Tα1=0

could be used to specify zero correlation between αT1Xα1TX and αT2Xα2TX. Choosing the last of these (an arbitrary choice), and noting that a normalization constraint is again necessary, the quantity to be maximized is

αT2Σα2−λ(αT2α2−1)−ϕαT2α1α2TΣα2−λ(α2Tα2−1)−ϕα2Tα1,

where λλ, ϕϕ are Lagrange multipliers. Differentiation with respect to α2α2 gives

Σα2−λα2−ϕα1=0Σα2−λα2−ϕα1=0

and multiplication of this equation on the left by αT1α1T gives

αT1Σα2−λαT1α2−ϕαT1α1=0α1TΣα2−λα1Tα2−ϕα1Tα1=0

which, since the first two terms are zero and αT1α1=1α1Tα1=1, reduces to ϕ=0ϕ=0. Therefore, Σα2−λα2=0Σα2−λα2=0, or equivalently (Σ−λIp)α2=0(Σ−λIp)α2=0, so λλ is once more an eigenvalue of ΣΣ, and α2α2 the corresponding eigenvector.

Again, λ=αT2Σα2λ=α2TΣα2, so λλ is to be as large as possible. Assuming that ΣΣ does not have repeated eigenvalues, λλ cannot equal λ1λ1. If it did, it follows that α2=α1α2=α1, violating the constraint αT1α2=0α1Tα2=0. Hence λλ is the second largest eigenvalue of ΣΣ, and α2α2 is the corresponding eigenvector.

Now with αTkΣαkαkTΣαk¶

As stated above, it can be shown that for the third, fourth, ..., ppth PCs, the vectors of coefficients α3,α4,...,αpα3,α4,...,αp are the eigenvectors of ΣΣ corresponding to λ3,λ3,…,λpλ3,λ3,…,λp, the third and fourth largest, ..., and the smallest eigenvalue, respectively. Furthermore,

var[αTkX]=λkvar[αkTX]=λk for k=1,2,...,p.k=1,2,...,p.

Application of PCA to a data set¶

  • HOW DO YOU DO A PRINCIPAL COMPONENT ANALYSIS?
  1. Standardize the range of continuous initial variables
  2. Compute the covariance matrix to identify correlations
  3. Compute the eigenvectors and eigenvalues of the covariance matrix to identify the principal components
  4. Create a feature vector to decide which principal components to keep
  5. Recast the data along the principal components axes

Exploration of the data¶

The data is in the archive gym_crowdedness.csv, it is the number of attendees every 10 minutes from one year in a gym.

The libraries and load the csv

https://www.kaggle.com/datasets/nsrose7224/crowdedness-at-the-campus-gym

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#from sklearn.preprocessing import StandardScaler
df = pd.read_csv('gym_crowdedness.csv')
df.head()
Out[2]:
number_people date timestamp day_of_week is_weekend is_holiday temperature is_start_of_semester is_during_semester month hour
0 37 2015-08-14 17:00:11-07:00 61211 4 0 0 71.76 0 0 8 17
1 45 2015-08-14 17:20:14-07:00 62414 4 0 0 71.76 0 0 8 17
2 40 2015-08-14 17:30:15-07:00 63015 4 0 0 71.76 0 0 8 17
3 44 2015-08-14 17:40:16-07:00 63616 4 0 0 71.76 0 0 8 17
4 45 2015-08-14 17:50:17-07:00 64217 4 0 0 71.76 0 0 8 17

We see a general glance at the data

In [3]:
df.dtypes
Out[3]:
number_people             int64
date                     object
timestamp                 int64
day_of_week               int64
is_weekend                int64
is_holiday                int64
temperature             float64
is_start_of_semester      int64
is_during_semester        int64
month                     int64
hour                      int64
dtype: object
In [4]:
df.describe()
Out[4]:
number_people timestamp day_of_week is_weekend is_holiday temperature is_start_of_semester is_during_semester month hour
count 62184.000000 62184.000000 62184.000000 62184.000000 62184.000000 62184.000000 62184.000000 62184.000000 62184.000000 62184.000000
mean 29.072543 45799.437958 2.982504 0.282870 0.002573 58.557108 0.078831 0.660218 7.439824 12.236460
std 22.689026 24211.275891 1.996825 0.450398 0.050660 6.316396 0.269476 0.473639 3.445069 6.717631
min 0.000000 0.000000 0.000000 0.000000 0.000000 38.140000 0.000000 0.000000 1.000000 0.000000
25% 9.000000 26624.000000 1.000000 0.000000 0.000000 55.000000 0.000000 0.000000 5.000000 7.000000
50% 28.000000 46522.500000 3.000000 0.000000 0.000000 58.340000 0.000000 1.000000 8.000000 12.000000
75% 43.000000 66612.000000 5.000000 1.000000 0.000000 62.280000 0.000000 1.000000 10.000000 18.000000
max 145.000000 86399.000000 6.000000 1.000000 1.000000 87.170000 1.000000 1.000000 12.000000 23.000000
In [5]:
df.shape
Out[5]:
(62184, 11)

Then we see if there is any null data element

In [6]:
df.isnull().any()
Out[6]:
number_people           False
date                    False
timestamp               False
day_of_week             False
is_weekend              False
is_holiday              False
temperature             False
is_start_of_semester    False
is_during_semester      False
month                   False
hour                    False
dtype: bool

In this example we will drop the column 'date' for simplicity

In [7]:
gym = df.drop('date',axis = 1)

1. Standardize the range of continuous initial variables¶

In [8]:
median = gym.mean()
standard_deviation = gym.std()
scaled = (gym - median) / standard_deviation
In [9]:
scaled
Out[9]:
number_people timestamp day_of_week is_weekend is_holiday temperature is_start_of_semester is_during_semester month hour
0 0.349396 0.636545 0.509557 -0.628046 -0.05079 2.090257 -0.292532 -1.393927 0.162602 0.709110
1 0.701989 0.686232 0.509557 -0.628046 -0.05079 2.090257 -0.292532 -1.393927 0.162602 0.709110
2 0.481619 0.711056 0.509557 -0.628046 -0.05079 2.090257 -0.292532 -1.393927 0.162602 0.709110
3 0.657915 0.735879 0.509557 -0.628046 -0.05079 2.090257 -0.292532 -1.393927 0.162602 0.709110
4 0.701989 0.760702 0.509557 -0.628046 -0.05079 2.090257 -0.292532 -1.393927 0.162602 0.709110
... ... ... ... ... ... ... ... ... ... ...
62179 -0.267642 0.890022 1.010352 1.592215 -0.05079 0.397836 -0.292532 0.717386 -1.288748 0.857972
62180 -0.355791 0.915093 1.010352 1.592215 -0.05079 0.397836 -0.292532 0.717386 -1.288748 0.857972
62181 -0.179494 0.940081 1.010352 1.592215 -0.05079 -0.292431 -0.292532 0.717386 -1.288748 1.006834
62182 -0.488013 0.965152 1.010352 1.592215 -0.05079 -0.292431 -0.292532 0.717386 -1.288748 1.006834
62183 -0.267642 0.990099 1.010352 1.592215 -0.05079 -0.292431 -0.292532 0.717386 -1.288748 1.006834

62184 rows × 10 columns

2. Compute the covariance matrix to identify correlations¶

In [10]:
covariance_matrix = np.cov(scaled.T)
covariance_matrix
Out[10]:
array([[ 1.00000000e+00,  5.50218370e-01, -1.62061859e-01,
        -1.73957812e-01, -4.82493487e-02,  3.73327299e-01,
         1.82682899e-01,  3.35350361e-01, -9.78535478e-02,
         5.52049444e-01],
       [ 5.50218370e-01,  1.00000000e+00, -1.79319084e-03,
        -5.08807521e-04,  2.85073776e-03,  1.84849490e-01,
         9.55090525e-03,  4.46758988e-02, -2.32210763e-02,
         9.99077439e-01],
       [-1.62061859e-01, -1.79319084e-03,  1.00000000e+00,
         7.91338197e-01, -7.58620382e-02,  1.11687310e-02,
        -1.17820251e-02, -4.82362857e-03,  1.55586861e-02,
        -1.91427432e-03],
       [-1.73957812e-01, -5.08807521e-04,  7.91338197e-01,
         1.00000000e+00, -3.18988342e-02,  2.06733408e-02,
        -1.66457755e-02, -3.61271915e-02,  8.46234643e-03,
        -5.17288766e-04],
       [-4.82493487e-02,  2.85073776e-03, -7.58620382e-02,
        -3.18988342e-02,  1.00000000e+00, -8.85265918e-02,
        -1.48579083e-02, -7.07984358e-02, -9.49422885e-02,
         2.84316486e-03],
       [ 3.73327299e-01,  1.84849490e-01,  1.11687310e-02,
         2.06733408e-02, -8.85265918e-02,  1.00000000e+00,
         9.32418634e-02,  1.52475895e-01,  6.31245806e-02,
         1.85120732e-01],
       [ 1.82682899e-01,  9.55090525e-03, -1.17820251e-02,
        -1.66457755e-02, -1.48579083e-02,  9.32418634e-02,
         1.00000000e+00,  2.09862098e-01, -1.37159611e-01,
         1.00907232e-02],
       [ 3.35350361e-01,  4.46758988e-02, -4.82362857e-03,
        -3.61271915e-02, -7.07984358e-02,  1.52475895e-01,
         2.09862098e-01,  1.00000000e+00,  9.65556768e-02,
         4.55808573e-02],
       [-9.78535478e-02, -2.32210763e-02,  1.55586861e-02,
         8.46234643e-03, -9.49422885e-02,  6.31245806e-02,
        -1.37159611e-01,  9.65556768e-02,  1.00000000e+00,
        -2.36235024e-02],
       [ 5.52049444e-01,  9.99077439e-01, -1.91427432e-03,
        -5.17288766e-04,  2.84316486e-03,  1.85120732e-01,
         1.00907232e-02,  4.55808573e-02, -2.36235024e-02,
         1.00000000e+00]])
In [11]:
plt.figure(figsize =(10,10))
sns.set(font_scale = 1.5)
sns.heatmap(covariance_matrix,
cbar = True,
annot = True,
square = True,
fmt = '.2f',
annot_kws={'size':12}
)
Out[11]:
<AxesSubplot: >
No description has been provided for this image
  • If positive then: the two variables increase or decrease together (correlated)
  • If negative then: one increases when the other decreases (Inversely correlated)

3. Compute the eigenvectors and eigenvalues of the covariance matrix to identify the principal components¶

In [12]:
eigen_values, eigen_vectors = np.linalg.eig(covariance_matrix)

4. Create a feature vector to decide which principal components to keep¶

In [23]:
variance_explained = []
for i in eigen_values:
    variance_explained.append(i/sum(eigen_values)*100)
In [29]:
x = [i for i in range(len(variance_explained))]
plt.bar(x,variance_explained, width = 0.5, color = ['red', 'blue'])
plt.xticks(np.arange(10),('pca_1','pca_2', 'pca_3', 'pca_4','pca_5','pca_6','pca_7','pca_8','pca_9','pca_10'),rotation = 45)
plt.title('Variance by principal component')
plt.xlabel('Principal component')
plt.ylabel('Percentage of variance explained')
plt.show()
No description has been provided for this image

From the previous graphic, we can select for example PCA_1, PCA2, PCA_6, PCA_7 and PCA_10 and we captured 78.79% of the variance.

In [31]:
variance_captured = variance_explained[0]+variance_explained[1]+variance_explained[5]+variance_explained[6]+variance_explained[9]
variance_captured
Out[31]:
78.7952503836204

Now we build the feature vector

In [38]:
feature_vector = [eigen_vectors.T[0],eigen_vectors.T[1],eigen_vectors.T[5],eigen_vectors.T[6],eigen_vectors.T[9]]
feature_vector=np.array(feature_vector)

5. Recast the data along the principal components axes¶

In [35]:
scaled.shape
Out[35]:
(62184, 10)
In [40]:
feature_vector.T.shape
Out[40]:
(10, 5)
In [41]:
gym_recast = np.dot(scaled,feature_vector.T)
In [44]:
gym_recast = pd.DataFrame(gym_recast)
In [48]:
sns.pairplot(gym_recast)
Out[48]:
<seaborn.axisgrid.PairGrid at 0x7fc18c0dc040>
No description has been provided for this image
Created in deepnote.com Created in Deepnote