An explanation and application of Principal component analysis (PCA)¶
PCA is a statistical technique for reducing the dimensionality of a dataset. This is accomplished by linearly transforming the data into a new coordinate system where (most of) the variation in the data can be described with fewer dimensions than the initial data.
- Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)
Definition and Derivation of Principal Components¶
Definning PCs¶
Suppose that is a vector of random variables, , and that the variances of the random variables and the structure of the covariances or correlations between the variables are of interest. Unless is small, or the structure is very simple, it will often not be very helpful to simply look at the variances and all of the correlations or covariances. An alternative approach is to look for a few derived variables that pre- serve most of the information given by these variances and correlations or covariances.
Although PCA does not ignore covariances and correlations, it concentrates on variances. The first step is to look for a linear function of the elements of having maximum variance, where is a vector of constants , and denotes transpose, so that
Next, look for a linear function , uncorrelated with having maximum variance, and so on, so that at the th stage a linear function is found that has maximum variance subject to being uncorrelated with . The th derived variable, is the th PC. Up to PCs could be found, but it is hoped, in general, that most of the variation in will be accounted for by PCs, where .
How to find them¶
Consider, for the moment, the case where the vector of random variables has a known covariance matrix . This is the matrix whose th element is the (known) covariance between the th and th elements of when , and the variance of the th element of when . The more realistic case, where is unknown, follows by replacing by a sample covariance matrix . It turns out that for , the th PC is given by where is an eigenvector of corresponding to its th largest eigenvalue . Furthermore, if is chosen to have unit length (), then , where denotes the variance of .
Derive the form of the PCs¶
Consider first ; the vector maximizes . It is clear that, as it stands, the maximum will not be achieved for finite so a normalization constraint must be imposed. The constraint used in the derivation is , that is, the sum of squares of elements of equals 1, (). Other constraints, for example , may more useful in other circumstances, and can easily be substituted later on. However, the use of constraints other than in the derivation leads to a more difficult optimization problem, and it will produce a set of derived variables different from the PCs.
A short review about variance of a random vector¶
- Definition 1: The expectation of a random vector is given by .
- Definition 2: The variance-covariance matrix (or simply the covariance
matrix ) of a random vector is given by
- Proposition 1:
- Proposition 2:
Thus, is symmetric matrix, since .
- Proposition 3: Let , then
Maximizing ¶
Now, to maximize subject to , the standard approach is to use the technique of Lagrange multipliers. Maximize
where is a Lagrange multiplier. Differentiation with respect to gives
, or ,
where is the () identity matrix. Thus, is an eigenvalue of and is the corresponding eigenvector. To decide which of the p eigenvectors gives with maximum variance, note that the quantity to be maximized is
so must be as large as possible. Thus, is the eigenvector corresponding to the largest eigenvalue of , and , the largest eigenvalue.
In general, the th PC of is and , where is the th largest eigenvalue of , and is the corresponding eigenvector. This will now be proved for ; the proof for is slightly more complicated, but very similar.
Maximizing ¶
The second PC, , maximizes subject to being uncorrelated with , or equivalently subject to . But
Thus, any of the equations
, , ,
could be used to specify zero correlation between and . Choosing the last of these (an arbitrary choice), and noting that a normalization constraint is again necessary, the quantity to be maximized is
,
where , are Lagrange multipliers. Differentiation with respect to gives
and multiplication of this equation on the left by gives
which, since the first two terms are zero and , reduces to . Therefore, , or equivalently , so is once more an eigenvalue of , and the corresponding eigenvector.
Again, , so is to be as large as possible. Assuming that does not have repeated eigenvalues, cannot equal . If it did, it follows that , violating the constraint . Hence is the second largest eigenvalue of , and is the corresponding eigenvector.
Now with ¶
As stated above, it can be shown that for the third, fourth, ..., th PCs, the vectors of coefficients are the eigenvectors of corresponding to , the third and fourth largest, ..., and the smallest eigenvalue, respectively. Furthermore,
for
Application of PCA to a data set¶
- HOW DO YOU DO A PRINCIPAL COMPONENT ANALYSIS?
- Standardize the range of continuous initial variables
- Compute the covariance matrix to identify correlations
- Compute the eigenvectors and eigenvalues of the covariance matrix to identify the principal components
- Create a feature vector to decide which principal components to keep
- Recast the data along the principal components axes
Exploration of the data¶
The data is in the archive gym_crowdedness.csv, it is the number of attendees every 10 minutes from one year in a gym.
The libraries and load the csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#from sklearn.preprocessing import StandardScaler
df = pd.read_csv('gym_crowdedness.csv')
df.head()
number_people | date | timestamp | day_of_week | is_weekend | is_holiday | temperature | is_start_of_semester | is_during_semester | month | hour | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 37 | 2015-08-14 17:00:11-07:00 | 61211 | 4 | 0 | 0 | 71.76 | 0 | 0 | 8 | 17 |
1 | 45 | 2015-08-14 17:20:14-07:00 | 62414 | 4 | 0 | 0 | 71.76 | 0 | 0 | 8 | 17 |
2 | 40 | 2015-08-14 17:30:15-07:00 | 63015 | 4 | 0 | 0 | 71.76 | 0 | 0 | 8 | 17 |
3 | 44 | 2015-08-14 17:40:16-07:00 | 63616 | 4 | 0 | 0 | 71.76 | 0 | 0 | 8 | 17 |
4 | 45 | 2015-08-14 17:50:17-07:00 | 64217 | 4 | 0 | 0 | 71.76 | 0 | 0 | 8 | 17 |
We see a general glance at the data
df.dtypes
number_people int64 date object timestamp int64 day_of_week int64 is_weekend int64 is_holiday int64 temperature float64 is_start_of_semester int64 is_during_semester int64 month int64 hour int64 dtype: object
df.describe()
number_people | timestamp | day_of_week | is_weekend | is_holiday | temperature | is_start_of_semester | is_during_semester | month | hour | |
---|---|---|---|---|---|---|---|---|---|---|
count | 62184.000000 | 62184.000000 | 62184.000000 | 62184.000000 | 62184.000000 | 62184.000000 | 62184.000000 | 62184.000000 | 62184.000000 | 62184.000000 |
mean | 29.072543 | 45799.437958 | 2.982504 | 0.282870 | 0.002573 | 58.557108 | 0.078831 | 0.660218 | 7.439824 | 12.236460 |
std | 22.689026 | 24211.275891 | 1.996825 | 0.450398 | 0.050660 | 6.316396 | 0.269476 | 0.473639 | 3.445069 | 6.717631 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 38.140000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 |
25% | 9.000000 | 26624.000000 | 1.000000 | 0.000000 | 0.000000 | 55.000000 | 0.000000 | 0.000000 | 5.000000 | 7.000000 |
50% | 28.000000 | 46522.500000 | 3.000000 | 0.000000 | 0.000000 | 58.340000 | 0.000000 | 1.000000 | 8.000000 | 12.000000 |
75% | 43.000000 | 66612.000000 | 5.000000 | 1.000000 | 0.000000 | 62.280000 | 0.000000 | 1.000000 | 10.000000 | 18.000000 |
max | 145.000000 | 86399.000000 | 6.000000 | 1.000000 | 1.000000 | 87.170000 | 1.000000 | 1.000000 | 12.000000 | 23.000000 |
df.shape
(62184, 11)
Then we see if there is any null data element
df.isnull().any()
number_people False date False timestamp False day_of_week False is_weekend False is_holiday False temperature False is_start_of_semester False is_during_semester False month False hour False dtype: bool
In this example we will drop the column 'date' for simplicity
gym = df.drop('date',axis = 1)
1. Standardize the range of continuous initial variables¶
median = gym.mean()
standard_deviation = gym.std()
scaled = (gym - median) / standard_deviation
scaled
number_people | timestamp | day_of_week | is_weekend | is_holiday | temperature | is_start_of_semester | is_during_semester | month | hour | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.349396 | 0.636545 | 0.509557 | -0.628046 | -0.05079 | 2.090257 | -0.292532 | -1.393927 | 0.162602 | 0.709110 |
1 | 0.701989 | 0.686232 | 0.509557 | -0.628046 | -0.05079 | 2.090257 | -0.292532 | -1.393927 | 0.162602 | 0.709110 |
2 | 0.481619 | 0.711056 | 0.509557 | -0.628046 | -0.05079 | 2.090257 | -0.292532 | -1.393927 | 0.162602 | 0.709110 |
3 | 0.657915 | 0.735879 | 0.509557 | -0.628046 | -0.05079 | 2.090257 | -0.292532 | -1.393927 | 0.162602 | 0.709110 |
4 | 0.701989 | 0.760702 | 0.509557 | -0.628046 | -0.05079 | 2.090257 | -0.292532 | -1.393927 | 0.162602 | 0.709110 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
62179 | -0.267642 | 0.890022 | 1.010352 | 1.592215 | -0.05079 | 0.397836 | -0.292532 | 0.717386 | -1.288748 | 0.857972 |
62180 | -0.355791 | 0.915093 | 1.010352 | 1.592215 | -0.05079 | 0.397836 | -0.292532 | 0.717386 | -1.288748 | 0.857972 |
62181 | -0.179494 | 0.940081 | 1.010352 | 1.592215 | -0.05079 | -0.292431 | -0.292532 | 0.717386 | -1.288748 | 1.006834 |
62182 | -0.488013 | 0.965152 | 1.010352 | 1.592215 | -0.05079 | -0.292431 | -0.292532 | 0.717386 | -1.288748 | 1.006834 |
62183 | -0.267642 | 0.990099 | 1.010352 | 1.592215 | -0.05079 | -0.292431 | -0.292532 | 0.717386 | -1.288748 | 1.006834 |
62184 rows × 10 columns
2. Compute the covariance matrix to identify correlations¶
covariance_matrix = np.cov(scaled.T)
covariance_matrix
array([[ 1.00000000e+00, 5.50218370e-01, -1.62061859e-01, -1.73957812e-01, -4.82493487e-02, 3.73327299e-01, 1.82682899e-01, 3.35350361e-01, -9.78535478e-02, 5.52049444e-01], [ 5.50218370e-01, 1.00000000e+00, -1.79319084e-03, -5.08807521e-04, 2.85073776e-03, 1.84849490e-01, 9.55090525e-03, 4.46758988e-02, -2.32210763e-02, 9.99077439e-01], [-1.62061859e-01, -1.79319084e-03, 1.00000000e+00, 7.91338197e-01, -7.58620382e-02, 1.11687310e-02, -1.17820251e-02, -4.82362857e-03, 1.55586861e-02, -1.91427432e-03], [-1.73957812e-01, -5.08807521e-04, 7.91338197e-01, 1.00000000e+00, -3.18988342e-02, 2.06733408e-02, -1.66457755e-02, -3.61271915e-02, 8.46234643e-03, -5.17288766e-04], [-4.82493487e-02, 2.85073776e-03, -7.58620382e-02, -3.18988342e-02, 1.00000000e+00, -8.85265918e-02, -1.48579083e-02, -7.07984358e-02, -9.49422885e-02, 2.84316486e-03], [ 3.73327299e-01, 1.84849490e-01, 1.11687310e-02, 2.06733408e-02, -8.85265918e-02, 1.00000000e+00, 9.32418634e-02, 1.52475895e-01, 6.31245806e-02, 1.85120732e-01], [ 1.82682899e-01, 9.55090525e-03, -1.17820251e-02, -1.66457755e-02, -1.48579083e-02, 9.32418634e-02, 1.00000000e+00, 2.09862098e-01, -1.37159611e-01, 1.00907232e-02], [ 3.35350361e-01, 4.46758988e-02, -4.82362857e-03, -3.61271915e-02, -7.07984358e-02, 1.52475895e-01, 2.09862098e-01, 1.00000000e+00, 9.65556768e-02, 4.55808573e-02], [-9.78535478e-02, -2.32210763e-02, 1.55586861e-02, 8.46234643e-03, -9.49422885e-02, 6.31245806e-02, -1.37159611e-01, 9.65556768e-02, 1.00000000e+00, -2.36235024e-02], [ 5.52049444e-01, 9.99077439e-01, -1.91427432e-03, -5.17288766e-04, 2.84316486e-03, 1.85120732e-01, 1.00907232e-02, 4.55808573e-02, -2.36235024e-02, 1.00000000e+00]])
plt.figure(figsize =(10,10))
sns.set(font_scale = 1.5)
sns.heatmap(covariance_matrix,
cbar = True,
annot = True,
square = True,
fmt = '.2f',
annot_kws={'size':12}
)
<AxesSubplot: >
- If positive then: the two variables increase or decrease together (correlated)
- If negative then: one increases when the other decreases (Inversely correlated)
3. Compute the eigenvectors and eigenvalues of the covariance matrix to identify the principal components¶
eigen_values, eigen_vectors = np.linalg.eig(covariance_matrix)
4. Create a feature vector to decide which principal components to keep¶
variance_explained = []
for i in eigen_values:
variance_explained.append(i/sum(eigen_values)*100)
x = [i for i in range(len(variance_explained))]
plt.bar(x,variance_explained, width = 0.5, color = ['red', 'blue'])
plt.xticks(np.arange(10),('pca_1','pca_2', 'pca_3', 'pca_4','pca_5','pca_6','pca_7','pca_8','pca_9','pca_10'),rotation = 45)
plt.title('Variance by principal component')
plt.xlabel('Principal component')
plt.ylabel('Percentage of variance explained')
plt.show()
From the previous graphic, we can select for example PCA_1, PCA2, PCA_6, PCA_7 and PCA_10 and we captured 78.79% of the variance.
variance_captured = variance_explained[0]+variance_explained[1]+variance_explained[5]+variance_explained[6]+variance_explained[9]
variance_captured
78.7952503836204
Now we build the feature vector
feature_vector = [eigen_vectors.T[0],eigen_vectors.T[1],eigen_vectors.T[5],eigen_vectors.T[6],eigen_vectors.T[9]]
feature_vector=np.array(feature_vector)
5. Recast the data along the principal components axes¶
scaled.shape
(62184, 10)
feature_vector.T.shape
(10, 5)
gym_recast = np.dot(scaled,feature_vector.T)
gym_recast = pd.DataFrame(gym_recast)
sns.pairplot(gym_recast)
<seaborn.axisgrid.PairGrid at 0x7fc18c0dc040>