What do Principal Components Actually do Mathematically?

I have recently taken an interest in PCA after watching Professor Gilbert Strang’s PCA lecture. I must have watched at least 15 other videos and read 7 different blog posts on PCA since. They are all very excellent resources, but I found myself somewhat unsatisfied. What they do a lot is teaching us the following:

  • What the PCA promise is;

  • Why that promise is very useful in Data Science; and

  • How to extract these principal components. (Although I don't agree with how some of them do it by applying SVD on the covariance matrix, that can be saved for another post.)

Some of them go the extra mile to show how the promise is being fulfilled graphically. For example, a transformed vector can be shown to be still clustered with its original group in a plot.

Objective

To me, the plot does not provide a visual effect that is striking enough. The components extraction part, on the other hand, mostly talks about how only. Therefore, the objective of this post is to shift our focus onto these 2 areas - to establish a more precise goal before we dive into the components extraction part, and to bring an end to this post with a more striking visual.

Prerequisites

This post is for you if:

  • You have already seen the aforementioned plot - just a bonus actually;

  • You have a decent understanding of what the covariance matrix is about;

  • You have a good foundation in linear algebra; and

  • Your heart is longing to discover the principal components, instead of being told what they are!

How to Choose P?

After hearing my dissatisfaction, my friend Calvin recommended this paper by Jonathon Shlens - A Tutorial on Principal Component Analysis to me. It is by far the best resource I have come across on PCA. However, it's also a bit lengthier than your typical blog post, so the remainder of this post will focus on section 5 of the paper. In there, Jonathon immediately establishes the following goal:

The [original] dataset is XX, an m×nm × n matrix. Find some orthonormal matrix PP in Y=PXY = PX such that CY1nYYTC_Y \equiv \frac{1}{n}YY^T is a diagonal matrix.[1] The rows of PP shall be principal components of XX.

As you might have noticed, CYC_Y here is the covariance matrix of our rotated dataset YY. Why do we want CYC_Y to be diagonal? Before we address this question, let’s generate a dataset XX consisting of 4 features with some random values.

"""
Mostly helper functions.
Skip ahead unles you would like to follow the steps on your local machine.
"""
from IPython.display import Latex, display
from string import ascii_lowercase
import numpy as np
import pandas as pd

FEAT_NUM, SAMPLE_NUM = 4, 4

def covariance_matrix(dataset):
    return dataset @ dataset.transpose() / SAMPLE_NUM

def tabulate(dataset, rotated=False):
    '''
    Label row(s) and column(s) of a matrix by wrapping it in a dataframe.
    '''
    if rotated:
        prefix = 'new_'
        feats = ascii_lowercase[FEAT_NUM:2 * FEAT_NUM]
    else:
        prefix = ''
        feats = ascii_lowercase[0:FEAT_NUM]
    return pd.DataFrame.from_records(dataset,
                                     columns=['sample{}'.format(num) for num in range(SAMPLE_NUM)],
                                     index=['{}feat_{}'.format(prefix, feat) for feat in feats])

def display_df(dataset, latex=False):
    rounded = dataset.round(15)
    if latex:
        display(Latex(rounded.to_latex()))
    else:
        display(rounded)
x = tabulate(np.random.rand(FEAT_NUM, SAMPLE_NUM))
display_df(x)

sample0

sample1

sample2

sample3

feat_a

0.472612

0.453242

0.811147

0.237625

feat_b

0.728994

0.916212

0.202783

0.116406

feat_c

0.803590

0.967202

0.659594

0.726142

feat_d

0.771849

0.753178

0.153215

0.459026

XX above just looks like a normal dataset. Nothing special. What about its covariance matrix?

c_x = covariance_matrix(x)
display_df(c_x)

feat_a

feat_b

feat_c

feat_d

feat_a

0.285804

0.237986

0.381435

0.234878

feat_b

0.237986

0.356387

0.422564

0.334312

feat_c

0.381435

0.422564

0.635896

0.445776

feat_d

0.234878

0.334312

0.445776

0.349302

Its covariance matrix CXC_X doesn't look that intersting either. However, let us recall that the covariance matrix is always a symmetric matrix with the variances on its diagonal and the covariances off-diagonal, i.e., having the following form:

var(a,a)cov(a,b)cov(a,c)cov(b,a)var(b,b)cov(b,c)cov(c,a)cov(c,b)var(c,c)\large \begin{vmatrix} var(a, a) & cov(a, b) & cov(a, c) \\ cov(b, a) & var(b, b) & cov(b, c) \\ cov(c, a) & cov(c, b) & var(c, c) \\ \end{vmatrix}

Let's also recall that cov(x,y)cov(x, y) is zero if and only if feature x and y are uncorrelated. The non-zero convariances in CXC_X is an indication that there are quite some redundant features in XX. What we are going to do here is feature extraction. We would like to rotate our dataset in a way such that the change of basis will bring us features that are uncorrelated to each other, i.e., having a new covariance matrix that is diagonal.

Time to Choose

With a clearer goal now, let's figure out how we can achieve it.

GivensGoalUnknownY=PXCX1nXXTCY1nYYTCY to be diagonalHow to Choose P?\begin{array}{ccc} \text{Givens} & \text{Goal} & \text{Unknown} \\ \hline \begin{gathered}Y = PX \\ C_X \equiv \frac{1}{n}XX^T \\ C_Y \equiv \frac{1}{n}YY^T\end{gathered} & C_Y\text{ to be diagonal} & \text{How to Choose }P\text{?} \end{array}

From the givens above, we are able to derive the relationship between CYC_Y and CXC_X in terms of PP:

CY=1nYYT=1n(PX)(PX)T=1nPXXTPTC_Y = \frac{1}{n}YY^T = \frac{1}{n}(PX)(PX)^T = \frac{1}{n}PXX^TP^T
CY=PCXPTC_Y = PC_XP^T

Let's recall one more time that all covariance matrices are symmetric, and any symmetric matrix can be "Eigendecomposed" as

QΛQTQ{\Lambda}Q^T

where QQ is an orthogonal matrix whose columns are the eigenvectors of the symmetric matrix, and Λ\Lambda is a diagonal matrix whose entries are the eigenvalues. There is usally more than one way to choose PP, but Eigendecomposing CXC_X will prove to make our life much easier. Let's see what we can do with it:

CY=PQΛQTPTC_Y = PQ{\Lambda}Q^TP^T

Since we know Λ\Lambda is diagonal and QTQIQ^TQ \equiv I, what if we choose PP to be QTQ^T?

CY=QTQΛQTQ=IΛIC_Y = Q^TQ{\Lambda}Q^TQ = I{\Lambda}I
CY=ΛC_Y = \Lambda

Voilà, by choosing PP to be eigenvectors of CXC_X, we are able to transform XX into YY whose features are uncorrelated to each other!

Test it

Well, that was quite convenient, wasn't it? What's even better is that we can demonstrate it in a few lines of code:

_, q = np.linalg.eig(c_x)  # Eigendecomposition
p = q.transpose()
y = tabulate(p @ x, rotated=True)
display_df(y)

sample0

sample1

sample2

sample3

new_feat_e

1.400186

1.576029

0.906121

0.830166

new_feat_f

-0.162144

-0.225848

0.572904

0.076917

new_feat_g

-0.042285

-0.086877

-0.091316

0.335921

new_feat_h

0.087761

-0.072164

-0.002497

-0.008295

The transformed dataset YY with the newly extracted features ee to hh, doesn't look like anything either. What about its convariance matrix??

c_y = covariance_matrix(y)
display_df(c_y)

new_feat_e

new_feat_f

new_feat_g

new_feat_h

new_feat_e

1.488654

0.000000

0.000000

0.000000

new_feat_f

0.000000

0.102858

0.000000

-0.000000

new_feat_g

0.000000

0.000000

0.032629

-0.000000

new_feat_h

0.000000

-0.000000

-0.000000

0.003246

Holy moly, isn't this exactly what we were aiming for from the beginning, with just a few lines of code? From a dataset with some redundant and less interesting fetures, we have extracted new features that are much more meaningful to look at, simply by diagonalizing its convariance matrix. Let's wrap this up with some side-by-side comparisons.

display_df(x, latex=True)
display_df(c_x, latex=True)
display_df(y, latex=True)
display_df(c_y, latex=True)
XCovariance Matrix of Xsample0sample1sample2sample3feat_a0.4726120.4532420.8111470.237625feat_b0.7289940.9162120.2027830.116406feat_c0.8035900.9672020.6595940.726142feat_d0.7718490.7531780.1532150.459026feat_afeat_bfeat_cfeat_dfeat_a0.2858040.2379860.3814350.234878feat_b0.2379860.3563870.4225640.334312feat_c0.3814350.4225640.6358960.445776feat_d0.2348780.3343120.4457760.349302\footnotesize \begin{array} {cc} X & \text{Covariance Matrix of }X \\ \begin{array}{|l|rrrr|} \hline {} & sample0 & sample1 & sample2 & sample3 \\ \hline feat\_a & 0.472612 & 0.453242 & 0.811147 & 0.237625 \\ feat\_b & 0.728994 & 0.916212 & 0.202783 & 0.116406 \\ feat\_c & 0.803590 & 0.967202 & 0.659594 & 0.726142 \\ feat\_d & 0.771849 & 0.753178 & 0.153215 & 0.459026 \\ \hline \end{array} & \begin{array}{|l|rrrr|} \hline {} & feat\_a & feat\_b & feat\_c & feat\_d \\ \hline feat\_a & 0.285804 & 0.237986 & 0.381435 & 0.234878 \\ feat\_b & 0.237986 & 0.356387 & 0.422564 & 0.334312 \\ feat\_c & 0.381435 & 0.422564 & 0.635896 & 0.445776 \\ feat\_d & 0.234878 & 0.334312 & 0.445776 & 0.349302 \\ \hline \end{array} \\ \end{array}
YCovariance Matrix of Ysample0sample1sample2sample3new_feat_e1.4001861.5760290.9061210.830166new_feat_f0.1621440.2258480.5729040.076917new_feat_g0.0422850.0868770.0913160.335921new_feat_h0.0877610.0721640.0024970.008295new_feat_enew_feat_fnew_feat_gnew_feat_hnew_feat_e1.4886540.0000000.0000000.000000new_feat_f0.0000000.1028580.0000000.000000new_feat_g0.0000000.0000000.0326290.000000new_feat_h0.0000000.0000000.0000000.003246\footnotesize \begin{array} {cc} Y & \text{Covariance Matrix of }Y \\ \begin{array}{|l|rrrr|} \hline {} & sample0 & sample1 & sample2 & sample3 \\ \hline new\_feat\_e & 1.400186 & 1.576029 & 0.906121 & 0.830166 \\ new\_feat\_f & -0.162144 & -0.225848 & 0.572904 & 0.076917 \\ new\_feat\_g & -0.042285 & -0.086877 & -0.091316 & 0.335921 \\ new\_feat\_h & 0.087761 & -0.072164 & -0.002497 & -0.008295 \\ \hline \end{array} & \begin{array}{|l|rrrr|} \hline {} & new\_feat\_e & new\_feat\_f & new\_feat\_g & new\_feat\_h \\ \hline new\_feat\_e & 1.488654 & 0.000000 & 0.000000 & 0.000000 \\ new\_feat\_f & 0.000000 & 0.102858 & 0.000000 & -0.000000 \\ new\_feat\_g & 0.000000 & 0.000000 & 0.032629 & -0.000000 \\ new\_feat\_h & 0.000000 & -0.000000 & -0.000000 & 0.003246 \\ \hline \end{array} \\ \end{array}

Look at this. Isn't it just beautiful?

[1]: The reason orthonormality is part of the goal is that we do not want to do anything more than rotations. We do not want to modify XX. We only want to re-express XX by carefully choosing a change of basis.

Last updated