What do Principal Components Actually do Mathematically?
I have recently taken an interest in PCA after watching Professor Gilbert Strang’s PCA lecture. I must have watched at least 15 other videos and read 7 different blog posts on PCA since. They are all very excellent resources, but I found myself somewhat unsatisfied. What they do a lot is teaching us the following:
What the PCA promise is;
Why that promise is very useful in Data Science; and
How to extract these principal components. (Although I don't agree with how some of them do it by applying SVD on the covariance matrix, that can be saved for another post.)
Some of them go the extra mile to show how the promise is being fulfilled graphically. For example, a transformed vector can be shown to be still clustered with its original group in a plot.
Objective
To me, the plot does not provide a visual effect that is striking enough. The components extraction part, on the other hand, mostly talks about how only. Therefore, the objective of this post is to shift our focus onto these 2 areas - to establish a more precise goal before we dive into the components extraction part, and to bring an end to this post with a more striking visual.
Prerequisites
This post is for you if:
You have already seen the aforementioned plot - just a bonus actually;
You have a decent understanding of what the covariance matrix is about;
You have a good foundation in linear algebra; and
Your heart is longing to discover the principal components, instead of being told what they are!
How to Choose P?
After hearing my dissatisfaction, my friend Calvin recommended this paper by Jonathon Shlens - A Tutorial on Principal Component Analysis to me. It is by far the best resource I have come across on PCA. However, it's also a bit lengthier than your typical blog post, so the remainder of this post will focus on section 5 of the paper. In there, Jonathon immediately establishes the following goal:
sample0
sample1
sample2
sample3
feat_a
0.472612
0.453242
0.811147
0.237625
feat_b
0.728994
0.916212
0.202783
0.116406
feat_c
0.803590
0.967202
0.659594
0.726142
feat_d
0.771849
0.753178
0.153215
0.459026
feat_a
feat_b
feat_c
feat_d
feat_a
0.285804
0.237986
0.381435
0.234878
feat_b
0.237986
0.356387
0.422564
0.334312
feat_c
0.381435
0.422564
0.635896
0.445776
feat_d
0.234878
0.334312
0.445776
0.349302
Time to Choose
With a clearer goal now, let's figure out how we can achieve it.
Let's recall one more time that all covariance matrices are symmetric, and any symmetric matrix can be "Eigendecomposed" as
Test it
Well, that was quite convenient, wasn't it? What's even better is that we can demonstrate it in a few lines of code:
sample0
sample1
sample2
sample3
new_feat_e
1.400186
1.576029
0.906121
0.830166
new_feat_f
-0.162144
-0.225848
0.572904
0.076917
new_feat_g
-0.042285
-0.086877
-0.091316
0.335921
new_feat_h
0.087761
-0.072164
-0.002497
-0.008295
new_feat_e
new_feat_f
new_feat_g
new_feat_h
new_feat_e
1.488654
0.000000
0.000000
0.000000
new_feat_f
0.000000
0.102858
0.000000
-0.000000
new_feat_g
0.000000
0.000000
0.032629
-0.000000
new_feat_h
0.000000
-0.000000
-0.000000
0.003246
Holy moly, isn't this exactly what we were aiming for from the beginning, with just a few lines of code? From a dataset with some redundant and less interesting fetures, we have extracted new features that are much more meaningful to look at, simply by diagonalizing its convariance matrix. Let's wrap this up with some side-by-side comparisons.
Look at this. Isn't it just beautiful?
Last updated
Was this helpful?