One of the biggest problems that data scientists and data analysts face early on in their careers is dealing with massive datasets. Not only do the datasets have a lot of rows, but they also have unending features. In such cases, doing analysis becomes significantly harder since it becomes difficult to visualize the vast number of features.
Luckily, techniques like Principal Component Analysis are just what you need in such situations. The goal of PCA is dimensionality reduction while distorting the original data as little as possible. And today, we’re going to implement PCA from scratch so you can have a sound idea of how it’s implemented and what’s actually going on behind the code.
While PCA offers many advantages, such as reducing the training time and removing noise, the main benefit most people seek to achieve is the ease of visualization and analysis due to the lesser number of converted features. And that’s precisely what we’ll be looking at today.
If you want to dive into the code directly, here’s the link to the notebook.
How Does PCA Work?
You might be wondering how it makes the visualization easier? Well, imagine you have to visualize a dataset that has two features. Just plot both the features along each axis in a graph – pretty simple, right? Now imagine you had ten features instead of two. How do you think you will visualize all these features? Not so easy now, is it?
Visualizing features is only easy as long as they go up to 3 features at a time. Anything more than that becomes quite challenging for the human mind to interpret and hence visualize. This is where PCA comes in. PCA reduces the dimension (features) for you while retaining the original information of the dataset, so it may become easier for you to analyze and interpret the features.
PCA <> Feature Selection!
Also, it’s essential to always keep in mind that PCA is NOT a feature selection algorithm! I have seen many people get confused with PCA and eventually mix it with different feature selection algorithms. Unlike feature selection algorithms, PCA doesn’t provide top N features.
With that said, let’s move forward and start the tutorial.
The tutorial is structured as follows:
· Dataset and imports
o Data scaling
o Covariance Matrix
Let’s get started without any further ado.
Dataset and imports
I have chosen a real-life dataset for the sake of demonstration here to see the relevance. Also, I have tried not to get too much since the goal here is to understand the PCA rather than diving into complex datasets. We’ll be using a small Pizza dataset which you can find on my GitHub.
Let’s import the dataset and some common libraries that we’ll need.
Before we move to the PCA part, there’s one thing we need to do – separate the target variable from the rest of the dataset. In this case, we’ll store the brand variable into a separate dataframe and drop it from the main dataframe.
That’s all. Let’s move forward to the PCA part now.
PCA – Step by Step
There are three individual steps that we require here:
Data Scaling – We want any feature to be perceived as more important just because of scale differences. This is a common step that introduces equality in data no matter the scale.
Covariance Matrix – Covariances between each feature.
Eigendecomposition – Breaking down matrices into eigenvalues and eigenvectors.
That’s all. Let’s start.
As mentioned above, data scaling is done to take out any difference in feature importance due to a mere difference of scales. Most of the machine learning tasks require this as a prerequisite.
Data scaling can be easily done using Scikit-learn.
The output shows that our features were successfully scaled. Let’s move further.
Covariance shows how two variables vary with each other – for example, the height and weight of a person. When we create a covariance matrix, the variance for each feature is present along the diagonal of the matrix. In contrast, the covariances for each pair are present in the rest of the places.
Let’s create the covariance matrix using Numpy.
As you can see, the diagonals contain identical elements as they’re the variances for each feature. The other cells contain the covariances.
As mentioned above, we use eigendecomposition to split down square matrices into eigenvalues and eigenvectors. While the eigenvectors are unit vectors, the eigenvalues contain the value that reflects the magnitudes of those vectors.
As a matter of fact, eigenvectors are orthogonal. And we’ve seen above that the covariance matrices are symmetrical. This means that the PCA will have its first principal axis explaining most of the variance. To explain the majority of the rest variance, we have the orthogonal. The said procedure is repeated N times, where N stands for the number of original features.
The principal components made are sorted using the percentage of covariance they explain. The benefit of this method is that only the first four components are explaining over 95% of the variance, so it might be sensible to retain these three components only and move on.
That’s enough explanation for now and if you’re interested more, feel free to take a look here.
Let’s use Numpy again, this time to perform the eigendecomposition.
Just like that, you can view the vectors as well. Moving on, we can calculate the percentage of explained variance for every principal component using this. Let’s see.
On the top, you can see a value of 1 – it represents the sum of the individual percentages of variances per component and should always be equal to one. According to the output above, we can see that only the first three entries account for more than 93% of the total variance. Great!
Lucky for us, that’s more than enough. We can move forward to visualizations using only these three features since this essentially means that we can rely upon these three principal components only for the whole data.
We established the fact before that it’s not a piece of cake for humans to comprehend anything more than four dimensions. And as it turns out, with the help of PCA, we have figured out a way to represent the original eight dimensions of the Pizza dataset (8 features) in just three dimensions, which will be much easier to visualize.
So, let’s go ahead and create a new Pandas Dataframe containing only the three components that we found enough for explaining almost all the data.
Let’s get to Python’s powerful visualization libraries now. First, we will visualize the dataset in 2d. We’ll need to do this using two principal components for now and ditch the third one. Let’s get to it.
As you can see, the classes are pretty separable in the plot. While some classes like I and G are somewhat harder to separate, most of them are clearly separable and hence can be easily predicted using an ML algorithm. Finally, let’s try to visualize the plot in 3D. However, unfortunately, Seaborn doesn’t support 3D plots by default, so we’ll go a bit tricky way. Here’s how we can plot a 3D plot.
Sweet! The plot almost perfectly depicts the distribution of classes on a 3D scale. Now, just imagine what would have happened to your neurons if you had to visualize the original 8-dimensional data. Now you know why PCA is such an invaluable technique for most data scientists and other analysts.
Implementing PCA is relatively easy when all you’re doing is importing a bunch of libraries and calling their functions to get the job done. While I realize it’s the obvious practical solution and the go-to method for everyone, my understanding is that you should at least do it once manually to know what’s actually going on.
So today, we have seen how PCA works and how it manages to pack most of the information of a dataset into a small set of principal components. That’s all for today! I hope you enjoyed reading the article. Feel free to drop your comments below.