Developing machine learning models to predict certain scenarios isn’t a big thing nowadays. All you need to do is find yourself a suitable dataset. There are some amazing libraries out there that let you train models with minimal coding. These libraries, such as scikit-learn, have all the built-in functions you need. All you need to do is import them and start predicting.
However, what distinguishes the good ML engineers from the average ones is how they prepare the data before fitting it on the model. Not only does this make a huge difference in terms of training, but it also makes or breaks the performance of the model.
So, to realize the impact of data preprocessing today, we will be looking at how preprocessing makes a huge difference in your model’s performance. We will train a model to see how it performs and then train it again – this time after preprocessing the data. So, you can clearly see the effect of preprocessing. Let’s start!
What is Preprocessing?
Data preprocessing refers to how we convert the raw data we collect in a form that’s ready to be fed to the model. That’s data preprocessing in a nutshell right there for you. Now there are a lot of procedures included in it, such as data cleaning, data transformation, or data scaling.
Training a Model Without Preprocessing the Data
Before we can train a model, we need to choose an appropriate dataset. The dataset I have chosen for the sake of this tutorial is the US census data which can be found on the UCI Machine Learning repository. I’ve also uploaded it on my GitHub along with the source code for this tutorial here.
So, we will be using this dataset to predict whether an individual earns more than $50k/year. Let’s kick things off by importing the necessary libraries.
After this, we will define the column names to assign them to the dataset manually since there is none present when we download the data. We will also drop the ‘fnlqgt’ column since it’s practically useless for this specific use case. Lastly, we will print the data head and the shape to have the first look at our dataset in Pandas.
As you can see, we have 14 columns and more than 32 thousand records in the dataset. So, that’s enough for adequately training our model. While there are various models that scikit-learn supports, I have chosen logistic regression for this example since it’s a straightforward one and we can easily demonstrate the use of preprocessing in it. Let’s create the instance of our model.
We have the dataset; we’ve created the instance of the model. And since we will not involve pre-processing in this section, there’s not much work left. However, we still need to have some test data ready to test the accuracy of our model once it’s trained. Luckily, we don’t need to split our dataset into training and testing sets since our repository already has separate testing data.
Now, let’s split the data into feature dataframe and target variable dataframe so our model will know what it has to predict and what are the feature columns. Here’s how it can be done. Make sure you do this for both the training and testing dataframes.
Now, there’s just one step we need to do before fitting the model on our data – encoding the categorical variables. As you might already know, the model cannot decode the categorical variables and we need a way to convert them into numerical features. I’ll be using One-Hot encoding here to do this, which is provided by scikit-learn. Remember to do this for both training and testing dataframes.
I’ll separate the categorical features, apply the encoding on them, and later combine them with the numerical features to get the original dataframe. Let’s see.
Now, let’s separate the numerical features and combine them back with the encoded features.
Now, let’s just repeat the same procedure for the testing data as well.
That’s all done there! Let’s just fit the model on the data and calculate the accuracy score. We can simply call the accuracy_score function of the sklearn here to do this.
As you can see, we’re getting an accuracy score of 0.77 which essentially means our model is successfully predicting almost 77% of the instances in the testing dataset correctly. This is a good accuracy score, not to mention that we have not used any kind of preprocessing yet, not even dealt with the missing values. But let’s see if we can improve it any further.
Training the Model After Preprocessing the Data
We will start all over again and this time preprocess our data before fitting the model on it.
Just like we did in the previous section, we will import the dataset and put it in the form of a Pandas dataframe. After that, we will use the info function to dive into the details of the dataframe to see if we have any missing values.
The output above shows that there are no null values in the dataset but frankly, that’s too good to be true when we talk about real-world datasets. Sometimes, datasets have null values marked as ‘?’ so it might be the reason that Pandas is not detecting them as null values. Let’s check for any ‘?’ values and convert them to NaNs if they exist.
We can see that the missing values are present in a very small ratio so instead of thinking of a way to deal with them, we will simply remove them. This is feasible only because the volume was way too less, if it was more, then we would re-assign them using a strategy such as replacing them with the averages. Anyway, let’s move on and simply remove them. Let’s also see the number of entries we’re left with once we remove these missing values.
Moving on, we will check if the data has any sort of skewness and if it does, we will apply some kind of transformation on it since skewness may reduce the accuracy of the model by violating its assumptions. Let’s make a plot to view the skewness of the data.
Let’s view the skewness in numerical terms as well to confirm this.
As it can be seen, there’s considerable skewness in ‘capital-gain’ and ‘capital-loss’. So, we will use logarithmic transformation to reduce the skewness as much as possible.
Moving on, we will use scaling on the numerical features. Scaling is a highly effective normalization technique that is used to reduce the spread of feature values. It doesn’t change the relationship between the variables but equalizes the effect each feature has on the model, especially for models which are highly dependent upon the magnitudes of the values.
We will be using the MinMax scaler here which can be imported from the scikit-learn library.
That’s it. We have applied a logarithmic transformation to deal with the skewness as well as scaling on the feature values. This is enough preprocessing for now and our model should perform quite better than how it did the last time. Let’s move on and fit the model on our dataframe.
Note: We still need to encode the categorical variables on the new feature values just like we did when we did not use data preprocessing. However, the process is exactly the same, and this time we just have to use the preprocessed data. So, I’m not including the code here.
Here’s the final step; we’ll fit the model on the data again so we can check the accuracy score and see how much it has improved after doing the data preprocessing. Let’s see.
As you can see, the performance of our model has improved significantly solely due to adding the data preprocessing pipeline before fitting the model. Well, the accuracy score might seem like it’s not a big jump, given all the effort involved in preprocessing, trust me, there is a lot more preprocessing that could yet be done. This tutorial only served the purpose of showing a demo of how effective preprocessing is.
Data preprocessing a vital step in developing any machine learning model. If you want your model to be as optimized as possible, make sure you never forget preprocessing and always explore it as much as possible so you can find out what kind of preprocessing it requires.
Most aspiring data scientists pay low attention to preprocessing and keep wasting time in fine-tuning the models when in reality, preprocessing matters as much, as we have seen throughout this article.
So, always make sure you put in enough time to preprocessing the data. Happy coding!
The notebook used in this article can be found on my GitHub here.