• No products in the cart.

Housing Price Predictor Tutorial

Motivation

A housing price predictor plays a key role not only when someone’s buying a house, but also when people consider selling their house. Models like these help people know in advance what value to expect for their houses. Moreover, developing a price predictor helps a beginner in machine learning experience all the major steps involved while developing a custom solution and experience the ML pipeline in detail. One can play around with the model as he likes and get his hands dirty with a lot of coding related to solving real-life problems.  You can also look at and play with the code by clicking this link.

Introduction

Predicting housing prices has always been trendy when it comes to data science. Even though people have been predicting housing prices long before the advent of data science, it has made things quite convenient and easy to understand. While there are a lot of models already available on the internet for predicting housing prices, I realized there’s still room for more accuracy.

The models being used to predict prices have grown old and there are more accurate and advanced ways now to do the deed. Hence, I have made this tutorial to cover everything a beginner needs to go through while developing a highly accurate housing prediction model. I’ll be going through every step in detail while quickly explaining each chunk of code. By the end of the article, you’ll have an in-depth understanding of each step and a firm grasp of the major concepts involved. So, let’s get started without any further ado.

The Dataset

Exploring the dataset should always be your first step before anything else. While you can also explore the data using Pandas, I generally prefer to have a good look at it manually as well to notice any readily visible trends.

So, let’s import some libraries needed for the model and read the data from our CSV file into a Pandas dataframe using read_csv().

?

?

Now, let’s have a glimpse of how the dataset looks like, using df.head() function shown below.

?

Here’s how the output looks.

?

?

Well, there sure are many columns. Can you count how many are there? Well, you can but you don’t need to. Pandas can tell us how many records we have and how many fields does each record has. To do this, we can run a simple df.shape() method which will show us the dimensions of our dataset.

?

?

As seen in the output above, we have over 42 thousand records of housing prices with each of them consisting of 20 features.

Feature Scaling

Now that we have explored our dataset, the next step is to apply feature scaling to the available features. Feature scaling is a popular method that’s used to normalize the values in a dataset, in order to remove any kind of bias from the data which may arise from non-regularized measurements.

Scaling can be done easily using scikit-learn. We already have MinMaxScaler imported in the first step. All we have to do now is make an instance of the scaler and fit our relevant data on that instance. Let’s see how this can be done.

First, we’ll get a list of the available columns we have and store that in a columnnames variable and copy our original dataframe df into a separate one called scaled_features so we don’t change the original dataframe.

?

Now, we will scale all the available features in the data as shown below.

?

Just like this, we have a new dataframe called scaled_features having all the features we had originally, but in scaled form.

Preparing the Data

No dataset is perfect and the one we have here is no different. After some brief research, I realized some features were either censored or had a lot of missing values in them. Due to this reason, they did not have a great effect on our model. So, the best course of action here is to get rid of all those features that aren’t of any use to us. We will simply delete those columns from our dataframe using the del function.

?

Now if you paid enough attention to the dataframe, some columns consist of categorical data instead of the usual numerical data, i.e. garage_type, city. But we cannot use categorical data while training our model. So, what we’ll do here is encode the categorical data to convert it to its equivalent numerical form.

However, since there’s no ordinal relationship existing between both these fields otherwise, we could have used the mainstream integer-encoding here. If we do, the algorithm will assume a relationship between the said variables that doesn’t exist. So, we will be getting unexpected results and our model would eventually be ruined.

The best solution here is to use One-Hot encoding. Using it, we can easily convert the data into numerical form that can later be used by our model. Here’s how that can be done using Pandas’ get_dummies function.

?

Finally, we’re done with all the data preparation steps. All we need to do now is to make arrays that will be holding the data for the final training of the model. However, before we use all the available data for training, we need to keep some data aside for testing as well, since we don’t have a separate dataset for the testing phase. It will be beneficial for us later when we’re testing the performance of our model. For this, we will do the train-test split here.

We will use two arrays, one for the features and the other one for the target value. I.e.: price. Moreover, we will reserve 30% of the data for testing while the rest of the data will be used for training.

?

Choosing the Model

There are a lot of regression models that you could choose here. Linear regression is the most widespread model used while some people use Lasso regression as well. However, we won’t be using any of those. Instead, we will be implementing the model using the Gradient Boosting algorithm. Even though it’s a computationally expensive algorithm, the results achieved are far better, since the algorithm fits the models sequentially various times and improves the accuracy from the previous mistakes it did.

Feel free to know more about the boosting models here.

?

Fitting and Testing the Model

Now, we can finally start training our model using the final dataframe we have after all the pre-processing. After we’ve fit the model on our dataset, we can easily see how our model performs on the new unseen testing data that we set aside in our test-train split. To achieve this, we will simply calculate the mean absolute error.

Another thing to keep in mind here is that since we’re using a boosting model, we have a possibility of the data being overfitted here, due to various reasons. To check this, we have to compare the error of both training and testing data. If the training accuracy is unusually high but the testing accuracy isn’t, it will make the case of overfitting apparent. So, let’s fit the model first and see how it performs.

?

?

As you can see, our Gradient Boosting Regressor is fully trained and ready to use. Don’t be intimidated by all the parameters shown with the model above. You don’t need to worry about them for now as scikit-learn will manage it automatically. Let’s move on and see the error rate of our model to see its performance on the testing data.

?

?

So, the mean absolute error on the testing data is EXTREMELY good and our model is performing great.

Now, let’s do the same operation on the training data as well.

?

?

As we can see, the error on the training set is pretty similar to what we got in the testing data. So, we can be sure there’s no underfitting or overfitting in our model.

That’s it! We have developed a fully functional housing prices prediction model based on gradient Boosting Regressor and the error rate we’re getting is also GREAT! That’s how easy scikit-learn makes model development for you.

Wrap Up

Throughout this article, we have made a complete machine learning model from scratch on housing price prediction using scikit-learn. Each part of the process was thoroughly explained, from data pre-processing to calculating the model’s accuracy. Moreover, we also got a complete insight into what a model development pipeline looks like.

Finally, I hope that you’ve understood all the concepts involved thoroughly. If you are also interested in making such models, give this model a try to clear up any confusion. Do leave your feedback and a thumbs up if you liked the article!

If you want to get the complete source files for this project, don’t hesitate to grab them from my GitHub here.

August 7, 2021
© 2021 Ernesto.  All rights reserved.  
X