Linear regression is amongst the simplest supervised learning techniques that you will come across in machine learning. While multiple linear regression does involve some more variables, simple linear regression only includes two coefficients – slope and intercept, as we’ll see in a bit.
Today, we will dive into it and see how we can implement simple linear regression from scratch. This article marks the start of the series of from-scratch articles. Here’s how the blog will be organized:
- · Introduction to Simple Linear Regression
- · Mathematical Concepts Beneath Simple Linear Regression
- · Implementation from Scratch
- · Scikit-Learn Implementation
- · Wrap-Up
Note: The notebook and dataset for this tutorial can be found on my GitHub here.
Introduction to Simple Linear Regression
The underlying concept of simple linear regression is as simple as it could get. In fact, it’s only used while developing the basics of machine learning since it doesn’t quite have many practical uses because you can only use one independent variable to predict the dependent variable. And to do that, all you have to do is plug in some values to a formula.
Regression is generally used while dealing with continuous variables, such as weight, volume, and so on.
There are two types of regression algorithms:
- · Simple Linear Regression – only a single output as well as a single input variable. For example, using age to predict salary.
- · Multiple Linear Regression – multiple inputs to predict a single output. For example, using height, weight, age to predict BMI.
Today, we will only be focusing on simple linear regression and multiple linear regression will be a story for another day, which by the way, can be found here. Simple linear regression might be hardly used for any real-life scenario, but it’s a good starting point to learn regression techniques and serves as a baseline to compare to more complex algorithms.
Here are some of the requirements/assumptions of the algorithm:
- · Linear Assumption – the relationship between the input and output variable is linear
- · No Noise – the variables do not contain noise; so highly affected by outliers
- · Normal Distribution – more accurate predictions if variables are normally distributed
- · Rescaled Inputs – scaled inputs are used to make the predictions more accurate
When you train a simple linear regression model, you try to figure out the best possible coefficients for the equation of the line of best fit that passes through the data points.
Mathematical Concepts Beneath Multiple Linear Regression
The mathematical equation we’re dealing with here is quite simple, merely the equation of a line. Here’s the equation you need to figure out:
Here, x is the input variable, so the only two variables left are the beta coefficients.
Let’s start with Beta 1. It’ll represent the slope of the line and we can calculate it as follows:
Here, the Xi tells us the value of the current input feature while the X-bar represents the overall mean of the variable. The same scheme is also adopted by the target variable, y.
Now comes the turn of Beta 0. The formula you can use to calculate it is:
That’s all! That’s all the math there is behind simple linear regression. Amazing, right? That’s how easy it is to calculate the values of the coefficients and once you’re done with them, simply plug them into the equation and you’ll get the answer.
Now, it’s time to move to the hands-on implementation using Python.
Implementation from Scratch
First off, we’ll import the necessary libraries we’ll need – NumPy and Matplotlib. I’ll also import some rcParams to make some visualizations appear better.
Time to code our algorithm in the form of a class. If you understood the math that we discussed earlier, it shouldn’t be hard for you to understand this. We’ll use the following functions to build the simple linear regression class:
__init__() – the constructor for our class that will contain the Beta 0 and 1. The initial values of them will be set to None.
fit (X, y) – function to calculate values of Beta 0 and 1 using the input x and y.
Predict (X) – the function that applies the line equation to predict given input.
Let’s see how the code of the algorithm looks like.
Since our algorithm is ready to roll, we need to prepare our dataset to try the algorithm. The dataset that I’ve chosen for this tutorial is a certain company’s dataset that contains the amounts spent on marketing certain items and the subsequent sales made from them.
Note: The dataset can be downloaded here
Let’s import the dataset and have a bird’s eye view using the head() function Pandas.
There are three input features that can be used to predict Sales, which will be our target variable. However, since it’s simple linear regression, we’ll have to choose only one of these three features as our input variable. So, how do we decide which feature is the most correlated to Sales? Well, we make a pairplot. Let’s see how to do that.
The graph makes it evident that the TV column is the most correlated to the Sales column. So, let’s pick it as our input feature and split our dataset into training and testing datasets.
Finally, we have our data ready, and we’re ready to test our from-scratch implementation. Let’s make an instance of the class we defined and fit the model on the dataset. Lastly, we’ll also print the values of Beta 0 and Beta 1.
Here are the output values that we get for Beta coefficients:
Now, let’s print the predictions as well and try to compare them to the actual labeled test data we have.
Let’s see what the preds variable holds first.
In contrast, here are the real values from the test dataset.
As you can see, the values are quite comparable. But we still don’t have a good estimate of how well our model is performing. For that, let’s calculate the Root Mean Squared Error value, which is a good evaluation metric.
This means that our model is around 2-3 units wrong on average, which is mostly the result of variance in the dataset. But still, it’s a pretty good RMSE value and doesn’t mean our model is performing badly. Let’s visualize the best fit line by plotting all the predictions.
As you can see, this is the line that’s being used to label all the new input values we feed to the model. Since the parameters are already calculated, the model simply plugs in the x value to calculate the y value.
Let’s jump on to the scikit-learn’s implementation now to compare our own model with it and see what differences exist.
To know how good our model, in reality, is, we’ll train a similar model on the same dataset, but this time using sklearn. The results will tell us how good we were. Let’s get to it.
The output coefficient values are:
Well, if you compare these values to the ones we got from our scratch model, you’ll see there’s a very tiny difference between the two – meaning our model is quite good.
Let’s calculate the RMSE value as well using scikit-learn and compare it as well.
Once again, quite nearly identical! Well, that’s quite amazing and tells us how efficient our implementation was.
Throughout the article, we saw how easy and straightforward it is to implement a simple linear regression model from scratch in Python. If you understand the basic principles behind the model, you can easily implement it by yourself, without needing any external library to do it for you.
However, it’s not always a smart idea to build things from scratch because even if you can do them, they take up quite some time, relatively. As you saw how it took around a couple of lines to do the same thing using scikit-learn. And we got almost identical results in the end.
Nevertheless, it’s always preferred to know the basics of how an algorithm works, since it can help you optimize certain scenarios and think more critically than other average data scientists.
The next articles on this series of from-scratch implementation using Python can be found here.
Ernesto.Net is a Top corporate IT Training, and consulting company in Fort Lauderdale, Florida which works on all major Data Science and Blockchain requirements. We specialise in Data Science, Blockchain, and Big Data.nOur clientele includes Multiple Fortune 500 companies we have been featured on NASDAQ, ABC, NBC, CBS, and Fox.n n nVisit our official website https://ernesto.net for more information.