In our last tutorial, we looked at decision trees and saw how well they perform. Today, we’re going to be looking at something a step further than them. Well, as the name suggests, random forests are nothing but a combination of a vast number of decision trees in random order.
Today in this article, we will learn about implementing random forests from scratch in Python. While we can also implement them using libraries available and it’s actually the more practical approach, in my opinion, you should, at least for once, implement things from scratch to discover the process behind them.
So, let’s start without any further ado.
What is a Random Forest?
Random forests are non-parametric models that can be used for both classification and regression tasks. If you were able to understand decision trees, I’m pretty sure you won’t have any problem understanding random forests. Hence, I highly suggest you go through my previous article if you’re unaware of how a decision tree functions since I won’t be repeating that in this article.
Random forests follow the analogy that a vast number of trees can come together and make a forest. The term ‘random’ indicates that the trees are chosen at random to make up the forest.
Here’s a picture that clearly depicts the relationship between decision trees and random forests.
Random forests use the concept of ‘bagging’ where a combination of models is used to enhance the model performance.
To put it in a nutshell:
· N number of subsets are made using the original dataset
· N number of decision trees are built using these subsets
· Each tree makes its prediction, and the majority prediction is perceived as the final prediction of the forest
Let’s summarize this information in the form of a diagram:
Let’s move on to implementing the algorithm from scratch.
Implementing Random Forest from Scratch
We’ll just be coding stuff out today since the math used behind this was essentially covered in the previous article on decision trees. Here’s the linkagain if you haven’t gone through it yet.
We’ll need three basic classes to implement the random forest that are listed below:
I’m pretty sure you’ll be familiar with two of the classes listed above from the previous tutorial. And the good news is, they’re exactly similar. So, you won’t need to spend any time on them if you’ve followed the previous article.
However, the last one is a new addition – it will be used to implement the ensemble algorithm by collecting random decision trees.
Let’s get going with the first class. The Node class will store all the information regarding a particular node in a tree. The data going left and right, the splitting threshold, and the information gain will pack everything nicely.
Here’s how we can code it:
That was pretty simple. Let’s move on to the DecisionTree class next to implement the classifier we’ll be using. Here’s a brief overview of the methods we will use in the class before coding.
__init__() – the constructor function. This will initialize the values of our hyperparameters min_samples_splitand max_depth. The first one defines how many samples we require to split a node further, while the latter defines the depth of the decision tree we build. These hyperparameters act as the breaking conditions for the recursive function of building the tree.
_entropy(s) – defines the impurity of a particular vector.
_information_gain(parent, left_child, right_child)– to find the information gain when a node is split into child nodes.
_best_split(X, y) – the most important function; the one that decides the best splitting parameters. It uses the input features X and the target variable y to find the optimal values.
_build(X, y, depth) – the main method. It creates the tree with recursive calls of splitting nodes until the breaking conditions are met, as described in the hyperparameters.
fit(X, y) – used to call the _build()method to store the updated tree in the constructor after each iteration.
_predict(x) – to predict testing dataset by traversing throughout the tree and converting the input to output.
predict(X) – when provided with a matrix of input features X, this function applies the _predict() method to each entry in this matrix.
Take your time to understand each method. I’ve provided a description of what each method does so you’re not lost in the dark. With that said, let’s move on to the coding part.
Don’t get too intimidated by the code; once you have a sound grasp on the methods listed above, coding them in Python will be a piece of cake for you.
Finally, it’s time to implement our RandomForest class now. This class will be based upon the DecisionTree class, just like the DecisionTree class was based on the Node Class. Here are the methods we’ll be using, along with their description.
__init__() – constructor for the hyperparameters. Holds the values for the number of trees in the forest, the maximum depth it can go, and the minimum sample split. Last but not least, it will also hold the individual trees when the model is trained.
_sample(X, y) – method to bootstrap the features.
fit(X, y) – to train the classifier model.
Predict(X) – to come up with the final output using the majority rule.
Here’s how we can code it in Python.
That’s it. We have successfully coded our very own random forests from scratch. Now it’s time to train some model and evaluate it to see how we did.
To keep things practical and as close to real-life as possible, we will be using a Fraud Detection on Bank Payments dataset from Kaggle. Let’s load the dataset and define our training features and target class. Also, since the dataset is too big and will take up a lot of resources, we will limit it to 1000 rows for the sake of this tutorial.
You might be wondering what the dataset looks like; here you go.
We will now split the data into separate training and testing datasets to ensure accurate metrics.
We’re all ready to start the training now! Let’s instantiate an object using the classifier class we wrote and print out some predictions made by the model.
As you know, 1 represents fraud, while 0 tells us that the transaction wasn’t marked as fraud. Let’s use our testing dataset to print out the accuracy of our model now to see how it fared.
Well, we certainly couldn’t train a state-of-the-art model, but financial problems are naturally significantly harder to deal with, and even million-dollar companies struggle with classifying it. Anyway, it’s not that bad, and for one, the data might be imbalanced towards non-fraud cases, which is the reason the results of the model show most cases marked as non-fraud when they were actually fraudulent ones.
Today we have successfully built a random forest classifier from scratch in Python. I’m sure it was easier than most of you had imagined. Luckily, random forests are among the most widely used classifiers in the industry nowadays due to their fantastic performance.
Again, you will have many libraries to let you implement random forest in a few lines. And most of the time, it’s the better alternative. But building it from scratch yourself makes sure you know the underlying mechanisms perfectly.
That’s it for today! See you in the next article.