Feature importance refers to a score assigned to an input feature (variable) of a machine learning model depending upon its contribution to predicting the target variable. Let’s say you have a dataset that contains 20-30 independent variables and a single target variable. Suppose some of the features contribute to determining the output, while others are not important, i.e., removing them will not change the results much. Obviously, you do not want to include such features as they are not useful, and they are taking up space and increasing the computation time.
Well, you can find the importance of each input variable and perform feature selection accordingly. Moreover, feature importance also provides insight about the data and the trained model as you can see which attributes are significant and how much they are contributing to predicting the output.
In this article, we will go through some of the techniques to calculate feature importance in Python. As you will see, they are pretty simple, and you can implement them with a few lines of code. Specifically, we will cover the following three ways:
- · Feature Importance from Coefficients
- · Feature Importance from a Tree-Based Model
- · Permutation-Based Feature Importance
Note: The notebook for this tutorial can be found on my GitHub here.
But before calculating feature importance, we need to load a dataset and prepare it. So, let’s go ahead and do that.
Dataset Loading, Exploration, and Pre-processing
For this article, we will use the Heart Attack Possibility dataset from Kaggle. You can download it from here.
First, let’s import the necessary modules.
Let’s now load the heart.csv file into a pandas DataFrame, i.e.,
This dataset contains 13 attributes and a single target variable. The attributes are explained below. Feel free to skip them if you are not interested in the details of the features.
cp: It represents the chest pain type and has four distinct values.
- Typical angina
- Atypical angina
- Non-anginal pain
trestbps: It shows the resting blood pressure of a patient.
chol: It represents the serum cholesterol measured in mg/dl.
fbs:It stands for fasting blood sugar. If it is greater than 120 mg/dl, then the value is 1. Otherwise, it is 0.
restecg: It represents resting electrocardiographic results.
thalach: It shows the maximum heart rate of the patient.
exang: It represents if angina got triggered due to exercise.
oldpeak: It shows the ST depression induced by exercise relative to rest.
slope: It represents the slope of the peak exercise ST segment. If the value is 1, then it is upsloping, 2 indicates flat, while 3 means downsloping.
ca: It represents the count of major vessels. It has values from 0 to 3.
thal: It stands for thalassemia. It contains three distinct values.
- Fixed defect
- Reversible defect
target: It represents whether a patient has a chance for heart disease or not. If there is a chance, then the target value is 1. Otherwise, 0.
Let’s see if our dataset has any null values.
Great! As you can see, there are no missing values.
The next step is to split the dataset into training and testing samples. Moreover, as you can observe, there is a difference in the ranges of attributes. Therefore, we also need to standardize the data such that each feature has a distribution of zero mean and unit standard deviation.
Cool! Our data is ready for calculating feature importance now.
Feature Importance from Coefficients
Linear machine learning algorithms such as linear regression, multiple linear regression, and logistic regression use the line equation to make the prediction, i.e.,
x1, x2, …., xn are the independent variables (features), and y is the dependent one.
w1, w2, …., wn are the weights, and b is the bias.
These weights or coefficients determine how much each input variable affects the outcome. Therefore, we can use them as scores for feature importance.
Let’s use Scikit-Learn’s logistic regression algorithm as we have a classification dataset. In the following code, we initialize the model with the default parameters and fit it based on the given training samples.
We can get the coefficients of the attributes using the coef_ attribute. We create a DataFrame using the features and the coefficients and sort it in descending order. Let’s see.
To get a better idea about the significant features, let’s visualize it by plotting a bar graph using the matplotliblibrary.
The larger the coefficient value (both in the positive and negative direction), the greater the effect of the corresponding feature on the outcome. From the graph above, we can see that the attributes such as cp and ca are significant, as they have high importance score.
Feature Importance from a Tree-Based Model
Tree-based machine learning algorithms can also be used to calculate feature importance. One such example is the decision tree algorithm. It performs classification and regression using information gain (or Gini index) and variance reduction, respectively. The dataset gets broken down into smaller and smaller subsets to increase the homogeneity using the best attribute.
The best feature is the one that has the highest information gain or the one that results in the most homogenous subset. Therefore, these measures can be used as feature importance scores.
Let’s test this and train a decision tree model on the same dataset. We can get the feature importances using the feature_importances_property. In fact, all the tree-based machine learning models have this property.
The random forest algorithm is more stable and accurate as it uses an ensemble of decision trees. Consequently, importance scores will also be more accurate. Let’s see.
Permutation-Based Feature Importance
Permutation-based importance is another method to find feature importances. It usually takes a fitted model and validation/ testing data. It randomly shuffles the single attribute value and checks the performance of the model. If the model performance is greatly affected by it, then that feature is important. Moreover, this process gets repeated a certain number of times. For the results, the mean importance score for each attribute is taken.
The permutation-based feature importance can be computed using the permutation_importance() function from the sklearn.importance class. It takes a fitted model, validation/ testing data, and the performance metric. Here, we use the random forest classifier (trained in the above section) and accuracy as the evaluation metric. Let’s see.
Today, we covered three techniques to compute the feature importance. As you can observe, all the approaches were pretty easy and took only a few lines of code. There are other methods available as well. But, if you know these techniques, you should be good to go.
That’s all for this article. Specifically, we covered the following topics.
- Introduction to feature importance
- Dataset Loading, exploration, and preprocessing
- Obtaining feature importance from coefficients
With Logistic Regression
- Getting feature importance from tree-based models
With Decision Tree Classifier
Random Forest Classifier
- Permutation-based feature importance
Random Forest Classifier