Classification is an important area in machine learning and data mining, and it falls under the concept of supervised machine learning. A supervised machine learning algorithm is an algorithm that learns the relationship between independent and dependent variables using the labeled training data. Then, using this learned mapping, it can predict the output of an unknown sample.
In this tutorial, we will cover one of the basic classification techniques, i.e., logistic regression. Specifically, we will cover the following topics:
- · Introduction to Logistic Regression
- · Mathematical Concepts Beneath Logistic Regression
- · Implementation from Scratch
- · Scikit-Learn Implementation
- · Conclusion
Note: The notebook for this tutorial can be found on my GitHub here.
Introduction to Logistic Regression
Logistic regression is used for binary class classification problems. It learns a relationship between a categorical dependent variable and independent variables. The categorical dependent variable contains only two distinct values or classes. In other words, the dependent variable is binary, where our desired class gets considered as a positive class (1) and the other one as negative (0). For example, in cancer detection, the positive class represents the case where the patient has cancer, and the negative class shows no cancer detection.
Similarly, for email spam detection, positive class or 1 represents the spam email, and 0 (negative class) shows that the email is not spam.
Here are some of the requirements/assumptions of the logistic regression algorithm:
- · First, it requires that the target variable must be binary.
- · Logistic regression assumes that the variables are independent of one another.
- · It also assumes that there is minimal multicollinearity among independent variables, meaning that one independent variable is not highly correlated with one or more other independent variables.
- · Logistic regression also requires a large data size.
So now we know what logistic regression is, let’s go ahead and look at some math behind it.
Mathematical Concepts Beneath Logistic Regression
Logistic regression is similar to linear regression. However, it is used for the classification problem, and therefore, makes some modifications to the linear regression technique. Linear regression uses the following straight-line equation to predict the outcome, i.e.,
x1, x2, …., xn are the independent variables, and y is the dependent one.
w1, w2, …., wn are the weights, and b is the bias. Initially, the weights and bias are assigned zero values.
This equation gives us a continuous value that can be in any range, i.e., from -infinity to +infinity. We pass it through a function known as the sigmoid or the logit function. It maps the predicted value between 0 and 1 according to the following equation.
Let’s see its visualization to get a better idea.
As you can observe, it is an S-shaped function having a range of [0, 1]. A high value of x will result in a value near 1, and a low value of x will map to a value near 0. In other words, we can say that it gives us the probability of the desired class. If the output of the sigmoid function (probability) is greater than the specified threshold, then the output class is positive. Otherwise, it is negative.
The next step is to check if our predicted value is correct or not using some cost/ loss function. If we talk about linear regression, it uses the mean squared error function, i.e.,
For m training examples, the average cost function is given as, i.e.,
However, this simple loss function does not work for logistic regression as it uses a sigmoid function, unlike linear regression that only has a simple line equation. Instead, it uses the Binary-Cross Entropy Loss or the Log Loss, and it is given as follows:
Now, we need to update the weight and the bias parameters based on this loss. We use the gradient descent algorithm to update the parameters iteratively, such that the classification error decreases. For that, we take partial derivatives of the cost function with respect to weights and bias, i.e.,
Finally, we update weights and bias, i.e.,
α is the learning rate. The above process gets repeated for a given number of iterations.
Implementation from Scratch
Let’s now go ahead and implement the logistic regression algorithm from scratch in Python. For this article, we will use the Cardiovascular Disease dataset from Kaggle. You can download it from here.
First, let’s import the necessary modules.
Load and Explore the Dataset
Now, let’s load the dataset into a pandas DataFrame using the pd.read_csv() method. For this CSV file, the delimiter is a semicolon, so we specify that in the method. Moreover, we make the id column at position 0 as our index.
Let’s explore the features of our dataset.
age: It is given in terms of days.
gender: It has two distinct values, i.e., 1 and 2. Value 1 represents women, and 2 stands for men.
The height and weight features are given in cm.
The ap_hi variable represents the systolic blood pressure, and ap_low shows the diastolic blood pressure.
cholesterol and gluc (glucose) contain three distinct values:
- Above normal
- Well above normal
The smoke and alco(alcohol) variables are both binary and represent whether a patient smokes or drinks, respectively. The active binary feature indicates the physical activity of a person.
The cardio variable is our target binary variable that shows if a person has any cardiovascular disease or not.
The dataset has a total of 70000 samples. For this tutorial, we will limit it to 1000.
As explained above, the age feature is in days. So, let’s convert it to years to make the interpretation easier.
Next, we use the pd.describe()method to gain more information about the dataset, for example, mean, max, and standard deviation of every attribute, etc.
Let’s now separate the dataset into the target and the independent variables.
The next step is to split the dataset into train and test samples. We will do that using the train_test_split()method from the sklearn.model_selection class imported above. We set aside 20% of the data for testing.
Then, we standardize the data such that each feature will have a distribution of zero mean and unit standard deviation.
The LogisticRegression Class
Now, we write the code to implement logistic regression from scratch. The LogisticRegression class contains the following methods:
- __init__(learning_rate, number_of_iterations, verbose): The constructor for our class to initialize parameters such as learning rate and the number of iterations. Moreover, weights and bias are None initially.
- initialize_parameters(n): This method initializes weight and bias parameters.
- binary_cross_entropy(y, y_hat): It calculates the log loss given the true and the predicted labels.
- sigmoid(z):It passes the outcome of the line equation through the sigmoid activation function.
- fit(X, y):This method calculates the values of weight and bias parameters.
- predict_proba(X): It calculates the prediction probability using the equation of a line and the sigmoid function.
- predict(X, theta): It finds the class label of given samples using the predict_proba()method and the given threshold (theta). By default, it is set to 0.5.
Let’s now initialize parameters to train the model. We set the learning rate (alpha) to 0.01 and the number of iterations to 1000. Then, we fit the model based on the training samples.
To get a better concept of the training loss, let’s visualize it using the matplotlib library.
As you can observe, the loss got down to 0.529.
Now that we have trained our model, let’s go ahead, and use the predict() method to find out the labels of our test dataset.
Let’s evaluate our model using the accuracy and the confusion matrix metrics.
The accuracy of our model is 76.5%. Moreover, as you can observe, there are 18 false positives and 29 false negatives, which means that 18 patients were incorrectly identified as heart patients, while 29 of them had some cardiovascular disease but got predicted negative.
Let’s now compare our implementation with Scikit-Learn’s implementation to check how good our model is. We will use the LogisticRegressionclass from the sklearn.linear_model. Let’s see.
As you can see, it has an accuracy of 77%, which is only 0.5% more than our model. The confusion matrix also does not have much difference.
That’s it for this tutorial. As you can see, logistic regression is quite easy to implement. However, it is not used much in practice. Instead, it is commonly used as a baseline for other binary classification algorithms. It is also a great algorithm to get started with if you are new to machine learning, and it provides basics for more intermediate to advanced topics.
Ernesto.Net is a Top corporate IT Training, and consulting company in Fort Lauderdale, Florida which works on all major Data Science and Blockchain requirements. We specialize in Data Science, Blockchain, and Big Data.
Our clientele includes Multiple Fortune 500 companies we have been featured on NASDAQ, ABC, NBC, CBS, and Fox.
Visit our official website https://ernesto.net for more information.