If you have good knowledge about the basics of Python, there’s a lot you can do with the help of hundreds of libraries available. While you might already be familiar with how numbers can be used to train machine learning models to make predictions, turns out that we can classify textual data in the same way as well. However, there are some extra steps needed to be done when dealing with such data.
A very good example to see how textual data can be trained is implementing a sentiment analysis model. It covers all the basics of converting text into numbers to use it while training and then predicting the new, unseen data.
So, in this article, we will be going through a step-by-step implementation of sentiment analysis in Python using scikit-learn. I will be explaining each step involved along with the relevant code, so you don’t miss out on anything.
Sentiment analysis is a clever technique that lets you figure out the underlying sentiment beneath the statement of someone. For example, if I have to post a review for a clothing store and it doesn’t involve a numerical rating, just the text. The sentiment analysis of that review will unveil whether the review was positive, negative, or neutral.
This is a very important technique for huge businesses. Not only does it let them capture more customer response, but it also helps them easily analyze any new products. Once an ML-based sentiment analysis model is deployed, capturing the sentiments of customers becomes a simple task.
Implementation in Python
Here’s the linkto the binder in case you want to try out the code for yourself.
Now that we understand what sentiment analysis is all about, and how it works, let’s see how we can make our very own model that can predict sentiments based on simple statements. We will be using the Amazon Reviews datasetthat contains reviews for mobile phones.
Let’s import some basic libraries we need, so we can easily read data and manipulate it.
An important thing to note here is that the dataset we’re using is extremely huge and it will cost us a lot of computational power to deal with such a huge dataset. So, for the sake of our tutorial, let’s reduce the size of the dataset so we can train our model easily.
The sample function of pandas takes a certain fraction of the sample from the actual dataset to reduce the data randomly. Let’s run this cell to see how the datahead looks like.
As you can see, we have 6 columns that tell us the different characteristics of each customer review. Our major focus will be on the ‘Reviews’ column since that will be used to classify the sentiment.
The next step is to remove any missing values from our dataset since they are of no use to us. Also, we need to remove the reviews that have a neutral rating since this we will only focus on either positive or negative reviews in this tutorial.
Let’s convert the rating to a binary scale now. It will make it easy for us to treat the ratings since they’ll be either positive or negative, and not spread over a range. For this, we will encode the ratings above 3 to 1, as they are positive, while the ratings below 3 will be encoded to 0 as they are negative. Here’s how this can be done.
A new column will be added to the dataset showing us if a certain rating is positive or not. Here’s how the modified dataset now looks like.
As you can see, the last column tells us if a certain rating falls in the positive category or not.
Let’s check the mean of the new column called ‘Positively Rated’ to see if most ratings are positive or negative.
As you can see, the mean is closer to 1 which means that most ratings in the dataset are positive.
That marks the end of our pre-processing, and we’re ready to move forward and prepare our data for training. We’ll need to import the test-train-split module from the sklearn library since we don’t have separate testing data to evaluate our model later. If you don’t know why we’re doing this step, feel free to find about it here.
Now that we have separated our training data, we need to figure out how are we going to train it. We cannot simply feed the text data to a machine learning model since they only work on numerical data.
So, the next step is to use count vectorizer. A count vectorizer is a way to extract features from a dataset so that those features can be used while training a model. We will import the count vectorizer from the feature extraction library of scikit-learn and fit its instance of the training data.
We can see the vocabulary generated by the count vectorizer using the get_feature_names function.
This is just a small subset of the total vocabulary built from the training data. Let’s see how many total words we have here.
Next up, we need to transform the documents to a document-term matrix. This will give us a bag-of-words representation of X_train.
As seen above, the resulting matrix has each column representing a word from our vocabulary and each column representing a document. Hence, each entry shows the number of times a specific word is appearing in a document.
It’s time to train our model now using the training data we have prepared. We can choose any regression model here but since the data is high-dimensional and somewhat sparse, logistic regression is a good choice.
Training and Evaluating the Model
That’s it, we have successfully trained our model. Now it’s time to see how our model is performing. For that, we will use the AUC score to estimate the accuracy of our model.
It’s important to note that we cannot simply use the X_test to evaluate our model here and we need to transform it as well just like we did with X_train to make things even. Also, if the model encounters any new words in X_test that were not present in X_train, will be ignored and not predicted.
Well, that’s an impressive AUC score while keeping the fact in mind that we did not use the whole dataset. Hence, that’s a success for our model while using only a fraction of the original dataset available to us.
To have a more in-depth view of how the model is working, let’s see some of the smallest co-efficient and the largest ones to see how the model is connecting the individual words to positive or negative reviews.
We need to get the feature names of the model in the form of a NumPy array and sort them using argsort(). After that, we can access the smallest coefficients as simply the first 10 ones and the largest ones by using indexes of [:-11:-1]. Here’s how it can be done.
As you can see, the model has associated words like ‘worst’, ‘terrible’, and ‘slow’ with negative reviews while the words like ‘excellent’, ‘perfectly’, ‘love’ are associated with positive reviews. Also, note that the vocabulary contains typos as well such as ‘excelent’.
Classifying New Data Using the Model
Now that our model is up and running, how can we use it to predict new user reviews to see if they’re positive or not? Well, we can simply use the predict function of the sklearn library and input the review we want to classify. Again, it’s important to remember that we need to transform the review first otherwise the model will not predict it correctly.
Let’s see how this can be done.
As you can see, the model predicted the first input to be positive, while the second one is predicted to be a negative review. Great! Remember we used ‘1’ to encode the positive reviews and ‘0’ for the negative ones.
That’s pretty much it! We implemented a sentiment analysis in Python using the Amazon mobile reviews dataset. Scikit-learn was used while developing the model. Throughout the tutorial, we went through a detailed implementation of the model in Python. You can follow the tutorial and code a sentiment analyzer for any dataset you want and use it to classify user ratings or whatever your use-case is!
If you want to access the code used in this tutorial, you can find it on my GitHub. Happy Coding!