In this article, we will build a system for calculating the similarity between different documents along with making it available as an API and web app.
Text Similarity is defined by how close two corpora of text are in comparison to each other. There are a number of ways that could be useful, such as:
- Search engines use this functionality to show websites containing similar text as the query. Sites like Yahoo Answers and Quora use the same functionality to show related answers
- Chatbots can use this to provide appropriate answers to the customers, thus enhancing the customer’s experience.
- It can be used for plagiarism checking to ensure appropriate credits for a given piece of text
This article is divided into 5 sections:
- Cleaning the Data
- Cosine Similarity
- Building the API
- Building the UI
- Containerize the application
If you want to jump to the code, here’s a link to the code repository: document_similarity_checker
Cleaning the Data
Before we calculate the similarity between various corpora of text, we need to clean the text to remove the information that is not helpful in calculating the similarity.
There are a lot of words in a language that isn’t helpful in the semantic meaning of a sentence. These are known as stopwords. These include words such as I, me, myself, etc. To remove these, we will use Python’s library, “nltk” and download the stopwords.
Here’s the output from that:
Punctuation isn’t really helpful in determining the similarity between two documents and thus we will remove that too using the following code:
We create a dictionary that maps each character in punctuation to None, basically removing it. translatemethod helps us by automatically applying the map to each character in the string. Here’s the output:
To calculate similarity, we will tokenize our string which basically means converting it into a string containing the individual words
Lastly, we need to do stemming. It is basically converting each word into its base form.
Let’s combine everything we have done in this section into a single function
Let’s start by understanding how cosine similarity works and then implement that in python
Cosine similarity is a metric, helpful in determining, how similar the data objects are irrespective of their size. In cosine similarity, data objects in a dataset are treated as a vector. The formula to find the cosine similarity between two vectors, xand y , is:
cos(x, y) = ( x . y ) / (||x|| * ||y||)
- x . y = product (dot) of the vectors ‘x’ and ‘y’.
- ||x|| and ||y|| = length of the two vectors ‘x’ and ‘y’.
- ||x|| * ||y|| = cross product of the two vectors ‘x’ and ‘y’.
Let’s calculate this in python. Although it looks quite straightforward to implement, we have to take care of various considerations when dealing with text data such as:
- There would be words present in one piece of text and not in the other
- Both will not have the same number of words
For this purpose, we will be using sklearnlibrary to calculate the vectors from the strings and then calculating cosine similarity on them.
We need to join the tokens again because the CountVectorizeraccepts sentences and outputs a vector equal to the length of unique words in the corpus for each sentence and assigns the count of that word in a particular sentence. Here’s the output from above:
First, we get the cleaned sentences, then the unique words in the corpus, and finally a list of vectors for each sentence. Let’s now calculate cosine similarity:
The cosine_similarity function accepts the vectors in a certain format and thus we have to reshape them as such. If we look at the corpus, we can guess that the first and second sentences are more similar than the second and third. Let’s see what the cosine similarity gives us:
It’s indeed confirmed by our program.
Building the API
There will be only one endpoint for calculating the similarity using the cosine method. Let’s first create separate files for the above modules. Here’s the name of all the files that we will be using:
- utils.py (Cleaning Functions)
- index.py (for UI)
- Dockerfile (Containerize the application)
- templates/index.html (Contains the UI)
- data/ (contains the text files)
Here’s the directory structure
Let’s first group together the above code we wrote into multiple files:
We will be using Flask for our purposes as it helps in building simple APIs really fast. I will introduce you to the whole code of API and then explain it.
Let’s dissect this file:
- We start by importing the required libraries. These include Flask and functions from the modules above.
- We define the app variable, which basically represents our web app
- We then use this syntax, @app.route . This basically is a decorator. All you need to know is whenever we hit this route, the function below that is called.
- We have only included the method GET as we are not passing any data from a form.
- We define the route for calculating the similarity.
- Notice the last line in each of the endpoints, jsonify . JSON is the standard data format in web applications and hence we change our output to JSON before sending.
- Lastly, we run the app on port 80.
Let’s run this file using the command python api.py :
Currently, there is no UI so we can’t view the app on our browser. Instead, we will use curlto access the API endpoints. We need to keep the above terminal running and open a new tab for this. We need to pass the path of both files using the query strings. Note that the names are exactly the same as what we use in the api.py , this is required for the API to function correctly
Cosine Similarity API Endpoint
Building the UI
As before, let’s first see the code:
Let’s dissect the code:
- We import and define the app as we did in the api.py file but here we also define the Upload folder for saving the files
- Then we define the routes as before but here we are going to define two methods GET and POST as we are uploading the files from a form.
- In case of GET , we just render the form.
- For POST method, we read the files using the names defined in the form and save them in the folder.
- Finally, calculate cosine similarity using the function we defined above and send it to the template.
Let’s now look into the index.htmlfile. Note that the index.html file should be defined in the templates folder otherwise Flask won’t be able to check it.
Here, we check if we have the similarity dictionary and show the form or the similarity respectively.
We will then run the app using the following command:
Below are the screenshots from the app:
Containerize the Application
To make our app more useful and easily accessible to people, we would containerize it using Docker. We will create a Dockerfile in the same folder as the index.py. Let’s examine the Dockerfile
The commands are executed one by one here.
- First, we install python and create a new working directory and copy all the contents from the current directory to this one.n
- Then, we run the commands to upgrade pip and install all the necessary libraries for this.n
- Finally, we run the index.py file which runs the server of our app.n
To make an image, first, go inside the folder where your Dockerfile exists, then run the following command:
This will take some time to execute. After it’s done you can run your app as follows:
-p tag defines the port that you want the application to run on. You can view the image running on the docker desktop and also view the application in the browser.
I have also pushed this docker image to docker hub so you can download and play around with this application here: document_similarity_checker
In this article, we looked into using cosine similarity for calculating document similarity:
- We started by cleaning the data using various methods such as tokenizing, stemming, removing stop words, etc.
- We then explored cosine similarity theoretically and implemented it in Python using scikit-learn .
- After that, we created an API and a UI for our application using Flask
- Finally, we built a docker image so that anyone can use this application.