
In this article, we will build an analyzer that detects the emotion and the associated gender with a speech using Python/TensorFlow
The idea behind this project is to create a neural network model for detecting emotions from the conversations we have in our daily life. The neural network model can detect up to 8 different emotions of males/females. This can be used for multiple purposes such as:
- Personalization in marketing for recommending products based on emotions.
- Automotive companies can use this to detect the emotion of drivers and adjust speed to avoid any collision.

The article is divided into the following sections:
- Dataset Preparation
- Pre-process Data
- Model Creation
- Model Training
- Prediction
- Building the API
- Building the UI
- Containerize the application
If you want to jump to the code, here’s a link to the code repository: emotion-analyzer
Dataset Preparation
We will be using the speech dataset, The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). To download the file, click here: “Audio_Speech_Actors_01–24.zip”.
The folder contains 24 folders with audio files for each of the 24 actors. The filenames consist of a 7-part numerical identifier (e.g., 02–01–06–01–02–01–12.mp4). These identifiers define the stimulus characteristics:
Filename identifiers
- Modality (01 = full-AV, 02 = video-only, 03 = audio-only).
- Vocal channel (01 = speech, 02 = song).
- Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).
- Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the ‘neutral’ emotion.
- Statement (01 = “Kids are talking by the door”, 02 = “Dogs are sitting by the door”).
- Repetition (01 = 1st repetition, 02 = 2nd repetition).
- Actor (01 to 24. Odd-numbered actors are male, even-numbered actors are female).
We will start with writing a function to place all the files into the same folder i.e, combine all the audio files of actors into a single folder.

The function iterates through each of the folders inside the downloaded folder and moves all the files in it to the parent folder and deletes the empty folder.
Pre-Process Data
Currently, we just have the audio files in raw format and nothing else. For training purposes, we need to convert that into features and labels so the neural networks can train on it.
Let’s start with creating the labels associated with each file

We start with creating a dictionary mapping the identifier to the emotion. We iterate through each of the file names and extract the gender. If the id is odd it’s female and male otherwise. We concatenate the emotion and the gender to form the label. Finally, we convert the list to a data frame. We save the labels to csv file as well as we will need these for prediction purposes later on.
Now, we have to convert the audio files into features. We will use librosa for this purpose as it can help read audio files and extract features from it

The function loops through each of the files to load and extract features from it using librosa. Finally, we fill the NA values with 0 as they can cause errors when we pass them to the neural network. Following is the output data frame:

We also need to divide the dataset into train and test sets.

We start by using scikit-learn module and dividing the dataset into 80% train and 20% test set. We then convert the data frames to NumPy arrays. The labels are currently string values and we need to convert them to categorical variables meaning they will have zero in the whole array except the index of their label. Finally, we need to reshape the features so the neural network can accept it
Model Creation
Let’s now build the model:

Let’s dissect this function:
- We start by initializing the model
- We then add layers to the model and specifying the input shape which contains 216 features and the final layer which contains 16 labels (8 emotions and 2 genders)
- We define our optimizer and its learning rate
- We define some callbacks for saving the best model, reducing the learning rate if validation loss doesn’t decrease, and stopping the model if the loss doesn’t decrease for 10 consecutive epochs
- We finally compile the model with the loss, optimizer, accuracy as the metric, and the above-created callbacks
Model Training
Let’s train the model for 300 epochs and plot the losses

The output plot will look like this:

Prediction
Let’s write a function that takes in the path of a file and predicts the output of the model:

We start by loading the audio file and converting it to features as we did before. We then load the saved model and predict the output label. The model gives the probability for each label. So, we need to extract the label index with the highest probability and convert it to the text of the label (using the labels we saved to csv before). Let’s try this function:

Building the API
For the API, there will be a single endpoint where we will pass the file path and it will output the label. Let’s first create separate files for the above modules. Here’s the name of all the files that we will be using:
- clean_and_train.py (For above functions)
- api.py (For this section)
- index.py (For UI)
- Dockerfile (To Containerize the application)
- templates/index.html (HTML file to show the UI)
clean_and_train.py:

To clean and train the file, download the data in one folder above and run the following command:
python clean_and_train.py
Here’s the directory structure:

We will be using Flask for our purposes as it helps in building simple APIs really fast. I will introduce you to the whole code of API and then explain it.

Let’s dissect this file:
- We start by importing the required libraries. These include Flask and the predict function from above.
- We define the app variable, which basically represents our web app
- We then use this syntax, @app.route. This basically is a decorator. All you need to know is whenever we hit this route, the function below that is called.
- We have only included the method GET as we are not passing any data from a form.
- We define the route for predicting the emotion.
- Notice the last line in each of the endpoints, jsonify. JSON is the standard data format in web applications and hence we change our output to JSON before sending.
- Lastly, we run the app on port 80.
Let’s run this file using the command python api.py :

Currently, there is no UI so we can’t view the app on our browser. Instead, we will use curl to access the API endpoints. We need to keep the above terminal running and open a new tab for this. We need to pass the path of the audio file using the query string. Note that the name should be exactly the same as what we used in the api.py, this is required for the API to function correctly

Building the UI
As before, let’s first see the code:

Let’s dissect the code:
- We import and define the app as we did in the api.py file but here we also define the Upload folder for saving the files
- Then we define the route as before but here we are going to define two methods GET and POST as we are uploading the files from a form.
- In the case of GET, we just render the form.
- For the POST method, we read the files using the names defined in the form and save them in the folder.
- Finally, predict the emotion using the predict function we defined above and send it to the template.
Let’s now look into the index.html file. Note that the index.html file should be defined in the templates folder otherwise Flask won’t be able to check it.

Here, we check if we have the label dictionary and show the form or the label respectively.
We will then run the app using the following command:

Below are the screenshots from the app:


Containerize the Application
To make our app more useful and easily accessible to people, we would containerize it using Docker. We will create a Dockerfile in the same folder as the index.py. Let’s examine the Dockerfile

The commands are executed one by one here.
- First, we install python and create a new working directory and copy all the contents from the current directory to this one.
- Then, we run the commands to upgrade pip and install all the necessary libraries for this.
- Finally, we run the index.py file which runs the server of our app.
To make an image, first, go inside the folder where your Dockerfile exists, then run the following command:

This will take some time to execute. After it’s done you can run your app as follows:

-p tag defines the port that you want the application to run on. You can view the image running on the docker desktop and also view the application in the browser.
I have also pushed this docker image to docker hub so you can download and play around with this application here: emotion-analyzer
Conclusion
In this article, we looked into using TensorFlow and neural networks for analyzing emotion in audio data:
- We started by preparing the features and labels needed for training the model.
- We then implemented neural networks using TensorFlow and trained the model on our dataset.
- After that, we created an API and a UI for our application using Flask
- Finally, we built a docker image so that anyone can use this application.
0 responses on "Building a Speech Emotion Analyzer"