Cyclical data is quite a common term when it comes to machine learning; however, I haven’t seen many people who use it properly. Many ignore it and just use it like any other categorical variable. Frankly, before I dived deep into cyclical data, I behaved the same. And it isn’t much of a bummer, but it can certainly be improved.
Let’s take date/time, for example – how do you use them in training machine learning models? Well, you encode them just as you encode any other categorical variables; personally, One-Hot encoding always works as my go-to approach. But what if I told you that you’re missing out on a lot by using the date as categorical data, and it could prove much more useful if you were to use it the right way? Something worth trying, right?
So, today, we’ll be looking at the proper way we should be dealing with cyclical data and see how it benefits us.
Note: Here is the notebook if you want to jump straight to the code.
Why Handle Cyclical Data Separately?
The approach of using cyclical data in a special way when it can just be used as categorical variables begs the question of why dealing it separately? What exactly do we get in return? Let’s see.
Imagine you have monthly data in your dataset, and you’re encoding just like everyone to train your model. Let’s suppose you’re using One-Hot encoding to encode it; now, you’ll create 12 extra variables to proceed where only one will have a non-zero entry.
While this procedure is absolutely normal and how most data scientists do things, there’s a better way for it. As you know, when 12 months are complete, the cycle starts again. And the first month comes right after the 12th – but does your model know this? No.
However, if we use cyclical data the right way, it won’t be the same anymore. The model will know how the data cycles are connected and that 1 comes right after 12, so the relationship between 11 and 12 is no different from 12 and 1, which is the ideal behavior. Moreover, we would also not need to create 12 additional variables, reducing the dimensionality significantly.
Now, let’s move on and see these concepts with a real-life example.
Dataset Loading and Imports
To demonstrate the concepts we just went through, we need a dataset with some cyclical features. For this purpose, I’ve chosen a random energy consumption dataset since it has date data and is practical as well. You can get the dataset from the notebook link I’ve shared above. So, let’s start by importing all the required packages and the dataset.
Here’s what the first few rows of the dataset look like:
There’s a lot of extra information right there. Let’s only keep what we need – consumption and hour of the day, since it’s enough to study cyclical behavior.
Let’s look at the data head now:
That’s it. Now let’s pick up a subset of the data and visualize it to see why encoding it isn’t the best strategy.
Here are the results:
As you can see, visualizing the data shows a regular pattern seen in the data. It repeats itself after exactly 24 hours. It’s evident that this behavior isn’t ideal for training a machine learning model.
So now, how do we treat this cyclical data the right way? Let’s see.
Encoding Cyclical Data
Encoding the data with any encoder such as One-Hot is bad for two major reasons here:
Loss in the information provided by data, since it doesn’t consider the cycles in the data.
An addition of 23 new features into the dataset.
Here’s what we’ll do:
We will use sine and cosine transformations to transform the data instead of the mainstream encoders. Here’s are the formulas:
Let’s translate the formulas into Python:
Now, let’s take a look at the data once again and see how it looks now:
As you can see, the values make much more sense now; the values of the 23rd hour and the 1st hour are almost the same, meaning that the model will now know they’re somehow connected. Also, we’re doing it without adding any extra dimension columns.
However, you might be wondering, why do we need both Sin and Cosine functions if one can do the job? Let’s take a look at the graphs of both functions before I answer that:
If you concentrate, you might be able to spot that there exists a problem in the graphs – the values repeat. If you focus on the sine graph between 24 and 48 on the x-axis, you will notice that there will be 2 points for the same day if you were to draw a straight line. Obviously, this is not the ideal behavior we would want.
Let’s see what happens if we draw a scatter plot using both the cosine and sine functions, which will further back my point.
As expected, we get a perfect circle. Makes pretty much sense to represent cyclical data in the form of a circle, no?
That pretty much wraps it up for us.
We saw how we’re using cyclical data such as date/time every day in our lives and how that’s not right. Through some visualizations, we were able to see that only encoding the data does not suffice, and consequently, the model misses out on important information that the cyclical data represents.
Encoding along does not realize how the end and start of the cycles are connected, which we were able to achieve by using sine and cosine functions. Moreover, we also avoided adding the extra dimensionality to our dataset by the encoding.
That’s it for today! I hope you enjoyed reading the article.