In this article I tackle the following problem: how to define and distinguish anomalies (spikes, peaks, and outliers in data) in real-life, production situations. Typically, the data drift results in the absence of a reference level. Since we do not know what is normal, how to decide what is abnormal?
During an EDA (explorative data analysis) I often group the data points along a certain axis in order to examine their frequency distribution. In Python Pandas library, the relevant method is dataframe.value_counts(). In my Sopra Steria projects, I often work with data representing incidents in an IT infrastructure. Below I have grouped my incidents along the time of day when they originated. The reason for doing so is to isolate the anomalies, that is the certain times of the day with abnormally high concentration of events. Here is an example:
Above, the x-axis indicates the time of day (between hours 0:00 and 23:59), while the y-axis shows the number of events per minute. The frequency chart visualizes a handful of time points during the day that receive abnormal concentration of events. This observation can be helpful to quickly identifying problems specific to projects, clients, applications or network segments. Those anomalies are clearly visible.
why the classical approach rarely works
Unfortunately, above I picked an easy example. Most real-life situation are more tricky and anomalies are less visible. Consider Example B:
Here the textbook approach will not work. In classical understanding, anomalies are outliers: certain percentage of the highest peaks in the frequency distribution. There are a few classic definitions of what makes an outlier. In normal distribution, some people use a threshold of Q3 + 1.5 IQR (Q3 is third quartile, IQR is interquartile range). In non-normal distributions, sometimes quartile 0.975 is used to cut off 2.5% of the highest peaks. So let’s try it on our Example B. We will trim off the peaks using the IQR rule. The result is below. The peaks are in yellow. Below the main chart, two focal areas have been magnified to better examine the detail.
The reason of my concern is the yellow area between the hours 3:30 and 4:00 a.m., visible in the focus area 1. Even though the peaks are indeed highest in the absolute sense, I am not convinced whether they constitute an anomaly in the business sense. It rather looks that the entire project experiences a higher workload between hours 3:30 and 10:00 a.m. Some spikes observed during that period are still quite average, relative to that higher workload around them.
Then is there a better way to define an outlier in our context?
Your enemy: the data drift
We need a reference level, defining what should be considered ‘usual’ in a given data set. For instance, if we suspect that the higher volume of incidents between 3:30 and 4:00 might be just normal in the given project, why not use the previous months as reference level? Sadly, a quick look at the same data, split into slices of 5 consecutive months, shows how bad an idea this would be:
Firstly, we have some spikes in every month. To define a reference model of a ‘usual’ workload, we would need a time period without anomalies, but there isn’t one. Secondly, there is the data drift. The characteristics of a daily workload changes with time and every month is slightly different. This data drift is best illustrated when considering the time period between 3:30 and 4:00 a.m. in each of the 5 months. In the end, we conclude that there is no such thing as the ‘usual’ processing level during that time frame.
The paradigm shift
This brings us to a question: how do we define an anomaly? And what really makes an outlier?
When you think about it, there really isn’t a universal mathematical formula applicable for every case. In general, an outlier is an ‘odd’ data point unusual against a given reference level, or against a reference population of data points. But if such stable reference context is absent, then the concept of an outlier (or an anomaly) also becomes an empty phrase. Poorly defined concepts bring confusion and lead to false conclusions.
With this observation, we are ready for a paradigm shift: hunt for features instead of anomalies. We should stop looking for anomalies in the data, if we cannot define them. Instead, we should be looking for particular features of the histogram. Note: here by feature I mean some specific histogram shape, not a feature in the machine learning sense.
So here is a new definition of a feature we are interested in: very narrow spikes, much higher than the nearest surrounding. This makes our job easier.
The reference level
The concept of the ‘nearest surrounding’ can be approximated with the moving average. Below are three moving averages of various time windows. Which one should we use?
The answer: the moving average window size must be proportional to the size of the feature that we want to isolate. Our interest is in the narrowest spikes that are 1 minute wide. So the window size of 3-12 minutes is appropriate. Programmer note: Pandas’ DataFrame.rolling() helps to calculate the moving average, however it is not ideal. Additional processing is needed to correctly calculate the moving average right before and after the edges of the period of interest, which in our case is midnight.
Isolation of spikes
We will still have a number of data points sticking above the moving average reference level (by definition, up to 50% of data). The deviation from the reference level is illustrated below (upper: the deviation along the x-axis, lower: its frequency distribution).
So we once more need to decide what makes a spike ‘interesting’. Now however it is easier. Now that we have created a reference level, we can reuse one of the statistical definitions of an outlier to simply isolate a certain upper percentile of data (such as percentile 0.975) as outliers. Yet more appropriate logically is to link the threshold to the reference level. The deviation from reference level * 2 looks promising: only a few black spikes are left sticking above the 0 level:
Now let us isolate the spikes again using this method. Here is the result, and it looks much better now. The isolated peaks are in yellow.
Admittedly, technically we have not isolated outliers. Instead we have isolated certain shapes that seemed interesting. That approach has practical advantages. The results are more precise, and closer to what we intuitively mean when saying “this looks wrong” or “this does not belong here”.
Interestingly, the highest point of the original frequency distribution, sticking out at 3:30, has been missed. Whether that’s good or bad, depends on the definition of what we look for. The method described here have interesting extentions for a broader use. I may be able to cover this in a sequel article.
My simple method certainly does not provide a definite answer on how to detect outliers and anomalies in situations where data drift effectively obfuscates the reference level. I am sure there are other, more sophisticated ways. Feel free to share comments.
Credits for this work go to Sopra Steria which provided data for this research. This work is part of our constantly growing expertise in the AIOps domain (that’s Data Science applied to IT Service Management). For any Service Desk, Infrastructure Management, or Application Management project, consider Sopra Steria as your service provider. The Lagoon Data Lake which we internally operate provides an opportunity for vast spectrum of analytics. The resulting insight is aimed to optimize your project for performance and cost.