I am looking at distribution of a certain data set (left). It has two peaks (this is called ‘bimodal’) therefore I suspect that those are two overimposed populations. How do I split the data, to rediscover the original two populations (right)? More technically, how to split a bimodal distribution? This is a followup to my previous article The Truth Behind a Histogram Dent, and continues on the theme of my Data Science projects for Sopra Steria.
The purpose
The reason for this exercise is to discover the true meaning of data. In real life, most data sets can be illustrated with histogram of one peak (‘unimodal’). For instance, income distribution or age distribution. The situation where there are two very distinct peaks (‘bimodality’) may mean that we are looking at two separate phenomena camouflaged in one graph. Therefore, statistics describing this data set may be deceitful, and leading false reasoning. So to reach clarity in data interpretation, it would be better to split the population and analyze each part separately. Example: imagine we look at a population of 100 farm animals, where we find in surprise that the average number of legs per animal is 3. Noting that the set consists of 50 sheep and 50 chickens, it might make sense to split the population in two and look at each half separately, to reach more practical conclusions.
An important prerequisite is that the goal is to understand the true meaning of data. It is a subtle but important point: our goal is not to achieve the split as shown on the right above. This imaginary split is only hypothetical. The reality may be different. Therefore, the real goal is to find the root cause of bimodality. Only then we will discover its true shape.
Notably, I already solved this problem once in last year’s article, using data clustering. I was not satisfied: it worked technically but did not reveal the meaning behind the bimodality. Here is an improved attempt.
The process
In my example, the data represents technical alarms in a sizeable IT infrastructure (thousands of servers and applications, and millions of events). Along the x-axis is the response time in minutes (how quickly are those alarms processed and fixed). We often refer to these handling times as deltas (the difference between event start of processing and end of processing). In such a situation, we should be typically seeing the right-skewed distribution of the handling time deltas with one mode, like the one illustrated below: the single mode indicates that most events are being processed within 17 minutes:
However, in one particular project, we stumbled upon this:
We have two modes instead of one. Since the shape looked unusual, it made sense to investigate. The conclusion will influence business-critical parameters: what is the median response time? How well are we meeting the SLA (service-level agreement) that requires events to be handled in X minutes? And what percentage of alarms are we closing in 10 minutes? In the previous article, I provided some hints on how this can be done. Here I want to extend this and propose the methodology.
The methodic approach
The data has a number of features. In our case, 200 columns: category, priority, time of day, source application, source server, or error code. I suspect that some of those features hide a clue.
The process I propose is similar to peeling subsequent layers of an onion: I look at one feature at a time only. If this does not bring effect, I look at the next one. Enlightenment comes after 3 steps or so. Let’s see.
Step 1
In the previous article (The truth behind a histogram dent) I described the first step of this process. I ranked the features with statistical tests (Pearson, Spearman, and Chi2) to establish that the most meaningful feature: the calendar time. Indeed, we discovered a point in time (the ‘epoch’) responsible for the population shift: we saw that before that point, the incident handling time was longer, and after that point, the time was short. This allowed us to split the data before (orange) and after (blue) the epoch, as shown below. For detail, read that article.
This was already quite successful. But did we understand the meaning of bimodality? We found that the process of incident handling before the epoch was delayed by 10 minutes. But why? We later found that this was caused by an artificial delay introduced by the supervision system. The practical, business insight of this is significant. Since, as we found, this previous delay was not part of the handling process, so it should not influence the SLA. In other words, when calculating the business statistics we should only look at the blue population.
Now, however, the result was still not satisfactory. The way the populations are split suggests that we only discovered part of the truth. In particular, why does the blue population (after the epoch) still have a tiny second peak? It seems that the second mode, located at 11 seconds processing time, survived the epoch somehow, albeit in less conspicuous form.
Step 2
I separated the blue population and proceeded with followup analysis separately for this set:
On this smaller sample, I introduced an artificial column, separating events into subset A (below 10 minutes) and B (10-20 minutes) and C (over 20). I then run feature ranking again, separately for numeric and categorical features, to see what features may correspond with the split.
The ranking indicated the next important feature could be the application name (every event is associated with the application that has experienced the problem). Could it be that the second peak is actually composed of events associated with particular application names? The analysis here is problematic because we have thousands of apps. I tried to visualize the median delta of the handling time per application using seaborn.catplot(). I saw that some apps indeed have a higher delta, but with too many apps, it was difficult to see any pattern. Below, the y-axis shows handling time deltas, and the x-axis shows one boxplot per application:
Then I had the following idea. The apps occupying the second bimodal peak have their sample means significantly different than the population mean. Then they could be identified and separated by a simple t-test.
So I needed to transform the right-skewed distribution to normal (sklearn.preprocessing.PowerTransformer). Then consider events belonging to every application as a separate sample. Then run t-test (scipy.stats.ttest_1samp()), against each of those and identify and separate those samples (applications) with highest p-values.
The operation was partially successful. First, the transformed distribution met only half of the normality tests (Anderson–Darling, Shapiro–Wilk, Kolmogorov–Smirnov, D’Agostino-Pearson) but I could not tell whether that was because of the original bimodality or other reasons. Then, however, I could indeed identify the set of bogus applications, shown orange below. Finally, by modifying p-value I could broaden or narrow the number of apps, and hence the associated tickets. The result was interesting, but not fully satisfactory because it contained records from both modes (shown below). I think this is because the t-test identified not only the hump (the second peak) we were looking for but many other unusual features of the distribution we were not looking for.
So I ditched this process and tried one more method, which eventually worked best: manual construction of individually designed statistics: I simply selected apps that had the majority of tickets within the suspected area, that is between 10 and 20 seconds. Here is the results:
This simple method worked better than advanced statistics described earlier. Using this method we identified 160 applications (out of 3,000) that produced alarms, which for some reason were delayed by 10 minutes. As can be seen, the 2,000 alarms belonging to this set (marked orange) cover the hump quite well. So it is likely that we identified the feature responsible for the hump. In other words, if those apps were not here, we would not be having bimodality.
We are closer to the answer, but we are still not there. We know (or rather, suspect) which apps cause the problem, but we still don’t know why. Why do those 160 apps show slower response times?
Step 3
We peel the onion the third time. Maybe the application name holds the clue to the true secret, hidden in another feature?
I took the suspected set of 2,000 records and tried to match this against the features once again. Is there a feature that correlates with the fact that a record belongs there, or not? This time I got an intriguing clue. The feature that correlated most was the automated ticket description field. This is the automatically allocated textual label, containing the technical context of the event creation. For instance, if the problem occurred during application backup, then the backup system will leave its stamp here. Quick tests revealed that the set had striking similarity in this field: they all shared one of only two labels (out of over one hundred possibilities)! They were either created in association with a certain backup system or a certain event routing system.
The conclusions of this are quite interesting. We spent a long time chasing the reason for bimodality among the application names. The reason was not there. But statistics were not lying: the apps were indirectly associated with the reason, and eventually led us there. Thanks to sorting the tickets associated with the app names, we were able to narrow the set of suspected tickets and pinpoint the reason for the second hump.
The business meaning
How does all this translate to business? Actually, the practical conclusions are quite profound, because, in the end, we understood in detail what was happening: certain technical systems hold alarms, events, and tickets in a 10-minute delay, before passing to the next processing stage. The reasons for this delay are technical: often-case, the alarms are false and so it makes sense to initially do nothing but wait a certain period till they automatically die off and “turn green”, in the operator’s jargon.
Before the epoch, most of these systems had a 10-minute delay. After the epoch, almost all of them were turned to a 0-minute delay. All… except two: a certain backup system and a certain ticket routing system kept their 10-minute delay. Importantly, we understood that the artificial delay did not disappear with the epoch – it was still there, in some systems.
This means that the initial calculation of SLA was wrong. Below is the original bimodal chart, now split in three. Lightblue are the events before the epoch, highly affected by the delay. Red is very recent events, still subject to artificial delay. Navy blue is all the remaining events: only this chart shows the actual dynamics of event processing.
And here are the numbers. For instance, let’s try to answer again this question: how many alarms do we handle in 10 minutes? The original answer was 42.53%. Then, after discarding the light blue set, the ratio grew to 61.02%. Then we discarded the red set, and established the ratio as 64.84%, which is the final answer. So the difference is quite significant, and the original calculation was essentially wrong. This is why sometimes it makes sense to split bimodal populations.
Summary: ‘peel the onion’ method
The method deployed here looks at subsequent features, always one at a time, to see how they affect the situation. The deeper we go, the closer we are in understanding the true nature of the observed phenomenon.
We first looked at the calendar time. We saw a moment in time that indeed did split the population in two, but we did not understand why. However, we narrowed the set for further analysis.
We then look at application names. We found that a certain number of apps mapped to the hump, but still we did not understand what do they have in common. Nevertheless, we narrowed the set again.
We then looked at technical labels. We found those apps had matching labels. It is the meaning of those labels that held the answer to why there were two modes.
The final observations are:
- Note the final split did not fully reflect the originally envisioned result. In fact, we found not two, but three separate sets.
- The problem with this method is that it is manual and time-consuming, as opposed to easy, automatic methods I proposed last year: How to isolate a spike. But those methods provide only a technical solution, correctly splitting the populations but not allowing to understand the reason for bimodality
- Interestingly, if we looked at the labels from the very beginning, before narrowing the sets, we would not have found anything. Why is that? I leave it as homework to the reader.
Update 2021-11-20
This article, which I published recently, covers a similar case, with the particular focus on one of the methods: the Chi-Square test for independence.