I recently work a lot with IT Infrastructure Management data. At Sopra Steria, we manage sizeable ecosystems of our corporate clients that include thousands of apps and infrastructure elements. We handle events, incidents, alarms, and support tickets. We process thousands such application tickets per day (which maps to millions of technical events in the underlying infrastructure). One interesting area of Data Analytics work is finding ways to accurately tell apart various groups of events, to answer questions like:

  • which applications are most troublesome? Maybe they tend to throw unexpected bursts of incidens, or alarms that take longer to fix?
  • what is the characteristics of work with incidents of various priorities?
  • are some projects more predictable than others?
  • what is the hourly, daily, monthly workload distribution across clients?

These things are critical for PMs to plan ahead their workload, secure resources and build effective teams. So an important part of my work is to empower project owners in ability of understanding and quantifying differences between populations of events (stemming from various projects, applications, service teams, or customers). My approach is simple, empirical, based on observation and frequent dialogue with operational teams. Techniques below worked well for me eventhough some may not be fully correct or compliant with literature, so double check before replicating anything (and drop me a note so I improve too). So here is the thought process, roughly.


Step 1: launching the basic thought process

I should start with this simple observation. AI and Machine Learning? Nope. I use these techniques rarely. Here is why. Before trying any ML/AI, I want to understand the data very well, so that I can devise optimal feature engineering for the possible classification. This process of data understanding often takes days and involves Exploratory Data Analysis, statistical tests and a lot of work with histograms. During this phase, I often discover solutions and answers to the problem statement, which are perhaps not final, but are good enough. Then I stop. I don’t get to ML/AI, because the work is done before it got started. Let me now proceed to the methods I often use to compare populations of data.

Here is an example. Below are a few populations of incidents. To have a feeling of the data, the first thing I would do (after cleaning and normalizing the data) is to plot their histogram along a numeric dimension that has practical importance for the project. Often case, this dimension will be related with time (execution time, resolution time, closure time, minute of the day, or various deltas, such as delta between start and close time, total lifetime, resolution time etc). Here are a few such histograms, showing time deltas of several populations of events.

If you end up with the histograms like above, the task is easy: we can immediately tell the difference: the first category of events grows in volume with time, while the second diminishes. But I picked an easy example. Most commonly, I end up with annoying situations demonstrated below.

These populations are right-skewed and that is typical. They resemble lognormal, but they are not (data typically does not pass statistical tests for log-normality). Is there a meaningful difference between them? For instance, which of those groups of events are “easy”, in the sense, their time delta is usually shorter? We can’t tell. The histograms are too similar. What now?

Step 2: the core insight with basic statistics, KDE and CDF

For the low hanging fruit: look for the medians and compare them. Not mean. Do we have outliers? Of course we do. Then the mean is obstructing the truth – don’t even look. The median, robust against outliers, is our friend. The median (or any other percentile) is the first signal that the difference may exist. A word of caution: for comparing the medians, don’t do it directly – better use Cohen’s d statistics.

But that alone is a poor indicator. It is just one number, which tells nothing about the shape of the population. Two populations of a similar median may be quite different! One way out is to try to smoothen the diagram with KDE, and observe its shape. KDE (for Kernel Density Estimation), in simple terms, is a function that allows to prettify a histogram by removing the details. You can draw it either in Seaborn or in matplotlib. Here it is:

The KDE is that smooth line over the barplot diagrams. You still don’t see much, but the trick is to remove the hist bars and overimpose the KDEs.

Nice! We start seing the difference between the histograms. But you need to be very careful with KDEs. They are not designed to compare subtleties of the chart. As an example, here’s two overimposed KDEs at work:

From this chart we could deduce that the median of the yellow population should be more towards the right. In other words, iff x axis represents response times of incidents, the blue alarms have shorter response times, while the yellow alarms have longer ones. Let’s overimpose this over the actual bars:

The histogram bars confirm what we thought. But look what happens towards the edges. And here’s a much worse “tricky” example.

The KDE suggests that yellow histogram has lower peak. After overimposing the bars, we see that this is far from truth:

To fix this, you could to some extent experiment with KDE kernels, to end up with better approximation:

But I don’t do it as there are more caveats in interpretation of KDEs. To properly draw KDE in Seaborn you need to slice x axis data to match the section shown on the chart. Then you cannot infer any statistics from that subset any more! KDEs are often misleading, especially aroung the edges, or in situations where population is small. I use it for illustration but not for conclusions.

The next technique which I trust more is the CDF (cummulative distribution function), in other words the cumulative histogram. CDFs are less appealing than KDE, in fact quite boring to look at, they hide many details, but they tell the truth. A CDF maps a value to a percentile. You will instantly understand what a CDF is from the diagram below, showing a CDF (orange line) overimposed with histogram of the same data (blue bars):

Let’s draw again the “tricky” example above, this time as a CDF:

As we can see, the factual cummulative difference between the two is in the area between 10 and 30. It is now clear.

I use overimposed CDFs a lot for comparisons of population. In the example below, comparing populations of many thousands of events coming from the same source in various years, it seems that alarms in 2019 lived shorter than in 2017, which could be linked to the improving effectiveness of service procedures:

Sometimes such subtle differences between CDFs can be made better exposed with logarithmic scales:

If I am interested in knowing how far these CDFs are positioned from each other, I use K-S test (read below). CDFs also have some caveats. It is easy to misinterpret them when incorrectly slicing a range of source data.

Step 3: quantifying insight into numbers

I typically spend some time with histograms, KDEs and CDFs, visualising the differences of various data slices. Sometimes this is not enough and other techniques come in handy: Seaborn’s violin plot (below), 2d density plot, scatter plots, each of them exposing various characteristics of the populations to compare:

They all help with contextual data understanding, if you use them carefully. The next step is to verify the conclusions and automate the solution. This is the end of visualization. Now we need numbers. We need to find the metrics that describes the differences between the population in the most useful manner. There are numerous techniques here. Unfortunately I cannot use parametric tests, as most data is not normal, and, sadly, not lognormal either (eventhough it resembles lognormal chart). Some useful metrics that remain are:

  • the median; for comparison use Cohen’s d
  • the mean, after removing the outliers (>3 sigma). It is not a very clean way (outliers are calculated from the mean, but removing outliers changes the mean itself!), but it often works and results tend to be more visible than with the median
  • K-S test (Kolmogorov-Smirnov). Nice test, in its nature similar to the CDF visualization: it quantifies the space between the CDF charts, how far they are apart. However, it is not supposed to work with discrete data. And most time data is discrete, because systems round the time to the second or to the minute.
  • K-sample Anderson-Darling test. Its output is cryptic and more difficult to read, but it works with discrete data. It gives more weight to the tails than K-S.

Those tests provide numbers, either in the form of p-value or some other statistics, allowing to quantify the differences in populations which has been visualized earlier. Then the solution can be automated. As an example, below I tried to compare lifetime deltas of incidents from two big clients, using Python’s scipy.stats.ks_2samp() and anderson_ksamp(). The output reads:

1. abs of Cohen d: 0.2 small, 0.5 medium, 0.8 large:	0.31774098347710694
2. K-S: Ks_2sampResult(statistic=0.17490118577075098, pvalue=0.5807797311570602)
3. Anderson Darling: big is insignificant
Anderson_ksampResult(statistic=0.2136100046288539, critical_values=array([0.325, 1.226, 1.961, 2.718, 3.752, 4.592, 6.546]), significance_level=0.25)

The statistical interpretation: Client A’s incidents are being fixed somewhat shorter than those of Client B (Cohen’s d shows medium difference that is 1/3 of pooled standard deviation), however other than this, there are no anomalies distinguishing the shapes of the distribution (neither K-S, nor A-D tests show anything significant). Business Interpretation: if events are similar, yet one group is being solved faster, this suggests superior organization of one of those support projects. For instance, a more mature, larger operational team, or better tools integration.

If this is not enough, the next steps in the research would be to deploy clustering, t-SNA, PCA, and finally classification, to visualize and understand better the differences among populations. As stated earlier, my (limited) empirical experience shows that those are often not needed. But being a simple country boy, I let the big brains tackle these difficult matters.

Over the coming weeks I plan to continue the exciting journey over some Sopra Steria data analytics.

Techniques for comparing populations of IT data

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.