Analysts don’t like the “why” questions. They are tough to answer. For instance, in a help desk analysis, it is easy to show which tickets are resolved faster. But it is difficult to say why. In my practice in Sopra Steria I found that one tool is especially robust to deal with practical “why” questions: the Chi-Square test for independence.

Answering the “why”: the traditional way

Even for a statistician, it is not immediately clear how a Chi-Square test (which I will cover later) could help to answer the “why” questions. To demonstrate how I use it, I need to first explain the alternative way. Here is an example.

I want to know why some help desk tickets take long time, while some others take short time. This is depicted in the histogram below. I am interested in the spike peak (a spike, a hump) to the very left, representing a group of tasks that are being solved below 1 minute, which is surprisingly quick. I want to know why.

To discover the reason for that peak, I should divide the data into two sets: records within the peak (blue below), and all the remaining records (orange below). Then I will test some ideas that come to mind. Perhaps some agents (service desk employees) solve problems quicker, because they are more experienced or more effective? Let’s visualize the distribution of tasks sorted by the agent name within the peak, and outside the peak:

The outcome is interesting. When it comes to the number of tasks closed below 1 minute, the agent Anton is clearly the leader. This way we quickly identified a hint, which can lead to the potential reason for the peak in data.

Why the traditional way does not work

Unfortunately in real life, such reasoning is not that easy. The “agent name” is an example of the data feature. The real data will have hundreds of features. Each of those features will have hundreds or thousands of possible values. Guessing which feature is responsible for the peak is tedious and impractical. One would need to browse through hundreds of charts looking like this:

And here is where the Chi-Square test for independence (also known as chi2, described in the previous article) can shine. Here is rendering of a Jupyter notebook demonstrating the power of chi2 on a different example.

The reason I use Jupyter notebooks is that the technical readers can read the source code to know exactly what I was doing. Non-technical readers can simply skip the source code sections, and still understand the concept.

chi2_peak

Summary

The Chi-Square test does not directly answer the why question, but it provides hints speeding the reasoning.

In our case, we found the following hint: the spike consists mainly of tickets, whose “code” field has value “duplicate”. What does this really mean?

The business interpretation may be as follows. Certain tasks are marked duplicate very early in their lifetime. Then, after a certain retention period of inactivity, an automatic process transfers tasks to the closed state. So in fact, we are not seing a spike in activity after 6 days, but a disproportional number of fake tasks that are automatically closed after 6 days. Quite an important insight for the business owner of the project.

With the help of chi2 test, this conclusion could be reached quickly. We could efficiently establish the feature that correlated with the spike: the task code. Remember it was one of 300 possible choices. An example of another possible choice (feature), mentioned earlier in the article, would be the agent name. This time however, if we tried to see which agent is responsible for the spike, we would fail because the spike was simply not related to that feature. We would then need to repeat the same for 100+ other features, wasting hours of work. The chi2 test made this quicker.

Notes and credits

Final note: the reasoning above is simplified. The chi square test is not universal. It only works on categorical features, on certain conditions. Also, it is part of a larger family of statistical tests for independence, not to be discussed here. To read more: [don’t trust the data], [unmasking data in camouflage], [isolating the spikes]. The subject is quite broad. The goal of this material is to demonstrate how an advanced statistical tool, such as the chi square test for independence, can be used to efficiently address a practical problem.

Credits: creating this material was possible thanks to my Data Science work for Sopra Steria. We always look for talent. If interested, [click here to contact our recruitment].

Answering Why (with Chi-Square)
Tagged on:         

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.