My Machine Learning classifier’s prediction accuracy improves with the growing volume of train data. But at the same time, its precision falls. Why so? And how to fix it? Read on.
reducing the problem to classification
At Sopra Steria, we work with incidents thrown by applications. For analytics purposes, we designed and built the Data Lake where such data is collected and crunched. In the previous posts, I discussed some of these analyses. But we can do far more, including an attempt to peek into the future. We want to know in advance things like: will system X throw a red alert within 24 hours from now? With such knowledge, the project team can take preventive measures improving several KPIs: better SLA, better system uptime, and better user experience, savings on the reduction of nightly calls to L3 experts, and better sleep of team members.
As it turns out, problems are often preceded by various signals. By properly monitoring the infrastructure and collecting proper metrics, we can catch those signals. We then deploy Machine Learning. A classifier would read and interpret those signals and produce the binary output: C0 (“there will be no incident”), or C1 (“there will be an incident”). In jargon, we say that we reduced the problem to classification.
picking the best classifier
There are many ways to build classifiers. In Python’s scikit-learn library, some of the popular engines are decision trees, Random Forest, SVC, linear regression, and many more, in addition, we can use other packages such as XGBoost or neural networks from Keras/TensorFlow. They all need train data. But we have a lot of that. On the data lake, we store many years of data, containing millions of call center incidents and billions of underlying infrastructure events. In general, the classifier’s confidence will grow when trained on a large volume of historic data.
Because during the training we will evaluate and compare many (potentially thousands) classifiers, we need metrics to tell us which one is best. There are several choices, best explained with the so-called Confusion Matrix. Example: in the last 100 days, we had 90 days with no incident (we call such situation C0, or negative) and 10 days with incidents (we call this C1, or positive). Before each of those days, we asked the classifier what will happen tomorrow. So the classifier did run 100 times with this outcome:
- 80 True Negatives (TN): factual C0, correctly predicted
- 10 False Positives (FP): factual C0, incorrectly predicted as C1
- 5 False Negatives (FN): factual C1, incorrectly predicted as C0
- 5 True Positives (TP): factual C1, correctly predicted
The customary way to present this is the Confusion Matrix, like this:
Prediction: C0 | Prediction: C1 | |
Actual: C0 | TN = 80 | FP = 10 |
Actual: C1 | FN = 5 | TP = 5 |
Now, how to tell if the classifier is good? One popular metric is Accuracy. It is the percentage of correct guesses (TN + TP). In this case, we guessed correctly 85 times, so Accuracy = 85%.
But more often, accuracy is not the right metric. Because the choice of metrics is dependent on what we need the classifier for. In this particular case, what I want to do is pass just the C1 predictions to the project owner: on the contrary, the C0 predictions are of no use. So rather than Accuracy, we are only interested in how well the classifier predicts C1. This is called Precision. In our case, Precision is 33%: we said “yes” 15 times (FP+TP), but that was correct only 5 times (TP). So, our classifier has good accuracy (85%) but poor precision (33%). (Advanced reader will notice that I simplify, skipping other popular metrics such as Recall, F1 score, AUC, and AUROC. For discussion on deficiencies of all those against proper scoring rules for imbalanced data, go here).
picking the optimal train period
All right then. We are interested in the classifier with maximum precision. We can train it with historical data. But what amount of data we should use? In general, the more data the better. But that statement is not universally correct. In some cases, the classifier may overfit. In others, the old historical data may differ from the current situation and confuse the classifier. So it makes sense to test the classifier on various period lengths to find the optimal train data size. In other words: is it better to train a classifier on 1 month of data, or 1 year of data? We did not know so we made a test.
So we did the test and came up with the following result. The x-axis shows the amount of train data, which at maximum corresponds to 140,000 incidents records of the 4-year period. The y-axis shows the metrics we are interested in.
The spike towards the left of the diagram can be ignored – with a very small train set, random fluctuations of metrics are expected. The right side is what we are interested in. As can be seen, while we increase the size of the train set from 100 to 140,000 cases, the accuracy (black) slowly grows, from the original 53%, up to 60%. That is a good sign. 60% is not a lot but we could tune the classifier later, so it’s okay for the first approximation. What is surprising is that simultaneously, precision (green) systematically falls down, from 53% down to 42%. This is something difficult to explain. I could imagine situations when a badly designed training process causes metrics to decrease or become unstable, but why would one metric systematically go up while the other systematically goes down?
why accuracy grows but precision falls
To understand the situation, we can go back to the confusion matrix. If Accuracy goes up, this means that the number of TN is probably growing. But, if simultaneously the Precision goes down, then the number FN is growing too. If both true negatives and false negatives are growing in number, then perhaps we simply have more negative cases.
Then maybe it is not a classifier problem, but the input data problem. Is it possible that while we fed more and more data to the classifier, the class distribution inside data slowly shifted?
Let’s dig. Below is the entire test set from 4 years, 2016 – 2020. Green is C0 events, blue is C1 events, and red is the proportion between the two. The trend is not very obvious until we finally draw the aggregated proportion (black). Bingo. Originally, Class C0 made up 55% of the cases, but towards the end of the 4-year period, the aggregated count of Class C0 was 40%.
So we found the reason. At a closer look we made not one, but two systemic errors at once:
- we did not check the class distribution over the entire input set. We could have used stratified sampling: at least to make sure that the class distribution in each tested period remained the same.
- we intended to test classifiers against various lengths of test data. But we used naive method: When testing for 1000-cases test period, we only used the first thousand of records (the 2016 data only). But when testing for the 100,000-cases period, we used data from all four years. But this is different data, and many things could change in the data meanwhile (not just the class distribution). So the difference in performance of the two classifiers could be influenced by various factors: not just the amount of data, but also things hidden in the data. In the end, we could never tell whether the increase in performance is caused by more data, or by other reasons
the solution
The problem outline above implies the solution. As a temporary remedy, we did resample the data, making sure it remains stratified 50/50 in each of the tested periods. We used undersampling (as opposed to oversampling, SMOTE), because we have enough data to afford to lose some. Here is the new statistics, and the aggregate proportion (black) confirms we are good this time:
We also made sure that whenever we evaluate the certain length of test period, all data is used, and then result is averaged out. And here is the new results of classifier metrics, in relation to the training set size:
It is now showing the expected growing trend, and it grows much faster than before. It also shows much higher absolute numbers than initially. Both precision and accuracy were in the range of .5 – .6 before, and now both are about .75. This should not be interpreted as an improvement. It is rather an effect due to the different balancing of the underlying classes. Most importantly, we reached the goal, which was to establish the optimal train test size. As can be seen, it is about 8,000 records, after which we reach optimal accuracy and precision, and adding more data does not influence the results in any meaningful way.
There are many ways to yet improve this process. Firstly, noting that the set is unbalanced, there are better ways to measure success than Accuracy and Precision (credits to Frank Harrell). What is also problematic is that by artificially changing class proportion to 50/50, we show the wrong proportions of the two classes to the classifier during the training. The classifier learned this way will then have lower accuracy on the future real test data. This is discussed here (credits to Baptiste Rocca). Finally, there are millions of ways to tune the classifier using feature selection, feature augmentation, and hyperparameter tuning. Maybe some of these will be covered in the subsequent posts. Credits also go to my Sopra Steria colleague Alfredo Jesus Urrutia Giraldo gifted with natural superpowers to see stuff behind the data.