In Sopra Steria we manage the IT infrastructure and applications of big clients. We process millions of service tickets and infrastructure events. This massive stream of data comes from monitoring tools such as Zabbix, Nagios, Solarwinds, and higher level frameworks: ServiceNow, MicroFocus, Jira, etc. What can we do with this data? How to become more effective, more precise, reduce cost, increase SLA and improve the KPIs?
I joined the company in 2019. My mission was to enrich these operations with advanced data analytics.
At the very start of my research I stumbled upon the term AIOps (Artificial Intelligence for IT Operations) covering Big Data analytics techniques to enhance IT operations. I was first excited. However, the more I read about AIOps, the more dissatisfied and frustrated I became. The term was vague, poorly defined and did not lead to anywhere practical. As example, this is Wikipedia AIOps definition, this is Gartner AIOps definition and here is a BMC article Beginner Guide to AIOps, here is AIOps by ServiceNow and here is AIOps by Microfocus. After reading Gartner Market Guide to AIOps Platforms (2018) I felt I still wasn’t getting anywhere. Gartner, with their macro vision, could produce only very generic statements. The paper summarized work of 20 vendors. On the other hand of the spectrum was the market itself: immature, experimental and lacking the case studies.
I stopped reading about AIOps. Instead, I started working with Service Desk project managers to understand the real business needs. Now as this two-year exercise is still in progress, we have coined our own definition of AIOps. It is something very practical. It contains five service areas where advanced analytics brings real benefits and helps the service desk teams.
Below is the list of those five service areas which are currently on our radar. And regarding the definition of AIOps, I will get back to this at the very end.
1. Project profiling
The most basic need related to starting any service desk project is the assessment of the needs, also known as project due diligence. What size of team do you need? What skills? What is the expected volume of work? Is this work predictable and repeatable, or is the influx of tickets stochastic (chaotic)? What’s the breakdown of ticket priorities and how they map to the SLAs? How often the major incidents happen? How easy it is to plan and predict workload for particular hours of day, days of week, and weeks of the year? In short, we want to know whether this is an easy project or a difficult one.
To get a feel of what I am talking about, here is a volumetric snapshot of the workload of two projects, over a 12-months period.
Even with very simple techniques such as visual EDA (explorative data analysis), we can already suspect that Project B is more problematic, due to unpredictable workload. Advanced descriptive analytics can provide far more insight. Some techniques are:
- Basic statistics: distribution analysis, statistical tests
- analysis of variance (ANOVA)
- spike analysis
- anomaly detection
- trend analysis, variability analysis
- drill down, data slicing and multidimensional EDA
Those methods provide allow to rank the project difficulties, assess the risk, cost, and resources needed, understand the skill set, and provide insight of what the main problems are and how often they occur.
2. Cost Reduction
In the daily operational life of an ITSM manager, constant cost reduction is not a whim, but contractual obligation. Surprisingly, it typically works well for the team, too (rather than causing reduction of staff). The work that is being reduced is typically the most boring work. Staff is given more interesting tasks, and the amount of work from a satisfied client is likely to grow. Can analytics help here? In my experience, there are two leading directions.
The first direction is elimination of unnecessary work. At the level of alarms (low-level technical events), we can establish what events tend to repeat in various patterns. A simplistic example below shows spikes of aggregated workload repeating regularly on certain hour of the day. Those events that can be either fixed once only to prevent repetition, or simply considered noise, turned off and ignored. Of course, the repetition patterns can be more complex, and that’s where the advanced analytics shines.
The second direction is automation. A lot of work can be automated with robotic tools such as UiPath. The trick is to find which tickets (out of many thousands) to approach. Thus we can help to statistically identify the lowest hanging fruit: tickets that are very similar to each other, and are solved the same way (thus automation is potentially quick and easy), and tend to repeat frequently (thus automation will be profitable in terms of ROI). Below is one example of visualizing groups of tickets in a certain two-dimensional space. Ticket groups marked green (both easy and profitable) are the candidates to automation.
Another, separate area of automation is a particular case of ticket classification. In many cases, tickets can be automatically allocated to the right team (or even automatically resolved) with the help of a properly trained Machine Learning model and Prescriptive Analytics (indicating what should be done). Simple tickets (such as forgotten user password) can not only be automatically classified, but also processed and solved. Thus, their entire life cycle can be automated. This is perhaps the most mature area of AIOps, and the first one that has been put in production.
The techniques used in this Cost Reduction vary, and range from trend analysis with exponentially-moving weighted average, to Fourier Transform. Significant work on data cleansing and feature elimination is needed too. The results can be substantial. Reduction of 20% workload can often be achieved with very basic methods, while advanced analytics can deliver much more.
3. Team Planning Support
Team allocation is a manager’s nightmare. The workload variability does not match team availability. On the one hand, we have the stream of tickets, subject to certain SLAs by priority, trends and seasonality. On the other, we have the operational team and the human factor: night shifts, weekends, holidays and sick leave. The two forces are in constant struggle. Yet, the more insight into workload, the easier team planning. At the most basic level, the weekly workload heatmaps are helpful:
More advanced tools include:
- Correlation analysis (what correlates with massive incidents? When the most difficult workload tends to arrive?)
- trends analysis (is there a tendency for spikes of activity on Tuesday night?)
- team dynamics analysis, including attrition and learning curve
4. Predictive analytics
Knowing the future is helpful. You can attempt to prevent unpleasant events (such as massive incidents), or you can prepare better, knowing what is anticipated. Predictive analytics has several useful directions in IT operations.
Trend analysis (volumetric forecasting) can be used to predict the volume: increased workload, anomalies, and bursts of activities. In the interesting example below, we were able to detect signals for an upcoming performance incident 18 weeks in advance (the timeline goes from left to right, with one dot indicating one week).
Detailed predictive analytics goes one step further to predict the exact parameters of the events that are expected, including real-time alerts. Sometimes even very general details are useful. For instance, we might not know what exactly happens tonight, but we might be able to predict the technical area, severity and the skills we will need.
Predictive analytics methods range from mathematical linear trend estimation, to statistical forecasting to machine learning. Corrections for seasonality and weekly trends, and other typical time series data transformations are needed. The main problems are: finding causality in the data, careful filtering, data representation (transformation of features), data drift, and, in the case of machine learning, the problem of influencing your own results once the model has been deployed.
5. Custom analyses
Last but not least, custom analyses, specific to the project need, are often the most important piece of the puzzle. You need to listen carefully to the project team, to understand where the needs are. As one example, one project asked for analysis of an SLA: why some tickets take longer to resolve, in the effect breaching the contractual service-level agreement? The finding, visualized below: in a pattern recurring almost daily, the SLAs were constantly breached at 7:00 am and 7:00 pm. The spikes correlated with the daily shift changing times. The problem was then quickly solved with shift reorganization.
Custom analyses by nature vary in methods and cover all areas of statistics. The main difficulty lies not in technology, but rather in starting the conversation. The operational team is not likely to articulate their needs up front, because they do not know what is possible. Several conversations may be needed.
Summary
AIOps, redefined into the five areas above, becomes a practical toolbox for making the Service Desk team’s life easier.
At the same time, it is an area of fascinating research, where many discoveries can still be made. In my current work, I feel we have barely scratched the surface of what is possible here. With any new project, our capabilities constantly deepen. Most importantly, it is stimulating to deliver results which contain not just numbers, but practical and immediate help for the project teams.
Data analytics cannot exist in limbo. It should be part of overall data-driven culture: data governance, data management, data protection and curation, and finally the Data Engineering infrastructure, with long term storage, real-time data pipelines, automatic data cleansing, etc: a data lake. I wrote about our data lake before on this blog.
Finally, looking back at the definitions of AIOps listed at the beginning of this article, I can risk saying that most of them miss the point. First, they are too vague to be practical. Second, the particular focus on Machine Learning is, in my view, misleading. In the five areas of work described above, Machine Learning is indeed useful, but only in two narrow cases.
Then, what is AIOps? To me, AIOps is about bringing the data-driven culture into operations. Once data is accepted as the central asset of the service organization (which implies certain data management approach) then advanced analytics can be used to improve operations: boost effectiveness, reduce cost, automate boring work, increase confidence in planning and bring overall insight. In essence: with AIOps, the Advanced Analytics and Data Engineering are deployed so the operational team can sleep better. Like in the five examples above.
Credits
This article received thorough comments from my Sopra Steria colleagues: Jerome Perdriaud, Alfredo Jesus Urrutia Giraldo, Andreas Vermeulen, and Francois Marie Lesaffre. The insight, generated over the past two years, came from numerous people whose names are impossible to list, so I will only name the key influencers among the management: Mohammed Sijelmassi, Joachim Von Ekensteen, Fabien Colin and Marzena Rybicka-Szudera. In essence, I have put together thoughts which in fact have been generated by collective knowledge and effort. Thank you.