Here is a humorous recent example of what one can achieve with basic data exploration, without even going into any advanced ML techniques.
In this recent project I was asked to study response time log of an online service running over Amazon EC2 + S3. The service happens to slow down under increased demand, and so the user response time suffers (anything over 100 ms is bad). Various performance scaling scenarios were proposed, but first we were going to understand the characteristics of the performance degradation over time. Could incidents of extremely bad performance be predicted/forecasted?
anomaly detection and forecasting of network traffic
For a bit of theory: the task might or might not be feasible. We are talking about network traffic anomaly detection, which boils down to time-series data analysis. The data might exhibit characteristics of a random walk (which we do not know at this point), in which case we are doomed. However, some basic domain-specific knowledge and intuition tells us that there are chances to succeed. Performance peaks may be preceded by various signals indicating problems coming, and that can be visible in the server internal metrics (users, processes, CPU, memory, I/O). Note this would complicate our model to multivariate time series, multi-step forecast which probably is worth it. They also tend to repeat in weekly or monthly cycles, show seasonality patterns and trends that can be attributed to market and industry-specific events.
Time-series analysis and forecasting is a specific branch of Machine Learning. While prediction can be modelled with regression methods, identification of anomaly events rather looks like classification tasks. The popular Scikit-learn python package contains collection of classifiers and regressors that can be useful (linear regression, exponential smoothing, k-means, random forest, gradient boosting, support vector machines and more). If this won’t work, Keras package (an API over TensorFlow) provides deep learning models, including time delay neural networks (TDNN), long short-term memory network (LSTM), Neural Networks Autoregression (NNAR) and other models proven to work with time series prediction. Having tried various models, we would be later comparing their results with an evaluation metrics such as RMSE (root mean squared error).
exploring the time-series data
That’s it for the theory, now let’s see how much of it we shall need. Here is the service response time we received for analysis – north of 50 thousands of data points over a few weeks. Source data is in blue, moving average is in red. The second chart shows a magnified fragment.
There is nothing specific in the data on the first sight, except the extreme peak in the middle and a few smaller peaks. Magnified zoom shows no obvious patterns or cyclic trends either. Before proceeding to anomaly detection, let’s spend some more time on data exploration. First, I am interested in the distribution of the response times (could we group the anomalies as separate data population?). Pandas DatFrame.describe() method reveals that the average response time is already very high, more on this later.
count 53180.000000 mean 504.889206 std 205.255240 min 0.000000 25% 369.000000 50% 491.000000 75% 617.000000 max 8068.000000
I don’t like these numbers. At this point I’d like to visualize the distribution of the response times. Here it is.
Oups. I was expecting a Gaussian bell curve, or another known distribution. Instead, I can clearly see a multimodal shape with modes at 200, 400 and 600 and 800 ms. Why should this be? And why are the modes so evenly spaced, in decimal system? What’s more intriguing however, is the visible zebra pattern. Here is a section magnified.
It looks that service response times are grouped together in groups spaced by 40 ms. In other words, the response time is often 205 ms, or 245 ms, or 285 ms, but rarely, say, 257 ms. This does not look natural. It could be attributed to an internal working of the software (for instance, waiting loops released at some time intervals) but not to stochastic network delays. Now let’s look at the distribution of the delta between subsequent probes, which can be quickly constructed using diff() or shift() of Pandas Series object.
This strange shape shows that relation between of one observation and the next is weak. If the response was 205 ms, the next one could still be 205 with 15% chance, but equally well 245, 285 or 325 or more. In other words, the network issues such as momentary traffic congestions have little or no effect on the response time.
We thought the data made no sense. It might have been strongly affected by some human-generated process probably intrinsic to the server internals, which effectively masked the effect of the network phenomena or anomalies we were supposed to study. We shared this obsevation with the data owner.
Shortly afterwards the data owner came back with feedback. They confirmed the data had been indeed dirty, in a way. An endless loop, resulting from programmer’s mistake, was overutilizing the Amazon EC2 compute resources, and competing with space with our server process. The zebra response-time patterns we observed might have resulted from the way EC2 handled time-sharing of two concurrent processes, perhaps providing sparse time slots to our server process at strict intervals.
Here is the new data submitted to analysis a few days after the software problem has been fixed.
What can be immediately observed is that the average server response time is now a nice 28 ms, with peaks ranging at 300ms. Below is the response time distribution, which this time, as expected, resembles the Gaussian bell (to be precise, this looks like lognormal distribution which is closely related to Gaussian. Perhaps it is also bimodal, which could signal two over-imposed lognormal distributions. This could be further researched, but events below 30ms are outside scope of interest, as they do not cause any trouble). The zebra pattern disappeared. In the last chart below you’ll see the delta distribution, essentially meeting the original expectations: a response time of a probe is similar to that of a previous probe.
Now the data is clean and ready for proper anomaly detection, only that… it is not needed any more. We have eliminated the performance problems altogether. The service is now performing well. Our early feedback helped to reduce the response time from 500 to 30 ms. The ML project is over before it started for good.
Data exploration pays.