I often feel the gap between the mainstream Data Science rhetoric and the true business needs is widening. When I hear of Hyperautomation, Edge AI, AutoML, or GANs, I challenge myself to take a leap back, understand our needs better.
Democratization of statistics: Chi2 for non-experts
I am big fan of advanced methods deployed to solve practical problems by ordinary users. Here is our recent achievement. My colleague, an experienced service desk manager, observed that the volume of work in his team has grown. He would
An approach to categorize multi-lingual phrases
I have 130,000 help desk tickets with multi-lingual descriptions. I need to divide this set into categories, such as “password reset”, “license expired”, or “storage failure”. Why? Users could then allocate a category to a new ticket they create. Then
The implications of Scikit-learn bug #21455
As described last week, the Scikit-learn chi-square feature selection is not usable until the bug #21455 is addressed. The problem concerns sklearn.feature_selection.chi2 and the derivative methods, including SelectKBest, if used for categorical features other than binary. The nature of the
Your model may be inaccurate
With Machine Learning in Python, you may do feature selection with SelectKBest. As I just confirmed, this method sometimes returns faulty results. This potentially impacts the accuracy of numerous ML models worldwide. Below the details and the way out. The
Answering Why (with Chi-Square)
Analysts don’t like the “why” questions. They are tough to answer. For instance, in a help desk analysis, it is easy to show which tickets are resolved faster. But it is difficult to say why. In my practice in Sopra
What makes Data Quality so difficult
Garbage in, garbage out. Analysis of untrusted or poorly understood data will yield incorrect results. Hence the textbook approach is to clean the data first, and only then proceed with data analytics. For instance, in the data lakes, the data
Don’t trust Data Science. Ask the people
Can intuition beat the popular data science tools? Is SelectKBest, the popular feature selection method, wrong? Here is a story of a recent project. The story ends with a puzzle which I cannot solve. Help is welcome. Both data and
Mistaken by factor of 100,000

Longormal data is very tricky. Wrong visualization methods can lead to radical misinterpretation of the result. In this article I show an example of such a mistake based on a real project, and I demonstrate how to avoid the caveats
Practical AIOps: 5 use cases
In Sopra Steria we manage the IT infrastructure and applications of big clients. We process millions of service tickets and infrastructure events. This massive stream of data comes from monitoring tools such as Zabbix, Nagios, Solarwinds, and higher level frameworks: