Following on my previous post, we have spent some time on building an internal Data Lake at Sopra Steria. The infrastructure is functional now and admitting its first users.
Much has been said on building successful Data Science teams. Multidisciplinary DS projects are hard because they require broad skill set. Good teams are scarce. Once you have built an existing, seasoned team, it makes sense to work on increasing their comfort and productivity by reducing an overhead of unnecessary, superfluous work. And this is why we needed the Lake.
Skill should be focused on most difficult tasks: statistics, exploration, reasoning, classification, prediction, or detection of anomalies. However, typical work breakdown of a DS project includes: environment preparation (20% time), data acquisition (40%) and data cleansing (30%). It often happens that no more than 10% of project time can be spent on data science proper.
We have seen this in Data Science work pertaining to IT Infrastructure Management (ITSM), contracted with analysis of thousands and millions of tickets, events, incidents, alarms, problems and changes coming from the vast customer IT infrastructure monitored by our tools. These difficulties sparked our internal Data Lake project: an efficient environment where ITSM data is semi-automatically acquired, secured, formatted and provisioned right to the doorsteps of the analyst team. We envisioned an environment that does 90% of the work for us:
- no-brainer data acquisition: fresh data automatically available
- no-brainer data cleansing: real-time ETL-type processes do that
- immediate environment prep: preinstalled tools and libraries are there
- faster experimenting: parallel execution environment brings results in a fraction of time
And here it is. The environment we have built (diagrammed below) includes Spark and Hadoop family technologies as horsepower, Kafka + Airflow + Spark as the ETL, and wide variety of storage options: HDFS, NFS, Hive, flat files and relational and key-value databases. Of course, the lake is populated by the fish! These are our worker hosts. At the end of the workflow cycle, data scientists receive clean data and the tools they love including various notebooks (zeppelin, jupyter) Python libraries on them, and ways to productize their work (CI/CD, Docker, PowerBI). With this, we aim to increase Data Scientist’s productivity by an order of magnitude, to 70-100% of the project time.
Below is gallery of some analytics executed on the lake, including root cause, correlation, parameter degradation over time, process optimization, prediction of volume and difficulty, design for automation and much more. Time permitting, I am hoping to cover many of those in more detail in subsequent articles. This is an ongoing project at Sopra Steria in Katowice, Poland. Sopra Steria, headquartered in France, leads digital transformation of major European corporate clients. With over 40 thousand employees dedicated to serving behemoth client infrastructures, we have quite some data to crunch – enough for substantial analysis. And yes, we’re frequently looking for talent, too. Stay tuned!
4 thoughts on “The Data Lake at Sopra Steria”
It’s indeed a great initiative to have a common ready to consume repository of data . I personally like “https://ondata.blog/articles/why-analysts-need-data-lakes/” post.
Couple of thing that can be cherry on the cake –
1. Data anonymization, as we have centres (both data and engineers) outside Europe as well.
2. An autoML tool that can do preliminary pre-processing of data as soon it enters the lake.
Swati, your comment is appreciated. Good luck with the evaluation work.
Congratulations on the completion of the Data Lake. Looking forward to using it for the use cases that we had discussed.
Shalini, looking forward too!