Data Lakes vary from each other. Standards are only emerging. The Lagoon Data Lake we have internally built at Sopra Steria (introduced in the previous post) is an internal IaaS Data Lake solution, built mostly of open source components (Spark, HDFS and much more) integrated on an internal cloud, allowing to speed up our Python-based analytics. But it is a complex beast, giving you plenty of possibilities which also results in fresh users getting lost. Where to start? This short article provides a simple usage scenario, which has been implemented on our Lagoon Data Lake for educational purposes. Our entry-level internal users should read this article to get started with the Lake. For other people who cannot access our lake, this piece may help understand how data lakes work.
The scenario
In my part of Sopra Steria, we monitor large-scale customer IT infrastructures. Our job is to keep them operational, which means we need to quickly respond to incidents. Here we will focus on the incident data. In the data lake, among many other things, we collect the incident data from our IT monitoring. The database contains millions of such incidents from the past five years. This data, quite rich with features (millions of rows and 200 columns) is sufficient for months and years of data science work! You can slice and dice, group, cluster, look for optimizations, classify and predict. All the cool stuff. But that we can do later. At this point we want you, as the Data Scientist to get started. So the goal is to get hold of the data, perform some basic exploration, and and draw three most basic histograms: how long the incidents typically live? At what time of day most incidents arrive? And finally, how many incidents do we process weekly, and is one week very similar to another? Let’s get started!
Without data lake
First, let’s understand how would the same task be executed without the data lake. This I can explain easily, because this indeed has been my work routine before we built Lagoon. To analyse the data, I needed to:
- Log in to a remote Oracle database serverwhere incidents were stored
- Using SQL Developer or sqlplus, execute SELECT query and download the data to my laptop as csv
- Open jupyter notebook, execute pd.open_csv() and start playing with data.
Was this simple? No. This has been an extremely difficult, ineffective and irritating process. Unfortunately, we work with real-time data and I needed to analyse fresh data every morning. The problems were:
- that source table is large: 5-20 GB or more. The transfer often broke and it was unstable
- the content of the incident table changes all the time, and it’s not only INSERTS but also UPDATES. So I had to download the entire table every morning, otherwise my csv got out of sync
- but then, it was too much data to load to jupyter. So prior to the analysis, I had to manually reduce it depending on the type of analysis needed. I used simple Linux scripts, like awk or sed or python, which weren’t difficult but added another time-consuming step to the work
- then I had to perform data cleansing – daily. Things like changing formats of dates, removing null values, removing bogus rows and recalculating columns – even if automated – added an additional step to the semi-manual work load
In the end, to spend 1 hour on data analysis, I often needed to waste an entire day to acquire and prepar the data again.
With the data lake
These and other issues stimulated our team to invest in building the Lake. Therefore one key concept we immediately built the infrastructure upon was the workflow. If you read the previous section, you already know what we wanted: an automated sequence of processes that do everything that could possibly be automated. So a typical Lagoon workflow acquires the data, reduces it to just the necessary columns, cleans it removing garbage data, and creates auxiliary columns with some pre-calculated values. The result is a clean data file, ready for analysis.
The subsequent steps correspond to data lake zones. The raw zone is to store the source data (these ugly 5-20 GB files). The staged and trusted zones hold subsequent stages of data processing, where the garbage is gradually removed, nulls are filled with correct values, and it all becomes more sensible. Finally the gold zone holds the final, pretty dataset, ready to digest into a jupyter notebook.
(To achieve this, we use a complex set of tools which, at this point, you don’t even need to know about. However, just for your information we might use Apache Spark, Hadoop, Yarn, HDFS, NFS, Airflow, Kafka for ETL, perhaps also Oracle, Postgre, Hive for storage, all this enclosed in a federation of virtual machines-based cloud environment, with VPN, security, role-based access control at both shell and browser level. Don’t worry. No need to understand now. It just works.)
The architecture looks like this. The fish represent the worker machines, while the colored processes represent the workflows.
There is only one thing to add. As you noticed above, we have more than one workflow, because we have several projects to cater for. Each project has different data, different client and different users. You, as the entry-level Lagoon user with main purpose to train your skills, will receive access to just two workflows. The first is called Reference Workflow. It processes specially prepared reference data which is suitable for training purposes. You have read access. The second is called the Sandbox Workflow. It is similar. But here you have read-write access to all data stages there, which means you are allowed to play and break things, which will then remain broken for you and other users.
Now, we will examine the Reference Workflow data of the final, gold stage of processing. Keep in mind that this is not any particular customer project (to these you don’t have access!), but a specially crafted training data that statistically resembles real project data.
Let’s do it!
Lengthy introduction, eh? The difficult part is behind us. We just explained to you all the things you don’t have to do, or even properly understand. It boils down to one fact: the final data set, ready for your Data Science craft, is waiting for you in the gold zone of the reference workflow, automatically updated every 5 minutes! How about this? If you are Data Scientists, most probably the only thing you want is to get hold of that final data and start playing.
After acquiring the Lagoon access credential from our sysadmin, open your jupyterlab link in the browser and point it to the reference notebook which we prepared for the orientation purpose, located at /gold/reference/reference_sample.ipynb. You should see this.
Read the notebook carefully and execute each cell. Section A leads you through verification of the workflow steps. This basically helps you understand how much the workflow has already done for you, so you don’t need to do anything. For instance, here we can verify that the auxiliary columns have already been computed. Green cells provide guidance, white cells is the code you should run:
Section B leads you through the basic data exploration of the gold data se. Here for instance we can check the overall breakdown of all available incidents per year or per priority code:
Finally, we get to section C where we can print some simple charts. Let’s try to draw the diagrams which were set as the goal of this exercise. First, how long the incident typically lives?
It looks like most incidents are being closed somewhere north of 250 seconds. The spikes are correct, resulting from processes and procedures the various categories of incidents need to go through before being resolved and then closed.
Secondly, at what time of day most incidents arrive?
As we can see, the biggest spike is about 1:00 am, which however does not correspond to the biggest aggregated volume of events, which seem to arrive during day, between roughly 7:00 and 16:00.
And finally, how many incidents per week?
In the year 2018 the weeks are highly variable, however in most cases we saw 500 – 1000 incidents per week.
Summary
Having examined the Reference Notebook, you should now be equipped with understanding:
- what is Lagoon Data Lake and how it makes analyst’s work easier
- what are zones and what are workflows
- what are the things you don’t have to do
What next? Go to the internal Lagoon documentation and try to:
- execute more advanced Data Science on the Reference Workflow set
- import your own data set and build your own workflow using Apache Airflow
- when building advanced, time-consuming models, experiment with parallel execution using Spark and HDFS, executed directly from jupyter, zeppelin or console
If you’re not a Sopra Steria employee, you don’t have access to all that but I hope the explanation was clear enough to grasp what we have built and why. In your own organization, you can build a simple but useful data lake using similar criteria, implementing the workflow-based approach across zones. This flexible infrastructure can be used for all kinds of advanced analytics – not just related to the infrastructure tickets. For instance, we now look at analysis of raw log files. Technically, the data engineering part is no different from ITSM tickets. We acquire the data, clean it and proceed to exploration and then, finally reasoning.
To learn more, write to me and I am happy to help, discuss, or hear any comments.