With substantial analytical needs at Sopra Steria Apps, we are looking to expand our Data Science environment. My thoughts go towards a Data Lake architecture, from a concrete angle, having practical requirements and knowing quite precisely what we want. I’ve also done some research in the community. It stroke me that the ‘official’ rhetorics on what data lakes are is often detached from ‘practical’ needs of the analytical community, adding in a layer of unnecessary complexity. In this article I will attempt to delineate between practical and superfluous Data Lake requirements. (The diagrams are just a colorful sample of the volumetrics analysis we’re running with vast amounts of data – the reason we’re after the Lake.)
1. The popular answers
What’s the point of bulding a Data Lake, and moving data there? A data lake is a large storage repository that holds a vast amount of raw data in its native format until it is needed. In practical terms, when people say ‘Data Lake‘ they often mean large quantities of raw data organized in non-hierarchical, flat storage over Hadoop HDFS or Amazon S3 (competitors include Microsoft Azure or Google Cloud). My earlier article analyses why people build the lakes in the first place. Frequently quoted reasons are: to keep the data in one place, to keep in the native format, and the access to existing tooling ecosystem (such as the Apache Hadoop / Spark family).
The promise of data lakes (as opposed to traditional warehouses) involves: scalability, lower cost, removal of silos and ability to keep the raw data intact thanks to schema-on-read. However, moving relational data to the lake implies serious and obvious concerns: loss of structure means loss of meaning. Data, formerly read/write is now append-only. Simple things like access control, data consistency, constraints, indexes, views and aggregates might need to be reinvented.
Then what’s wrong with a traditional, dimensional data warehouse? Suppose we already have one – why would we additionally want moving tables to Hadoop? My earlier article touches on this, as well as this StackOverflow thread ‘What’s the point of a table in data lake‘. Melissa Coates in this article (2018) made the same question a bit more precise, to which answered James Serra in this article providing a few more compelling cases when moving SQL tables to the lake actually makes sense:
- for an ETL offload. In the traditional ETL workflow, you may use Spark/Hadoop in place of the staging database. If processing on Hadoop is more efficient, why not?
- as backup. A RDBMS implies throwing away most of the data and keeping just the well-governed, cleansed subset. In the case you want to keep backup of raw data, use data lake with data in the native format
- to take advantage of the growing family of data-lake tools, optimized to work with each other. For instance, loading data to Azure SQL Data Warehouse with Polybase (that allows MsSQL Server to read Hadoop data) is faster than with SSIS (Microsoft’s traditional ETL tool).
IBM Analytics provides a more strategic view on this. The IBM Industry Model (2016) quotes a few more reasons of shifting data to the lake, some of which include:
- provide the necessary data lineage back to the source systems (for governance and regulatory compliance)
- enable users with self-service data access
- reduce IT maintenance effort by infrastructure consolidation
- Create a set of canonical data structures to enforce consistency and standardization across the organization – enabling better reuse of data across the different LOBs
I like those answers but I also note they do not corresspond to the needs of the analytics communty around me and are mostly superfluous.
The practical answer
The typical pre-Data Lake scenario seems to be as follows. Here’s a group of data analysts working on laptops. They download data from a data warehouse and work on some predictive analysis, clustering, anomaly detection, and data profiling. What’s typically making their work tedious? Repetitive, administrative and uncreative tasks that take time. Typically those tasks are related to data access, ingest, cleansing and transformation. Hence, we are looking for some very practical data-sharing environment to make their work more effective.
Here’s my proposed list of the main drivers to push the analytics community towards data lakes. The must-have Data Lake features are:
- Data sharing environment, enabling for simple pipelines so one person’s result can be ingested as input by other’s processes
- Common repository with data lineage. Many repetitive, time consuming tasks to do with data preparation can be done just once, so all analysts have access to the result. As an example, besides the source data set, we want a copy clean of null and bogus values, and another copy with precomputed auxiliary columns and so on. In practice, the popular data sets pertaining to the master data management entities will have hundreds of life copies.
- Real-time automation, such as ETL daemons / Kafka processes, delivering an up-to-date copy of the data from the sources (as opposed to setting up an individual access to the the Oracle data warehouse repository for each analyst, forcing them to make friends with tools, access rights and procedures, forcing each analyst to log in and run SQL SELECT queries when needed. Time consuming, cumbersome, frustrating.)
- Access to common set of libraries and tools. As example, anyone working even in relatively simple unattended learning tasks would note several key algorithms are conspicuously absent from the standard Python/ pandas / sklearn stack, and need to be installed separately. Much time can be saved when these needs catered centrally in the Data Lake, resulting with analysts freed from tools instalation, versioning, and conflict resolution
- Strong hardware resources: actually, efficiency over laptop – based environment can be achieved with quote basic capital expenditure: Gigabyte Ethernet, a bunch of dedicated CPUs / GPUs available overnight and a few terabytes of storage is actually sufficient to most tasks that laptops can’t handle.
In contrast, the requirements quoted in the previous section are superfluous and outside the real need of most analytical teams I spoke to. As an example, none of my contacts necessarily needs the data in the native format
The separation of our must-haves from superfluous features made the research clearer.
Some purists may say we are not building a data lake but a mere data puddle. I’m fine with that. Maybe a Data Puddle is all we need? We’re driven by practical need, rather than fancy names. The realization of this has practical impact on our current approach. We’re currently looking beyond the usual suspects list (Spark / AWS / Azure). Some of interesting base ideas include PaaS solutions such as Dataiku, but also barebones integrations of a bunch of Ubuntu virtual machines attached to a common filesystem, or a logical data lake spanning a few of those platforms.