Data Lakes are repositories where data is ingested and stored in its original form, without much (or any) preprocessing. This is in contrast to traditional data warehouses, where much effort is in the ETL processing, data cleansing and aggregation, to ensure that any data that comes in is clean and structured.
When people say ‘Data Lake‘ they often mean large quantities of raw data organized in non-hierarchical, flat storage over Hadoop HDFS or Amazon S3 (competitors include Microsoft Azure or Google Cloud). Data sets might be discoverable by tags, rather than filesystems hierarchy. Sometimes data organized in a database, such as MongoDB is also referred to as a data lake, which is perhaps less common but happens.
Data lakes are cool, however most literature focuses on ‘how’ rather than ‘why’ building them. Before building a data lake, make sure you really need one. Here are some considerations to mind during the data lake design exercise.
rationale 1: data in one location
Data lakes allows to keep all your domain data in one place. But this fact alone is not a good-enough reason to build one. Keeping data in one place is not necessarily an advantage! Often, just the opposite. For example
- aggregating data of various sensitivity, confidentiality and levels of access, will only increase trouble of making it securely available to the right personnel. Access control is a must. One way to deal with it is to provide such control at the level of a proxy REST API accessing the lake, perhaps using Apache Knox or Apigee. That said, why making things complex? Your data prior to pushing to the data lake was naturally protected by silos domain systems, why not leave it as is?
- if you already have data in three or four locations, maybe it is better to keep it there and do nothing, instead of launching a data lake project and pay for its time, storage and effort
- keeping data distributed have advantages. you may want to keep low-latency data in one location, optimized for speed, and other data in other location, with cheaper storage and networking. Or you may group/separate data geographically, for local access. Or because some data cannot travel internationally for legal reasons. Or because you need remote active/active replicas. There are many reasons to keep various data sets separately
Having data in one place MAY be an advantage in some situations, for instance if you need:
- uniform access control management and security policies
- uniform data retency and data curation policies
- all data will be accessible to the same crawling services, performing search, filter and transformation
- less administration overhead when
- optimized storage cost
rationale 2: native format
In Data Lakes, data is stored in the native format. But why?
It is actually quite clever. The reason Data Lakes win over warehouses is similar to why Agile Programming has won over the waterfall model. Humans are inherently bad at predicting the future. When building software in traditional plan-driven model such as waterfall, chances are high that much of the effort will be wasted due to misinterpreted requirements, or else, new needs that will arrise while the project is in motion. Similarly, a lot of effort is required to build and maintain a traditional data warehouse in a dimensional model, including the ETL for data cleansing and preprocessing. Again, we are likely to do it in vain, because the environments evolve. By the time the data warehouse is complete, there will be new requirements that make the original design outdated.
Noting that storage is often cheaper than time and effort, an alternative approach is to invest zero in preprocessing, but ingest the data as-is, and store it in bulk in the hope it will be needed tomorrow. The investment in schema design, formating, cleansing or filtering will be expendited later when the data is needed. In fact, much of the data might never be used which is fine, some of the original ideas will turn outdated, and other use cases will appear that require a different approach. In other words, decide as late as possible (which is a Lean software development principle).
Unfortunately, that argumentation isn’t universally valid. Sometimes the requirements are clear from the beginning, in the form of user scenarios that need immediate implementation. In such a case pre- and postprocessing makes sense in the data lake. Such data lakes evolve into three or more tiers of data (raw data, clean data and analytics data), quite similar to the setup underlying a traditional data warehouse ETL (transactional/relational database, staging database and warehouse/dimensional database). This article has more.
Maybe this is your case. Then again, you need to ask again whether you really need the data lake.
rationale 3: tooling ecosystem
A good reason reason to keep data in a lake, implemented over Hadoop HDFS or Amazon S3 may be related to what you want to eventually do with the data. Both those platforms allow for plethora of tools that are readily integrated and available to act as feeders (that push data into your lake) or local processes (that process, search, filter and analyse data).
For HDFS/Hadoop data lake, this would include MapReduce, Spark, Hive, Pig, Cloudera Impala, Apache Ignite, Storm, Flink, Hive, HAWQ, Drill, Flume, Scoop, Kafka, Zookeeper, Oozie, Mahout, H2O, and solutions from Confluent, Cloudera and Hortonworks.
For S3, this would be the Amazon services including EMR with
Apache Spark, HBase, Presto, and Flink, Amazon RedShift, Kinesis, ElasticSearch,
If you envision tasks (most typically, related to analytics) that would benefit from in some of those platforms, then this might be a good-enough reason to start building a data lake.
rationale 4: after 10 years?
If you are excited to build a data lake but aren’t sure what it would serve, don’t. You are likely to end up with large quantities of data that need some maintenance. Data has its lifecycle, too. At some point, perhaps after a few years, the organization might decide to decommission it. No big deal if the data lake contains outdated www logs or IoT sensor data. But if more vital, sensitive, personal, or financial data is also kept there, the data retention analysis might become a time-consuming and expensive project. The effort might not be worth it, especially if most of the data has never been used. A nuclear power plant construction project might serve as analogy: an important part of the ROI calculation is the lengthy shutdown period cost.
summing up
Data lakes are cool but most literature focuses on ‘how’ rather than ‘why’ building them. I have seen projects where the lakes were built for the sake of it. This is waste of resources.
Having data in one location, in a native format, and available for Hadoop family of tools might be useful… sometimes. Sometimes not. Ask yourself why you need the data lake.