Data Engineering + Data Science: building the full stack

This article is part of Big Data in 30 hours course material, meant as reference for the students.

In our class we have looked at a number of Data Engineering and Data Science technologies. You may be wondering how they play together. It is now time to build an end-to-end workflow, resembling the production environments.

An example data streaming architecture

In Data Science, we primarily focus on Machine Learning scenarios, such as supervised, unsupervised, and reinforced learning. In classification and regression much of effort is spent on training the model and tuning its parameters. In a typical production environment however, the engineering effort is elsewhere. You would typically deploy a model that has already been trained. You want it work smoothly, process data and make decisions in requested response time, and be reliable. This is where we need Data Engineering.

Let’s take a simple example. Our task is to analyse the Web server logs and alert the sysadmin on anomalies, such as DoS attack or suspected login attempts. Our approach is to use the historic data to train a model. We want the model to distinguish between normal activity and suspected activity. We would then deploy the trained model in production.

The model needs access to real-time data. We need some data transport method. One of several options here is to build an event-driven data streaming architecture with Kafka. Data stream representing the subsequent www log events would be streamed from the producer (Web server) to the consumer (the analytical model). Thus the model would receive the ingest data in real time and make decisions fast.

An advantage of this approach is the horizontal scalability. You can easily add more producers, supplying more Kafka streams from various data sources, and consumers, analysing or storing the data. You can also build intermediary services, doing various things with data (ETL processing, cleansing, filtering) and pushing it further to another stream. Other services can link between Kafka and powerful storage models we learned during this class, such as NoSQL databases (like MongoDB or Elastic), data lake (like HDFS) or relational databases and warehouses (like Oracle). The possibilities are endless and the architecture is promising, if your business problem problem can be represented as stream of events.

We drew the diagram with Kafka used as transport, but event-driven streaming architecture is just one possible choice. In essense, we need a method for the components to communicate and exchange data. Another choice is Service Oriented Architecture (often with RESTful microservices). It is also possible to exchange the data at the storage level. In this case the architecture would be centered around some efficient storage model, and this could be a Data Lake over HDFS, NoSQL with MongoDB, Oracle , or BigTable, just to provide a few examples. Often, the choice will be dictated by the commercial provider we already use: Amazon AWS, Azure, Google, Cloudera, Oracle.

Development and deployment

When developing the components of this architecture, you will typically maintain your source code in a version control system such as Git, which we learned at this class.

When it comes to the deployment method, one frequent choice is Docker. As an example, the figure below shows a possible contents of a Docker container generating the Kafka stream of www events we described earlier. If you intend to deploy a test environment (for the purpose of testing and validation) you will probably need a traffic simulator generating streams of events. Then it makes sense to deploy the simulator, the www server and the Kafka producer all in one container, so you can launch the test with one command. This defines your containerized deployment for test environment (shown in the diagram below). You would have another containerized deployment for production environment, which of course would not contain a simulator.

Summary

This roughly describes the type of architectures we will be building at the end of Big Data in 30 hours class. This exercise also puts together the stuff we learned during the class.To sum up

  1. We use Data Science to build and train the models: supervised, unsupervised, and reinforced learning, classification, regression, clustering, collaborative filtering, NLP, Deep Learning.
  2. We use Software Engineering to manage the lifecycle of our software: we control the versioning using git, we run unit tests, and isolate and deploy using Docker
  3. We use Data Engineering to integrate models with production architecutes, being one of: streams, microservices, data lake, data warehouse, or NoSQL. Processing power would be managed by Amazon AWS, Hadoop or Spark.

 I hope you enjoyed the class. Feel free to contact me

Leave a Reply

Your email address will not be published. Required fields are marked *