[ Note: The content of this page, created in 2017, is a bit dated now. It is kept here for historical reasons]
The goal of this technical, hands-on class is to introduce practical Data Engineering and Data Science to technical personnel (corporate, academic or students), during 15 lectures (2 hours each, up to the total of 30 hours). All subjects are introduced by examples that students are expected to immediately play with using either command-line or GUI tools.
Prerequisites: the participants need to be technical, reasonably fluent in general programming and operating systems, with basic exposure to Linux shell, databases, and SQL. Working knowledge of Python will be needed for the lectures 9-15.
The course has originally been developed as the class at Cracow Technical University (Politechnika Krakowska), Faculty of Physics, Mathematics and Computer Science. The audience were the final year students of the graduate Master’s degree Informatics studies, Data Analytics specialty.
The course is available commercially. Due to the material size, I am also offering the most popular content of the class downsized to 2-8 hours, in two versions: (1) Data Engineering Pipelines and (2) Data Engineering for Data Scientists.
Some resources:
The student resources page with materials, links and collateral. Note that only selected and fragmentary material is available online, hence it is not suitable as an online resource for self-learners.
The linkedin discussion group (open to all): Big Data in 30 hours
To learn more and request quotation, contact me. The syllabus follows.
Lecture number | Title | Description |
Lecture 1 | Linux power tools: sed, awk, grep. Software Engineering environment: Git, Docker | For simple transformation of flat files, Linux shell power tools is all you need. Mastering these is helpful before moving on to advanced analytics. And to produce quality software, you will need version control (Git) and a deployment/isolation framework (Docker) |
Lecture 2 | Relational DBMS and SQL: Sqlite. DataOps. | Understand how the data organization in 3rd normal form helps processing. Understand the power of SQL, but also limits and performance caveats of SQL queries in a traditional RDBMS. Also: introduction to DataOps. |
Lecture 3 | Data Warehousing and ETL. Oracle | When the transactional system isn’t enough? Why and how are the warehouses being built? We will build our own ETL process. |
Lecture 4 | OLAP and BI. Tableau | Visualization matters. Understand the role of traditional data analytics and management reporting. |
Lecture 5 | Non-relational. Mongo, BigTable, CosmosDB | We will experiment with data sets that are more effectively searched and transformed in NonSQL environments. |
Lecture 6 | Distributed filesystems. Hadoop, MapReduce, Hive | Let’s organize data in parallel in HDFS and run parallel MapReduce operations against it. Understand the Hadoop family of tools. |
Lecture 7 | Cloud. Apache Spark, Amazon AWS | Run your data operations in dynamically created resources. When is it price-effective? Also, an intro to Spark. |
Lecture 8 | Streaming. Kafka | Handling an analysis of real-time data instead of static data. |
Lecture 9 | Data Scientist routine. Numpy, pandas, matplotlib. | Hands-on intro to entry steps of a data scientist routine: data cleansing, data exploration, and data transformation. |
Lecture 10-12 | Regression, Prediction, Collaborative filtering. SciPy, Scikit-learn | Five lectures devoted to three distinct analytics techniques to generate insight from data. |
Lecture 13-15 | Anomaly detection, Classification, deep learning. Keras, TensorFlow. | We will prepare tensors of data and tweak the layered network parameters to optimize the learning and add value to analytics. |