Big Data in 30 hours

The goal of this technical, hands-on class is to introduce practical Data Engineering and Data Science to technical personnel (corporate, academic or students), during 15 lectures (2 hours each). All subjects are introduced by examples that students are expected to immediately play with using either command-line or GUI tools.

Prerequisites: the participants need to be technical, reasonably fluent in general programming and operating systems, with basic exposure to Linux shell, databases, and SQL. Working knowledge of Python will be needed for the lectures 9-15.

At the moment (2018), I am teaching the class at Cracow Technical University (Politechnika Krakowska), Faculty of Physics, Mathematics and Computer Science. The audience are the final year students of the graduate Master’s degree Informatics studies, Data Analytics specialty. Courtesy of the Faculty, the lectures are open to all. If you live in Krakow, Poland, please come: every Wednesday 9:00 am, Room F020, ul. Podchorążych 1, Kraków (Politechnika Krakowska Wydz. Fiz Mat i Inf), until 23rd January 2019. Some resources:

The student resources page with materials, links and collateral.

The linkedin discussion group (open to all): Big Data in 30 hours

To learn more, contact me. The syllabus follows.

Lecture number Title Description
Lecture 1  Linux power tools: sed, awk, grep. Software Engineering environment: Git, Docker  For simple transformation of flat files, Linux shell power tools is all you need. Mastering these is helpful before moving on to advanced analytics. And to produce quality software, you will need version control (Git) and a deployment/isolation framework (Docker)
Lecture 2 Relational DBMS and SQL: Sqlite. DataOps. Understand how the data organization in 3rd normal form helps processing. Understand the power of SQL, but also limits and performance caveats of SQL queries in a traditional RDBMS. Also: introduction to DataOps.
Lecture 3 Data Warehousing and ETL. Oracle When the transactional system isn’t enough? Why and how are the warehouses being built? We will build our own ETL process.
Lecture 4 OLAP and BI. Tableau Visualization matters. Understand the role of traditional data analytics and management reporting.
Lecture 5 Non-relational. Mongo, BigTable, CosmosDB We will experiment with data sets that are more effectively searched and transformed in NonSQL environments.
Lecture 6 Distributed filesystems. Hadoop, MapReduce, Hive Let’s organize data in parallel in HDFS and run parallel MapReduce operations against it. Understand the Hadoop family of tools.
Lecture 7 Cloud. Apache Spark, Amazon AWS Run your data operations in dynamically created resources. When is it price-effective? Also, an intro to Spark.
Lecture 8 Streaming. Kafka Handling an analysis of real-time data instead of static data.
Lecture 9 Data Scientist routine. Numpy, pandas, matplotlib. Hands-on intro to entry steps of a data scientist routine: data cleansing, data exploration, and data transformation.
Lecture 10-12 Regression, Prediction, Collaborative filtering. SciPy, Scikit-learn Five lectures devoted to three distinct analytics techniques to generate insight from data.
Lecture 13-15 Anomaly detection, Classification, deep learning. Keras, TensorFlow. We will prepare tensors of data and tweak the layered network parameters to optimize the learning and add value to analytics.