We are in the middle of this semester’s Big Data in 30 Hours class. We just did lecture 7 out of 15. So far we covered Relational Databases, Data Warehousing, BI (Tableau), NoSQL (MongoDB and ElasticSearch), Hadoop, HDFS and Apache Spark. While we are about to move to Data Science in Python (with Numpy, Scikit-learn, Keras and TensorFlow), I received valuable feedback from the students. Rather than learning a little bit of everything, they prefer to focus more on the infrastructure. Students postulate more Data Engineering, less Data Science.
I was impressed by the mature feedback and will now be working on modifying the class content. I am currently deciding, what are the most important infrastructure technologies and tooling that a Data Scientist should be aware of. In other words, what is the best way to spend 30 teaching hours to introduce a Data Scientist into Data Engineering + Software Engineering infrastructure? My tentative list is below and I would be glad to see any comments. Am I missing something? (Note that high-level technologies, such as Python/R programming and libraries for classification, regression, machine translation etc are out of the list, considered a separate subject).
index | class | technologies |
1. | OS | Linux console orientation, bash, power tools (sed, awk, grep etc). Formats: csv, plaintext, JSON, XML |
2. | relational stack | RDBMS (Oracle, sqlite), Data Warehousing, ETL |
3. | data viz | Business Intelligence, Tableau, Elastic Kibana |
4. | parallel processing | HDFS, MapReduce, Apache Spark, Hive |
5. | NoSQL | key-value/document (MongoDB, Elastic), graph databases (TigerGraph), columnar (HBase) |
6. | streaming | Kafka |
7. | cloud | AWS |
8. | workflow | Airflow ? |
9. | deployment | Docker, Kubernetes |
10. | softwae engineering | DevOps, (DataOps?), version control (git), python unittest |