We are in the middle of this semester’s Big Data in 30 Hours class. We just did lecture 7 out of 15. So far we covered Relational Databases, Data Warehousing, BI (Tableau), NoSQL (MongoDB and ElasticSearch), Hadoop,  HDFS and Apache Spark. While we are about to move to Data Science in Python (with Numpy, Scikit-learn, Keras and TensorFlow), I received valuable feedback from the students. Rather than learning a little bit of everything, they prefer to focus more on the infrastructure. Students postulate more Data Engineering, less Data Science.

I was impressed by the mature feedback and will now be working on modifying the class content. I am currently deciding, what are the most important  infrastructure technologies and tooling that a Data Scientist should be aware of. In other words, what is the best way to spend 30 teaching hours to introduce a Data Scientist into Data Engineering + Software Engineering infrastructure? My tentative list is below and I would be glad to see any comments. Am I missing something? (Note that high-level technologies, such as Python/R programming and libraries for classification, regression, machine translation etc are out of the list, considered a separate subject).

indexclasstechnologies
1.OSLinux console orientation, bash, power tools (sed, awk, grep etc). Formats: csv, plaintext, JSON, XML
2.relational stackRDBMS (Oracle, sqlite), Data Warehousing, ETL
3.data vizBusiness Intelligence, Tableau, Elastic Kibana
4.parallel
processing
HDFS, MapReduce, Apache Spark, Hive
5.NoSQLkey-value/document (MongoDB, Elastic), graph databases (TigerGraph), columnar (HBase)
6.streamingKafka
7.cloudAWS
8.workflowAirflow ?
9.deploymentDocker, Kubernetes
10.softwae
engineering
DevOps, (DataOps?), version control (git), python unittest
Top 10 Data Infrastructure technologies for a Data Scientist

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.