Top 10 Data Infrastructure technologies for a Data Scientist

We are in the middle of this semester’s Big Data in 30 Hours class. We just did lecture 7 out of 15. So far we covered Relational Databases, Data Warehousing, BI (Tableau), NoSQL (MongoDB and ElasticSearch), Hadoop, HDFS and Apache Spark. While we are about to move to Data Science in Python (with Numpy, Scikit-learn, Keras and TensorFlow), I received valuable feedback from the students. Rather than learning a little bit of everything, they prefer to focus more on the infrastructure. Students postulate more Data Engineering, less Data Science.

I was impressed by the mature feedback and will now be working on modifying the class content. I am currently deciding, what are the most important infrastructure technologies and tooling that a Data Scientist should be aware of. In other words, what is the best way to spend 30 teaching hours to introduce a Data Scientist into Data Engineering + Software Engineering infrastructure? My tentative list is below and I would be glad to see any comments. Am I missing something? (Note that high-level technologies, such as Python/R programming and libraries for classification, regression, machine translation etc are out of the list, considered a separate subject).

index	class	technologies
1.	OS	Linux console orientation, bash, power tools (sed, awk, grep etc). Formats: csv, plaintext, JSON, XML
2.	relational stack	RDBMS (Oracle, sqlite), Data Warehousing, ETL
3.	data viz	Business Intelligence, Tableau, Elastic Kibana
4.	parallel processing	HDFS, MapReduce, Apache Spark, Hive
5.	NoSQL	key-value/document (MongoDB, Elastic), graph databases (TigerGraph), columnar (HBase)
6.	streaming	Kafka
7.	cloud	AWS
8.	workflow	Airflow ?
9.	deployment	Docker, Kubernetes
10.	softwae engineering	DevOps, (DataOps?), version control (git), python unittest

Tagged on: Big Data in 30 hours

OnData.blog

Top 10 Data Infrastructure technologies for a Data Scientist

Leave a Reply Cancel reply