Here is a humorous recent example of what one can achieve with basic data exploration, without even going into any advanced ML techniques. In this recent project I was asked to study response time log of an online service running
4 reasons for building Data Lakes… or not
Data Lakes are repositories where data is ingested and stored in its original form, without much (or any) preprocessing. This is in contrast to traditional data warehouses, where much effort is in the ETL processing, data cleansing and aggregation, to
Lecture notes: Introduction to Apache Spark
In Lecture 7 of Big Data in 30 hours lecture series, we introduce Apache Spark. The purpose of this memo is to serve to the students as a reference of some of the concepts learned. About Spark Spark, managed by
Product Owner vs Product Manager vs Architect
This short memo is to clarify the proper usage of these roles in the context of software development projects: Product Owner, Product Manager / Manager, and (Software/Product) Architect. Product Owner The term Product Owner is mainly used in Scrum context.
Collected thoughts on implementing Kafka data pipelines
Below are my recent notes and thoughts collected during the recent work with Kafka, to build data streaming pipelines between data warehouses and data lakes. Maybe someone will benefit. The rationale Some points on picking (or not picking) Kafka as
Introducing my Big Data orientation workshops
Data Science allows us to create models that analyze data faster and more accurately than humans. If you’re a Python programmer, you’re likely to use libraries such as TensorFlow, Keras, Scikit-learn, or Pandas to create those models. To turn those
Git version control: part 2
With help of this article, you made your first steps with Git, the version control software. You learned to commit your software so that it became version-controlled. You need just two more skills: work with remote repositories, and to check
Data Engineering + Data Science: building the full stack
This article is part of Big Data in 30 hours course material, meant as reference for the students. In our class we have looked at a number of Data Engineering and Data Science technologies. You may be wondering how they
Git version control: concise introduction
This article is part of Big Data in 30 Hours lectures series and is intended to serve as reference material for students. However, I hope others can also benefit. Why do we need version control in Data Science? Working with data
Should justice use AI?
Should the widely understood justice system (courts, police, penitentiary system, and the related government agencies), be banned by law from collecting Big Data and using Artificial Intelligence? Back in 2013, I was part of a data analytics project for State