4 reasons for building Data Lakes… or not

Data Lakes are repositories where data is ingested and stored in its original form, without much (or any) preprocessing. This is in contrast to traditional data warehouses, where much effort is in the ETL processing, data cleansing and aggregation, to ensure that any data that comes in is clean and structured. When people say ‘Data Lake‘ they often mean large quantities of raw data organized … Continue reading 4 reasons for building Data Lakes… or not

Lecture notes: Introduction to Apache Spark

In Lecture 7 of Big Data in 30 hours lecture series, we introduce Apache Spark. The purpose of this memo is to serve to the students as a reference of some of the concepts learned. About Spark Spark, managed by Apache Software Foundation, is an open source framework that facilitates distributed computations. It is lately quite popular. With Spark, one can set up a set … Continue reading Lecture notes: Introduction to Apache Spark

Product Owner vs Product Manager vs Architect

This short memo is to clarify the proper usage of these roles in the context of software development projects: Product Owner, Product Manager / Manager, and (Software/Product) Architect. Product Owner The term Product Owner is mainly used in Scrum context. In Scrum methodology, it is clearly defined. In fact, there are only two roles that Scrum defines explicitly and that is Scrum Master and Product … Continue reading Product Owner vs Product Manager vs Architect

Collected thoughts on implementing Kafka data pipelines

Below are my recent notes and thoughts collected during the recent work with Kafka, to build data streaming pipelines between data warehouses and data lakes. Maybe someone will benefit. The rationale Some points on picking (or not picking) Kafka as the solution Kafka originated at LinkedIn, who remains a major user. Kafka is now an open-source Apache projectKafka is good as glue between components. Connecting … Continue reading Collected thoughts on implementing Kafka data pipelines

Data Engineering for Data Scientists – my workshop at TopHPC conference

  Data Science allows us to create models that analyze data faster and more accurately than humans. If you’re a Python programmer, you’re likely to use libraries such as TensorFlow, Keras, Scikit-learn, or Pandas to create those models. To turn those models into production machines, we need a bit of Data Engineering knwoledge. We need to define the underlying data architecture, perhaps considering as components: … Continue reading Data Engineering for Data Scientists – my workshop at TopHPC conference

Git version control: part 2

With help of this article, you made your first steps with Git, the version control software. You learned to commit your software so that it became version-controlled. You need just two more skills: work with remote repositories, and to check out a particular version of your files (not necessarily the newest one). Learn those two things , and you’re good to go. This text, by … Continue reading Git version control: part 2

Data Engineering + Data Science: building the full stack

This article is part of Big Data in 30 hours course material, meant as reference for the students. In our class we have looked at a number of Data Engineering and Data Science technologies. You may be wondering how they play together. It is now time to build an end-to-end workflow, resembling the production environments. An example data streaming architecture In Data Science, we primarily … Continue reading Data Engineering + Data Science: building the full stack

Git version control: concise introduction

This article is part of Big Data in 30 Hours lectures series and is intended to serve as reference material for students. However, I hope others can also benefit. Why do we need version control in Data Science? Working with data is similar to working with software. A Data Scientist developing source code and data for the models needs similar basic tooling that regular software developers … Continue reading Git version control: concise introduction

Should justice use AI?

Should the widely understood justice system (courts, police, penitentiary system, and the related government agencies), be banned by law from collecting Big Data and using Artificial Intelligence? Back in 2013,  I was part of a data analytics project for State Police. On April Fool’s we received a hilarious hoax – an obviously fake internal announcement that police analytics could now predict crimes before they actually … Continue reading Should justice use AI?

Lecture notes: an intro to Apache Spark programming

In Lecture 7 of our Big Data in 30 hours class, we discussed Apache Spark and did some hands-on programming. The purpose of this memo is to summarize the terms and ideas presented. Apache Spark is the currently one of the most popular platforms for parallel execution of computing jobs in a distributed environment. The idea is not new. Starting in the late 1980’s, the HPC (high … Continue reading Lecture notes: an intro to Apache Spark programming