Getting started: Apache Spark, PySpark and Jupyter in a Docker container

Apache Spark is the popular distributed computation environment. It is written in Scala, however you can also interface it from Python. For those who want to learn Spark with Python (including students of these BigData classes), here’s an intro to the simplest possible setup. To experiment with Spark and Python (PySpark or Jupyter), you need to install both.┬áHere is how to get such an environment … Continue reading Getting started: Apache Spark, PySpark and Jupyter in a Docker container

Building End-to-End Production Machine Learning pipelines

Much of the public discourse in Data Science focuses on model optimization (selection of regressors/classifiers, hyperparameter tuning, model training and improvment of the prediction accuracy). Less material is available on using and deploying these trained Machine Learning models in production. I was asked to summarize my experience in this domain in a series of workshops, one of which I deliver next week at the TopHPC … Continue reading Building End-to-End Production Machine Learning pipelines

Audit logging for Amazon Redshift

I lately spent a while configuring and analysing the logs for Amazon Redshift warehouse. I am summarizing the experience here so others can achieve the same faster. Amazon Redshift is the analytical data warehouse platform on the AWS cloud, with rapidly growing user base. It is optimized to work with S3 storage service. Redshift is focused on performance (columnar architecture, Massively Parallel Processing (MPP) using … Continue reading Audit logging for Amazon Redshift

Taking advantage of Machine Learning (ML), without even starting

Here is a humorous recent example of what one can achieve with basic data exploration, without even going into any advanced ML techniques. In this recent project I was asked to study response time log of an online service running over Amazon EC2 + S3. The service happens to slow down under increased demand, and so the user response time suffers (anything over 100 ms … Continue reading Taking advantage of Machine Learning (ML), without even starting

4 reasons for building Data Lakes… or not

Data Lakes are repositories where data is ingested and stored in its original form, without much (or any) preprocessing. This is in contrast to traditional data warehouses, where much effort is in the ETL processing, data cleansing and aggregation, to ensure that any data that comes in is clean and structured. When people say ‘Data Lake‘ they often mean large quantities of raw data organized … Continue reading 4 reasons for building Data Lakes… or not

Lecture notes: Introduction to Apache Spark

In Lecture 7 of Big Data in 30 hours lecture series, we introduce Apache Spark. The purpose of this memo is to serve to the students as a reference of some of the concepts learned. About Spark Spark, managed by Apache Software Foundation, is an open source framework that facilitates distributed computations. It is lately quite popular. With Spark, one can set up a set … Continue reading Lecture notes: Introduction to Apache Spark

Product Owner vs Product Manager vs Architect

This short memo is to clarify the proper usage of these roles in the context of software development projects: Product Owner, Product Manager / Manager, and (Software/Product) Architect. Product Owner The term Product Owner is mainly used in Scrum context. In Scrum methodology, it is clearly defined. In fact, there are only two roles that Scrum defines explicitly and that is Scrum Master and Product … Continue reading Product Owner vs Product Manager vs Architect

Collected thoughts on implementing Kafka data pipelines

Below are my recent notes and thoughts collected during the recent work with Kafka, to build data streaming pipelines between data warehouses and data lakes. Maybe someone will benefit. The rationale Some points on picking (or not picking) Kafka as the solution Kafka originated at LinkedIn, who remains a major user. Kafka is now an open-source Apache project Kafka is good as glue between components. … Continue reading Collected thoughts on implementing Kafka data pipelines

Introducing my Big Data orientation workshops

Data Science allows us to create models that analyze data faster and more accurately than humans. If you’re a Python programmer, you’re likely to use libraries such as TensorFlow, Keras, Scikit-learn, or Pandas to create those models. To turn those models into production machines, we need a bit of Data Engineering knwoledge. We need to define the underlying data architecture, perhaps considering as components: Apache … Continue reading Introducing my Big Data orientation workshops

Git version control: part 2

With help of this article, you made your first steps with Git, the version control software. You learned to commit your software so that it became version-controlled. You need just two more skills: work with remote repositories, and to check out a particular version of your files (not necessarily the newest one). Learn those two things , and you’re good to go. This text, by … Continue reading Git version control: part 2