By topic – OnData.blog

Below is the list of articles on this blog, grouped by topics.

1. General

What makes Data Quality so difficult
Practical AIOps: 5 use cases: the comprehensive design of advanced analytics for IT infrastructure support (Service Desk)
Data Lake article series: Why Analysts need Data Lakes?, Data Lake: simple usage example and The Data Lake at Sopra Steria: why and when one should build a data lake
Are people fair? How I used statistics to determine if users were honest (expected distribution, Monte Carlo simulation, Central Limit Theorem, oversampling)
Don’t trust Data Science. Ask the people: This article is somewhat more technical. It reveals how you should talk to people before trusting the tools. It is a story how a conversation with a domain expert led to discovery of a bug in popular Data Science library, scikit-learn.

2. Understanding the data (data literacy)

The phantom I followed: how human brain misinterprets the data (chi-square, Cramer’s V, Pearson, Spearman, scatter plot, apophenia)
Data Puzzle: how even relatively simple visualization is misinterpreted by human brain (positive correlation, right-skewness, log-normal)
The Truth Behind a Histogram Dent: a detective story on how a tiny feature in the histogram led to big discovery (bimodality, Spearman, Pearson, and Chi2 correlations)
Mistaken by factor of 100,000: why lognormal data must be visualized with logarithmic scale
How to tell anomalies in data drift: what constitutes a data anomaly and how to visualize it (EDA, outliers, IQR, histogram spikes)
A picture worth 1,000 words: elimination of random noise from the time-series histogram (Exponentially-Weighted Moving Average (EMA), simple moving average (SMA), Boxcar Filter)

3. Statistics

3 Steps to Unmask Data in Camouflage: working with bimodal (or trimodal) distributions. How to logically separate data into two distributions (Pearson, Spearman, Chi2, t-test, normality tests (Anderson–Darling, Shapiro–Wilk, Kolmogorov–Smirnov, D’Agostino-Pearson)
Kolmogorov-Smirnov test: a practical intro: a powerful statistical method used to compare the distributions of data
Linear Regression: Killer App with 19-century maths
Answering Why (with Chi-Square): how another statistical tool, Chi-square test can be deployed to understand the nature of data anomalies
Democratization of statistics: Chi2 for non-experts

4. Advanced Analytics and Machine Learning

When Accuracy Grows But Precision Falls: what happens when you badly prepare train data – and how we fixed a paradoxical Machine Learning classifier’s behavior (confusion matrix, stratified sampling, undersampling, oversampling, SMOTE)
Taking advantage of Machine Learning (ML), without even starting: how I solved a Machine Learning challenge, by just looking at the distribution. This explains why exploring the data is necessary before deploying anything complex (time-series data analysis, multivariate time series, multi-step forecast, Amazon EC2 + S3)
An approach to categorize multi-lingual phrases: how I addressed semantic similarity for mix of languages: French, Norwegian, English and Polish (word2vec, gensim, corpus, embeddings, similarity metrics, t-SNA, K-Means)

5. Technology

The implications of Scikit-learn bug #21455: discovery of a processing fault in scikit-learn’s SelectKBest, impacting data scientists worldwide
Nine Circles of Hell: time in Python: comparing Python native timestamp, Unix timestamp, datetime.datetime, numpy.datetime64, pandas.Timestamp
Synchronizing an SQL Database to a Data Lake (Change Data Capture at ingest): Attunity Replicate, Oracle Goldengate, Striim, ETL, Kafka, HDFS, Debezium, Hudi, Confluent, Avro, Parquet, ORC
Lecture notes: an intro to Apache Spark programming
Collected thoughts on implementing Kafka data pipelines

6. Popular science

During the Covid-19 epidemic I published a series of articles promoting understanding the data from the scientific perspective. (Note: the topic has been highly politicized and polarized since. In some countries, the adjective “scientific” in relation to Covid-19 has been linked to certain political views. This is not what I mean by scientific. In my work, I did not support any of the camps, neither the supporters of restrictions nor the covid-deniers. Instead, I simply promoted data literacy and impartial data analysis)

No, the Virus did not survive 17 days: analysis how unfortunate wording in a leaked research report had been misrepresented by general press leading to worldwide panic
Coronavirus mortality: less than we think: analysis of data in the early days of the epidemic
How the herd instinct hijacked the herd immunity: critical analysis of popular articles by self-proclaimed Covid influencers

7. Earlier work

Before this blog started, I contributed to various writing activities.

Book Author, 2005: Grid Computing: The Savvy Manager’s Guide, ISBN-10: 0127425039 published by Elsevier Science / Morgan Kaufmann, United States, San Francisco, available at Amazon

Contributor (2001-2003) to team work that led to publications in Scientific American, New York Times and RFP standards of data transfer protocols:

“The Grid: Computing without Bounds” in Scientific American Magazine Vol. 288 No. 4 (April 2003)
W. Allcock, J. Bester, J. Bresnahan, S. Meder, P. Plaszczak, S. Tuecke. (2003). “GridFTP: Protocol Extensions to FTP for the Grid“. Global Grid ForumGFD-RP