Below is the list of articles on this blog, grouped by topics.
1. General
- What makes Data Quality so difficult
- Practical AIOps: 5 use cases: the comprehensive design of advanced analytics for IT infrastructure support (Service Desk)
- Data Lake article series: Why Analysts need Data Lakes?, Data Lake: simple usage example and The Data Lake at Sopra Steria: why and when one should build a data lake
- Are people fair? How I used statistics to determine if users were honest (expected distribution, Monte Carlo simulation, Central Limit Theorem, oversampling)
- Don’t trust Data Science. Ask the people: This article is somewhat more technical. It reveals how you should talk to people before trusting the tools. It is a story how a conversation with a domain expert led to discovery of a bug in popular Data Science library, scikit-learn.
2. Understanding the data (data literacy)
- The phantom I followed: how human brain misinterprets the data (chi-square, Cramer’s V, Pearson, Spearman, scatter plot, apophenia)
- Data Puzzle: how even relatively simple visualization is misinterpreted by human brain (positive correlation, right-skewness, log-normal)
- The Truth Behind a Histogram Dent: a detective story on how a tiny feature in the histogram led to big discovery (bimodality, Spearman, Pearson, and Chi2 correlations)
- Mistaken by factor of 100,000: why lognormal data must be visualized with logarithmic scale
- How to tell anomalies in data drift: what constitutes a data anomaly and how to visualize it (EDA, outliers, IQR, histogram spikes)
- A picture worth 1,000 words: elimination of random noise from the time-series histogram (Exponentially-Weighted Moving Average (EMA), simple moving average (SMA), Boxcar Filter)
3. Statistics
- 3 Steps to Unmask Data in Camouflage: working with bimodal (or trimodal) distributions. How to logically separate data into two distributions (Pearson, Spearman, Chi2, t-test, normality tests (Anderson–Darling, Shapiro–Wilk, Kolmogorov–Smirnov, D’Agostino-Pearson)
- Kolmogorov-Smirnov test: a practical intro: a powerful statistical method used to compare the distributions of data
- Linear Regression: Killer App with 19-century maths
- Answering Why (with Chi-Square): how another statistical tool, Chi-square test can be deployed to understand the nature of data anomalies
- Democratization of statistics: Chi2 for non-experts
4. Advanced Analytics and Machine Learning
- When Accuracy Grows But Precision Falls: what happens when you badly prepare train data – and how we fixed a paradoxical Machine Learning classifier’s behavior (confusion matrix, stratified sampling, undersampling, oversampling, SMOTE)
- Taking advantage of Machine Learning (ML), without even starting: how I solved a Machine Learning challenge, by just looking at the distribution. This explains why exploring the data is necessary before deploying anything complex (time-series data analysis, multivariate time series, multi-step forecast, Amazon EC2 + S3)
- An approach to categorize multi-lingual phrases: how I addressed semantic similarity for mix of languages: French, Norwegian, English and Polish (word2vec, gensim, corpus, embeddings, similarity metrics, t-SNA, K-Means)
5. Technology
- The implications of Scikit-learn bug #21455: discovery of a processing fault in scikit-learn’s SelectKBest, impacting data scientists worldwide
- Nine Circles of Hell: time in Python: comparing Python native timestamp, Unix timestamp, datetime.datetime, numpy.datetime64, pandas.Timestamp
- Synchronizing an SQL Database to a Data Lake (Change Data Capture at ingest): Attunity Replicate, Oracle Goldengate, Striim, ETL, Kafka, HDFS, Debezium, Hudi, Confluent, Avro, Parquet, ORC
- Lecture notes: an intro to Apache Spark programming
- Collected thoughts on implementing Kafka data pipelines
6. Popular science
During the Covid-19 epidemic I published a series of articles promoting understanding the data from the scientific perspective. (Note: the topic has been highly politicized and polarized since. In some countries, the adjective “scientific” in relation to Covid-19 has been linked to certain political views. This is not what I mean by scientific. In my work, I did not support any of the camps, neither the supporters of restrictions nor the covid-deniers. Instead, I simply promoted data literacy and impartial data analysis)
- No, the Virus did not survive 17 days: analysis how unfortunate wording in a leaked research report had been misrepresented by general press leading to worldwide panic
- Coronavirus mortality: less than we think: analysis of data in the early days of the epidemic
- How the herd instinct hijacked the herd immunity: critical analysis of popular articles by self-proclaimed Covid influencers
7. Earlier work
Before this blog started, I contributed to various writing activities.
Book Author, 2005: Grid Computing: The Savvy Manager’s Guide, ISBN-10: 0127425039 published by Elsevier Science / Morgan Kaufmann, United States, San Francisco, available at Amazon
Contributor (2001-2003) to team work that led to publications in Scientific American, New York Times and RFP standards of data transfer protocols:
- “The Grid: Computing without Bounds” in Scientific American Magazine Vol. 288 No. 4 (April 2003)
- W. Allcock, J. Bester, J. Bresnahan, S. Meder, P. Plaszczak, S. Tuecke. (2003). “GridFTP: Protocol Extensions to FTP for the Grid“. Global Grid ForumGFD-RP