1. General

2. Understanding the data (data literacy)

  • The phantom I followed: how human brain misinterprets the data (chi-square, Cramer’s V, Pearson, Spearman, scatter plot, apophenia)
  • Data Puzzle: how even relatively simple visualization is misinterpreted by human brain (positive correlation, right-skewness, log-normal)
  • The Truth Behind a Histogram Dent: a detective story on how a tiny feature in the histogram led to big discovery (bimodality, Spearman, Pearson, and Chi2 correlations)
  • Mistaken by factor of 100,000: why lognormal data must be visualized with logarithmic scale
  • How to tell anomalies in data drift: what constitutes a data anomaly and how to visualize it (EDA, outliers, IQR, histogram spikes)
  • A picture worth 1,000 words: elimination of random noise from the time-series histogram (Exponentially-Weighted Moving Average (EMA), simple moving average (SMA), Boxcar Filter)

3. Statistics

4. Advanced Analytics and Machine Learning

  • When Accuracy Grows But Precision Falls: what happens when you badly prepare train data – and how we fixed a paradoxical Machine Learning classifier’s behavior (confusion matrix, stratified sampling, undersampling, oversampling, SMOTE)
  • Taking advantage of Machine Learning (ML), without even starting: how I solved a Machine Learning challenge, by just looking at the distribution. This explains why exploring the data is necessary before deploying anything complex (time-series data analysis, multivariate time series, multi-step forecast, Amazon EC2 + S3)
  • An approach to categorize multi-lingual phrases: how I addressed semantic similarity for mix of languages: French, Norwegian, English and Polish (word2vec, gensim, corpus, embeddings, similarity metrics, t-SNA, K-Means)

5. Technology

6. Popular science

During the Covid-19 epidemic I published a series of articles promoting understanding the data from the scientific perspective. (Note: the topic has been highly politicized and polarized since. In some countries, the adjective “scientific” in relation to Covid-19 has been linked to certain political views. This is not what I mean by scientific. In my work, I did not support any of the camps, neither the supporters of restrictions nor the covid-deniers. Instead, I simply promoted data literacy and impartial data analysis)

7. Earlier work

Before this blog started, I contributed to various writing activities.

Book Author, 2005:  Grid Computing: The Savvy Manager’s Guide, ISBN-10: 0127425039 published by Elsevier Science / Morgan Kaufmann, United States, San Francisco, available at Amazon

Contributor (2001-2003) to team work that led to publications in Scientific American, New York Times and RFP standards of data transfer protocols: