Longormal data is very tricky. Wrong visualization methods can lead to radical misinterpretation of the result. In this article I show an example of such a mistake based on a real project, and I demonstrate how to avoid the caveats by proper visualization on a logarithmic scale.

What follows is a rendering of a Jupyter notebook. Users not interested in source code may simply skip over the Python sections, and focus on diagrams alone. The sources of the functions used are available in github.

loghist-article

The utility module named loghist presented above is available in github (click here). Feel free to use it and improve it, and drop me a note if you do. The purpose of the module is illustrative rather than production. It leaves plenty of space for an improvement for the quality and robustness.

Addendum 2021-10-19

After publishing the article I had an interesting internal exchange in Sopra Steria Data Science circles.

My colleague Pierre-Henry Lambert indicated that the logic above is somewhat weak for two reasons. One, the average (mean) is a rather poor measure of a distribution. In most cases the median should be used instead (for robustness agains outliers etc.). Another argument, broader, is that it is wrong to assume that a population can be described by a single measure, no matter what it is. Thus, the question “what is the average of this population” may be a wrong question to ask. (Thanks for this, Pierre-Henry!)

I agreed with both points made. In this article I did not mean to discuss that the average was not where it seemed, but to merely use that fact as an example to show something broader: visualizing a lognormal data on a linear axis leads to all sorts of wrong conclusions. Misjudgment of the average is just one example. Another example is the fact that the population seems one-modal, while in fact it is trimodal.

Those considerations brought me to one more interesting “proof” why the logarithmic axis is correct. Instead of asking where is the average, let’s ask a related question: where is the median? And then let’s find the median on the logarithmic scale. Here is the result:

Interestingly, the median now beautifully appears exactly where an unexperienced reader would place it.

This operation demonstrates even better that in our particular case the logarithmic visualization “correct”, in a sense that it provides a representation that is very close to an intuitive understanding of an unexperienced reader.

Mistaken by factor of 100,000

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.