Note 1: If you are looking for some COVID-19 conspiracy theories, go elsewhere. Below is only some boring statistics.
Note 2: This is a shallow data analysis done on an incomplete data set. I invite you to provide more data – I will provide proper citation if you do so. Note 3: This article is my private after-hours effort. It has no relation to the institutions I am associated with: my customers, employers, or partners. Note 4: Published on 29th March 2020. All numbers and diagrams relate to this date.
Summary of this article
1. For Italy, the recalculated case fatality rate appears to be 2x lower than the mainstream narration.
2. The collapse of the health system may increase case fatality rate to 4-5%, which is tragic, but not 10% or 45% as portrayed in the media.
3. For young and healthy people, the risk of dying from the coronavirus is very small
4. For seniors, the risk is significant, but probably lower than people think.
I am a data analyst. Not a medical professional. In this article, I am not making any claims about the medical nature of the COVID-19 disease. I am not trying to diminish its weight or danger. Just the opposite – I think we face a grave situation. My goal is simple. I have understood that certain important figures are commonly misinterpreted or miscalculated. So I want to present correct view on those statistics. My contribution can be helpful to decision makers, because you cannot make good decisions based on wrong numbers. However, none of the text below should be interpreted as personal advice on any individual behavior when confronting the disease. In particular, an interpretation of this article as encouragement to neglect the danger would be a misunderstanding. The differerence of some percentage points in certain risk factors may be important for large-scale strategic decisions, but not for individual ones. Let’s begin.
We have been told that the mortality of COVID-19 is around 3% (sources vary, but I’ll come to that later). Then what about Italy, where the death rate has stabilized at 45.48%? In this article I will explain why both numbers are wrong, and why the real mortality seems to be lower than most of us think.
What is mortality rate
There exist three metrics which are commonly confused. The mortality rate is the fraction of a population that dies from a disease in a certain time interval. The case fatality rate (CFR) , also known as lethality, is the proportion of death among people diagnosed. The death rate presented in the diagram above is the percentage of deaths among the closed cases (this means people who were sick in the past, and then either recovered or died). Note: each of those terms happen to be used with different meaning in other sources and they are often confused, especially in the popular press. Just remember that for the purpose of this article, these are the definitions we will use.
A simple example: Let’s suppose that 100 people got infected but only 10 got properly diagnosed. Among them, 1 died, 1 recovered, while 8 are still sick. Thus we have: mortality rate 1% (one out of a hundred), case fatality rate 10% (one out of ten), and death rate 50% (one out of two). If you read this carefully, you will be immediately struck by some observations:
- mortality rate is impossible to measure because of the practical impossibility to diagnose all the people in a population
- case fatality rate is easy to measure, but easy to misinterpret, because it depends on our diagnostic capability. In this example we were able to diagnose 10, but if we were able to diagnose 20, then CFR would drop by half (we would have 1 dead out of 20)
- death rate will change over time. Today only one person has recovered, but maybe tomorrow two more will recover and the death rate will drop to 25% (1 out of 4 dead). Instead, if tomorrow one person dies, the death rate will jump to 66% (2 out of 3).
So this defines our problem. We would love to know the mortality rate, but we cannot measure it. We can only estimate it using measurable, but highly uncertain statistics: CFR and death rate. The good news is that the difficult part of this article is over. We will now look at some death rate diagrams, and then estimate CFR and conclude on mortality rate. Just remember these terms:
Example: 100 people got infected but only 10 got properly diagnosed. Among them, 1 died, 1 recovered, while 8 are still sick.
- mortality rate: 1% (one out of a hundred),
- case fatality rate (CFR): 10% (one out of ten)
- death rate: 50% (one out of two)
Why death rate changes over time?
I will now examine some death rate diagrams from worldometers.info, to show something intriguing. Let’s start with China (updated source here). The reason to start with China is that the peak of the epidemic is well behind us:
This is good, because we can observe the entire cycle of the epidemic in the data. Here is the very interesting China death rate diagram which I want to discuss.
Today, with almost 80,000 closed cases and over 3,000 deaths, the death rate for China is about 4%. But why did it change so much over time? Why did it start at almost 50%? Was the virus more lethal in the early stages of the disease?
It was not. Here is what happens. In the early phases of the epidemy, the social awareness is low. Not all people report the disease, and tests are not available either. The only trusted data come from the hospitals, who count the dead. But people who come to hospitals are not average. Typically, they are already very sick.
What is also important: it is easy to capture the moment of death, but it is difficult to capture the moment of recovery. A dead patient is recorded in the statistics immediately. But a recovered patient is not present in the statistics until after several medical checks and a few weeks of observation.
For these two reasons, we can say, in statistical terms, that the early data is biased towards the most severe cases. When an epidemic starts, we collect data about the most severe cases, but we miss data about mild cases of the disease. When the epidemic is over, our data is more complete.
One more important point: when the epidemic is over, there are no more open cases. All cases are closed: recovered or dead. At this point, the death rate approximates the case fatality rate. So, for China, we already know the estimated case fatality rate: 4.1%.
China death rate summary:
- the initial death rate of 43% was biased (not true)
- the final death rate of 4.1% tells a better picture
- this approximates the case fatality rate
But how about South Korea?
Now have a look a South Korea death rate (source diagram here). Why does it look so different?
The left part of the diagram does not fit our earlier narration. In the earliest stage of the epidemic (before February 29), why does the Korea death rate start at zero, instead of 40%+ like in China? The reason is simple. Here is the total number of deaths in South Korea on that date:
As we see, between Feb 15 and Feb 29 there were only 17 deaths. That’s one dead person per day. It is just too little data to make ANY conclusions. Some of those deaths could have been completely random. This means that the left part of the previous diagram has absolutely no significance. Just ignore it. Towards the right (just like in China) the chart approximates the case fatality rate. It is 2.98% now, but it is getting lower every day and the final number will be lower.
South Korea death rate summary:
- In the earliest phase of epidemy, death rate is irrelevant. Too little data.
- The final Case Fatality Rate is going to be below 2.98%.
But how about Italy?
Now let’s finally look at the death rate of Italy (updated source data is here):
Now… what?! By today, Italy has definitely collected enough data (9,000 deaths). And this is not an early phase of the epidemic either. Then why isn’t the curve declining, like in China and South Korea, and why has the death rate stabilized at 45%? Are our earlier theories wrong? Has the virus mutated? Or is the virus more deadly for Europeans? Or perhaps the health system has collapsed, the hospitals cannot help anyone and that’s why 45% of people die, instead of 3%? That would be terrible. But no. Not true. None of these hypotheses are true.
My intuition, as a data practicioner, is simple: this is nothing else than a data quality problem (a more precise statistical term: insufficient data harmonization among countries). I will explain this in a minute.
However, even with no statistical background, you can use your common sense to feel the same. For instance, we all know that many public figures including Boris Johnson, Prince Charles and Tom Hanks have tested positive and very few of them died. Then is it possible that 45% of Italians die? Of course not. Then what is going on in Italy?
The short answer: for some reason, the data coming from Italy is incomplete. Unlike China or South Korea, Italy is still not recording recoveries of people with mild symptoms. Maybe (just a guess) for various reasons (societal, cultural, practical) many people with mild symptoms just stay at home and get over it, never even informing the authorities, and never being tested. So, just like in the early days in China, we only know of severe cases, and it seems that a large percentage of people die. But in fact, even though a large number of people die, it is still a small percentage of all the infected. This means there are many, many more sick people in Italy.
But can we ever prove this hypothesis without testing every citizen? Yes, we can, by looking at the statistics of the active cases in Italy. Is this group representative for the entire population, in terms of age distribution? It turns out it is not. It turns out that most Italian coronavirus cases (not only dead, but also active and recovered) come from the highest risk group (seniors over 60). How can we explain this? Maybe in Italy the virus attacks old people only? Of course not. Italy is simply not recording cases of young people. This has been explained in great detail in this excellent article by Andreas Backhaus. The diagram below (from the same article) shows that the age distribution in the Italian society (blue) is different from the age distribution among the coronavirus cases (orange). Young people are underrepresented in that orange group.
Italy death rate summary:
Good news: the mortality in Italy is much lower than it seems. Yes, 9,000 people died. But this was not 42% of all the infected. Probably just 3%.
Bad news: the spread of the disease in Italy is much wider than we think. The official data has 92,00 registered cases, but this number does not reflect the reality. It could be that 300,000 people have already got sick and recovered!
Case Fatality Rate in Italy 10%? Not quite.
The case fatality rate (CFR) in China is 4%, in South Korea it is 3%, but what about Italy? Now we understand the problem with CFR. Let’s quickly calculate the Italian CFR today, using real-time data: 10,000 dead / 92,000 cases = 10,87%. This is the number you will see in the media, including businessinsider or lifescience and more. But it is not true. It does not tell us the truth about the whole society. It only tells us about the hospitalised patients, which are a very specifically selected subset of the Italian society. They came to the hospital already sick.
In statistics, in such situations we say that the metric is inappropriate. We can calculate the CFR, but it is not helpful, not being representative for the society. Then what metric is more appropriate?
One idea: knowing the age groups of the closed cases, we can calculate CFR by each age group separately. What will we learn then? As we see below, per age group, the Italian CFR is not much different from rest of the world. It is the same virus, with the same lethality. This is how it looks across countries (authors: Hannah Ritchie and Max Roser, here is the entire very comprehensive source article from ourworldindata):
You could now take that age-distributed CFR and recalculate it over the true Italy age structure available here. Depending on your estimation method, you will end up with a number… in the range of 4-5%. Not 10%.
This CFR of 4-5% is still higher than in S. Korea (3%) and in China (4%). Why? I would attribute this to two factors: (1) South Korea does more tests per citizen, thus more sick people are diagnosed and registered, (2) Italian hospitals are full, not everyone gets treatment, and so more die. An important conclusion here: a collapse of the health service, including issues such as lack of ventilators, chaos or poor organization can contribute to CFR growth from 3% to 4-5%. Is it a big increase? Quite big. But it is not 10%.
- death rate is 45% but does not reflect the truth
- CFR is 10% but does not reflect the truth
- CFR, recalculated by age group, is 4-5%
- inefficient health system increased CFR from 3% to 4-5%, but not more.
There is a lot of good material providing more detail, for instance this article in Spanish provides the mortality rate by age group in Spain. This article (Timothy W Russel et al) gives the same for the Diamond Princess cruise ship. And more. All those materials, which cost many hours of work, provided by independent people in many parts of the world, give us the same and consistent data:
Case Fatality Rate per age, worldwide (simplified)
- juniors: ~ 0.1%
- middle-aged: ~ 1%
- seniors: ~ 10%
This is a purposely simplified picture. Of course, many factors play a role. If you have cancer, you are at a higher risk. If you live in a country with excellent hospitals, you are at a lower risk. And so on.
Now that we know the CFR of COVID-19 we can discuss the mortality rate. Remember the beginning of this article? Mortality cannot be measured, it can only be approximated, because not all cases are recorded. However, we know this: mortality is always lower than CFR, a statement which stems from simple mathematics. Now, in order to estimate mortality, we need to estimate how many cases remained unnoticed and never reported. This is beyond my capability, but I can provide some hints.
There is growing evidence that the number of unreported cases worldwide may exceed that of the registered cases several times. For instance, this material claims that 85% of the infections in China may have remained undocumented. This seems to make sense because, as we just did show above, in Italy it is probably 75%. I also liked this nice study explaining how the CFR in Korea is lower than in China for a similar reason. Finally, this paper, explained here for the broader public, provides an estimate that the mortality in Wuhan was about 1.4% (down from 4%), for the same reason. Over the past days, there has been a growth of literature following similar logic, suggesting that the overall mortality rate of COVID-19 may be in the range of 1% or less. I think for Italy it is similar. However I will not focus on that number, for practical reasons: rather than the overall mortality, it is more important to establish the mortality per age group.
The true number of infections is certainly higher than registered, perhaps by factor of 4 (this number, which can be wrong, is an educated guess, rather than a result of research). What does this mean? If 4 times more people are sick and don’t die, then the mortality is 4 times lower than the CFR. Now, the point is that eventhough “4” is a guess, but mortality is always lower than CFR by some factor. If we match this knowledge with the fact that the CFR for the young and healthy is already very low, we reach this practical conclusion:
Mortality in young people:
For those under 40 and healthy, chance of dying from the coronavirus is in the range of one out of a thousand (0.1%).
To compare: chance of being hit and seriously injured by a car in the coming 12 months is 0.7% (seven times higher).
What about old people?
There is no question that the virus rapidly kills people in higher risk groups: seniors, people with diabetics, cancer and more. Hospitals are full and the situation for those people is tragic. The biggest tragedy for these groups is that once fallen sick, their chances for recovery drop rapidly given the lack of medical equipment (ventilators) and facilities. However, there are some clues in the data that much of the tragedy may be caused not by the disease, but either by inefficient procedures, or by public hysteria associated with it. Let’s see:
Here is Italy death rate in the previous years. In 2019, it was 10.7. With the population of 60.48 million, that’s 1772 deaths per day. This is a normal rate in Italy, without the coronavirus.
In contrast, in the year 2020, as provided by italiaora.org, there have been 163,239 deaths so far. Is it more than last year? Yes. Today is March 29th. If people had been dying at the same daily rate as last year, we would have slightly fewer: 155,936 deaths today. This means the coronavirus has caused 7303 extra deaths.
But the worldometer data (based on the reports by the Italian government) says 10023 deaths were caused by the coronavirus in Italy. What does this mean? Is it possible that this number is inflated? A death in an ICU (intensive care unit) is registered according to the disease the individual was in the ICU for. All other diseases that individual had are called comorbidities. Is it possible that the missing 2700 persons are individuals who were infected by the virus when their state was already terminal due to a comorbidity (another disease)?
In addition, there are many reasons to believe that the public atmosphere indirectly causes at least some of these deaths by overwhelming the medical system with a large volume of coronavirus treatment requests, some of which are mild and do not require medical action.
Note 1: I have not done good statistical work on this. The daily death rate may have seasonal and regional fluctuation, and I don’t have access to this data. Also, comparing year to year is tricky. Years may differ for many reasons (the medical and pharmaceutical advances, demographics of the country as an ageing population). In addition, it has been pointed out by one reader that the state of national emergency might have also influenced the number of deaths, for instance by reducing the number of car accidents, accidental deaths during surgery (there are no surgeries), or increasing the number of suicides. Definitely more work needs to be done to verify the above reasoning, which should be interpreted as a hypothesis rather than a fact. If anyone among the readers has access to similar research, done not necessarily in Italy but also in other countries, please contact me and I will be pleased to amend the article with a proper citation of your work. Note 2: I was not able to contact italiaora.org to establish how trusted is their death counter can be.
Hypothesis to be verified:
Mortality in seniors:
Hypothesis: It is possible that 1/4 of patients , who died with Coronavirus, in fact died from other causes.
Summary: should we be afraid?
I am far from following any conspiracy theory, or claiming COVID-19 is an invented disease. The coronavirus is real, mortal, and dangerous. It spreads quickly, it kills people, and causes tragedy.
However, it seems (basing solely on Italian data) that the societal perception of this disease is not adequate to the real danger. The numbers seem prove this:
1. For Italy, recalculated case fatality rate is 2 x lower than the mainstream narration
2. Health system collapse can increase CFR to 4-5%, which is tragic, but not 10% or 45% as portrayed in the media
2. For young people: the risk of dying from coronavirus is very small
3. For seniors: the risk is significant, but the absolute number of deaths may be lower than people think
What’s more, the widespread panic probably contributes to the number of unnecessary deaths, promoting chaos and irrational behaviour.
An important thing that must be said here was best summarized by one reader: Those numbers don’t mean anything once you know someone who is taken by the virus that would be alive otherwise. So far I know three. This cannot be stated better. Even if the mortality is down to one per million, what difference does it make if that one person was your mother? This article focuses on the overall statistics, but digging these numbers we must remain respectful of the thousands of personal tragedies behind them. I am grateful for this comment.
The reason for writing this article
This is where my analysis ends. I wrote it due to my concern that some of the recent national-scale decisions may be based on wrong numbers.
We have shut down economies, brought entire industry segments to the verge of insolvency, and we have shut down people at homes for weeks. The impact of this will be unprecedented and we may take years to recover (personal bankrupcies, unemployment, increased suicide rate, increase in diseases caused by unhealthy lifestyles, increased home abuse in pathological families). Domestic violence has trippled since the start of the lockdown, and 90% of that violence is related to the COVID-19 epidemic.
The scale and possible impact of those decisions is scary. Trying to avoid COVID-19 deaths, aren’t we likely to cause many more non-COVID deaths and misery?
This is a question that I cannot answer. However, decision makers, their advisors and influencers must have access to the correct data to base their decisions upon. Hence my analysis – hopefully useful to some of them.
Post Scriptum & Foot notes
Here is Facebook page associated with this blog. Follow it to get notified about future posts. Also, your comments are welcome there.
During discussion after the publication, I incorporated several useful comments, discarded some others. Below is the log of these edits
1. 30th Mar 2020: added this after a clever commentary from a reviewer Krzysztof Bartuś, a medical professional: the state of national emergency might have also influenced the number of deaths, for instance by reducing the number of cars accidents, reducing number of accidental deaths during surgery (there are no surgeries), or increasing the number suicides.
2. 31st Mar 2020: excellent comment from Rik van Riel: When people who need a ventilator cannot get one, the mortality rate goes up by about an order of magnitude. The top factor that determines the mortality rate seems to be whether or not all the coronavirus patients who get sick at the same time fit in the hospital, and the hospital has enough equipment (and people) to treat them. Your analysis appears to ignore the factor “can everybody who needs treatment get treatment?” That factor may be worth adding, since that is the one factor that the public at large and politicians can actually do something about at this point.
Correct. I think this influences both CFR and mortality. Not by order of magnitude, but significantly, and it is visible in data. I added this. Edit 1: But [Italy CFR] is still higher than S. Korea (3%) and China (4%). […] Collapse of health service […] can contribute to CFR growth from 3% to 5%, but not to 45%. Edit 2: The biggest tragedy for this age group is that once fallen sick, their chances for recovery drop rapidly given lack of medical equipment (ventilators) and facilities.
3. 30th Mar 2020: after repeated questions on the overall mortality, I added this comment: Over the past days, there has been growing literature following similar logic, suggesting the overall mortality rate of COVID-19 may be in the range of 1% or less. I will not focus on that number, for practical reasons: rather than overall mortality, it is more important to establish the mortality per age group.
4. 30th Mar 2020: added this paragraf – grateful for the commentary: An important thing that must be said here has been best summarized one reader: Those numbers don’t mean anything once you know someone who is taken by the virus that would be alive otherwise. So far I know three. This cannot be stated better. Even if mortality is down to one per million, what difference does it make if that one person has been your mother? This article focuses on overal statistics, but digging these numbers we must remain respectful that behind those numbers are thousands of personal tragedies. I am grateful for that comment.
4. 30th Mar 2020: incorporated link to Guardian’s domestic violence article as suggested by reader
5. 31th Mar 2020: changed sentence: case fatality rate is dubious to : case fatality rate is easy to measure, but easy to misinterpret. Plus several minor changes. Thank you Dariusz Walat, the phrasing was unfortunate.
6. 2th Apr 2020: added this clarification following discussion with reader: (the medical and pharmaceutical advances, demographics of the country as an ageing population)
7. 2th Apr 2020: added first paragraph and last section “Why I wrote this article”, after signals from readers of possible misinterpretation
8. 3rd Apr 2020: removed question mark from the original title, after readers pointed to Betteridge’s law, implying that titles ending with question aim to attract audience for mediocre content. That was not the intention, the question indicated true uncertainty, however after the article has been out for a week, and received significant volume of comments, there seem to be growing confidence that the line of thought may actually be correct. Also removed indication that the text was targeting general public. After all the edits, it probably isn’t any more.
1. A repeating remark is that data from authoritarian China should be discarded, as not trustful. More generally: no government-supplied data can be trustful, as it is often deliberately modified. I decided to disregard that comment. I agree that this could be the case, but not just for China. The same concerns Italy and any other country. Data, reported by governments, could be either imprecise or deliberately tampered with. In the end, the key parameters I discussed, such as CFR by age group, seem to pretty well match between China, S. Korea and Italy, which provides an argument to believe that data, in its volume, is generally correct.
2. I received an intriguing remark from the medical community: there seem to exist evidence for incentive and internal pressure imposed on doctors to actually lower the statistics, attributing the factual corona deaths to other causes, which may be beneficial to their institution various procedural and practical reasons. I take this as a possibility, but for two reasons, I did not decide to explore this thread. First, my article is based on existing data analytics, while here is a speculative possibility. As analyst I don’t have tools to prove or disprove such thesis by looking at the data. Secondly, even if true, this would not impact the logic I am following. The important part of my reasoning is the idea that CFR should be recalculated based on distribution of deaths over age groups. The number of deaths, even if significantly higher, is unlikely to change that distribution.
3. What if many people die at home, and those deaths are not taken into statistics, attributed to other sources? It is difficult to rationally comment “what if” questions. Maybe. Maybe not. I see no evidence in the data that this could be happening. It is easy to attribute death to other source, but it is difficult to hide the fact of death itself. If many more people were dying at home for SARS-CoV-2, we would see an overall influx of total deaths in Italy in a given time period which, as I explained, does not seem to be happening.
4. I received this comment: you need to look at the expected remaining years of life at a given age before stating that people would have died anyways. I agree with the statement but I do not think it is valid regarding the particular calculation method: comparing the total number of death year-to-year
5. What if mortality between various mutations of SARS-CoV-2 is different? The graphics below, otherwise amazing, comes from netstrain.org and shows the virus mutation history. I discard this argument as speculative. It is a possibility. Following the Ockham razor logic pricinple, when seeing data discrepancy I lean to follow simple solutions (data quality) rather than convoluted ones (virus mutations).
1. This Lancet article estimates mortality in Wuhan as much higher, up to 20%. The authors use non-standard estimation metrics, calculating current mortality based on infection data projected into past. My knowledge is not good enough to validate that method.
3. Corriere della Sera in this article suggests that in Bergamo, 123 deaths attributed to other causes could have been caused by coronavirus, thus the real death toll could be even 4x higher. I cannot relate to it, as this appears a speculative, isolated case, not providing enough data to extrapolate conclusions to the entire population.
4. A recent Financial Times article claimed half of UK population is infected already. It has already been criticized elsewhere, and I agree with that critics. Here is the lifescience.com commentary.
If you have suggestion or see a flaw, send me your comments as well and I will properly acknowledge your name. Following people helped me with their comments and guidance, or simply by delivering supportive material. This does not mean they agree with my conclusions, or should be associated with this article anyhow. Actually, the most useful comments were those highly critical. And so I am thankful to Members of Data Science PL group: Izabela Jaworska, Marta Gojtowska, Marek Kochanowicz, Piotr Turek, Łukasz Jochemczyk, Antoni Iskrzyński, Artur Koperkiewicz, Tomasz Pycia, Marcin Adamski, Przemek Maciołek and other professionals and individuals: Michał Żołnowski, Magda Wojtuszek, Krzysztof Bartuś, Wojciech Oleszko, Jackie Corey, Rik van Riel, Bernhard Schott, Dariusz Walat, Miha Ahronovitz, and the translator Małgorzata Bronowska-Huszar who contributed most effort by proofreading the entire text and then translating it into Polish.