ETHICS

# What Everyone Should Know about Statistical Correlation

A common analytical error hinders biomedical research and misleads the public.

In 2012, the
*
New England Journal of Medicine
*
published a paper claiming that chocolate consumption could enhance cognitive function. The basis for this conclusion was that the number of Nobel Prize laureates in each country was strongly correlated with the per capita consumption of chocolate in that country. When I read this paper I was surprised that it made it through peer review, because it was clear to me that the authors had committed two common mistakes I see in the biomedical literature when researchers perform a correlation analysis.

Correlation describes the strength of the linear relationship between two observed phenomena (to keep matters simple, I focus on the most commonly used linear relationship, or Pearson’s correlation, here). For example, the increase in the value of one variable, such as chocolate consumption, may be followed by the increase in the value of the other one, such as Nobel laureates. Or the correlation can be negative: The increase in the value of one variable may be followed by the decrease in the value of the other. Because it is possible to correlate two variables whose values cannot be expressed in the same units—for example, per capita income and cholera incidence—their relationship is measured by calculating a unitless number, the
*
correlation coefficient
*
. The correlation coefficient ranges in value from –1 to +1. The closer the magnitude is to 1, the stronger the relationship.

The stark simplicity of a correlation coefficient hides the considerable complexity in interpreting its meaning. One error in the
*
New England Journal of Medicine
*
paper is that the authors fell into an ecological fallacy, when a conclusion about individuals is reached based on group-level data. In this case, the authors calculated the correlation coefficient at the aggregate level (the country), but then erroneously used that value to reach a conclusion about the individual level (eating chocolate enhances cognitive function). Accurate data at the individual level were completely unknown: No one had collected data on how much chocolate the Nobel laureates consumed, or even if they consumed any at all. I was not the only one to notice this error. Many other scientists wrote about this case of erroneous analysis. Chemist Ashutosh Jogalekar wrote a thorough critique on his
*
Scientific American
*
blog
*
The Curious Wavefunction
*
, and Beatrice A. Golomb of University of California, San Diego, even tested this hypothesis with a team of coauthors, pointing out that there is no link.

Regardless of the scientific community’s criticism of this paper, many news agencies reported on this article’s results. The paper was never retracted, and to date has been cited 23 times. Even when erroneous papers are retracted, news reports about them remain on the Internet and can continue to spread misinformation. If these faulty conclusions reflecting statistical misconceptions can appear even in the
*
New England Journal of Medicine
*
, I wondered, how often are they appearing in the biomedical literature generally?

The example of chocolate consumption and Nobel Prize winners brings me to another, even more common misinterpretation of correlation analysis: the idea that correlation implies causality. Calculating a correlation coefficient does not explain the nature of a quantitative agreement; it only assesses the intensity of that agreement. The two factors may show a relationship not because they are influenced by each other but because they are both influenced by the same hidden factor—in this case, perhaps a country’s affluence affects access to chocolate and the availability of higher education. Correlation can certainly point to a possible existence of causality, but it is not sufficient to prove it.

An eminent statistician, George E. P. Box, wrote in his book
*
Empirical Model Building and Response Surfaces
*
: “Essentially, all [statistical] models are wrong, but some are useful.” All statistical models are a description of a real-world phenomenon using mathematical concepts; as such, they are just a simplification of reality. If statistical analyses are carefully designed, in accordance with current good practice guidelines and a thorough understanding of the limitations of the methods used, they can be very useful. But if models are not designed in accordance with the previous two principles, they can be not only inaccurate and completely useless but also potentially dangerous—misleading medical practitioners and public.

I often use and design mathematical models to gain insight into public health problems, especially in health technology assessment. For this purpose I use data from already published studies. Uncritical use of published data for designing these models would lead to inaccurate, completely useless—or worse, unsafe—conclusions about public health.

EMAIL TO A FRIEND :