Like many people, I opened a Mastodon account on the weekend the Musk era began on Twitter. This post picks up from…
It’s way too easy to get a false impression of a study’s results. The combination of jargon and specialized measurement techniques can get baffling or misleading – especially if writers are trying to convince you to be excited or scared. And that’s a lot of the time!
So it definitely pays to take a bit of time with the details. Some background, though, before we get to my top 6 tips for deciphering outcomes.
An outcome is what is measured for results in a study. For example, “mortality” could be an outcome. In clinical trials and systematic reviews, there will almost always be several outcomes. To try to stop people from picking and choosing which outcome is important after they already know the results, outcomes should be designated, ahead of doing the research, as either “primary” or “secondary” – and then all of them should be reported. The size of a study should be large enough to be able to reliably assess the primary outcome(s). It often won’t be big enough to get a good handle on the secondary outcome(s). Arguing a treatment “works” based on secondary outcomes when primary outcomes came up empty is resting on very swampy ground!
Unfortunately, though, it’s common for authors to move the goal posts after the results are in. When the results for the primary outcomes are disappointing, it’s tempting to decide a secondary one is more important after all. Ben Goldacre calls this “outcome switching”:
If researchers switch from these pre-specified outcomes, without explaining that they have done so, then they break the assumptions of their statistical tests. That carries a significant risk of exaggerating findings, or simply getting them wrong, and this in turn helps to explain why so many trial results eventually turn out to be incorrect.
Goldacre and his colleagues show how common this is, by looking at trials registered at their start, and then published in some major journals. Primary outcomes were switched for roughly 1 in 5 clinical trials, and there were outcomes not reported at all more than half the time. Sigh!
And now on to my tips for deciphering what is reported!
1. The quality of data varies from outcome to outcome within a study.
Thinking “this is a good study” is a trap! Some weaknesses in a study downgrade the reliability of pretty much everything in it, but the reverse isn’t the case: being a good study isn’t a guarantee of quality for every outcome. Think of it this way: a study is a collection of elements of uneven quality.
One outcome could be totally objective, easy-to-measure, and cover 100% of the people in the study. While another outcome could be hard to measure, have lots of missing data, or be plagued by the risk of a specific bias. One example is recall bias: people’s memories are more unreliable for details from years ago, or for things they didn’t realize were significant at the time.
Another is outcome assessment bias. If possible, outcome assessment should be “blind” (also called “masked”): the people doing the “measuring” shouldn’t know what group they’re looking at, when they’re classifying people’s x-rays or whatever. Why?
Say the study is a clinical trial of a drug, and I’m really worried about its safety. And I know I’m looking at the result for a patient who took the drug. Even without realizing it, I might be extra diligent – and therefore spot more problems for those people than ones who didn’t take the drug. I could make different line calls for subjective assessments. And that’s how we end up with a bias – a systematic difference between one group and the other, that isn’t a result of taking that drug.
(More from me on this, with examples, at Statistically Funny.)
2. Read the fine print: what it’s called might not be a good reflection of what the outcome really means.
Don’t put too much weight on the name of an outcome – check what it really means. There could be fine print that would make a difference to you – for example, “mortality” is measured, but only in the short-term, and that’s not mentioned in the abstract. Or the name the outcome is given might not be what it sounds like at all. People use the same names for outcomes they measure very differently. Even something that sounds cut and dried can be different. “Perinatal mortality” – death around childbirth – starts and ends at different times before and after birth, from country to country. “Stroke” might mean any kind, or some kinds. And so on.
The “How are you?” cartoon above gets to another problem. To measure in a way that can detect differences between groups, researchers often have to use methods that bear no relationship to how we think of a problem, or usually describe it.
Pain is a classic example. We use a lot of vivid words to try to explain our pain. But in a typical health study, that will be standardized. If that were done with what’s called a “dichotomous” outcome – a straight up “pain: yes or no” type question – it can be easy to understand the result.
But pain could be measured on a scale (a “continuous” outcome): how bad is that pain, from nothing to the worst you can imagine? By the time average results between groups of people’s scores get compared, it can be hard to translate that back into something that makes sense. That’s what the woman in the cartoon is doing: comparing herself to people on a scale.
Some researchers put in the hard work of converting study measures back into something that makes sense in human terms. It would help the rest of us if more of them did that!
(More about standardized mean differences and standard deviations starts from this explainer at Statistically Funny.)
3. Remember biomarkers and other surrogates are substitutes, not health outcomes.
Getting data on many health outcomes takes so long, or needs such large numbers, that it makes sense to find a valid substitute that’s easier to reach in a study. Measuring viral load – the amount of virus in the blood – in tests of HIV treatments, for example.
Surrogates are still substitutes, though, and sometimes they fail to deliver. The relationship between every surrogate and the real outcome isn’t always universal. For example, there have been drugs that lowered cholesterol or that worked on a cancer biomarker, but also turned out to increase death rates in the long run. (More about surrogate outcomes and the problems here at Statistically Funny.)
That means there’s a higher amount of uncertainty when results come only from surrogate outcomes. And it’s why you need to be cautious. So how can you tell if it’s a surrogate?
Look for the words biomarker, surrogate, or intermediate endpoint. Progression-free survival is a surrogate for survival. A rule of thumb: if it needs a laboratory test to get the result, it’s probably a surrogate.
4. Look very closely when several outcomes are combined and treated as a single outcome.
Like surrogate outcomes, composite outcomes – also called composite endpoints – can enable results to be reached more quickly. An example is “major adverse cardiac events”. More than one outcome is combined statistically, and the result is treated as a single outcome. It’s not simple: the technique needs to account for people who have more than one of the outcomes, for example. And again, don’t jump to conclusions based on the name: they may not be what they sound like, and the same name can cover different combinations.
This new outcome can be powerful and reliable, if the combination makes sense, the outcomes are all about the same level of seriousness, and the change in each outcome goes in the same direction. If one gets worse while the others get better, that’s a worry.
There are nightmare scenarios, though, where researchers go fishing for a combination that hits the jackpot, without reporting those that don’t. So another thing to look out for: was the composite outcome pre-specified in a public protocol done before they launched into doing the study?
(More on composite outcomes here at Statistically Funny.)
5. The size and importance of an effect is a separate issue from “statistical significance”.
Concentrating on the effect size is critical. It’s the difference in an outcome between groups being compared.
However, a report can focus so much on having found a statistically significant effect – a low p value – that it’s easy to overlook whether or not the effect size was actually important.
Why don’t these always go together? Because a p value is heavily affected by the size of the study, not just the size of the effect. So you can have a “statistically significant” effect that isn’t clinically important at all.
And a note about the fortune cookie cartoon: obviously, you don’t want an adverse effect to be big. Which brings us to the last, but definitely not least, tip.
6. Keep possible good and adverse outcomes in mind, and in perspective.
That sounds too obvious to need to say, doesn’t it? But it’s surprisingly easy to forget to look for adverse effects when a study is trumpeting benefits. And you can’t count on researchers, or others reporting on a study, to draw your attention to this either. Underplaying or not even reporting adverse effects at all is a common way to spin the results of a study. (Academic or journalist spin is when findings are made to look stronger or more positive than is justified by the study.)
The reverse is true as well. Sometimes, people put so much effort into drawing attention to adverse events, that it’s easy to lose sight of the balance between benefit and harm.
More on what to look out for with adverse events and adverse effects here.
But let’s end on a more positive note. One of the most important developments in clinical trials has been the ramping up of organized efforts to standardize and improve which outcomes get measured and how.
A pioneer here was OMERACT, which has been working to improve outcome measures in rheumatology internationally for over 20 years, including patient representatives since 2002. Patients, for example, got fatigue added as a critical outcome for studying arthritis.
OMERACT is a major process, with regular international conferences. It used to be a unicorn. But the COMET Initiative database of publications on group efforts to agree on core outcomes for effectiveness trials was getting close to 1,000 entries as I finish writing this post. Major journals have even combined to tackle core outcomes for all women’s and newborn health (CROWN).
Understanding health studies – one at a time, and the bunches on the same topic – is so much easier when you can rely on the minimum essential outcomes all being clear, consistent, and revealed. More power to everyone who’s working to make that happen!
Want to read more? How about:
Or check out other posts tagged “Listicles”.
* The thoughts Hilda Bastian expresses here at Absolutely Maybe are personal, and do not necessarily reflect the views of the National Institutes of Health or the U.S. Department of Health and Human Services.