You would think a time when we needed to be able to “follow the science” would have been systematic evidence’s shining hour…
This is one of my older cartoons. Unfortunately, the problem it’s depicting hasn’t gone out of style. But now it strikes me as kind of out-of-date. Why? Because it doesn’t show a bunch of scientists across the street taking sides and shouting at each other.
That’s typically followed now by concern about “cancel culture” – a pointless debate which often just distracts attention from critical discussions we should be having. I think that happened with a recent pair of duelling systematic reviews about Covid’s infection fatality rate (IFR).
But before we get to the questions I think we need to address, we have to unpack the scientific issue in dispute. The IFR is a critical number in pandemic response models, for example. Even very small differences have a huge impact: half a percentage point is a million people dead out of every 200 million who get infected – including those who never even knew they’d gotten it.
Here’s the background of this particular gunfight at the Twitter corral:
- March 2020: John Ioannidis nailed his colors to the Covid-is-no-worse-than-the-flu mast, suggesting that the US might suffer only 10,000 deaths. His reputation took a beating over it.
- April 2020: A preprint of a study suggested so many people had asymptomatic infections in Santa Clara, California, that the infection fatality rate (IFR) for Covid-19 was very low: 0.12% to 0.2%. It was highly controversial for many reasons. It supported the Covid-is-no-worse-than-the-flu position – fast becoming the core of Covid denialism. Ioannidis was a co-author. His reputation took another beating.
- May 6, 2020: Gideon Meyerowitz-Katz and Lea Merone’s preprint of a systematic review and meta-analysis went online, with an estimated global IFR of 0.75% (with a range of uncertainty from 0.49 to 1.01%) – several times higher than the Santa Clara study. The preprint was updated 3 times between then and July. In July, the CDC updated their Covid models using their estimate. The first version currently has an Altmetric score of 1,946, a measure of how much attention an article is getting – and that’s a very high score. That score for a journal article would get you close to halfway up Altmetric’s top 100 for 2020.
- May 19, 2020: Ioannidis’ preprint of a sole-authored systematic review of studies inferring Covid’s IFR went online, including the Santa Clara study he co-authored. His conclusion again pegged Covid’s IFR very low – at roughly the same end of the spectrum as his Santa Clara study: 0.02% to 0.40%. The first version currently has an Almetric score of 4,324. (Which would be nudging close to the top 20 if it was a journal article.)
- May 20, 2020: Meyerowitz-Katz criticized Ioannidis’ review on Twitter, with over 2,000 likes and over 1,000 retweets.
- October 12, 2020: Meyerowitz-Katz criticized another Ioannidis publication on Twitter, this time a Covid commentary – with reference back to the disputed IFR estimate.
- October 14, 2020: The Ioannidis review was published in the journal, the Bulletin of the World Health Organization. According to Google Scholar as I’m writing, it has been cited 190 times.
- October 15, 2020: Myerowitz-Katz took aim at Ioannidis’ review on Twitter again.
- December 2020: The Meyerowitz-Katz systematic review was published in the International Journal of Infectious Diseases. According to Google Scholar, it has been cited 194 times.
- March 2021: Ioannidis published a sole-authored systematic review of the systematic reviews of Covid-19 IFR, including his own and Meyerowitz-Katz’s. He pretty much judges his own review to be reliable, and Meyerowitz-Katz’s particularly unreliable. In an appendix – which he has since withdrawn, thank heavens! – he made claims about Meyerowitz-Katz’s qualifications, his Twitter account, his Twitter bio, the photo on his Twitter account (including his T-shirt)….you get the picture. Heavily personal. This led to a full-blown Twitter storm. (My response to that extraordinary salvo was to write my Cartoon Guide to Criticism: Scientist Edition.)
I think this counts as a full-on feud between these 2 scientists, and it seems to be expanding beyond IFR. But I don’t want to discuss their behavior here. I want to discuss the science side of all this, and what issues this episode raises for the quality of science.
Major cheer and jeer squads formed around both reviews, often praising one review and heaping disdain on the other. It wasn’t just about a difference in interpretation of data: these were fundamental issues about what counts as reliable science in systematic reviewing – and that’s a highly specialized area. So what should we make of these respective claims? Is one, the other, or both of these systematic reviews excellent – or as diabolically bad as detractors say? And what are the implications of scientists’ conflicting claims if the answer is actually cut and dried?
It would take far too long to dig into all the detail about these 2 reviews, and every claim made about them. But there’s no need to. The picture gets very clear, very quickly. (Note: I criticized both these reviews heavily when they were in preprint, but never followed up to see what was in the published versions, and how much of the pre-publication critique the authors attended to.)
I’ve spent a few decades analyzing multiple systematic reviews on the same question, and I’ve studied reviews with conflicting conclusions, too. Over the years, I narrowed down to a list of 5 questions to save time by knocking out most of the worst and unreliable systematic reviews quickly. One isn’t relevant to this debate – it’s about whether the review is up-to-date. But let’s go through the other 4 questions for these 2 reviews. From here on, I call Ioannidis (October 2020) the “I” review, and the one by Meyerowitz-Katz and Merone (December 2020) the M&M review.
1. Are there clear, pre-specified, eligibility criteria for studies being chosen or rejected for the review?
This is key to being systematic – and the point of being systematic is to make a review’s results more reliable by minimizing the biases that lead to them. You want to know for sure that the goalposts aren’t moving around so people can include studies they want, and kick out those that are “inconvenient” for whatever reason. Ideally there is a pre-published protocol, so we can see if the goalposts shifted.
Now of course, if you already know some studies you want to keep out or allow in, you can set up criteria that operationalize your bias. So what we’re looking for here are justifiable criteria and methods that clearly aim to minimize bias.
And there are a lot of potential studies that could be included. In Ioannidis’ review of 6 reviews conducted within about a 3-month period, the most included studies in a single review was 338 studies, 2 had more than 80, and the other 3 each had less than 30. Clearly the scope for biased selection in a review on this question is pretty enormous.
The “I” review is very explicit about the criteria applied, but there is no protocol for this review. There were 3 versions of the preprint that preceded it, though. And it did not start off with the same explicit criteria as in the final criteria.
The scope is narrower for this review – only seroprevalence studies. That would tend towards lower estimates of IFR because of presumably larger denominators of people with asymptomatic infections. And there is a limitation in study size, which is a subject for debate.
It’s impossible, though, to get past the high risk of bias of a sole-authored systematic review conducted by a co-author of a primary study that caused him reputational damage. In his review of reviews, Ioannidis writes that he is also a co-investigator for a second of the included studies, for which he’s not a named co-author. In the “I” review, he declares being a co-author of one of the included studies. So for me, the “I” review passes this question, although not with flying colors. But it doesn’t get over the hurdle of the intent of this aspect of a systematic review: to give you confidence that the selection of studies was reasonably unbiased.
The M&M review gets a straight-up “no” to this first question. Again there was no protocol, and again there were several versions in preprint previously. The final has only 2 explicit eligibility criteria – and it’s explicitly stated if they were met, they were included. But it’s evident even within the paper that this is not so, as they list some studies excluded despite meeting the inclusion criteria – including one because the authors “explicitly warned against using its data to obtain an IFR”. That would be a really weird exclusion criterion – but it’s also mystifying: I can’t find any statement remotely like that in the publication cited.
Far more problematic, though, is the very clear evolution of the criteria as new studies emerged that the authors wanted to include: they changed the criteria to allow that. (By dropping the criterion of being published in English, for example.) Changing criteria along the way isn’t necessarily a bad thing, of course, but it does have implications: how you re-do your previous literature screening to accommodate the change, for example. Transparency is absolutely critical though, and the final paper is not transparent about the evolution of eligibility so that readers can assess the potential for bias in the iterative process.
And there’s a further problem for this review on the question of selection criteria to assemble as unbiased a study pool as possible. They did not ensure that population estimates were not multiple-counted. So the same groups of people can go into the study pool several times via different studies and thus get counted towards the totals multiple times (like the people on board the Diamond Princess cruise ship, for example).
2. Did they make a strong effort to find all the studies which could have been eligible?
Well, they made a lot of effort, but I don’t think either clears the “strong” bar. Neither has a librarian or information specialist involved, and so it’s not surprising that the quality of their search strategies is so low. (In January this year, a systematic review community standard for reporting on search strategies was published, called PRISMA-S.)
You can call the search terms “broad” as a technical matter, but I think they’re best described as vague. All searching is in English. Neither tells you how the records and de-duplication were managed. That may sound trivial, but it’s not. I want to know if this was done professionally or not, which minimizes the chances for records to fall through the cracks and optimizes quality control.
The descriptions of what they actually did are imprecise too. What does that mean? Well, for example, I tried to do what the “I” review says for one of the preprint servers (SSRN), and I couldn’t figure out exactly what had been done so I could be sure I had done the same thing – and nothing I tried came even remotely close to the results reported. ¯\_(ツ)_/¯
The entire searching and selection process for the “I” review was done by a single person, so there is no attempt to minimize error or bias in these processes. On the other hand, the narrow scope – seroprevalence studies – makes it more likely that the studies were findable, and fairly generally likely to be in the places searched.
The M&M search strategy is worse, although there is at least a second author in some of the later selection processes. The reporting of the searching only picks up after there had already been screening by a single author: there are already only 269 studies in consideration at that point – no reporting of the thousands of records that were discarded. Given government reports internationally is a major category of included studies, searching only in English is a far bigger problem for this review than the other.
Then there is the underlying problem of the changing inclusion criteria as this review had successive updates. There’s no explanation of how this was handled. Did they go back and start again from scratch each time? Had they kept records so complete – even for Google, Google Scholar, and Twitter searches – that they could go back and re-screen with the new eligibility criteria? Neither of those seems likely. Which leaves going back and doing a patch search for the particular new criteria: it’s not clear if they did that. They report the search as if it were a once-off. And they report the review as though the eligibility criteria were a constant, not iterative, expanding and contracting along the way. (This seems less of an issue for the “I” review, which had a broadly similar type of potentially eligible study at the beginning and end.)
3. Can you see a list of the studies that were excluded from the review?
This matters a lot for these reviews, given the problems in the 2 steps we’ve just looked at. If you really want to see if there was author bias in exclusions, you need to be able to see this – at least for the ones that were screened in full text.
The “I” review did not provide reasons for exclusions, or a list of the excluded studies at any stage. The M&M review did not provide reasons for exclusions. The flow diagram says 15 studies were excluded at full-text assessment stage, and there is a description in the text of reasons for excluding 14 studies: presumably that’s all but one of those. So the M&M review comes out ahead on this point, but not with flying colors.
4. Have they given you some indication of how good they think the studies they included are?
Yes, both reviews did this. So along with being about up-to-date as you could expect at the time of this dispute, that’s a point in their favor. That’s not enough to save them, although it’s clear that one has far more problems than the other.
Although I’m not going to dig into all the other issues, claims, and counter-claims about these 2 reviews, there are 2 additional issues that I think are important to touch on about the M&M review. One is a major methodological issue, and the other is a really simple reporting quality issue.
The first is the issue of the meta-analyses – the statistical combination of data from multiple studies. I’ve written an explainer about understanding the data in meta-analyses here, if you want more context.
One of the key rules to keep in mind is just because you can throw a bunch of numbers into a statistical pot, it doesn’t mean you should. Here’s the image I use to bring that message home when I teach this basic principle of meta-analysis:
The “I” review doesn’t combine the various IFR estimates, arguing the IFR varies too much for that to make sense. The M&M review does, though. (Using a random effects model, for those who want to know that detail.)
There’s a test for whether there might be too much difference in a set of studies to pool the data: it’s called a test for heterogeneity. It’s not a perfect science, but if there’s a lot of heterogeneity, it really calls into question whether the data even belong together. Even 75% on that heterogeneity test is classed as “considerable”. The rate in the meta-analyses in the M&M was 99%. This is what the authors say about that:
The main finding of this research is that there is very high heterogeneity among estimates of IFR for COVID-19 and therefore, it is difficult to draw a single conclusion regarding the number. Aggregating the results together provides a point estimate of 0.68% (0.53%–0.82%), but there remains considerable uncertainty about whether this is a reasonable figure or simply a best guess.
That’s stressed in the abstract, too. But there isn’t a discussion of the validity of meta-analysis for this data at all – or of using a random effects meta-analysis. Neither of the authors is a statistician, and it raises the question of whether there was statistical peer review here, when it was obviously needed. (Note: I’m not a statistician either.)
The simple reporting quality issue? The abstract says there are 24 estimates; the flow diagram says there are 26 included studies; Table 1 lists 27 studies; the text after Table 1 says 40 papers were reviewed in full text (the flow diagram says it was 42) and 25 studies were included “in the qualitative analysis”; and the meta-analysis includes 26 studies (with the text above it describing it as including all 24 included studies). And I raise that because it’s such an obvious quality red flag: it’s easy to see something has gone terribly wrong in quality control for this review.
There really isn’t a lot of reason for confidence in the quality of peer review or editorial care for this review. (And it suggests, by the way, that peer reviewers can’t be relied on to look at the criticisms posted on preprints of the manuscript they are reviewing.)
The bottom line, though: whether they’re right on the particulars or not, each side of this dispute is right in claiming the other review has deep flaws – though one has more than the other. But what does it mean that there are so many scientists claiming one or the other is solid and has come to a reliable answer?
And what about the hundreds of citations? I don’t know what proportion of the papers are citing one of these as “the” estimate for Covid-19’s IFR, and how often the estimate is used in a way that has serious implications – the CDC was one clear example of that, though. Systematic reviews are regarded as a gold standard, so people are going to reach for one to cite. The trouble is, as I argued in my last post, they’re not a “neutral good”. They can be wildly misleading, so they can give unjustified heft to highly biased claims.
To me, the success of these 2 reviews, and the content of much of the dispute around them, lead to 2 depressing conclusions. Awareness of what makes a systematic review reliable is still shockingly low, including among scientists who see themselves as experts in judging the quality of scientific claims. And as a consequence, even after more than 50 years of development of rigorous methods for systematic reviewing, scientists are still too often building their work on a foundation of perilously shaky knowledge of the science that’s gone before.
Disclosures: I studied the prevalence of some types of post-publication events in clinical trials and systematic reviews as part of my PhD. I was the lead scientist/editor for PubMed Commons, a commenting system in PubMed that ran from 2013 to early 2018 (archived here).
I have never met Gideon Meyerowitz-Katz (and haven’t cited or criticized his work prior to this episode). I have known John Ioannidis for decades (and often cited and praised his publications). I have also (less often) criticized work he’s done, before and during the pandemic (see for example, here, here, and here; and I wrote a post about EBM heroes and disillusion around one of those episodes.)