The question keeps coming up when I discuss the challenges facing the Cochrane Collaboration: What’s happening with open access for Cochrane reviews?…
5 Things We Learned About Peer Review in 2024

Back in 2019 I wrote a couple of posts summarizing what we had learned from research about peer review at journals. Since then, I’ve done an annual research roundup to keep up with the field. This is the sixth of those posts. (The posts in this series are tagged here.)
We’ve been averaging about one randomized trial a year. I found 3 trials from 2024, all related to the model of peer reviewing submissions eligible to be published in full in conference proceedings. In this slow drip of evidence, we’re mostly getting results on a spray of questions, many of them new, rather than a gradual growth of strong answers to core questions.
For me, the most important things I learned about peer review in 2024 didn’t come from trials, but from the first couple of observational studies of journal peer review in my list below. There was food for thought, too, in a discussion about forms of peer review appropriate to indigenous health research, which challenges the epistemological underpinnings of the current academic model. Clinton Schulz and colleagues wrote about yarning in peer review, incorporating two-way processes of communicating and reaching consensus.
“Yarning,” write Schulz et al, “has already been recognised as a culturally appropriate process for engaging with Indigenous groups and individuals in conducting research, facilitating in-depth discussions and allowing for the collection of rich data.” Their journal is committed to developing a process of structured dialogue as a respectful way of working with First Nations contributors. This, they hope, will be “a form of peer review, which is more inclusive and culturally attuned but also deeply collaborative.” I look forward to hearing more from them about this experience.
Meanwhile, let’s get stuck into my 5 topics from 2024’s research, with a summary for each, linking to more detail about the study.
- The quality of peer review at some journals could be improved by structured peer review.
- Peer reviewers are usually uncertain about their recommendations on questions like whether or not to accept a manuscript.
- Anonymizing discussion among peer reviewers might have some influence, though acceptance rates might not be affected.
- People might be biased towards assuming longer peer reviews are better, even when extra length isn’t contributing to improved quality.
- A “rebuttal stage” now widely offered to authors after the first round of peer review in some fields is controversial. Peer reviewers’ first impressions could theoretically be hard to budge, but we don’t know if this is a big problem.
1. The quality of peer review at some journals could be improved by structured peer review.
[Observational pilot study]
This paper had me scratching my head, trying to remember how structured (or not) peer reviewing is at the main journals at which I’ve peer reviewed. In biomedicine at least, I don’t think it’s unusual, but I realized I couldn’t say for sure – and I couldn’t find a study that described peer review formats for science journals.
In my roundup post for 2022, I concluded after a couple of new trials that “Prompting peer reviewers to look for particular quality issues is starting to look like a dead end – at least for substantial improvements in reports of biomedical research.” Some of the problems researchers were trying to solve with those interventions were quite specialized, and often somewhat labor-intensive for the peer reviewers. How many, and what type, of question is the absolute minimum, though?
The new study is from Mario Malički and Bahar Mehmani. Malički is a meta-scientist and co-editor-in-chief of the journal, Research Integrity and Peer Review. Mehmani is from the Innovation and Publishing Development department at Elsevier. Their study reports on results of a randomly selected subset of 23 Elsevier journals (out of 220) that piloted 9-question structured peer review. The journals’ editors could adapt the questions.
All the questions are yes/no, and all but could requires further detail, depending on the answer. A couple of examples: “If applicable, is the application/theory/method/study reported in sufficient detail to allow for its replicability and/or reproducibility?”; and “Have the authors clearly stated the limitations of their study/methods?” There had been a previous small pilot of 5 questions. An open field for a “comments to authors” type of peer review report followed the questions.
The authors analyzed the peer review reports independently, assessed inter-rater agreement among peer reviewers of manuscripts, and compared the rate of agreement to the pre-pilot agreement rates at the journals. Almost all peer reviewers answered all the questions, and agreement between peer reviewers was higher than usual – though as it’s not a trial, we can’t know if that was an effect of the intervention.
This study establishes the feasibility of adding structured questions to peer review, and raises a lot of questions. Here’s hoping some of those questions are answered in future research.
Mario Malički and Bahar Mehmani (2024). Structured peer review: pilot results from 23 Elsevier journals. [Discussion by Malički here; Elsevier’s description of their policy on this here.]
2. Peer reviewers are usually uncertain about their recommendations on questions like whether or not to accept a manuscript.
[Cross-sectional study]
I didn’t expect to be surprised by this paper, but the very high level of uncertainty these authors found sure explains a lot. As the authors conclude, “This uncertainty is part of the variability in peer reviewers’ recommendations.”
So how was it? Only 23% of reviewers reported no uncertainty about any of their recommendations on issues such as whether minor or major revisions were required. They were responding to a post-peer review survey by Adrian Barnett and colleagues. Three journals participated: BMJ Open, Epidemiology, and F1000Research. The authors don’t know how many peer reviewers were invited to participate, so there’s no response rate. They had 389 responses to analyse, mostly from BMJ Open.
Peer reviewers were mostly at least fairly experienced researchers – only 12% had less than 5 years of experience. The median time they spent on their reviews was 3 hours. Respondents were asked to score their own level of certainty about each recommendation – from 100% certain down.
The authors wrote: “Within-reviewer uncertainty has likely impacted reviewers’ recommendations and editorial decisions and hence impacted researchers’ careers. It is part of the ‘luck of the draw’ in peer review.” They point out that proposals like randomizing editorial decisions and eLife‘s decision not to have a binary accept/reject decision at all “capture some of the nuance and ambiguity in peer review.”
Adrian Barnett and colleagues (2024). Examining uncertainty in journal peer reviewers’ recommendations: a cross-sectional study.
3. Anonymizing discussion among peer reviewers might have some influence, though acceptance rates might not be affected.
[Randomized trial]
In some fields, there is a text forum for discussion among peer reviewers. The Conference on Uncertainty in Artificial Intelligence (UAI) is one of them – and it’s one of those conferences that peer reviews (and publishes) full articles. Peer reviewers can change their ratings of submissions after the discussion. One reviewer was the quasi-editor for the reviewer group for a submitted manuscript, and there were another 3-4 assigned. Reviewers are expected to read each others’ initial reviews before discussion. Peer reviewers weren’t given the names of the submissions’ authors.
All reviewers were randomized to anonymous or non-anonymous discussion. There were 581, though 65 did not submit their reviews in time, leaving 263 in the anonymized group, and 253 in the non-anonymized group. Emergency reviewers were recruited to make up the numbers, ending up with 289 in each group. There was no data on what proportion of the non-anonymized guessed (or presumed) the identity of other peer reviewers.
The submission acceptance rate was similar in the groups. There was a bit more discussion among the non-anonymized group, with a mean number of 0.53 posts in the anonymized group versus 0.46 in the named group, but there was no difference in the likelihood of not posting between the groups. Just over half the group were senior researchers, and they were more likely to participate in discussion than their junior peers, and anonymization did not affect this. Posts were given politeness scores, and this didn’t differ between the groups.
Charvi Rastogi and colleagues (2024). A randomized controlled trial on anonomyzing reviewers to each other in peer review discusions.
4. People might be biased towards assuming longer peer reviews are better, even when extra length isn’t contributing to improved quality.
[Randomized trial]
This research was conducted by Alexander Goldberg and colleagues, with a major machine learning conference. They studied perceptions of quality of peer review reports in a variety of ways. The part of their report that interested me was their study of what they called “uselessly elongated review bias.” The authors were dubious about studies suggesting length of peer review was an indicator of quality. To test this, they prepared “elongated versions of reviews by adding substantial amounts of non-informative content,” more than doubling their length. They did this for one randomly-selected peer review for each of 10 submissions. Then they randomly allocated 458 reviewers to either the original or elongated version.
The result: “the uselessly elongated reviews receive higher scores than the original shorter reviews… Overall, the mean score for the long condition group was 4.29 compared to 3.73 for the short condition.”
I think there are a couple of take-aways here. One is that explanatory text may, in itself, be valued by readers of peer review reports. The other is that we should be careful about studies that rely on length of peer review as a measure of quality.
Alexander Goldberg and colleagues (2024). Peer review of peer reviews: A randomized controlled trial and other experiments.
5. A “rebuttal stage” now widely offered to authors after the first round of peer review in some fields is controversial. Peer reviewers’ first impressions could theoretically be hard to budge, but we don’t know if this is a big problem.
[Randomized trial in a hypothetical setting]
This is another study relevant to the type of peer review at conferences that publishes accepted submissions. A part of these processes can be a step where authors have a chance to provide a response to the peer reviewer reports. This is called a “rebuttal stage,” and the authors of this trial report that though it has been widely adopted, everyone isn’t convinced it’s a useful addition to peer review. Apparently, social media is awash with authors saying rebuttal was a waste of time and didn’t change reviewers’ minds.
Ryan Liu and colleagues ran the trial to see if anchoring was a factor. They define this as a “bias where people who make an estimate by starting from an initial value and then adjusting it to yield their answer typically make insufficiently small adjustments,” so first impressions hold too much sway.
Liu et al tested for anchoring bias by creating a fake paper. A key part of the results was in an animated GIF. The experimental version had a frozen frame so that all the results couldn’t be seen – the “hidden” results considerably weakened the strength of the study’s results. The control version of the paper had the full results. They theorized that this mimicked the rebuttal stage of a peer review process, because it added information. It’s not a process I’m familiar with, so I don’t know if this a good imitation.
Participants were randomized to rate either the experimental version with results partially hidden, or the control version with the full results. They were 104 PhD students, and the topic of the trial wasn’t revealed to them. The group who reviewed the experimental version were told that there had been a technical glitch with the GIF and given a version with the full animated GIF, and given the opportunity to revise their rating of the paper. Liu et al hypothesized that if the experimental paper attracted lower scores and anchoring was a factor, then there would be a difference in the ratings of the full paper between the groups.
The authors concluded that the group getting the experimental version rated it lower than the group who saw the full results, so anchoring could theoretically occur. If the bias did come into play, the experimental group’s rating would still be lower than the control group’s when they saw the full version. However, there wasn’t a difference in the final ratings between the groups. So Liu et al concluded that they had not demonstrated an anchoring effect.
Ryan Liu and colleagues (2024). Testing for reviewer anchoring in peer review: A randomized trial. [PMC version]
~~~~
You can keep up with my work via my free newsletter, Living With Evidence.

This is the 7th post of a series on peer review research – that started with a couple of catch-ups on peer review research milestones from 1945 to 2018:
All posts tagged “Peer Review”
Disclosures: I’ve had a variety of editorial roles at multiple journals across the years, including having been a member of the ethics committee of the BMJ, and being on the editorial board of PLOS Medicine for a time, and PLOS ONE‘s human ethics advisory group. I wrote a chapter of the second edition of the BMJ‘s book, Peer Review in Health Sciences. I have done research on post-publication peer review, subsequent to a previous role I had, as Editor-in-Chief of PubMed Commons (a discontinued post-publication commenting system for PubMed). Up to early 2025, I had been advising on some controversial issues for The Cochrane Library, a systematic review journal which I helped establish, and for which I was an editor for several years. This year, I peer reviewed several abstracts for the upcoming 2025 Peer Review Congress.
The cartoons are my own (CC BY-NC-ND license). (More cartoons at Statistically Funny.)