Good start on a hard question — how or whether to use #AI tools in #PeerReview.
https://www.researchsquare.com/article/rs-2587766/v1
"For the moment, we recommend that if #LLMs are used to write scholarly reviews, reviewers should disclose their use and accept full responsibility for their reports’ accuracy, tone, reasoning and originality."
PS: "For the moment" these tools can help reviewers string words together, not judge quality. We have good reasons to seek evaluative comments from human experts.
Update. I acknowledge that there's no bright line between using these tools to polish one's language and using them to shape one's judgments of quality. I also ack that these tools are steadily getting better at "knowing the field". That's why this is a hard problem.
One way to ensure that reviewers take #responsibility for their judgments is #attribution.
Update. I'm pulling a few other comments into this thread, in preparation for extending it later.
1. I have mixed feelings on #attribution in peer review. I see the benefits, but I also see the benefits of #anonymity.
https://twitter.com/petersuber/status/1412455826397204487
2. It's easier for #AI to write good summaries than good reviews.
https://fediscience.org/@petersuber/109954904433171308
Update. I'm pulling in two of my Twitter threads on using #AI or #PredictionMarkets to estimate quality-surrogates (not quality itself). I should have kept them together in one thread, but it's too late now.
Update. I'm sure this has occurred to the #AI / #MLL tool builders. Determining whether an assertion is #true is a hard problem & we don't expect an adequate solution any time soon. But determining whether a #citation points to a real publication & whether it's #relevant to the passage citing it, are comparatively easy. (Just comparatively.)
Some tools already cite sources. But when will tools promise that their citations are always real and relevant — and deliver on that promise?
Update. I've been playing with #Elicit, one of the new #AI #search engines. Apart from answering your questions in full sentences, it cites peer-reviewed sources. When you click on one, Elicit helps you evaluate it. Quoting from a real example:
"Can I trust this paper?
• No mention found of study type
• No mention found of funding source
• No mention found of participant count
• No mention found of multiple comparisons
• No mention found of intent to treat
• No mention found of preregistration"
Update. Found in the wild: A peer-reviewer used #AI to write comments on a paper. The AI recommend that the author review certain readings, when 99% of the recommended works were fake.
https://www.linkedin.com/feed/update/urn:li:share:7046083155149103105/
Update. The US #NIH and Australian Research Council (#ARC) have banned the use of #AI tools for the #PeerReview of grant proposals. The #NSF is studying the question.
https://www.science.org/content/article/science-funding-agencies-say-no-using-ai-peer-review
(#paywalled)
Apart from quality, one concern is #confidentiality. If grant proposals become part of a tool's training data, there's no telling (in the NIH's words) “where data are being sent, saved, viewed, or used in the future.”
Update. If you *want* to use #AI for #PeerReview:
"Several publishers…have barred researchers from uploading manuscripts…[to] #AI platforms to produce #PeeReview reports, over fears that the work might be fed back into an #LLM’s training data set [&] breach contractual terms to keep work confidential…[But with] privately hosted [and #OpenSource] LLMs…one can be confident that data are not fed back to the firms that host LLMs in the cloud."
https://www.nature.com/articles/d41586-023-03144-w
Update. "Avg scores from multiple ChatGPT-4 rounds seems more effective than individual scores…If my weakest articles are removed… correlation with avg scores…falls below statistical significance, suggesting that [it] struggles to make fine-grained evaluations…Overall, ChatGPT [should not] be trusted for…formal or informal research quality evaluation…This is the first pub'd attempt at post-publication expert review accuracy testing for ChatGPT."
https://arxiv.org/abs/2402.05519
Update. 𝘓𝘢𝘯𝘤𝘦𝘵 𝘐𝘯𝘧𝘦𝘤𝘵𝘪𝘰𝘶𝘴 𝘋𝘪𝘴𝘦𝘢𝘴𝘦𝘴 on why it does not permit #AI in #PeerReview:
https://www.thelancet.com/journals/laninf/article/PIIS1473-3099(24)00160-9/fulltext
1. In an experimental peer review report, #ChatGPT "made up statistical feedback & non-existent references."
2. "Peer review is confidential, and privacy and proprietary rights cannot be guaranteed if reviewers upload parts of an article or their report to an #LLM."
Update. "Our results suggest that between 6.5% and 16.9% of text submitted as peer reviews to these [#CS] conferences could have been substantially modified by #LLMs, i.e. beyond spell-checking or minor writing updates."
https://arxiv.org/abs/2403.07183
Update. "We demonstrate how increased availability and access to #AI technologies through recent emergence of chatbots may be misused to write or conceal plagiarized peer-reviews."
https://link.springer.com/article/10.1007/s11192-024-04960-1
Update. "Researchers should not be using tools like #ChatGPT to automatically peer review papers, warned organizers of top #AI conferences and academic publishers…Some researchers, however, might argue that AI should automate peer reviews since it performs quite well and can make academics more productive."
https://www.semafor.com/article/05/08/2024/researchers-warned-against-using-ai-to-peer-review-academic-papers
Update. The @CenterforOpenScience (#COS) and partners are starting a new project in which researchers voluntarily submit papers to both human and #AI reviewers, and then give feedback on the reviews. The project is now calling for volunteers.
https://www.cos.io/smart-prototyping
Update. These researchers built an #AI system to predict #REF #assessment scores from a range of data points, inc #citation rates. For individual works, the system was not very accurate. But for total institutional scores, it was 99.8%. "Despite this, we are not recommending this solution because in our judgement, its benefits are marginally outweighed by the perverse incentive it would generate for institutions to overvalue journal impact factors."
https://blogs.lse.ac.uk/impactofsocialsciences/2023/01/16/can-artificial-intelligence-assess-the-quality-of-academic-journal-articles-in-the-next-ref/
Update. This editorial sketches a fantasy of #AI-assisted #PeerReview, then argues that it's "not far-fetched".
https://www.nature.com/articles/s41551-024-01228-0
PS: I call it far-fetched. You?
Far fetched to say the least.
Reviewing a paper without reading it!?! I guess I can also submit a paper without writing it, just letting the AI agents do the job. And then, there's no reason to review it. And why would I read a paper when I can get an AI-generated summary. At this point, why having journals to publish papers that nobody writes, reviews, and reads.
And there we go, AI will have achieved one useful thing, getting rid of the broken publication system.
But wait, silly me.
The AI-agent would only be available to th ultra-privileged, the 99.99% of us would still need to do the actual science job.
@RonBeavis @lgatto @petersuber Fine tuning is not so resource intensive as training, so journals could generate their own specific models without needing involvement of external companies. Not saying that would be a good idea, just that it's technically very feasible (also, there are plenty of open source models around)
@RonBeavis @nicolaromano @petersuber we'll pay for it with our APCs.
@nicolaromano @lgatto @petersuber But Nature is a private sector publishing company. If they think that they can get a substantial licensing fee for providing a corpus of reviews to 3rd parties, they probably will do just that.