Six #ASAPbio fellows asked four #LLMs to describe the strengths and weaknesses of #preprints. Here are the results.
https://asapbio.org/interim-findings-from-an-investigation-into-llm-responses-about-preprints-a-2025-asapbio-fellows-project/
The same fellows asked the same LLMs to ingest six preprints and their #PeerReviewed counterparts, and compare them for quality and rigor. Good question. But they've not yet analyzed the data and will presumably report soon.
PS: I'm interested in a related question. When LLMs answer research questions, do they treat on-topic preprints and on-topic postprints (peer-reviewed articles) as equivalent in weight or credibility? If not, how exactly do they take any differences into account?
How do we *want* #AI tools to treat #preprints?
(For a little background, see my post from late March.)
https://fediscience.org/@petersuber/116324036619224424
Here's an unusual new case study. Scientists created a fake disease ("bixonimania"), uploaded two fake research papers about it to preprint servers, and monitored AI tools to see whether they fell for it. Several of the major tools did fall for it, at first, even if they later expressed doubts.
https://www.nature.com/articles/d41586-026-01100-y
There were clues in the preprints that the research was fake. For example, the acknowledgments thanked a Starfleet Academy prof for her help and the Sideshow Bob Foundation for funding.
It's hard to avoid thinking that without those clues, humans might have fallen for those preprints too, at least at first. If we tested 100 human readers with different research backgrounds and purposes, the fall-for-it-at-first quotient might be 20, 40, or 60 rather than zero.
OK, AI tools don't get certain human jokes. That's shooting fish in a barrel. We still need to think about how AI tools ought to regard joke-free preprints.
More...
🧵
@petersuber
Isn't the big problem with the BigAI systems that they produce text with a form that implies there has been thorough vetting, but they just don't have algorithms to produce reliability scores in the first place? The most common symptom of this being the "hallucination" event, where the system extrapolates from its data without giving any warnings it is replying beyond the known edges of its knowledge.
Meanwhile humans can be fooled indeed, and might occasionally bluff even in their area of expertise, but very rarely will the expert be both fooled by a paper and willing to bluff about its contents at the same time.