Just tried q.e.d. by @odedrechavi.bsky.social et al. with a few papers including by myself & others where I knew a claim within was flawed based on a misunderstanding of the signal. 1) it was impressive. I see what the hype is about. 2) it hallucinated. www.qedscience.com Overly long #SciPub🧵 1/n

q.e.d Science

First, very impressed by q.e.d. just being able to synthesize scientific information so well from a file upload. It broke down the structure and highlighted genuinely key claims just the same as I'd expect a decent reviewer would, but in an infinitesimal fraction of the time. Wow. 2/n

Second, q.e.d. generally* highlighted gaps in claims well overall. Admittedly, in the examples I tried, it mostly just said "do even more work." The suggestions weren't wrong, but they weren't original either... It just highlighted what an expert likely already knew was a gap. 3/n

That said, q.e.d. sometimes gave a fundamentally flawed critique of why* it was a gap. For instance, one gap it identified was in a mutant study that lacked a genetic rescue experiment. Indeed, a gap. But it critiqued* the study for using multiple alleles/approaches and genetic backgrounds. 4.1/n

Use of multiple backgrounds & alleles strengthens* evidence, yet qed focused on how each independent tool could be flawed in this 'major' gap. 4 independent validations, but for q.e.d. only a 5th would truly* ensure confidence. The gap was true, but the assessment balance was whack... 4.2/n

In another study, I was a bit underwhelmed. It mostly assessed the work and tried to find any and all options the study didn't take, systematically listing all possible experiments without reflecting much on how important they might be. Still a great tool, don't get me wrong. 5.1/n

But qed failed to find the actual critical flaw in that study*. It flagged the right figure! But not the flaw.. Study argued removing a pathogen factor abolishes immune detection. In reality, abolishing that factor made the pathogen avirulent, so there was less triggering of host immunity. 5.2/n

qed felt the gap was that the authors hadn't checked *even more* genes to ensure the observation was robust to more genes controlled by the same immune pathways. Sure. But as above, that fails to identify the logical flaw. qed really likes to suggest "dig into that rabbit hole deeper." 5.3/n

Also please don't get me wrong: human reviewers make the same mistakes! qed is hardly doing a bad job here. Its overall performance is excellent. It's doing a great job at being your average peer reviewer. The problem is: the "average" peer reviewer isn't a "great" reviewer by definition. 5.4/n

Separately, in another study, qed came up with all sorts of methodological suggestions for how to statistically test an assertion: that immune genes were lost in species with "sterile" ecological niches. But it failed to challenge the assertion that wild ecologies are ever sterile. 5.5/n

Finally, qed hallucinated on me. It helps that this was my own paper, so my response was "I'm... pretty sure I never said that." I'll leave the explanation to the image Alt Text for those who are curious. The main thing is: we never said this. qed massively overextrapolated from what we did. 6/n

But I want to stress most of all: qed is very impressive. Reviewers are overworked (direct.mit.edu/qss/article/...). qed is easily up to the task of providing article assessment, and does as good a job or better than most people would at review, but in a miniscule fraction of the time. Wow. 7/n

Not only is qed already impressive, it is only likely to get better. Most of my gripes above could just as easily have happened with human reviewers. I myself have 'hallucinated' once in a peer review, critiquing the authors for something they never said. To err is as human as it is AI. 8/n

I can absolutely see qed being used in place of some stage of peer review in a future publishing system. But critically, its role would be to assist the expert editor*. Just like I looked over these claims, an editor could read the paper and then double check potential gaps with qed. 9/n

In a future publishing system, qed + editor could certainly replace "reviewers+editor" somewhat. Editors could still call in expert reviewers when they feel it's needed. But replacing "2 reviewers + 1 editor" with "1 reviewer + qed + 1 editor" would probably give similar results. 10/n

That said: qed is a tool. The user should not cede critical thinking to qed. It's worrying that qed reviews could inject hallucinations at scale... What's more, qed hallucinated misinformation off a supp figure. A lazy reviewer ignoring the supp would've been better... A cautionary tale. 11/n

Follow

@hansonmark.bsky.social Haven't tried qed but that's my general feeling when asking an LLM to comment on any piece of work. Ok average critique, but you need to work to make it good and really useful.

There is, however a moral issue here. I'm happy to upload my own unpublished manuscript and accept any risk associated with that (eg data leakage), however the authors of a paper I'm reviewing might not want that...

Also, while I do appreciate that "Our Al providers are contractually barred from training their own foundation models on your data.", I can't see anywhere who these providers are. In general AI companies don't have a good track record with regards to privacy matters. Also how would anyone find out whether they instead use the data for training?

· Edited · · Tusky · 0 · 0 · 0
Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.