Fantastic mechinterp paper showing that you can identify specific neurons which cause hallucinated answers in LLMs, and that these neurons are specifically associated with the language model trying to follow instructions too hard. arxiv.org/pdf/2512.01797

Follow

@jdp.extropian.net I personally am very sceptical that such a data-driven approach ("generating a balanced dataset of faithful (green check) and hallucinatory (red cross) responses using the TriviaQA benchmark") could even capture what model hallucinations really are. I'd think one first needs a taxonomy of hallucination types (one oldish for text-to-image is found in the limitations here: arxiv.org/abs/2206.10789). But even calling weights "neurons" is not of my liking and a flag.

Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.