@jdp.extropian.net I personally am very sceptical that such a data-driven approach ("generating a balanced dataset of faithful (green check) and hallucinatory (red cross) responses using the TriviaQA benchmark") could even capture what model hallucinations really are. I'd think one first needs a taxonomy of hallucination types (one oldish for text-to-image is found in the limitations here: https://arxiv.org/abs/2206.10789). But even calling weights "neurons" is not of my liking and a flag.