Our findings reveal that LLMs exhibit noticeable variance when responding to different instantiations of the same question. Specifically, the performance of all models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark. Furthermore, we investigate the fragility of mathematical reasoning in these models and demonstrate that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is due to the fact that current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data

Thanks to the folks at Apple for experimentally proving what I’ve been repeating for a decade or so: #AI without formal inference is a statistical parrot and a glorified pattern matcher.

Throwing more neurons and layers at the problem may just make it harder to spot this obvious fallacy. There’s no magical “emergence” of reasoning from statistics.

A technology that needs to be shown thousands of pictures of cats, appropriately normalized and labeled, in order to recognize a cat, when a child needs to be shown only a couple of cats to robustly assimilate the pattern, is not a technology that has actually “learned” anything - let alone a technology that manifests any sort of intelligence.

That’s not to say that we should throw neural networks away, but living beings don’t learn only by statistical exposure to examples.

https://arxiv.org/pdf/2410.05229

Follow

@fabio

If you are looking for a job my company is hiring. We have a few LLM experts already. We are a commercial, open-source, AI/ML oriented company. Hit me up if your interested or if you know anyone feel free to send them my way.

Excellent work on your findings/paper too.

Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.