**l'empathie mécanique** @dpwiz@qoto.org · 2025-03-26T11:39:12Z

l'empathie mécanique @dpwiz@qoto.org

l'empathie mécanique @dpwiz@qoto.org

New Google model. Even better than everything. Completely failed at simple Haskell task by hallucinating a ton of crap

Mar 26, 2025, 11:39 · · · ·

**l'empathie mécanique** @dpwiz@qoto.org · Mar 26, 2025, 12:15

**l'empathie mécanique** @dpwiz@qoto.org · Mar 26, 2025, 12:15

Mar 26, 2025, 12:15

l'empathie mécanique @dpwiz@qoto.org

I'm done with looking at the bullshit evals that give high scores to the toddler-level systems or woo the audience with cookie-cutting skills.

Who wants to join forces and make a new #Haskell / #Rust / #Prolog (?) benchmark so we can cyberbully the new models and "AI coders" as they come out?

**Oriel Jutty** @barubary@infosec.exchange · Mar 26, 2025, 13:34

**Oriel Jutty** @barubary@infosec.exchange · Mar 26, 2025, 13:34

Mar 26, 2025, 13:34

Oriel Jutty @barubary@infosec.exchange

@dpwiz Pure anecdote, but I've never seen an "AI" system that got anywhere near a working solution to this exercise: "Write a Haskell function that, given a string in HTML5 format starting with a script element, finds the location of the corresponding </script> end tag."