New Google model. Even better than everything. Completely failed at simple Haskell task by hallucinating a ton of crap :blobfacepalm:

Follow

I'm done with looking at the bullshit evals that give high scores to the toddler-level systems or woo the audience with cookie-cutting skills.

Who wants to join forces and make a new / / (?) benchmark so we can cyberbully the new models and "AI coders" as they come out?

@dpwiz Pure anecdote, but I've never seen an "AI" system that got anywhere near a working solution to this exercise: "Write a Haskell function that, given a string in HTML5 format starting with a script element, finds the location of the corresponding </script> end tag."

@barubary Ugh... Indeed. Cutting corners left and right :blobcatunamused:

@barubary @dpwiz Surprisingly, Qwen2.5-Coder-32B-Instruct-Q3_K_L failed miserably too, despite smartassery regarding different HTML entities.

Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.