New Google model. Even better than everything. Completely failed at simple Haskell task by hallucinating a ton of crap
@dpwiz Pure anecdote, but I've never seen an "AI" system that got anywhere near a working solution to this exercise: "Write a Haskell function that, given a string in HTML5 format starting with a script
element, finds the location of the corresponding </script>
end tag."
@barubary Ugh... Indeed. Cutting corners left and right
@dpwiz Sounds great, but https://github.com/openlifescience-ai/Open-Medical-Reasoning-Tasks is higher in my list.
I'm done with looking at the bullshit evals that give high scores to the toddler-level systems or woo the audience with cookie-cutting skills.
Who wants to join forces and make a new #Haskell / #Rust / #Prolog (?) benchmark so we can cyberbully the new models and "AI coders" as they come out?