I'm done with looking at the bullshit evals that give high scores to the toddler-level systems or woo the audience with cookie-cutting skills.
Who wants to join forces and make a new #Haskell / #Rust / #Prolog (?) benchmark so we can cyberbully the new models and "AI coders" as they come out?