New Google model. Even better than everything. Completely failed at simple Haskell task by hallucinating a ton of crap :blobfacepalm:

I'm done with looking at the bullshit evals that give high scores to the toddler-level systems or woo the audience with cookie-cutting skills.

Who wants to join forces and make a new / / (?) benchmark so we can cyberbully the new models and "AI coders" as they come out?

Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.