**l'empathie mécanique** @dpwiz@qoto.org · Mar 26, 2025, 11:39

**l'empathie mécanique** @dpwiz@qoto.org · Mar 26, 2025, 11:39

l'empathie mécanique @dpwiz@qoto.org

Mar 26, 2025, 11:39

l'empathie mécanique @dpwiz@qoto.org

New Google model. Even better than everything. Completely failed at simple Haskell task by hallucinating a ton of crap

**l'empathie mécanique** @dpwiz@qoto.org · Mar 26, 2025, 12:15

**l'empathie mécanique** @dpwiz@qoto.org · Mar 26, 2025, 12:15

Mar 26, 2025, 12:15

l'empathie mécanique @dpwiz@qoto.org

I'm done with looking at the bullshit evals that give high scores to the toddler-level systems or woo the audience with cookie-cutting skills.

Who wants to join forces and make a new #Haskell / #Rust / #Prolog (?) benchmark so we can cyberbully the new models and "AI coders" as they come out?

**L29Ah** @L29Ah@qoto.org · 2025-03-26T14:26:59Z

L29Ah @L29Ah@qoto.org

@dpwiz Sounds great, but https://github.com/openlifescience-ai/Open-Medical-Reasoning-Tasks is higher in my list.

Mar 26, 2025, 14:26 · · · ·

Resources

Developers

What is Mastodon?

qoto.org

More…