Paul Ganssle: "@glyph@mastodon.social @nedbat@hachyderm.io @bria…" - Qoto Mastodon

Dec 27, 2025, 18:50

Brian Okken @brianokken@fosstodon.org

I’ve seen reference to “eval” w.r.t. AI. What does it mean?

Dec 27, 2025, 19:02

Ned Batchelder @nedbat@hachyderm.io

@brianokken Basically, a test for whether the AI is good at a task or not.

Dec 27, 2025, 19:10

Ned Batchelder @nedbat@hachyderm.io

@brianokken Except it's not deterministic in the sense of unit tests. It's a measure of how well it does.

Dec 27, 2025, 19:16

Glyph @glyph@mastodon.social

@nedbat @brianokken is the word “benchmark” interchangeable here, or is there some subtle difference which prompted the coining of the new term?

Paul Ganssle @pganssle@qoto.org

@glyph @nedbat @brianokken I am not sure but they feel slightly different to me. I think of a benchmark as a fixed task that you do to track progress. Evals could be stuff like having humans rate 1000 examples of output, or using a specially trained evaluator model to rate a bunch of outputs. Probably a wide enough definition of either term allows one to encompass the other, but they probably have different centers of mass.

Dec 27, 2025, 21:22 · · Tusky · · ·

Sign in to participate in the conversation