I’ve seen reference to “eval” w.r.t. AI. What does it mean?

@brianokken Basically, a test for whether the AI is good at a task or not.

@brianokken Except it's not deterministic in the sense of unit tests. It's a measure of how well it does.

@nedbat @brianokken is the word “benchmark” interchangeable here, or is there some subtle difference which prompted the coining of the new term?

Follow

@glyph @nedbat @brianokken I am not sure but they feel slightly different to me. I think of a benchmark as a fixed task that you do to track progress. Evals could be stuff like having humans rate 1000 examples of output, or using a specially trained evaluator model to rate a bunch of outputs. Probably a wide enough definition of either term allows one to encompass the other, but they probably have different centers of mass.

Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.