@brianokken Basically, a test for whether the AI is good at a task or not.
@brianokken Except it's not deterministic in the sense of unit tests. It's a measure of how well it does.
@glyph @nedbat @brianokken I am not sure but they feel slightly different to me. I think of a benchmark as a fixed task that you do to track progress. Evals could be stuff like having humans rate 1000 examples of output, or using a specially trained evaluator model to rate a bunch of outputs. Probably a wide enough definition of either term allows one to encompass the other, but they probably have different centers of mass.