A Theory of Dynamic BenchmarksDynamic benchmarks interweave model fitting and data collection in an attempt
to mitigate the limitations of static benchmarks. In contrast to an extensive
theoretical and empirical study of the static setting, the dynamic counterpart
lags behind due to limited empirical studies and no apparent theoretical
foundation to date. Responding to this deficit, we initiate a theoretical study
of dynamic benchmarking. We examine two realizations, one capturing current
practice and the other modeling more complex settings. In the first model,
where data collection and model fitting alternate sequentially, we prove that
model performance improves initially but can stall after only three rounds.
Label noise arising from, for instance, annotator disagreement leads to even
stronger negative results. Our second model generalizes the first to the case
where data collection and model fitting have a hierarchical dependency
structure. We show that this design guarantees strictly more progress than the
first, albeit at a significant increase in complexity. We support our
theoretical analysis by simulating dynamic benchmarks on two popular datasets.
These results illuminate the benefits and practical limitations of dynamic
benchmarking, providing both a theoretical foundation and a causal explanation
for observed bottlenecks in empirical work.
arxiv.org