21  On Benchmarks

Eva Vivalt, University of Toronto

Abstract:

AI usage statement: AI was used to hone the writing of this contribution.

21.1 Introduction

Recent advances in AI will increase demand for benchmarks in quantitative social science. This is partly because AI lowers the cost of constructing measures from unstructured data, increasing the need for common ways to evaluate them. But benchmarks should not be treated as neutral and inert. Once benchmarks become important, they change researcher incentives. They affect which projects are pursued, which claims are made, and what evidence is reported. In other words, once benchmarks affect publication and credibility, they become objects of strategic behavior.

The relevant question is therefore not only how to build good benchmarks, but how researchers respond to the existence, absence, and uneven distribution of benchmarks.

The next section discusses several different kinds of problems that arise when benchmarks are absent, exist, or exist but unevenly.

21.2 Effect of Missing Benchmarks

21.2.1 Incentives When Benchmarks Are Absent

Where no benchmark exists, researchers retain substantial discretion over validation. They can choose local validation samples, define constructs narrowly or broadly, omit failed attempts, and avoid comparisons to alternative methods.

This is especially problematic in social science because construct validity is often domain-specific. A measure that works in one corpus, time period, or location may not work elsewhere. In physics or some natural-science settings, the relevant context can often be controlled or standardized more easily. In social science, context is effectively part of the object being studied.

The absence of benchmarks therefore creates incentives to overstate the generality of a measure. Researchers can validate a task in a narrow domain and then make a broader empirical claim than is justified, and without external benchmarking this may not be apparent.

21.2.2 Incentives When Benchmarks Exist

Benchmarks improve some incentives while creating or worsening others.

They improve incentives by making validation more comparable and by increasing the expected cost of weak measurement. A paper that performs poorly on a relevant benchmark is harder to defend.

But benchmarks also create incentives to optimize for the benchmark rather than the underlying measurement problem. Researchers may tune prompts, models, thresholds, and post-processing rules to the benchmark. They may also choose projects where benchmark performance is likely to look good, rather than projects where measurement matters most.

At times, a benchmark may be “saturated” and stop meaningfully distinguishing between alternative methods. However, researchers whose protocols perform slightly better simply by chance may overstate their performance.

21.2.3 Incentives When Benchmarks Are Uneven

The most interesting case might be partial benchmarking. Some areas will have strong benchmarks, while others will not.

This creates several predictable responses:

  • Benchmark avoidance: relevant benchmarks exist but are not used;
  • Selective validation reporting: multiple benchmarks or validation samples are tried, but only favorable results are reported;
  • Strategic task definition: the task is described narrowly enough to avoid inconvenient benchmarks, while the interpretation remains broad;
  • Migration to under-benchmarked domains: researchers move toward tasks where validation standards are weaker.

The last of these highlights an interesting tension. Research is based on exploring novel questions, and researchers typically benefit from being the “first” to publish on a topic. While follow-up work is very important, as individual studies may be incorrect or not generalizable, arguably the first paper on a topic also has the highest value of information under standard models. However, it may be less likely for benchmarks to exist that are applicable to truly novel research. The growing importance of benchmarks could therefore exacerbate the extent to which researchers strive to differentiate their research from the existing literature—a double-edged sword.

21.3 Benchmark Governance

Given both the necessity of benchmarks and the challenges that can arise with them, it is worth considering whether any norms could improve them.

First, a benchmark should have scope conditions. It should specify the construct, corpus, population, time period, and other variables in order to enable others to assess whether the benchmark is useful for a specific purpose. The assessment of the relevance of potential benchmarks may prove to itself be arduous, and tools will need to be developed to help with this.

Benchmarks also need maintenance. They should be versioned, refreshed, and in some cases retired. Public test sets may need hidden or delayed components. Saturated benchmarks may still serve as minimum standards, but they should not be treated as strong evidence of frontier performance.

Researchers conducting work on topics where benchmarks may be relevant should report which benchmarks were considered, which were used, which were not used, and why. They should report all validation exercises that materially affected the final method. They may be able to argue that a given benchmark is not relevant. However, assessing whether this is true will take effort. AI tools could again be valuable in evaluating these claims.

The profession has developed norms around preregistration, replication packages, and data sharing because researcher discretion affects published results. Benchmarks require similar attention.

21.4 Conclusion

Benchmarks are often proposed as a tool to improve research credibility and transparency. While benchmarks can clearly help, they could themselves shape the production of research.

A useful benchmarking regime should therefore do more than score models. It should make benchmark avoidance visible, reduce selective validation reporting, clarify the scope of measurement claims, and reward the production and maintenance of evaluation infrastructure. All else equal, we should anticipate migration to under-benchmarked areas and other strategic responses.