19 Optimal Editorial Screening Under Cheap Testing

Gabriel Sekeres, Department of Economics, Cornell University, gs754@cornell.edu

Abstract: As empirical testing becomes cheaper, journals face not only a multiple-testing problem but an equilibrium problem. Favorable robustness checks may be selected from a much larger, partly unobserved set of reasonable analyses, and the set of analyses itself may have been chosen because it was favorable. This note isolates one piece of that problem. A researcher is assigned a fixed hypothesis and a fixed universe of reasonable tests, can inspect many tests, and selectively submits favorable results; an editor has a Bayesian publication threshold and may commit to random audits of the ex ante test universe. Conditional on this commitment environment, fixed author-selected standards become uninformative as search grows, while random audits restore likelihood-ratio evidence and are evaluated by a posterior threshold. The limitation is central rather than incidental: the model does not characterize the full equilibrium in which researchers choose hypotheses, define robustness surfaces, and anticipate editorial rules. The contribution of the note is mainly diagnostic: it separates selected evidence from audit evidence, and it motivates a more general \(K\)-level equilibrium statement in which each additional strategic layer changes the object the editor must screen.

AI usage statement: AI was used for brainstorming and editing.

19.1 Introduction

Empirical science is increasingly cheap to search. In observational social science, a single research question can be paired with many defensible choices of sample, outcome, controls, estimator, transformation, clustering rule, and inference procedure. Large language models and agentic coding workflows reduce the cost of enumerating and executing these choices, and they also reduce the cost of anticipating how a referee or editor might respond to them. The consequence is not simply that researchers can run more tests. It is that a published robustness table increasingly reflects a strategic selection process: the evidence is informative only after one understands the menu from which it was selected and the incentives that shaped that menu.

We argued in Fishman and Sekeres (2026) that this changes the publication game. When the marginal cost of testing falls, a fixed menu of robustness checks can no longer carry the same evidentiary content. The formal result there is best understood as a commitment-equilibrium result. If the journal can commit in advance to a disclosure or screening rule, and if the relevant hypothesis and test space are sufficiently disciplined, then the equilibrium response is to make evidentiary requirements scale with search capacity. That is an important benchmark because journals and registries can sometimes create commitment devices. But it is not yet a full equilibrium theory of cheap science. It does not solve the game in which authors choose the hypothesis, shape the specification surface, and select claims partly in response to the announced screening technology.

This note takes that limitation as its starting point. It asks what an optimal editor should do in a deliberately stripped-down version of the problem, while keeping clear which strategic layer has been fixed. A researcher is assigned a hypothesis with prior probability \(\pi\) of being true. There is a finite set \(\mathcal{A}\) of reasonable tests. Each test has a known false-positive probability and power. The researcher can draw tests at cost \(c\) and submit a subset. The editor has a Bayesian publication threshold: publish if the posterior probability that the hypothesis is true exceeds \(\tau\). The narrow question is how the optimal editorial policy changes when \(c\) falls, conditional on the hypothesis and the reasonable-test universe being well defined before the relevant results are observed.

The first answer is negative. If the editor requires a fixed number of favorable tests, screening collapses as search capacity grows. A false hypothesis eventually generates the required number of favorable tests with high probability. The submitted evidence may look exactly like a conventional robustness table, but the selection-corrected posterior converges back to the prior because the unreported denominator has grown. The remedy is the one suggested by the commitment benchmark in Fishman and Sekeres (2026): disclosure requirements must scale with search capacity.

The second answer is constructive but more limited. If the editor can request tests at random from the ex ante reasonable set, the evidentiary content changes. A test drawn by the author after search is selected evidence. A test drawn by the editor from a fixed universe is audit evidence. In the benchmark model, the optimal audit rule is a likelihood-ratio threshold on the number of favorable audited tests. One random audit supplies one honest likelihood-ratio increment. A fixed finite audit supplies bounded information. A growing audit supplies exponential separation between true and false hypotheses. Requesting all tests is the limiting case: the editor sees the full specification surface.

These results should not be read as a complete equilibrium characterization. They are better read as a sequence of finite-depth strategic arguments. At the first level, the editor corrects for author selection within a fixed hypothesis and a fixed test universe. At the second level, the editor commits to an audit distribution and evaluates the resulting unselected evidence. At the third level, the author may move upstream and select the hypothesis or the specification surface itself. Each level changes the conditioning event in the editor’s posterior. The point of the simple model is to make those changes explicit and to show where the commitment benchmark stops. Note that these results are also highly reliant on the fact that the space of reasonable tests is finite, which is not necessarily true in practice. In particular, if the space of reasonable tests grows as the cost of testing falls, then even without author selection of a specification surface, screening can fail.

This paper is connected to several literatures. First, it belongs to work on publication bias and the positive predictive value of published research, from Ioannidis (2005) through modern treatments of selective publication in economics (Andrews and Kasy 2019; Kasy 2021; Frankel and Kasy 2022). Empirical evidence on selective publication and \(p\)-hacking in the social sciences includes Franco, Malhotra, and Simonovits (2014), Brodeur et al. (2016), Brodeur, Cook, and Heyes (2020), and Brodeur, Cook, and Neisser (2024). Second, it formalizes the concern that researcher degrees of freedom, specification search, and the garden of forking paths can generate false discoveries (Leamer 1983; Simmons, Nelson, and Simonsohn 2011; Gelman and Loken 2013, 2014). Statistical responses to related search problems include false-discovery-rate control and data-snooping corrections McCloskey and Michaillat (2024), as well as pre-analysis plans (Olken 2015; Kasy and Spiess 2025). Third, it relates to multiverse and specification-curve methods, which seek to make the space of reasonable analyses visible rather than focusing on one preferred path (Steegen et al. 2016; Simonsohn, Simmons, and Nelson 2020; Breznau et al. 2022). Fourth, it is close to strategic evidence and disclosure models in which an evaluator must interpret selected data, building on disclosure and persuasion foundations (Grossman 1981; Milgrom 1981; Kamenica and Gentzkow 2011) and more recent models of endogenous information acquisition and selective evidence (Di Tillio, Ottaviani, and Sørensen 2021; Herresthal 2022; Henry 2009; Henry and Ottaviani 2019). The posterior-threshold formulation also follows the decision-theoretic view of statistical testing (Wald 1950; Tetenov 2016). Finally, it speaks to recent attempts to formalize robustness checks. Lu and White (2014)] propose robustness tests in applied economics, and Prallon (2026) introduces a robustness radius that measures how far robustness-check estimands are from a main specification while accounting for uncertainty and covariance across regressions.

The contribution here is therefore intentionally narrow. I do not model estimands directly, nor do I solve the fully endogenous game in which the hypothesis, the estimand, and the set of reasonable tests are all chosen strategically. I treat each reasonable analysis as producing a binary favorable-or-unfavorable signal and ask what an editor can infer when those signals may have been selected. The binary model is small by design. Its purpose is to separate three objects that are often conflated in editorial practice: selected robustness checks, random editorial audits, and the ex ante construction of the reasonable-test universe.

19.2 A screening model

The model begins after the hardest institutional object has been fixed. There is already a hypothesis, already a set of analyses that count as reasonable for that hypothesis, and already a prior over whether the hypothesis is true. This is precisely the commitment environment in which an editor can ask a well-defined screening question. It is not yet the full game in which the author chooses the hypothesis or helps define the surface. The advantage of starting here is that the evidentiary effect of search can be isolated without also modeling the origin of the claim.

There is a binary state \(H\in\{0,1\}\). The state \(H=1\) means that the assigned hypothesis is true, significant in the claimed direction, or otherwise publication-worthy. The prior is \[ \mathbb{P}(H=1)=\pi\in(0,1) \] The editor chooses whether to publish. Publishing a true hypothesis gives payoff \(1\), publishing a false hypothesis gives payoff \(-\ell\), and rejecting gives payoff \(0\). Therefore the Bayes rule is to publish if and only if the posterior probability of \(H=1\) is at least \[ \tau=\frac{\ell}{1+\ell} \] Equivalently, the editor publishes when posterior log odds exceed \(\operatorname{logit}(\tau)\).

There is a finite set of reasonable tests \[ \mathcal{A}_M=\{1,\ldots,M\} \] In the symmetric benchmark, each test \(a\in\mathcal{A}_M\) produces a binary result \(X_a\in\{0,1\}\), where \(X_a=1\) means favorable to the hypothesis. Conditional on the state, tests are independent¹ and \[ \mathbb{P}(X_a=1\mid H=0)=\alpha\qquad \mathbb{P}(X_a=1\mid H=1)=\beta \] where \(0<\alpha<\beta<1\). Thus \(\alpha\) is the false-positive probability of a single reasonable test and \(\beta\) is its power. Dependence, heterogeneous tests, and continuous evidence are discussed below.

The researcher wants publication and can inspect tests at cost \(c\) per test. Let \(n(c,M)\leq M\) denote the number of tests the researcher can inspect. One can derive \(n(c,M)\) from an optimal stopping problem, but the comparative statics here only require that \(n(c,M)\) weakly increases as \(c\) falls and tends to \(M\) when the finite universe becomes cheap to exhaust. When \(M\) itself grows with the research technology, cheap science means \(n(c,M)\to\infty\).

The researcher observes the outcomes of inspected tests and may submit any subset. The editor understands this selection problem. A submitted favorable test is therefore not a random draw from the test universe; it is evidence that has survived search. The next section asks what this does to the posterior when the journal continues to use a fixed author-selected evidence standard.

19.3 Selected evidence under falling test costs

The first strategic layer is the familiar robustness-table problem. The editor sees favorable checks, but those checks were chosen by the researcher after some amount of search. A fixed standard can be persuasive only if the event ``the author found enough favorable checks’’ remains more likely under a true hypothesis than under a false one. Cheap testing erodes exactly that likelihood-ratio comparison.

Suppose first that the journal publishes if the researcher submits at least \(k\) favorable tests. If the editor naively treated \(k\) favorable tests as \(k\) random signals, posterior log odds would be \[ \operatorname{logit}(\pi)+k\log\frac{\beta}{\alpha} \] The corresponding naive hurdle is \[ k_0=\left\lceil \frac{\operatorname{logit}(\tau)-\operatorname{logit}(\pi)}{\log(\beta/\alpha)} \right\rceil \] This is the standard interpretation of robustness checks: enough independent favorable analyses should move beliefs above the publication threshold.

But the relevant event is not “\(k\) random favorable tests.” It is “the researcher found at least \(k\) favorable tests after inspecting \(n\) tests.” Under state \(t\in\{0,1\}\), define \(p_0=\alpha\), \(p_1=\beta\), and \[ A_t(n,k)=\mathbb{P}\{\operatorname{Bin}(n,p_t)\geq k\} \label{eq:At} \] The selection-corrected posterior after a qualifying submission is \[ \Pi(n,k)= \frac{\pi A_1(n,k)}{\pi A_1(n,k)+(1-\pi)A_0(n,k)} \label{eq:posterior_selected} \]

Proposition 19.1 (Collapse of fixed selected-evidence standards) Fix \(k<\infty\). As \(n\to\infty\), \[ A_0(n,k)\to 1 \qquad A_1(n,k)\to 1 \qquad \Pi(n,k)\to \pi \] Thus a fixed requirement of \(k\) author-selected favorable tests has asymptotically zero selection-corrected likelihood-ratio content.

If instead \(k_n/n\to \rho\) for some \(\rho\in(\alpha,\beta)\), then \[ A_1(n,k_n)\to 1 \qquad A_0(n,k_n)=\exp\{-n\operatorname{KL}(\rho\|\alpha)+o(n)\} \] where \[ \operatorname{KL}(x\|p)=x\log\frac{x}{p}+(1-x)\log\frac{1-x}{1-p} \] Consequently \(\Pi(n,k_n)\to 1\).

Proof. The fixed-\(k\) claim follows because the probability of at least one, two, or any fixed number of successes tends to one under either state as \(n\to\infty\). The linear-threshold claim is the law of large numbers plus the Chernoff large-deviation approximation for the upper tail of a binomial under \(p=\alpha\).

The proposition gives the basic cost-shift result. If the journal holds \(k\) fixed while \(c\) falls and \(n(c,M)\) grows, the false-hypothesis qualification probability rises. The selected tests look the same to a reader, but their meaning changes because the unreported denominator changes. The event that must be evaluated is not the published list of checks; it is the fact that such a list could be found. To keep the same screening force using author-selected evidence alone, the required number of favorable checks must grow with search capacity.

For a finite universe \(M\), the collapse has a ceiling. If \(c\to 0\) while \(M\) is fixed, then the limiting false qualification probability is \[ A_0(M,k)=\mathbb{P}\{\operatorname{Bin}(M, \alpha)\geq k\}. \] This may be small when \(M\) is small and \(k\) is demanding. The point is not that any fixed robustness standard is always useless. It is that the value of the standard depends on an object the reader usually does not observe: the size and structure of the search set. Cheap science becomes dangerous when the reasonable-test universe is large, when the author can cheaply exhaust it, or when the researcher can move upstream and choose among many hypotheses before presenting one. That last possibility is not solved by increasing \(k\) within a fixed hypothesis; it is a higher-level selection problem.

19.4 Optimal editorial audits

The second strategic layer asks whether the editor can restore unselected evidence by changing who chooses the tests. Now suppose the editor can request \(r\) tests from \(\mathcal{A}_M\) after seeing the submission. Think of the editor as committing to an audit distribution and revealing the realized audit only after the author has submitted. The audit may be implemented by asking the author to run specified code, by having a referee run it, or by requiring a machine-readable specification surface. The key difference from the prior section is that the tested specifications are not chosen by the author after seeing the signs of all tests.

This commitment language matters. An audit is useful because the author cannot condition the audited set on realized favorable outcomes. If the audit distribution itself can be renegotiated after the surface is observed, or if the surface is defined after the author knows which tests will pass, then the audit becomes selected evidence at a higher level. The benchmark below therefore asks what audit evidence is worth conditional on an ex ante surface and an editor who can commit to randomization.

For a clean benchmark, assume the author-selected submission has become uninformative in the cheap-search limit of Proposition 1. The editor therefore bases publication on the audit. Let \(Y_r\) be the number of favorable tests in a random audit of size \(r\). Conditional on \(H=t\), \[ Y_r\sim \operatorname{Bin}(r,p_t) \qquad p_0=\alpha \quad p_1=\beta \] up to finite-population corrections when sampling without replacement from a fixed realized universe.

Given \(y\) favorable audited tests out of \(r\), posterior log odds are \[ L_r(y)=\operatorname{logit}(\pi) +y\log\frac{\beta}{\alpha} +(r-y)\log\frac{1-\beta}{1-\alpha} \tag{19.1}\]

The Bayes-optimal editorial rule is therefore \[ \delta_r(y)=\mathbf{1}\{L_r(y)\geq \operatorname{logit}(\tau)\} \tag{19.2}\]

Since \(\beta>\alpha\), this is equivalent to a threshold rule \(y\geq h_r\), where \[ h_r= \left\lceil \frac{\operatorname{logit}(\tau)-\operatorname{logit}(\pi) -r\log\{(1-\beta)/(1-\alpha)\}} {\log(\beta/\alpha)-\log\{(1-\beta)/(1-\alpha)\}} \right\rceil \label{eq:hr} \]

Proposition 19.2 (Optimal finite audit) In the symmetric benchmark, among audit mechanisms that observe \(r\) randomly requested tests and then choose publish or reject, the optimal rule is the likelihood-ratio threshold Equation 19.2. If \(r=1\) and the threshold lies between the posterior after a failure and the posterior after a success, the optimal rule publishes if and only if the requested test is favorable. The posterior after one favorable random audit is \[ \mathbb{P}(H=1\mid Y_1=1)= \frac{\pi\beta}{\pi\beta+(1-\pi)\alpha} \] After \(r\) all-favorable audits, it is \[ \mathbb{P}(H=1\mid Y_r=r)= \frac{\pi\beta^r}{\pi\beta^r+(1-\pi)\alpha^r} \]

Proof. For any observed audit result, the editor’s expected payoff from publishing is \(q-\ell(1-q)\), where \(q\) is the posterior probability of \(H=1\). This is nonnegative if and only if \(q\geq \tau\). Bayes’ rule gives Equation 19.1. The formulas for one and \(r\) all-favorable audits follow by substituting \(y=1,r\).

The proposition separates selected checks from random audits. A fixed number of selected favorable tests becomes uninformative as search capacity grows because both true and false hypotheses can eventually produce them. A fixed number of random favorable audits remains informative because the tested specifications were not chosen after observing their outcomes. But fixed audits are not a magic substitute for a full equilibrium solution. For fixed \(r\), the maximum likelihood-ratio supplied by the audit is \((\beta/\alpha)^r\), which is finite. Thus fixed audits restore fixed evidence; they do not make false publication probabilities vanish.

If the journal wants asymptotic screening, the audit size must grow.

Proposition 19.3 (Growing audits) Let \(h_r/r\to \lambda\) for some \(\lambda\in(\alpha,\beta)\). Then \[ \mathbb{P}_0(Y_r\geq h_r)=\exp\{-r\operatorname{KL}(\lambda\|\alpha)+o(r)\} \] while \[ \mathbb{P}_1(Y_r<h_r)=\exp\{-r\operatorname{KL}(\lambda\|\beta)+o(r)\} \] Hence a random audit of size \(r\) separates true and false hypotheses at an exponential rate in \(r\).

The audit size required to make the false audit-pass probability at most \(\varepsilon\) is approximately \[ r\gtrsim \frac{\log(1/\varepsilon)}{\operatorname{KL}(\lambda\|\alpha)} \] This rate depends on the diagnostic quality of a randomly requested test, not directly on the number of tests the author was able to search. That is the main institutional advantage of audits: they convert the problem from asking how many favorable specifications the author can find to asking how many unselected specifications the editor can inspect. The conversion is powerful, but conditional: it works only after the specification surface has been defined in a way that the author cannot manipulate after seeing its realizations.

19.4.1 Sampling from a finite surface

When \(\mathcal{A}_M\) is finite and the editor samples without replacement, exact inference should condition on the selection event. If the researcher qualifies by finding at least \(k\) favorable tests in the full universe and the editor audits \(r\) tests, then under state \(t\), \[ \mathbb{P}_t(Y_r=y\mid S_M\geq k) =\sum_{s=k}^M \frac{\binom{s}{y}\binom{M-s}{r-y}}{\binom{M}{r}} \frac{\binom{M}{s}p_t^s(1-p_t)^{M-s}}{A_t(M,k)} \tag{19.3}\] where \(S_M=\sum_{a=1}^M X_a\). Equation 19.3 is the finite-population version of the audit likelihood. In the cheap-search, large-\(M\), fixed-\(k\) limit, the conditioning event has probability approaching one under both states, and the binomial audit approximation above is recovered.

At the other extreme, if \(r=M\), the editor observes the entire reasonable-test universe. The sufficient statistic is \(S_M\). The optimal full-surface rule publishes when [ () +S_M +(M-S_M) () ] Equivalently, publish when \(S_M/M\) exceeds a cutoff. If \(M\to\infty\) and the tests are independent, any cutoff in \((\alpha,\beta)\) gives perfect separation. If \(M\) is fixed, even full-surface auditing leaves a positive false-acceptance probability. For example, if the rule requires every reasonable test to pass, the false-acceptance probability is only \(\alpha^M\), not zero.

19.5 Heterogeneous tests

The symmetric benchmark makes the audit rule transparent, but the optimality logic is more general. Suppose each test \(a\in\mathcal{A}\) has outcome distribution \(f^t_a(x)\) under state \(t\). Conditional on \(H\), tests are independent, but they need not have the same power or false-positive rate. If the editor observes an audit set \(R\subseteq\mathcal{A}\) and outcomes \(x_R\), posterior log odds are \[\begin{equation} \operatorname{logit}(\pi)+\sum_{a\in R}\log\frac{f^1_a(x_a)}{f^0_a(x_a)} \end{equation}\] The optimal acceptance rule is still a posterior threshold. Thus the decision part of the mechanism is solved by Bayes’ rule.

The remaining problem is choosing which tests to audit. For an audit set \(R\), define the posterior after observing \(X_R\) by \(\pi_R(X_R)\). The ex ante value of auditing \(R\) is \[\begin{equation} V(R)=\mathbb{E}\left[\left((1+\ell)\pi_R(X_R)-\ell\right)_+\right] \end{equation}\] where the expectation is under the prior predictive distribution. An optimal audit set of size at most \(r\) solves \[ R^*\in \arg\max_{R\subseteq\mathcal{A},\ |R|\leq r} V(R). \] With identical tests, all audit sets of the same size are equivalent and uniform randomization is without loss. With heterogeneous tests, the editor should prefer tests with high diagnostic value, but the exact choice is a finite Bayesian experimental-design problem.

This generalization clarifies what is solved and what is not. Conditional on a known \(\mathcal{A}\) and known likelihoods, the editor’s decision problem remains standard Bayesian screening. The editor should aggregate audit outcomes by their likelihood ratios and publish when posterior odds exceed the loss-adjusted threshold. But the mechanism is broader than the decision rule. A deterministic audit menu can become part of the author’s optimization problem. A known audit distribution with broad support makes the author face the possibility that any reasonable specification may be inspected. The mechanism-design object is therefore a triple: a definition of \(\mathcal{A}\), an audit distribution over subsets of \(\mathcal{A}\), and a posterior-threshold publication rule. The first component is the one the benchmark takes as exogenous and the full equilibrium must endogenize.

19.6 Defining reasonable tests

The model so far assumes that \(\mathcal{A}\) is known. That assumption is doing a great deal of work. A random audit is only unselected evidence if it is random over a set that was not itself chosen after seeing the results. If the author can define \(\mathcal{A}\) , the audit is merely selected evidence at a higher level. The commitment result therefore depends not only on the editor’s ability to commit to an audit distribution, but also on the existence of an admissible surface to which that distribution can be applied.

This suggests three design requirements. First, the set of reasonable tests must be defined before the relevant outcomes are known, or by a procedure that is itself auditable. Second, the set must be broad enough to include the important forking paths: alternative samples, outcomes, controls, transformations, and inferential choices that a skeptical but reasonable reader would regard as admissible. Third, the set must be narrow enough that its members speak to the same claim. If robustness checks change the estimand too much, a failed check may not refute the main claim and a passed check may not support it. This is where statistical work on robustness radii and estimand distances complements the screening model.

A finite \(\mathcal{A}\) also only solves a within-hypothesis problem. Suppose the researcher can examine \(J\) candidate hypotheses and submit the one whose reasonable-test universe looks best. For hypothesis \(j\), let \[ S_{jM}=\sum_{a=1}^M X_{ja} \] be the number of favorable tests in its universe. If all \(J\) hypotheses are false and the editor publishes a hypothesis when \(S_{jM}\geq h\), then the probability that at least one false hypothesis passes the full-universe rule is \[\begin{equation} 1-\left(1-\mathbb{P}\{\operatorname{Bin}(M,\alpha)\geq h\}\right)^J \label{eq:outer_search} \end{equation}\] For fixed \(M\), this probability tends to one as \(J\to\infty\). Thus even auditing every reasonable test for a presented hypothesis cannot solve unrestricted hypothesis search. A perfect within-hypothesis multiverse can still be embedded in an outer garden of forking hypotheses.

This observation changes the interpretation of robustness policy. More robustness checks are useful only after the claim and its admissible perturbations have been fixed. The deeper institutional problem is to define the object being audited: the hypothesis, the estimand, the dataset, the outcome family, and the specification surface. Pre-analysis plans, registered reports, specification curves, and machine-readable multiverse definitions are all partial answers because they move part of the selection process before the realization of favorable evidence. Editorial audits are complementary: they give journals a way to use an ex ante surface once it exists. The unresolved question is whether a general equilibrium can be built in which authors, editors, and referees have incentives to generate and respect such surfaces rather than to define them strategically after the fact.

19.7 From commitment to a full equilibrium

A useful way to read the preceding sections is as a finite-depth strategic argument. At \(K=0\), the editor treats submitted robustness checks as if they were random signals. At \(K=1\), the editor recognizes that the author searched within a fixed hypothesis and a fixed test universe, so the relevant event is that the author found enough favorable tests. At \(K=2\), the editor responds by committing to an audit distribution, and the author anticipates that some unselected tests may be inspected. At \(K=3\), the author moves upstream and chooses the hypothesis, estimand, dataset, or specification surface in light of the screening rule. Higher levels repeat the same lesson: once one strategic margin is disciplined, the next margin becomes the object of selection.

This is the right way to locate the limitation of Fishman and Sekeres (2026) and of the present note. The earlier paper proves a commitment-equilibrium result: given a disciplined environment, the editor can commit to a rule and the researcher responds optimally to that rule. That is a meaningful institutional benchmark, because journals, registries, and registered reports are precisely attempts to create commitment. But a commitment equilibrium is not yet a full equilibrium of cheap science. It treats the object being screened as sufficiently well specified before play begins. The harder game lets the researcher choose or influence the object itself.

In that larger game, the relevant posterior is not merely \[ \mathbb{P}(H=1\mid \text{submitted evidence}) \] but something closer to \[ \mathbb{P}(H_j=1\mid \text{researcher chose }j,\, \text{researcher proposed }\mathcal{A}_j,\, \text{submitted evidence},\, \text{audit results}) \] The posterior requires a model of how hypotheses, estimands, and reasonable-test sets are generated and selected.

In the full game, there is a researcher who wants publication and an editor who wants to publish true claims and reject false ones. The researcher may choose a hypothesis \(j\), propose or exploit a test universe \(\mathcal{A}_j\), decide how many tests to inspect, choose what evidence to submit, and respond to any audit request. The editor may commit to admissibility rules, audit distributions, and posterior publication thresholds. A strategy profile should specify behavior at each of these stages, together with beliefs after every possible submission and proposed test universe.

The difficulty is that, once written this way, standard equilibrium existence arguments no longer apply automatically. The relevant action spaces are not generally compact, because hypotheses, estimands, datasets, and specification surfaces need not come from a fixed finite list. The editor’s payoff may be discontinuous, because a small change in the proposed surface can signal a large change in the selection process that produced it. Beliefs are hard to discipline off path, because the editor must form beliefs not only about whether a submitted hypothesis is true, but about why that hypothesis and that set of reasonable tests appeared in the submission.

The definition of ``reasonable’’ is therefore partly endogenous to the mechanism. If the editor announces that she will sample uniformly from a declared set of tests, the author has an incentive to shape that set. If the editor announces that she will audit high-diagnostic tests, the author has an incentive to choose claims for which those tests are favorable. If the editor announces that all tests must pass, the author has an incentive to choose hypotheses whose finite reasonable-test universe has already been pre-screened. Thus the admissibility rule is not a passive constraint; it is part of the strategic environment.

The finite model gives clean comparative statics conditional on an exogenous hypothesis and an exogenous test universe. It shows that selected-evidence standards collapse, that random audits restore honest likelihood-ratio evidence, and that full-surface auditing solves the within-hypothesis problem only up to the limits of a finite \(\mathcal{A}\). But these are not yet equilibrium characterizations of the larger game. They are \(K=1\), \(K=2\), and \(K=3\) arguments about how each additional strategic layer changes the object in the editor’s posterior. The contribution of these arguments is to identify the ingredients that a general equilibrium statement would have to contain: an admissibility technology for hypotheses and surfaces, an audit commitment, a disclosure rule, and beliefs over the selection process that generated the submitted claim.

One way to obtain a well-defined equilibrium is to impose enough institutional structure. For example, suppose there is a finite registry of hypotheses \(\mathcal H\), each hypothesis has an exogenously specified finite test universe \(\mathcal{A}_j\), search is bounded, audit rules are committed to in advance, and the editor has a prior over the truth of each registered hypothesis. Then the game is finite. Equilibrium exists, and the analysis above describes useful components of equilibrium behavior. But this existence result is bought by assuming away the most interesting part of the problem: who defines \(\mathcal H\), who defines \(\mathcal{A}_j\), and how those definitions respond to the screening rule?

A more general equilibrium result would make the \(K\)-level structure explicit rather than hide it inside a fixed action space. For bounded \(K\), one could ask what institutional restrictions are sufficient for equilibrium existence when the editor anticipates \(K\) layers of researcher response. For unbounded strategic depth, the relevant object is closer to a fixed point: a screening institution under which the hypothesis space, the admissible surface, the audit distribution, and the publication rule are jointly stable against the forms of search they induce. The note does not prove such a theorem. Its role is to motivate why such a theorem is needed, and why commitment alone is an incomplete but still valuable benchmark.

One could plausibly gesture to computational constraints and time concerns to restrict the game, but the recent jumps in the ability of LLMs, especially in simulating the review process itself,² make those restrictions less satisfying. As cheap testing expands, researchers studying the impact of AI on social science and research more broadly must often rely on relaxations and imposed constraints. The gap is a general equilibrium theory of editorial screening in which the objects of inference are themselves strategically produced.

When science is cheap, author-selected favorable tests lose evidentiary content unless disclosure requirements scale with the amount of search. This is the selected-evidence problem. Editorial audits offer a different route. In the benchmark model, an editor who can request random tests from a fixed reasonable-test universe should use a likelihood-ratio threshold on the audit results. One audit gives one honest likelihood-ratio increment. A fixed number of audits gives bounded information. A growing audit gives exponential screening. Auditing all tests gives the full finite-universe benchmark.

The main caveat is also the main motivation. These are commitment results for a world in which the hypothesis and the reasonable-test universe have already been made meaningful. Optimal screening is not only a problem of choosing how many robustness checks to require. It is a problem of designing the set over which robustness is assessed. If \(\mathcal{A}\) is too narrow, audits miss important forks. If it is too broad, audited tests may no longer address the same claim. If it is chosen ex post, randomization no longer purges selection. And if researchers can search over hypotheses before presenting one, even full robustness within the chosen hypothesis can fail. Cheap science therefore pushes journals toward a new editorial object: an ex ante, auditable specification surface, paired with a publication rule that treats author-selected evidence and editor-random evidence differently.

The unresolved equilibrium problem clarifies why this agenda is difficult. In a fully strategic model, the editor is not merely choosing an optimal publication threshold after observing evidence; she is trying to design a game in which the objects of inference are themselves well defined. The role of registries, pre-analysis plans, registered reports, and specification surfaces is therefore not only to reduce researcher discretion, but to make the screening game exist in a useful form. They bound the hypothesis space, discipline the construction of \(\mathcal{A}\), and give the editor interpretable beliefs about what has and has not been searched.

This is the sense in which the simple model motivates a broader \(K\)-level equilibrium statement. At each level of strategic reasoning, a different object becomes selected: first robustness checks, then audit responses, then surfaces, then hypotheses. A useful theory of cheap science must say which of these margins an institution can commit to fixing, and which remain endogenous. Cheap testing makes evidence abundant, but abundance is not the same as informativeness. The central design problem is to build institutions under which cheap evidence can again be mapped into credible posterior beliefs.

Andrews, Isaiah, and Maximilian Kasy. 2019. “Identification of and Correction for Publication Bias.” American Economic Review 109 (8): 2766–94. https://doi.org/10.1257/aer.20180310.

Benjamini, Yoav, and Yosef Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society: Series B (Methodological) 57 (1): 289–300. https://doi.org/10.1111/j.2517-6161.1995.tb02031.x.

Breznau, Nate, Eike Mark Rinke, Alexander Wuttke, Hung H. V. Nguyen, Muna Adem, Jule Adriaans, Amalia Alvarez-Benjumea, et al. 2022. “Observing Many Researchers Using the Same Data and Hypothesis Reveals a Hidden Universe of Uncertainty.” Proceedings of the National Academy of Sciences 119 (44): e2203150119. https://doi.org/10.1073/pnas.2203150119.

Brodeur, Abel, Nikolai Cook, and Anthony Heyes. 2020. “Methods Matter: P-Hacking and Publication Bias in Causal Analysis in Economics.” American Economic Review 110 (11): 3634–60. https://doi.org/10.1257/aer.20190687.

Brodeur, Abel, Nikolai Cook, and Carina Neisser. 2024. “P-Hacking, Data Type and Data-Sharing Policy.” The Economic Journal 134 (659): 985–1018. https://doi.org/10.1093/ej/uead104.

Brodeur, Abel, Mathias Lé, Marc Sangnier, and Yanos Zylberberg. 2016. “Star Wars: The Empirics Strike Back.” American Economic Journal: Applied Economics 8 (1): 1–32. https://doi.org/10.1257/app.20150044.

Di Tillio, Alfredo, Marco Ottaviani, and Peter Norman Sørensen. 2021. “Strategic Sample Selection.” Econometrica 89 (2): 911–53. https://doi.org/10.3982/ECTA17288.

Fishman, Nic, and Gabriel Sekeres. 2026. “Editorial Screening when Science is Cheap.” Working Paper. https://gabesekeres.com/papers/cheap\%5Fscience/gsekeres\%5Fcheap\%5Fscience.pdf.

Franco, Annie, Neil Malhotra, and Gabor Simonovits. 2014. “Publication Bias in the Social Sciences: Unlocking the File Drawer.” Science 345 (6203): 1502–5. https://doi.org/10.1126/science.1255484.

Frankel, Alexander, and Maximilian Kasy. 2022. “Which Findings Should Be Published?” American Economic Journal: Microeconomics 14 (1): 1–38. https://doi.org/10.1257/mic.20190133.

Gelman, Andrew, and Eric Loken. 2013. “The Garden of Forking Paths: Why Multiple Comparisons Can Be a Problem, Even When There Is No ‘Fishing Expedition’ or ‘P-Hacking’ and the Research Hypothesis Was Posited Ahead of Time.” https://sites.stat.columbia.edu/gelman/research/unpublished/p\%5Fhacking.pdf.

———. 2014. “The Statistical Crisis in Science.” American Scientist 102 (6): 460–65. https://doi.org/10.1511/2014.111.460.

Grossman, Sanford J. 1981. “The Informational Role of Warranties and Private Disclosure About Product Quality.” Journal of Law and Economics 24 (3): 461–83. https://doi.org/10.1086/467025.

Henry, Emeric. 2009. “Strategic Disclosure of Research Results: The Cost of Proving Your Honesty.” The Economic Journal 119 (539): 1036–64. https://doi.org/10.1111/j.1468-0297.2009.02265.x.

Henry, Emeric, and Marco Ottaviani. 2019. “Research and the Approval Process: The Organization of Persuasion.” American Economic Review 109 (3): 911–55. https://doi.org/10.1257/aer.20171919.

Herresthal, Claudia. 2022. “Hidden Testing and Selective Disclosure of Evidence.” Journal of Economic Theory 200: 105312. https://doi.org/10.1016/j.jet.2021.105312.

Ioannidis, John P. A. 2005. “Why Most Published Research Findings Are False.” PLOS Medicine 2 (8): e124. https://doi.org/10.1371/journal.pmed.0020124.

Kamenica, Emir, and Matthew Gentzkow. 2011. “Bayesian Persuasion.” American Economic Review 101 (6): 2590–2615. https://doi.org/10.1257/aer.101.6.2590.

Kasy, Maximilian. 2021. “Of Forking Paths and Tied Hands: Selective Publication of Findings, and What Economists Should Do about It.” Journal of Economic Perspectives 35 (3): 175–92. https://doi.org/10.1257/jep.35.3.175.

Kasy, Maximilian, and Jann Spiess. 2025. “Optimal Pre-Analysis Plans: Statistical Decisions Subject to Implementability.” Working Paper. https://maxkasy.github.io/home/files/papers/optimal\%5Fpreanalysis\%5Fplans.pdf.

Leamer, Edward E. 1983. “Let’s Take the Con Out of Econometrics.” American Economic Review 73 (1): 31–43. https://www.jstor.org/stable/1803924.

Lu, Xun, and Halbert White. 2014. “Robustness Checks and Robustness Tests in Applied Economics.” Journal of Econometrics 178 (Part 1): 194–206. https://doi.org/10.1016/j.jeconom.2013.08.016.

McCloskey, Adam, and Pascal Michaillat. 2024. “Critical Values Robust to P-hacking.” The Review of Economics and Statistics, 1–35. https://doi.org/10.1162/rest_a_01456.

Milgrom, Paul R. 1981. “Good News and Bad News: Representation Theorems and Applications.” The Bell Journal of Economics 12 (2): 380–91. https://doi.org/10.2307/3003562.

Olken, Benjamin A. 2015. “Promises and Perils of Pre-Analysis Plans.” Journal of Economic Perspectives 29 (3): 61–80. https://doi.org/10.1257/jep.29.3.61.

Prallon, Brenda. 2026. “How Robust are Robustness Checks?” https://arxiv.org/abs/2602.19384.

Simmons, Joseph P., Leif D. Nelson, and Uri Simonsohn. 2011. “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant.” Psychological Science 22 (11): 1359–66. https://doi.org/10.1177/0956797611417632.

Simonsohn, Uri, Joseph P. Simmons, and Leif D. Nelson. 2020. “Specification Curve Analysis.” Nature Human Behaviour 4 (11): 1208–14. https://doi.org/10.1038/s41562-020-0912-z.

Steegen, Sara, Francis Tuerlinckx, Andrew Gelman, and Wolf Vanpaemel. 2016. “Increasing Transparency Through a Multiverse Analysis.” Perspectives on Psychological Science 11 (5): 702–12. https://doi.org/10.1177/1745691616658637.

Tetenov, Aleksey. 2016. “An Economic Theory of Statistical Testing.” CeMMAP Working Paper CWP50/16. Centre for Microdata Methods; Practice, Institute for Fiscal Studies. https://doi.org/10.1920/wp.cem.2016.5016.

Wald, Abraham. 1950. Statistical Decision Functions. New York: John Wiley & Sons. https://books.google.com/books?id=nq0gAAAAMAAJ.

White, Halbert. 2000. “A Reality Check for Data Snooping.” Econometrica 68 (5): 1097–1126. https://doi.org/10.1111/1468-0262.00152.

The independence assumption is intentionally strong. Tests of the same hypothesis are often correlated, and reasonable specifications with similar power and controls may be highly correlated. We dealt with this in Fishman and Sekeres (2026) by assuming tests were drawn from Gaussian AR(1) processes. Dealing with the correlated case in general is necessary and this independence assumption is a failing of the current analysis. Intuitively, results should extend when considering purely the number of effectively independent tests the author has access to and the editor observes. This would require knowing the precise correlation between related tests (and the decay of that correlation as tests grow increasingly unrelated), which is practically just as strong as independence.↩︎
See recent tools such as Refine, Coarse, and Reviewer 2.↩︎