6 The Shifting Production Function: AI, Reproducibility, and the Future of Quantitative Social Science
Abel Brodeur, Department of Economics and Institute for Replication, University of Ottawa, abrodeur@uottawa.ca
Bruno Barbarioli, Institute for Replication, University of Ottawa, bbarbari@uottawa.ca
Abstract: Artificial intelligence, particularly large language models (LLMs), is reshaping the production function of quantitative social science along three axes simultaneously: as an accelerator of measurement, coding, and analysis; as a source of new threats to validity, reproducibility, and scientific integrity; and as potential infrastructure for continuous verification of cumulative knowledge. This position paper synthesizes recent evidence on each axis, drawing on the “Replication Engine” proposal for automated reproduction at scale and on a growing body of peer-reviewed work evaluating LLMs in social-science workflows. We argue that the field faces a narrow window in which norms, reporting standards, and shared infrastructure can be established before AI-assisted research practices become entrenched without adequate safeguards. We propose a concrete five-year research agenda organized around six workstreams, a disclosure checklist for AI-assisted quantitative research, and evaluation metrics for computational reproducibility. The goal is not to slow adoption but to channel it toward a more self-correcting science.
6.1 Introduction: Why This Moment Is Qualitatively Different
Quantitative social science has always co-evolved with its tools. The spread of personal computing enabled large-scale survey analysis; the internet enabled online experiments; machine learning enabled text-as-data. Each transition expanded the frontier of feasible research while introducing new methodological pitfalls and in each case the community’s norms lagged behind practice.
The current wave of AI, centered on large language models and agentic systems, is different in degree and arguably in kind. LLMs do not merely automate a single step; they can, in principle, participate in every stage of the research pipeline: literature review, hypothesis generation, instrument design, data collection (as simulated participants), coding, measurement, analysis, and even manuscript drafting (Grossmann et al. 2023; Xu et al. 2024; Thapa et al. 2025). This breadth of application means that AI is not simply another tool in the researcher’s kit, it is reshaping the production function itself.
At the same time, the social sciences continue to grapple with a reproducibility crisis. The landmark Reproducibility Project in psychology found that fewer than half of 100 published effects replicated (Open Science Collaboration 2015). Similar replication rates have been found for Economics (Camerer et al. 2016). Traditional replication efforts, while invaluable, do not scale: the Many Labs project replicated fewer than 30 studies over several years. Manual verification cannot keep pace with the roughly three million papers published annually across all fields.
In this piece, we envision AI-powered infrastructure that automatically reproduces computational findings at the moment of publication, using agents that parse papers, reconstruct environments, execute analyses, and flag irreproducible results (Brodeur and Barbarioli 2025). This vision, ambitious but increasingly technically feasible, crystallizes both the promise and the peril of AI for cumulative social science. We frame the argument around how AI changes the production function of quantitative social science, and offers prescriptive proposals for the next three to five years.
6.2 Implications of a Replication Engine Vision
The Replication Engine concept, as articulated by Brodeur and Barbarioli (2025), proposes a phased rollout: a pilot (Years 1–3) building cloud-based AI agents that parse manuscripts, reconstruct computational environments, execute analyses, and assign verification badges; followed by scaling through network effects (Years 4–10). The system envisions three specialized agents: one checks that submitted code runs and reproduces outputs, a third audits for coding errors and data irregularities, and a fourth tests robustness.
For the quantitative social sciences, this architecture addresses a concrete bottleneck. Journals such as the American Economic Review and the American Journal of Political Science now require data and code deposits, but verification remains largely manual and resource-constrained. Automated reproduction could shift verification from a post-publication afterthought to a pre-publication standard.
However, translating this vision to the social sciences requires confronting domain-specific challenges. First, much social-science analysis involves idiosyncratic data-cleaning decisions that are difficult to reconstruct from code alone. Second, many studies depend on restricted-use data (e.g., Census microdata, linked administrative records) that cannot be freely shared. Third, the boundary between “reproduction” (re-running code on the same data) and “replication” (testing the same hypothesis on new data) matters enormously, and automated systems must be transparent about which they are performing. Finally, the “robustness agent” described in the Replication Engine proposal raises delicate questions about specification search: Who decides what constitutes a “reasonable” robustness check, and how are results communicated without implying that all Green-badged findings are true?
Despite these caveats, the core logic is compelling: if the marginal cost of computational reproduction falls close to zero, the equilibrium level of verification will rise dramatically. The question is how to build this infrastructure in ways that are calibrated to social-science epistemology. An initial
6.3 Opportunities: Where LLMs Genuinely Help
We organize genuine opportunities into four clusters, noting that each comes with caveats discussed in Section 6.4.
LLMs have demonstrated strong performance as zero-shot or few-shot classifiers for social-science constructs, often approaching or exceeding the reliability of human coders for tasks such as sentiment analysis, stance detection, and hate-speech classification (Thapa et al. 2025). Laurer et al. (2025) show that instruction-tuned models can increase measurement validity and reduce cross-group bias when researchers provide careful natural-language specifications of the construct to be measured. This “validity-by-instruction” paradigm lowers the barrier to large-scale text analysis while introducing new researcher degrees of freedom in prompt design.
LLMs can forecast empirical effect sizes with surprising accuracy. Lippert et al. (2024) found that GPT-4 matched a cohort of 119 human experts in predicting effect sizes from a complex behavioural-science study (\(r = 0.89\) for GPT-4 versus \(r = 0.87\) for expert aggregates). In a collaborative condition, access to a GPT-4 chatbot significantly improved the accuracy of non-expert forecasters. This suggests a role for LLMs in triage: identifying which claims are most likely to replicate and therefore where human replication resources should be directed.
Simulation and synthetic-data pilot studies. Researchers are increasingly using LLMs as “silicon samples” to pilot experiments and generate hypotheses before fielding expensive human studies (Grossmann et al. 2023; Xu et al. 2024). While the validity of such simulations is contested (see Section 6.4), they offer a low-cost way to stress-test experimental designs and explore parameter spaces that would be infeasible with human participants alone.
Computational reproduction and code assistance. AI agents can parse statistical code, identify dependencies, reconstruct execution environments, and flag discrepancies between reported and reproduced results which is the core of the Replication Engine vision (Brodeur and Barbarioli 2025). More prosaically, LLM-assisted coding reduces the time required to clean data, write analysis scripts, and produce visualizations. The resulting productivity gains may lower the fixed costs of reproduction studies, which have historically been under-rewarded by the academic incentive structure.
6.4 Failure Modes and Validity Threats
The same properties that make LLMs powerful tools also generate new classes of risk.
Measurement validity and hidden bias. Laurer et al. (2025) demonstrate that LLMs can learn group-specific language patterns rather than the construct researchers intend to measure, introducing systematic measurement bias. When models are trained on data that overrepresent certain populations or linguistic registers, “accuracy” in aggregate may mask poor performance on under-represented subgroups. The ease of LLM-based measurement can encourage researchers to skip traditional construct-validation steps, amplifying the problem.
Reproducibility of LLM-based workflows. Proprietary model versioning, temperature settings, system prompts, and nondeterministic decoding create a reproducibility challenge specific to AI-assisted research. Abdurahman et al. (2025) provide a primer emphasizing that researchers must log exact model identifiers, API parameters, and prompt text; use API access rather than web interfaces; and be transparent about batching and context effects. Yet current practice rarely meets these standards. Model providers may deprecate or silently update endpoints, making exact reproduction impossible even when researchers document their workflow carefully.
Participant contamination and data integrity. Cox, Shirani, and Rouse (2024) warn that research participants themselves may use LLMs to generate responses, particularly in online surveys and text-based qualitative research on unfamiliar topics. As generative AI becomes ubiquitous, the boundary between “human” and “AI-assisted” data blurs. Social-science research that relies on text responses is especially vulnerable: AI-generated text may exhibit distinctive distributional properties that contaminate findings in ways that are difficult to detect post hoc.
Simulation fidelity and the “silicon sample” illusion. Although LLMs can produce text that superficially resembles human survey responses, they do not draw from genuine population distributions. They lack stable preference structures and update mechanisms; their outputs reflect training-data patterns and sampling algorithms rather than lived experience (Xu et al. 2024; Abdurahman et al. 2025). Treating LLM outputs as representative of human subpopulations without rigorous validation risks producing theoretically misleading results that appear empirically grounded.
Prompt-sensitivity and hidden researcher degrees of freedom. Small changes in prompt wording, ordering, and system instructions can substantially alter LLM outputs (Abdurahman et al. 2025). This creates a new form of the garden-of-forking-paths problem: researchers may iterate over prompt designs until they obtain desired classification rates or simulation outcomes, without reporting the full set of specifications explored. Unlike traditional analytic flexibility, prompt-based flexibility is difficult to constrain through pre-registration because the space of possible prompts is effectively infinite.
Version drift and proprietary opacity. Major LLM providers routinely update models without detailed changelogs. A study that produces valid measurements with gpt-4-0613 may yield different results with gpt-4-turbo-2024-04-09. Because commercial APIs are black boxes, researchers cannot inspect or control for architectural or training-data changes. This makes longitudinal or comparative studies particularly vulnerable.
6.5 A Concrete Agenda for the Next 3–5 Years
We propose six prioritized workstreams, each tied to responsible actors.
- Build open, domain-specific reproducibility benchmarks. Create curated test suites of social-science papers with deposited code and data, verified reproductions, and known failure cases. These benchmarks would allow automated-reproduction systems (including the Replication Engine) to be evaluated on social-science-specific challenges such as restricted data, platform-dependent packages, and stochastic simulation. Actors: Metascience funders (e.g., Arnold Ventures, NSF), data archives (ICPSR, Dataverse), professional associations.
- Develop reporting standards for AI-assisted research. We propose an disclosure checklist (Table 6.1) covering model identity, prompt logging, parameter documentation, sensitivity analyses, and data-provenance statements. Journals should require checklist completion for any manuscript that uses LLMs in data collection, measurement, analysis, or writing. Actors: Journal editors, professional associations (APSA, ASA, AEA), the Center for Open Science.
- Invest in open-weight, versioned models for research. The reproducibility challenges posed by proprietary models can be partially mitigated by investing in open-weight LLMs that are version-locked and archived. Funding agencies should support the development and hosting of research-grade models with frozen weights and documented training data, analogous to how statistical software versions are archived. Actors: NSF, DARPA, national labs, university computing consortia.
- Adapt pre-registration for AI-assisted workflows. Existing pre-registration templates do not accommodate prompt-based analysis. Pre-analysis plans should specify the exact prompt text, model version, temperature, and post-processing pipeline to be used. For exploratory LLM use, a “prompt diary” (analogous to a lab notebook) should be deposited alongside the pre-registration. Actors: OSF, AsPredicted, journal editors.
- Build detection tools for AI-contaminated data. Develop and validate classifiers that can flag AI-generated text in survey responses, interview transcripts, and open-ended data. Integrate these tools into major survey platforms (Qualtrics, Prolific, MTurk). Actors: Survey-methods researchers, platform providers, IRBs.
- Pilot automated reproduction for social-science journals. Partner with two to three journals to implement a lightweight automated-reproduction pipeline—a social-science-specific pilot of the Replication Engine concept. Evaluate the pipeline’s accuracy, false-positive rate, and cost per paper. Use findings to refine badge criteria (Green/Amber/Red) for the social-science context. Actors: Journal editors, metascience labs, cloud-computing sponsors.
6.6 Recommendations for the Field
For journals: Require the checklist (Table 6.1) as a condition of submission for papers that employ LLMs in any research stage. Adopt computational-reproducibility badges and link them to automated-verification pipelines as these mature.
For funders: Allocate dedicated funding streams for (a) open, versioned research models, (b) reproducibility benchmarks, and (c) replication studies that specifically evaluate the robustness of AI-assisted findings relative to traditional methods.
For labs and PIs: Adopt containerized computational environments (Docker, Singularity) and version-pin all LLM dependencies. Treat prompt design as a research instrument subject to the same validation and reporting standards as survey instruments or experimental protocols.
For graduate training: Integrate instruction on AI-assisted methods, their limitations, and responsible-use practices into quantitative-methods sequences. Require students to complete at least one computational-reproduction exercise as part of their training, following best practices for transparency and documentation (Abdurahman et al. 2025).
| Dimension | Required Disclosure | Rationale / Standard |
|---|---|---|
| Model identity | Provider, model name, exact version string (e.g., gpt-4-0613), access date |
Enables reproduction; guards against version drift |
| Prompt specification | Full system prompt, user prompt(s), and any chain-of-thought instructions; deposited in a public repository | Prompt is the “instrument”; must be inspectable |
| Parameter settings | Temperature, top-\(p\), max tokens, seed (if supported), number of completions | Nondeterministic settings affect output distribution |
| Data pipeline | Whether data were batched; batch size; order randomization; post-processing steps | Batching and order create context effects (Abdurahman et al. 2025) |
| Sensitivity analysis | Results under \(\geq 2\) alternative prompts, \(\geq 2\) temperature settings, or \(\geq 1\) alternative model | Demonstrates robustness to prompt/model choice |
| Validation | Human–LLM agreement metrics (e.g., Cohen’s \(\kappa\), \(F_1\)) on a held-out sample; disaggregated by relevant subgroups | Establishes measurement validity (Laurer et al. 2025) |
| Data provenance | Statement on whether respondent data may contain AI-generated content; detection method used (if any) | Guards against participant contamination (Cox, Shirani, and Rouse 2024) |
| Reproducibility deposit | Code, prompts, environment specification (Dockerfile or equivalent), and raw LLM outputs deposited in a permanent repository | Enables computational reproduction |
6.7 Limitations and Open Questions
Several important questions remain beyond the scope of this paper. First, the political economy of automated verification deserves scrutiny: who controls the verification infrastructure, and how do we prevent it from entrenching particular methodological orthodoxies? The Green/Amber/Red badge system, if implemented carelessly, could penalize innovative methods that do not fit standard templates. Second, the costs and benefits of open-weight versus proprietary models for research involve complex trade-offs between capability, accessibility, and auditability that we have only sketched here. Third, the ethical implications of “silicon samples” extend beyond validity to questions of consent, representation, and the potential displacement of human research participants, issues that intersect with ongoing debates about the governance of AI-generated content. Fourth, automated reproduction addresses computational reproducibility but not the deeper question of replicability—whether the same finding holds in a new sample or context. Building automated systems for conceptual replication remains a grand challenge for metascience.
6.8 Conclusion
AI is not simply adding a new tool to the quantitative social scientist’s workbench; it is restructuring the workbench itself. The same technology that enables rapid, scalable measurement and analysis also introduces new threats to validity, transparency, and reproducibility. The Replication Engine vision offers a compelling endpoint (a world in which every computational claim comes with machine-audited verification) but reaching that endpoint requires disciplined investment in standards, benchmarks, and infrastructure tailored to social-science epistemology.
The proposals in this paper are deliberately concrete and near-term. They are designed to be implementable within existing institutional structures while the field develops the deeper theoretical and empirical understanding needed for longer-term governance. The window for establishing norms is narrow: once AI-assisted practices become the unreflective default, retrofitting transparency and reproducibility will be far more difficult.