6 The Shifting Production Function: AI, Reproducibility, and the Future of Quantitative Social Science

Abel Brodeur, Department of Economics and Institute for Replication, University of Ottawa, abrodeur@uottawa.ca
Bruno Barbarioli, Institute for Replication, University of Ottawa, bbarbari@uottawa.ca

Abstract: Artificial intelligence, particularly large language models (LLMs), is reshaping the production function of quantitative social science along three axes simultaneously: as an accelerator of measurement, coding, and analysis; as a source of new threats to validity, reproducibility, and scientific integrity; and as potential infrastructure for continuous verification of cumulative knowledge. This position paper synthesizes recent evidence on each axis, drawing on the “Replication Engine” proposal for automated reproduction at scale and on a growing body of peer-reviewed work evaluating LLMs in social-science workflows. We argue that the field faces a narrow window in which norms, reporting standards, and shared infrastructure can be established before AI-assisted research practices become entrenched without adequate safeguards. We propose a concrete five-year research agenda organized around six workstreams, a disclosure checklist for AI-assisted quantitative research, and evaluation metrics for computational reproducibility. The goal is not to slow adoption but to channel it toward a more self-correcting science.

6.1 Introduction: Why This Moment Is Qualitatively Different

Quantitative social science has always co-evolved with its tools. The spread of personal computing enabled large-scale survey analysis; the internet enabled online experiments; machine learning enabled text-as-data. Each transition expanded the frontier of feasible research while introducing new methodological pitfalls and in each case the community’s norms lagged behind practice.

The current wave of AI, centered on large language models and agentic systems, is different in degree and arguably in kind. LLMs do not merely automate a single step; they can, in principle, participate in every stage of the research pipeline: literature review, hypothesis generation, instrument design, data collection (as simulated participants), coding, measurement, analysis, and even manuscript drafting (Grossmann et al. 2023; Xu et al. 2024; Thapa et al. 2025). This breadth of application means that AI is not simply another tool in the researcher’s kit, it is reshaping the production function itself.

At the same time, the social sciences continue to grapple with a reproducibility crisis. The landmark Reproducibility Project in psychology found that fewer than half of 100 published effects replicated (Open Science Collaboration 2015). Similar replication rates have been found for Economics (Camerer et al. 2016). Traditional replication efforts, while invaluable, do not scale: the Many Labs project replicated fewer than 30 studies over several years. Manual verification cannot keep pace with the roughly three million papers published annually across all fields.

In this piece, we envision AI-powered infrastructure that automatically reproduces computational findings at the moment of publication, using agents that parse papers, reconstruct environments, execute analyses, and flag irreproducible results (Brodeur and Barbarioli 2025). This vision, ambitious but increasingly technically feasible, crystallizes both the promise and the peril of AI for cumulative social science. We frame the argument around how AI changes the production function of quantitative social science, and offers prescriptive proposals for the next three to five years.

6.2 Implications of a Replication Engine Vision

The Replication Engine concept, as articulated by Brodeur and Barbarioli (2025), proposes a phased rollout: a pilot (Years 1–3) building cloud-based AI agents that parse manuscripts, reconstruct computational environments, execute analyses, and assign verification badges; followed by scaling through network effects (Years 4–10). The system envisions three specialized agents: one checks that submitted code runs and reproduces outputs, a third audits for coding errors and data irregularities, and a fourth tests robustness.

For the quantitative social sciences, this architecture addresses a concrete bottleneck. Journals such as the American Economic Review and the American Journal of Political Science now require data and code deposits, but verification remains largely manual and resource-constrained. Automated reproduction could shift verification from a post-publication afterthought to a pre-publication standard.

However, translating this vision to the social sciences requires confronting domain-specific challenges. First, much social-science analysis involves idiosyncratic data-cleaning decisions that are difficult to reconstruct from code alone. Second, many studies depend on restricted-use data (e.g., Census microdata, linked administrative records) that cannot be freely shared. Third, the boundary between “reproduction” (re-running code on the same data) and “replication” (testing the same hypothesis on new data) matters enormously, and automated systems must be transparent about which they are performing. Finally, the “robustness agent” described in the Replication Engine proposal raises delicate questions about specification search: Who decides what constitutes a “reasonable” robustness check, and how are results communicated without implying that all Green-badged findings are true?

Despite these caveats, the core logic is compelling: if the marginal cost of computational reproduction falls close to zero, the equilibrium level of verification will rise dramatically. The question is how to build this infrastructure in ways that are calibrated to social-science epistemology. An initial

6.3 Opportunities: Where LLMs Genuinely Help

We organize genuine opportunities into four clusters, noting that each comes with caveats discussed in Section 6.4.

LLMs have demonstrated strong performance as zero-shot or few-shot classifiers for social-science constructs, often approaching or exceeding the reliability of human coders for tasks such as sentiment analysis, stance detection, and hate-speech classification (Thapa et al. 2025). Laurer et al. (2025) show that instruction-tuned models can increase measurement validity and reduce cross-group bias when researchers provide careful natural-language specifications of the construct to be measured. This “validity-by-instruction” paradigm lowers the barrier to large-scale text analysis while introducing new researcher degrees of freedom in prompt design.

LLMs can forecast empirical effect sizes with surprising accuracy. Lippert et al. (2024) found that GPT-4 matched a cohort of 119 human experts in predicting effect sizes from a complex behavioural-science study (\(r = 0.89\) for GPT-4 versus \(r = 0.87\) for expert aggregates). In a collaborative condition, access to a GPT-4 chatbot significantly improved the accuracy of non-expert forecasters. This suggests a role for LLMs in triage: identifying which claims are most likely to replicate and therefore where human replication resources should be directed.

Simulation and synthetic-data pilot studies. Researchers are increasingly using LLMs as “silicon samples” to pilot experiments and generate hypotheses before fielding expensive human studies (Grossmann et al. 2023; Xu et al. 2024). While the validity of such simulations is contested (see Section 6.4), they offer a low-cost way to stress-test experimental designs and explore parameter spaces that would be infeasible with human participants alone.

Computational reproduction and code assistance. AI agents can parse statistical code, identify dependencies, reconstruct execution environments, and flag discrepancies between reported and reproduced results which is the core of the Replication Engine vision (Brodeur and Barbarioli 2025). More prosaically, LLM-assisted coding reduces the time required to clean data, write analysis scripts, and produce visualizations. The resulting productivity gains may lower the fixed costs of reproduction studies, which have historically been under-rewarded by the academic incentive structure.

6.4 Failure Modes and Validity Threats

The same properties that make LLMs powerful tools also generate new classes of risk.

Measurement validity and hidden bias. Laurer et al. (2025) demonstrate that LLMs can learn group-specific language patterns rather than the construct researchers intend to measure, introducing systematic measurement bias. When models are trained on data that overrepresent certain populations or linguistic registers, “accuracy” in aggregate may mask poor performance on under-represented subgroups. The ease of LLM-based measurement can encourage researchers to skip traditional construct-validation steps, amplifying the problem.

Reproducibility of LLM-based workflows. Proprietary model versioning, temperature settings, system prompts, and nondeterministic decoding create a reproducibility challenge specific to AI-assisted research. Abdurahman et al. (2025) provide a primer emphasizing that researchers must log exact model identifiers, API parameters, and prompt text; use API access rather than web interfaces; and be transparent about batching and context effects. Yet current practice rarely meets these standards. Model providers may deprecate or silently update endpoints, making exact reproduction impossible even when researchers document their workflow carefully.

Participant contamination and data integrity. Cox, Shirani, and Rouse (2024) warn that research participants themselves may use LLMs to generate responses, particularly in online surveys and text-based qualitative research on unfamiliar topics. As generative AI becomes ubiquitous, the boundary between “human” and “AI-assisted” data blurs. Social-science research that relies on text responses is especially vulnerable: AI-generated text may exhibit distinctive distributional properties that contaminate findings in ways that are difficult to detect post hoc.

Simulation fidelity and the “silicon sample” illusion. Although LLMs can produce text that superficially resembles human survey responses, they do not draw from genuine population distributions. They lack stable preference structures and update mechanisms; their outputs reflect training-data patterns and sampling algorithms rather than lived experience (Xu et al. 2024; Abdurahman et al. 2025). Treating LLM outputs as representative of human subpopulations without rigorous validation risks producing theoretically misleading results that appear empirically grounded.

Prompt-sensitivity and hidden researcher degrees of freedom. Small changes in prompt wording, ordering, and system instructions can substantially alter LLM outputs (Abdurahman et al. 2025). This creates a new form of the garden-of-forking-paths problem: researchers may iterate over prompt designs until they obtain desired classification rates or simulation outcomes, without reporting the full set of specifications explored. Unlike traditional analytic flexibility, prompt-based flexibility is difficult to constrain through pre-registration because the space of possible prompts is effectively infinite.

Version drift and proprietary opacity. Major LLM providers routinely update models without detailed changelogs. A study that produces valid measurements with gpt-4-0613 may yield different results with gpt-4-turbo-2024-04-09. Because commercial APIs are black boxes, researchers cannot inspect or control for architectural or training-data changes. This makes longitudinal or comparative studies particularly vulnerable.

6.5 A Concrete Agenda for the Next 3–5 Years

We propose six prioritized workstreams, each tied to responsible actors.

Build open, domain-specific reproducibility benchmarks. Create curated test suites of social-science papers with deposited code and data, verified reproductions, and known failure cases. These benchmarks would allow automated-reproduction systems (including the Replication Engine) to be evaluated on social-science-specific challenges such as restricted data, platform-dependent packages, and stochastic simulation. Actors: Metascience funders (e.g., Arnold Ventures, NSF), data archives (ICPSR, Dataverse), professional associations.
Develop reporting standards for AI-assisted research. We propose an disclosure checklist (Table 6.1) covering model identity, prompt logging, parameter documentation, sensitivity analyses, and data-provenance statements. Journals should require checklist completion for any manuscript that uses LLMs in data collection, measurement, analysis, or writing. Actors: Journal editors, professional associations (APSA, ASA, AEA), the Center for Open Science.
Invest in open-weight, versioned models for research. The reproducibility challenges posed by proprietary models can be partially mitigated by investing in open-weight LLMs that are version-locked and archived. Funding agencies should support the development and hosting of research-grade models with frozen weights and documented training data, analogous to how statistical software versions are archived. Actors: NSF, DARPA, national labs, university computing consortia.
Adapt pre-registration for AI-assisted workflows. Existing pre-registration templates do not accommodate prompt-based analysis. Pre-analysis plans should specify the exact prompt text, model version, temperature, and post-processing pipeline to be used. For exploratory LLM use, a “prompt diary” (analogous to a lab notebook) should be deposited alongside the pre-registration. Actors: OSF, AsPredicted, journal editors.
Build detection tools for AI-contaminated data. Develop and validate classifiers that can flag AI-generated text in survey responses, interview transcripts, and open-ended data. Integrate these tools into major survey platforms (Qualtrics, Prolific, MTurk). Actors: Survey-methods researchers, platform providers, IRBs.
Pilot automated reproduction for social-science journals. Partner with two to three journals to implement a lightweight automated-reproduction pipeline—a social-science-specific pilot of the Replication Engine concept. Evaluate the pipeline’s accuracy, false-positive rate, and cost per paper. Use findings to refine badge criteria (Green/Amber/Red) for the social-science context. Actors: Journal editors, metascience labs, cloud-computing sponsors.

6.6 Recommendations for the Field

For journals: Require the checklist (Table 6.1) as a condition of submission for papers that employ LLMs in any research stage. Adopt computational-reproducibility badges and link them to automated-verification pipelines as these mature.

For funders: Allocate dedicated funding streams for (a) open, versioned research models, (b) reproducibility benchmarks, and (c) replication studies that specifically evaluate the robustness of AI-assisted findings relative to traditional methods.

For labs and PIs: Adopt containerized computational environments (Docker, Singularity) and version-pin all LLM dependencies. Treat prompt design as a research instrument subject to the same validation and reporting standards as survey instruments or experimental protocols.

For graduate training: Integrate instruction on AI-assisted methods, their limitations, and responsible-use practices into quantitative-methods sequences. Require students to complete at least one computational-reproduction exercise as part of their training, following best practices for transparency and documentation (Abdurahman et al. 2025).

Table 6.1: Proposed Disclosure Checklist for AI-Assisted Quantitative Social-Science Research

Dimension	Required Disclosure	Rationale / Standard
Model identity	Provider, model name, exact version string (e.g., `gpt-4-0613`), access date	Enables reproduction; guards against version drift
Prompt specification	Full system prompt, user prompt(s), and any chain-of-thought instructions; deposited in a public repository	Prompt is the “instrument”; must be inspectable
Parameter settings	Temperature, top-\(p\), max tokens, seed (if supported), number of completions	Nondeterministic settings affect output distribution
Data pipeline	Whether data were batched; batch size; order randomization; post-processing steps	Batching and order create context effects (Abdurahman et al. 2025)
Sensitivity analysis	Results under \(\geq 2\) alternative prompts, \(\geq 2\) temperature settings, or \(\geq 1\) alternative model	Demonstrates robustness to prompt/model choice
Validation	Human–LLM agreement metrics (e.g., Cohen’s \(\kappa\), \(F_1\)) on a held-out sample; disaggregated by relevant subgroups	Establishes measurement validity (Laurer et al. 2025)
Data provenance	Statement on whether respondent data may contain AI-generated content; detection method used (if any)	Guards against participant contamination (Cox, Shirani, and Rouse 2024)
Reproducibility deposit	Code, prompts, environment specification (Dockerfile or equivalent), and raw LLM outputs deposited in a permanent repository	Enables computational reproduction

6.7 Limitations and Open Questions

Several important questions remain beyond the scope of this paper. First, the political economy of automated verification deserves scrutiny: who controls the verification infrastructure, and how do we prevent it from entrenching particular methodological orthodoxies? The Green/Amber/Red badge system, if implemented carelessly, could penalize innovative methods that do not fit standard templates. Second, the costs and benefits of open-weight versus proprietary models for research involve complex trade-offs between capability, accessibility, and auditability that we have only sketched here. Third, the ethical implications of “silicon samples” extend beyond validity to questions of consent, representation, and the potential displacement of human research participants, issues that intersect with ongoing debates about the governance of AI-generated content. Fourth, automated reproduction addresses computational reproducibility but not the deeper question of replicability—whether the same finding holds in a new sample or context. Building automated systems for conceptual replication remains a grand challenge for metascience.

6.8 Conclusion

AI is not simply adding a new tool to the quantitative social scientist’s workbench; it is restructuring the workbench itself. The same technology that enables rapid, scalable measurement and analysis also introduces new threats to validity, transparency, and reproducibility. The Replication Engine vision offers a compelling endpoint (a world in which every computational claim comes with machine-audited verification) but reaching that endpoint requires disciplined investment in standards, benchmarks, and infrastructure tailored to social-science epistemology.

The proposals in this paper are deliberately concrete and near-term. They are designed to be implementable within existing institutional structures while the field develops the deeper theoretical and empirical understanding needed for longer-term governance. The window for establishing norms is narrow: once AI-assisted practices become the unreflective default, retrofitting transparency and reproducibility will be far more difficult.

Abdurahman, Suhaib, Alireza Salkhordeh Ziabari, Alexander K. Moore, Daniel M. Bartels, and Morteza Dehghani. 2025. “A Primer for Evaluating Large Language Models in Social-Science Research.” Advances in Methods and Practices in Psychological Science 8 (2). https://doi.org/10.1177/25152459251325174.

Brodeur, Abel, and Bruno Barbarioli. 2025. “The Replication Engine: How to Build Automated Replication Infrastructure for Better, Faster Science.” Institute for Progress. https://ifp.org/the-replication-engine/.

Camerer, Colin F., Anna Dreber, Eskil Forsell, Teck-Hua Ho, Jürgen Huber, Magnus Johannesson, Michael Kirchler, et al. 2016. “Evaluating Replicability of Laboratory Experiments in Economics.” Science 351 (6280): 1433–36. https://doi.org/10.1126/science.aaf0918.

Cox, Emily, Fiona Shirani, and Paul Rouse. 2024. “Voices from the Algorithm: Large Language Models in Social Research.” Energy Research & Social Science 113: 103559. https://doi.org/10.1016/j.erss.2024.103559.

Grossmann, Igor, Matthew Feinberg, Dawn C Parker, Nicholas A Christakis, Philip E Tetlock, and William A Cunningham. 2023. “AI and the transformation of social science research.” Science 380 (6650): 1108–9.

Laurer, Moritz, Wouter van Atteveldt, Andreu Casas, and Kasper Welbers. 2025. “On Measurement Validity and Language Models: Increasing Validity and Decreasing Bias with Instructions.” Communication Methods and Measures 19 (1): 46–62. https://doi.org/10.1080/19312458.2024.2378690.

Lippert, Steffen, Anna Dreber, Magnus Johannesson, Warren Tierney, Wilson Cyrus-Lai, Eric Luis Uhlmann, Emotion Expression Collaboration, and Thomas Pfeiffer. 2024. “Can large language models help predict results from a complex behavioural science study?” Royal Society Open Science 11 (9): 240682. https://doi.org/10.1098/rsos.240682.

Open Science Collaboration. 2015. “Estimating the Reproducibility of Psychological Science.” Science 349 (6251): aac4716. https://doi.org/10.1126/science.aac4716.

Thapa, Surendrabikram, Shuvam Shiwakoti, Siddhant Bikram Shah, Surabhi Adhikari, Hariram Veeramani, Mehwish Nasim, and Usman Naseem. 2025. “Large Language Models (LLM) in Computational Social Science: Prospects, Current State, and Challenges.” Social Network Analysis and Mining 15 (1): 4. https://doi.org/10.1007/s13278-025-01428-9.

Xu, Ruoxi, Yingfei Sun, Mengjie Ren, Shiguang Guo, Ruotong Pan, Hongyu Lin, Le Sun, and Xianpei Han. 2024. “AI for Social Science and Social Science of AI: A Survey.” Information Processing & Management 61 (3): 103665. https://doi.org/10.1016/j.ipm.2024.103665.