9  Failing Faster, Learning Better: Agentic AI and Empirical Social Science Research

Charles Crabtree, School of Social Sciences, Monash University and K-Club Professor, University College, Korea University
Valentina Gonzalez-Rostani, Department of Political Science and International Relations, University of Southern California
Jae Yeon Kim, Department of Public Policy, University of North Carolina at Chapel Hill .

(All authors contributed equally. The names are listed alphabetically by last name.)

Abstract: How can social scientists generate more ideas, test more mechanisms, and fail more productively? We argue that agentic AI offers a qualitative shift in researchers’ capacity for iterative search. Drawing on Bendor (2010)’s framework of innovation as a search problem that balances creativity (broad exploration) and criticism (rigorous evaluation), we organize the paper around two recurring search failures: Type I error, in which promising ideas are screened out before they are adequately explored, and Type II error, in which researchers overinvest in weak research designs. We discuss two applications. In organizational-level measurement research, agentic AI lowers the feasibility barrier to exploring new projects and automates iterative validation, catching flawed measures before they enter the downstream workflow. In experimental and audit-based research, the prevalent error is Type II error from underpowered, under-piloted designs. Agentic AI enables power diagnosis and stress-testing at scale. We are explicit about what these tools cannot do: silicon samples cannot validate treatment effects in real humans; LLMs default to agreement; AI-generated stimuli carry hidden priors and risk bundling effects; and using AI to defend against AI contamination raises new problems. Agentic AI is best used to scaffold upstream iteration in design, measurement, and validation. It does not replace human judgment.

AI usage statement: TBD

“Every genuine test of a theory is an attempt to falsify it, or to refute it. Testability is falsifiability; but there are degrees of testability: some theories are more testable, more exposed to refutation, than others; they take, as it were, greater risks.”
Conjectures and Refutations, Popper (1963, 36)

“And policy analysts, in deciding whether to use a finding from behavioral decision theory, are in the FDA’s position: they can be either too patient, and fail to apply an idea that is ready, or too impatient, and apply something prematurely.”
Bounded Rationality and Politics, Bendor (2010, 181)

9.2 Innovation as Iterative Search: The Bendor Framework

Scientific research is not about confirmation, but about increasing falsifiability. In this framework, the more testable a theory is, the better (Popper 1963). Jonathan Bendor (1950–2025), the late professor of political economy at Stanford’s Graduate School of Business, extends Popper’s framework of scientific discovery to organizational problem solving (Bendor 1995, 2010, 2015). Building on Simon (1947)’s work on bounded rationality and Lindblom (1959)’s work on disjointed incrementalism, he argues that innovation arises from balancing two risks: dismissing promising ideas before adequate exploration (Type I error) and overinvesting in weak ideas (Type II error) (Johnston 2015). Because these risks trade off, effective search requires both creativity and criticism. Progress depends on the iterative interplay between exploration and exploitation.

Agentic AI compresses creativity and criticism into a streamlined process. It reduces the cost of local iteration. On the exploration side, agents lower the marginal cost of trying new measurement strategies and testing alternative designs, helping researchers avoid the Type I error of dismissing promising ideas before they are adequately explored. On the evaluation side, they automate checks against benchmarks, across specifications, and under adversarial prompting, which helps guard against the Type II error of overinvesting in polished but weak ideas.

This epistemological approach, or innovation workflow, creates structured redundancy. Redundancy is often seen as waste, but it can improve reliability in high-risk systems (Landau 1969). Multiple identity checks at an airport illustrate this logic. Overlapping checks reduce the chance that a failure goes undetected.

Table 9.1 illustrates how Type I and Type II errors map onto the two applications, organizational measurement and experimental and audit-based research, and serves as a roadmap for the sections that follow.

Table 9.1: Type I and Type II Errors Across Applications
Application Type I Error (missed opportunities) Type II Error (overinvestment in weak ideas)
Organizational measurement Feasibility constraints screen out data-intensive projects. Early commitment to flawed measures that propagate downstream.
Experiments and audits Underpowered or weak designs fail to detect real effects. Contaminated or weakly validated evidence appears credible.
Role of agentic AI Expands search by lowering feasibility and iteration costs. Improves evaluation through iterative validation and stress testing.

9.3 Application 1: Organizational-Level Data and Measurement

9.3.1 The Problem

Research on organizations routinely confronts severe data constraints. Relevant evidence often sits in semi-structured sources outside standard datasets, including speeches and dialogues by local actors (e.g., politicians, citizens, and interest groups), oral histories, administrative records, and organizational websites (Grimmer and Stewart 2013; King, Lam, and Roberts 2017; Parthasarathy, Rao, and Palaniswamy 2019; Barari and Simko 2023; J. Y. Kim, Vries, and Han 2025). These sources are often essential for understanding what organizations offer their members, how they build networks and coalitions, how they develop and deliberate over agendas, and how they shape political and policy outcomes. The challenge is not only in collecting such data, but also in ensuring measurement validity: whether the observations researchers assemble actually operationalize the concept of interest in a valid way (Adcock and Collier 2001; Grimmer, Roberts, and Stewart 2022).

Through Bendor (2010)’s lens, these constraints generate two kinds of search failure. The first is feasibility-filtered omission, which is a Type I problem in Bendor (2010)’s sense. This failure occurs when research programs are not pursued because the required data are too costly to assemble. The second is premature commitment, which can be thought of as a Type II problem. When data annotation is done by hand, researchers might commit too early to a single operationalization. If that measure proves flawed, reannotation is prohibitively expensive. For instance, a team studying a political party’s immigration rhetoric might discover after six months that their scheme conflates economic and cultural frames. By that point, the flawed measure has already propagated into downstream work, such as data analysis and visualization. The deeper problem is not labor cost, but conceptual: treating measurement as fixed rather than iterative.

9.4 Application 2: Experimental and Audit-Based Research

9.4.1 The Problem

Experiments and audit studies provide causal evidence on what works and thus inform decision-making. Specifically, experiments identify the effects of interventions on beliefs, attitudes, and actions, while audit studies identify whether otherwise comparable actors receive different responses when they vary on a socially salient group-level characteristic (Block et al. 2021; Gaddis et al. 2022; Butler and Crabtree 2017). However, these designs are resource-intensive and often difficult to implement well. Kane (2025) identifies seven pathways to erroneous null findings. These include respondent inattentiveness, manipulation failure, pre-treatment, small samples, poor outcome measurement, ceiling and/or floor effects, and countervailing effects. In our framework above, this set of issues can be viewed as a Type I problem: potentially real effects are dismissed because designs are underpowered, underpiloted, or otherwise too weak to detect them. Arel-Bundock et al. (2026) show that quantitative political science research is “greatly underpowered,” routinely fielding studies that cannot detect the effects they hypothesize. The cost is not just more null results; estimates may also be unstable in sign and inflated in magnitude.

Recent developments have intensified the problem. Online panels, once central to the globalization of public opinion research (Heath, Fisher, and Smith 2005; Thomas 2024), may now be contaminated by AI agents. The scale of the problem is already substantial. About one-third of crowdworkers report using LLMs to answer open-ended survey questions (Zhang, Xu, and Alvero 2025). Recent evidence also suggests that AI-assisted responding in live surveys likely ranges from 4% to 45% (Westwood and Frederick 2026; Panizza, Kyrychenko, and Roozenbeek 2026), and that AI agents are able to fully complete surveys and pass standard response-quality checks (Westwood and Frederick 2026; Gonzalez-Rostani and Raviv 2026). In experiments powered to detect small effects, contamination can bias estimates; it creates a Type II risk by producing polished but nonhuman evidence. At the same time, the institutional basis for iteration has become more fragile. Federal support for social science research is under pressure, including proposals to eliminate the National Science Foundation’s Social, Behavioral, and Economic Sciences Directorate, the major federal funder of academic social science in the United States (Kozlov, Garisto, and Chen 2026; American Political Science Association 2026).

9.5 How Agentic AI Addresses the Problem

AI agents can improve instrument design through pretesting. Synthetic responses1 can help diagnose confusing items, detect order effects, and estimate variance for power calculations, reducing Type I errors before fielding. Agents can also generate ecologically valid materials at scale. They can produce large sets of treatment materials, such as cover letters, emails, and profiles, calibrated to specific contexts, expanding the range of feasible audit studies. Finally, agents can serve as treatments themselves, delivering personalized, persuasive messages and conducting multi-turn conversations that adapt to participants’ responses (Argyle et al. 2025; Costello, Pennycook, and Rand 2024; Crabtree et al. 2026). These tools expand the feasible design space by lowering the cost of interactive and personalized treatments.

Agents also have clear limits. Synthetic outputs created by agents cannot validate behavioral effects. LLM simulations reproduce stereotypical responses from training data, so they do not establish causal effects. They also do not solve the Type II problem of determining whether treatments isolate the mechanism researchers care about.

Detecting responses by agents faces similar shortfalls. Adversarial agents can stress-test survey instruments, redesign quality checks, and benchmark suspicious responses (Gonzalez-Rostani and Raviv 2026), but the threshold for what constitutes a suspicious response is a moving target. The core issue is not only technical, but also substantive and institutional: data authenticity and reporting quality. Researchers must assess the credibility of responses and be transparent about detection procedures. AI-assisted contamination detection is, therefore, a short-term fix, not a long-term solution.

9.6 Conclusion: What These Tools Can and Cannot Do

Agentic AI offers genuine gains by compressing the exploratory and evaluative loops through which empirical social science advances. Across both applications, the clearest benefit is reducing Bendor (2010)’s two search errors: agents reduce Type I error by expanding what can be tested and reduce Type II error through criticism, benchmarking, and adversarial stress-testing. The gains are clearest where feasibility constraints have historically filtered out research programs, as in organizational measurement, and where underpowering or high resource demands are the prevalent failure modes, as in experimental design.

Several limits bear emphasis. First, LLMs default to agreement (acquiescence bias). The criticism function only works if agents are instructed to find failures. Without that adversarial structure, cheaper exploration can simply generate more Type II errors by letting polished but weak outputs survive longer than they should. Second, silicon samples are not behavioral pilots. They cannot establish treatment effects in real populations. Using these samples as if they can systematically favor designs that match the AI’s priors. Third, AI-generated stimuli carry hidden priors. These can undermine construct validity and require experimental decomposition. Fourth, faster pipelines do not, per se, guarantee better measures: without adversarial validation, source preservation, model versioning, and human review, researchers may simply produce opaque or weakly grounded measures more efficiently. More generally, models change, so pipelines must be versioned because results may not replicate across configurations (Bisbee et al. 2024).

The key danger is over-reliance on AI at the expense of human decision-making. Social science could become a loop in which AI generates treatments, tests them on synthetic respondents, and analyzes results. The gains described here do not require this path, and we do not consider it ideal. Agentic AI is most useful when it augments human judgment by handling mechanical and iterative tasks. Researchers can and should make more judgment calls, not fewer. Design decisions, theoretical interpretations, and validation against real-world behavior remain human responsibilities.


  1. For a review of synthetic samples, or silicon sampling, see Argyle et al. (2023), Horton, Filippas, and Manning (2026), and Sun et al. (2024).↩︎