9 Failing Faster, Learning Better: Agentic AI and Empirical Social Science Research
Charles Crabtree, School of Social Sciences, Monash University and K-Club Professor, University College, Korea University
Valentina Gonzalez-Rostani, Department of Political Science and International Relations, University of Southern California
Jae Yeon Kim, Department of Public Policy, University of North Carolina at Chapel Hill jaekim@unc.edu.
(All authors contributed equally. The names are listed alphabetically by last name.)
Abstract: How can social scientists generate more ideas, test more mechanisms, and fail more productively? We argue that agentic AI offers a qualitative shift in researchers’ capacity for iterative search. Drawing on Bendor (2010)’s framework of innovation as a search problem that balances creativity (broad exploration) and criticism (rigorous evaluation), we organize the paper around two recurring search failures: Type I error, in which promising ideas are screened out before they are adequately explored, and Type II error, in which researchers overinvest in weak research designs. We discuss two applications. In organizational-level measurement research, agentic AI lowers the feasibility barrier to exploring new projects and automates iterative validation, catching flawed measures before they enter the downstream workflow. In experimental and audit-based research, the prevalent error is Type II error from underpowered, under-piloted designs. Agentic AI enables power diagnosis and stress-testing at scale. We are explicit about what these tools cannot do: silicon samples cannot validate treatment effects in real humans; LLMs default to agreement; AI-generated stimuli carry hidden priors and risk bundling effects; and using AI to defend against AI contamination raises new problems. Agentic AI is best used to scaffold upstream iteration in design, measurement, and validation. It does not replace human judgment.
AI usage statement: TBD
“Every genuine test of a theory is an attempt to falsify it, or to refute it. Testability is falsifiability; but there are degrees of testability: some theories are more testable, more exposed to refutation, than others; they take, as it were, greater risks.”
Conjectures and Refutations, Popper (1963, 36)
“And policy analysts, in deciding whether to use a finding from behavioral decision theory, are in the FDA’s position: they can be either too patient, and fail to apply an idea that is ready, or too impatient, and apply something prematurely.”
Bounded Rationality and Politics, Bendor (2010, 181)
9.1 Introduction: Agentic AI as a Tool for Iterative Search
What is the value of artificial intelligence (AI) for empirical social science? The most common answers emphasize automation, such as annotating text, or generation, such as drafting survey items. These are important contributions, but they miss AI’s most consequential value: intensifying the iterative search process through which scientific knowledge is produced. Empirical social science advances through repeated cycles of drafting, testing, diagnosing, and revising rather than one-shot discovery. In this note, we argue that AI agents can expand the scale, speed, and depth of that process.
In this paper, we use the term “AI agents” to refer to systems that autonomously manage multi-step workflows, invoke external tools, maintain state across steps, and revise their behavior in light of intermediate results. They rely not only on a large language models (LLMs) but also on a broader architecture in which prompts set roles and limits, memory preserves context, control modules coordinate actions and tool use, and evaluators assess outputs (Acharya, Kuppan, and Divya 2025; Zhou et al. 2024). Compared with one-shot LLM use, in which a researcher poses a query and receives a single response, agentic systems can sustain sequences of planning, tool use, evaluation, and revision. Agentic systems interpret high-level goals, decompose them into steps, invoke external tools such as web browsers, code interpreters, and application programming interfaces (APIs), and update their approach as new information emerges (Hosseini and Seilani 2025; Bail 2024). The key property is iterative feedback. This distinction matters because many of the most consequential stages of research occur upstream of final inference. Measurement schemes must be drafted and tested; experimental stimuli must be piloted and revised; designs must be stress tested before fielding. Each of these loops has traditionally been constrained by the time, resource, and cognitive bandwidth of research teams. Agentic AI compresses the mechanical work between human judgment calls at a fraction of the cost, allowing researchers to explore more alternatives, identify weaknesses earlier, and refine designs more systematically. The argument we make is about cheaper and more extensive upstream iteration in design, measurement, and validation, not about looser standards for final inference. We develop this claim through two applications and organize the gains around Bendor (2010)’s framework of iterative search.
9.2 Innovation as Iterative Search: The Bendor Framework
Scientific research is not about confirmation, but about increasing falsifiability. In this framework, the more testable a theory is, the better (Popper 1963). Jonathan Bendor (1950–2025), the late professor of political economy at Stanford’s Graduate School of Business, extends Popper’s framework of scientific discovery to organizational problem solving (Bendor 1995, 2010, 2015). Building on Simon (1947)’s work on bounded rationality and Lindblom (1959)’s work on disjointed incrementalism, he argues that innovation arises from balancing two risks: dismissing promising ideas before adequate exploration (Type I error) and overinvesting in weak ideas (Type II error) (Johnston 2015). Because these risks trade off, effective search requires both creativity and criticism. Progress depends on the iterative interplay between exploration and exploitation.
Agentic AI compresses creativity and criticism into a streamlined process. It reduces the cost of local iteration. On the exploration side, agents lower the marginal cost of trying new measurement strategies and testing alternative designs, helping researchers avoid the Type I error of dismissing promising ideas before they are adequately explored. On the evaluation side, they automate checks against benchmarks, across specifications, and under adversarial prompting, which helps guard against the Type II error of overinvesting in polished but weak ideas.
This epistemological approach, or innovation workflow, creates structured redundancy. Redundancy is often seen as waste, but it can improve reliability in high-risk systems (Landau 1969). Multiple identity checks at an airport illustrate this logic. Overlapping checks reduce the chance that a failure goes undetected.
Table 9.1 illustrates how Type I and Type II errors map onto the two applications, organizational measurement and experimental and audit-based research, and serves as a roadmap for the sections that follow.
| Application | Type I Error (missed opportunities) | Type II Error (overinvestment in weak ideas) |
|---|---|---|
| Organizational measurement | Feasibility constraints screen out data-intensive projects. | Early commitment to flawed measures that propagate downstream. |
| Experiments and audits | Underpowered or weak designs fail to detect real effects. | Contaminated or weakly validated evidence appears credible. |
| Role of agentic AI | Expands search by lowering feasibility and iteration costs. | Improves evaluation through iterative validation and stress testing. |
9.3 Application 1: Organizational-Level Data and Measurement
9.3.1 The Problem
Research on organizations routinely confronts severe data constraints. Relevant evidence often sits in semi-structured sources outside standard datasets, including speeches and dialogues by local actors (e.g., politicians, citizens, and interest groups), oral histories, administrative records, and organizational websites (Grimmer and Stewart 2013; King, Lam, and Roberts 2017; Parthasarathy, Rao, and Palaniswamy 2019; Barari and Simko 2023; J. Y. Kim, Vries, and Han 2025). These sources are often essential for understanding what organizations offer their members, how they build networks and coalitions, how they develop and deliberate over agendas, and how they shape political and policy outcomes. The challenge is not only in collecting such data, but also in ensuring measurement validity: whether the observations researchers assemble actually operationalize the concept of interest in a valid way (Adcock and Collier 2001; Grimmer, Roberts, and Stewart 2022).
Through Bendor (2010)’s lens, these constraints generate two kinds of search failure. The first is feasibility-filtered omission, which is a Type I problem in Bendor (2010)’s sense. This failure occurs when research programs are not pursued because the required data are too costly to assemble. The second is premature commitment, which can be thought of as a Type II problem. When data annotation is done by hand, researchers might commit too early to a single operationalization. If that measure proves flawed, reannotation is prohibitively expensive. For instance, a team studying a political party’s immigration rhetoric might discover after six months that their scheme conflates economic and cultural frames. By that point, the flawed measure has already propagated into downstream work, such as data analysis and visualization. The deeper problem is not labor cost, but conceptual: treating measurement as fixed rather than iterative.
9.3.2 How Agentic AI Expands and Sharpens the Search
Four agentic capabilities matter most for enabling research to explore more directions quickly, making it more testable and thus falsifiable.
First, agents automate data collection at scale. Unlike conventional, rule-based web scrapers, they can navigate diverse websites, interpret semi-structured documents, and adapt to layout variations by combining search, browsing, and extraction in a single workflow (e.g., Nakano et al. 2022; Ahluwalia and Wani 2024). This capability reduces Type I error by making data-intensive, theoretically relevant measurement strategies feasible. This allows agents to handle the manual labor of discovery and hand off extracted material for downstream annotation with minimal human intervention.
Second, these systems can work with non-text and semi-structured materials, including PDFs, scanned documents, and images, which are often difficult to process with OCR and conventional tools (Xu et al., 2020; G. Kim et al. (2022)]. After searching, an AI agent can extract information from these sources within a single workflow. This capability expands the range of feasible data-intensive social science projects.
Third, agents enable multilingual classification and annotation with accuracy comparable to or exceeding human coders (Egami et al. 2026; Ziems et al. 2024). This capability makes comparative research across languages more tractable. It also allows researchers to take their original documents and quickly translate them for other audiences, such as an interactive dashboard of public comments on a regulatory issue.
Fourth, agents support iterative refinement. They can propose multiple annotation schemes, evaluate them against hand-annotated benchmarks, revise them, and automatically reapply them. This iterative process reduces premature commitment and improves concept–measure alignment.
More explorations should be paired with stricter tests and safeguards. LLMs tend to agree when asked whether an annotation scheme is valid, so validation pipelines should be structured adversarially, with some agents building the pipeline while others identify failures, propose alternatives, and demonstrate conditions under which findings collapse. As models approach or exceed human coding performance on some tasks, researchers should not rely only on agreement with a single hand-coded benchmark; they should also assess construct validity, check whether errors cluster in substantively important cases, and, where feasible, compare results across models. Even when exact reproducibility is difficult with closed systems, workflows can still be made auditable through source logs, prompt records, model versions, and human review, and AI-orchestrated criticism like this can prevent exploratory gains from becoming Type II errors.
9.4 Application 2: Experimental and Audit-Based Research
9.4.1 The Problem
Experiments and audit studies provide causal evidence on what works and thus inform decision-making. Specifically, experiments identify the effects of interventions on beliefs, attitudes, and actions, while audit studies identify whether otherwise comparable actors receive different responses when they vary on a socially salient group-level characteristic (Block et al. 2021; Gaddis et al. 2022; Butler and Crabtree 2017). However, these designs are resource-intensive and often difficult to implement well. Kane (2025) identifies seven pathways to erroneous null findings. These include respondent inattentiveness, manipulation failure, pre-treatment, small samples, poor outcome measurement, ceiling and/or floor effects, and countervailing effects. In our framework above, this set of issues can be viewed as a Type I problem: potentially real effects are dismissed because designs are underpowered, underpiloted, or otherwise too weak to detect them. Arel-Bundock et al. (2026) show that quantitative political science research is “greatly underpowered,” routinely fielding studies that cannot detect the effects they hypothesize. The cost is not just more null results; estimates may also be unstable in sign and inflated in magnitude.
Recent developments have intensified the problem. Online panels, once central to the globalization of public opinion research (Heath, Fisher, and Smith 2005; Thomas 2024), may now be contaminated by AI agents. The scale of the problem is already substantial. About one-third of crowdworkers report using LLMs to answer open-ended survey questions (Zhang, Xu, and Alvero 2025). Recent evidence also suggests that AI-assisted responding in live surveys likely ranges from 4% to 45% (Westwood and Frederick 2026; Panizza, Kyrychenko, and Roozenbeek 2026), and that AI agents are able to fully complete surveys and pass standard response-quality checks (Westwood and Frederick 2026; Gonzalez-Rostani and Raviv 2026). In experiments powered to detect small effects, contamination can bias estimates; it creates a Type II risk by producing polished but nonhuman evidence. At the same time, the institutional basis for iteration has become more fragile. Federal support for social science research is under pressure, including proposals to eliminate the National Science Foundation’s Social, Behavioral, and Economic Sciences Directorate, the major federal funder of academic social science in the United States (Kozlov, Garisto, and Chen 2026; American Political Science Association 2026).
9.5 How Agentic AI Addresses the Problem
AI agents can improve instrument design through pretesting. Synthetic responses1 can help diagnose confusing items, detect order effects, and estimate variance for power calculations, reducing Type I errors before fielding. Agents can also generate ecologically valid materials at scale. They can produce large sets of treatment materials, such as cover letters, emails, and profiles, calibrated to specific contexts, expanding the range of feasible audit studies. Finally, agents can serve as treatments themselves, delivering personalized, persuasive messages and conducting multi-turn conversations that adapt to participants’ responses (Argyle et al. 2025; Costello, Pennycook, and Rand 2024; Crabtree et al. 2026). These tools expand the feasible design space by lowering the cost of interactive and personalized treatments.
Agents also have clear limits. Synthetic outputs created by agents cannot validate behavioral effects. LLM simulations reproduce stereotypical responses from training data, so they do not establish causal effects. They also do not solve the Type II problem of determining whether treatments isolate the mechanism researchers care about.
Detecting responses by agents faces similar shortfalls. Adversarial agents can stress-test survey instruments, redesign quality checks, and benchmark suspicious responses (Gonzalez-Rostani and Raviv 2026), but the threshold for what constitutes a suspicious response is a moving target. The core issue is not only technical, but also substantive and institutional: data authenticity and reporting quality. Researchers must assess the credibility of responses and be transparent about detection procedures. AI-assisted contamination detection is, therefore, a short-term fix, not a long-term solution.
9.6 Conclusion: What These Tools Can and Cannot Do
Agentic AI offers genuine gains by compressing the exploratory and evaluative loops through which empirical social science advances. Across both applications, the clearest benefit is reducing Bendor (2010)’s two search errors: agents reduce Type I error by expanding what can be tested and reduce Type II error through criticism, benchmarking, and adversarial stress-testing. The gains are clearest where feasibility constraints have historically filtered out research programs, as in organizational measurement, and where underpowering or high resource demands are the prevalent failure modes, as in experimental design.
Several limits bear emphasis. First, LLMs default to agreement (acquiescence bias). The criticism function only works if agents are instructed to find failures. Without that adversarial structure, cheaper exploration can simply generate more Type II errors by letting polished but weak outputs survive longer than they should. Second, silicon samples are not behavioral pilots. They cannot establish treatment effects in real populations. Using these samples as if they can systematically favor designs that match the AI’s priors. Third, AI-generated stimuli carry hidden priors. These can undermine construct validity and require experimental decomposition. Fourth, faster pipelines do not, per se, guarantee better measures: without adversarial validation, source preservation, model versioning, and human review, researchers may simply produce opaque or weakly grounded measures more efficiently. More generally, models change, so pipelines must be versioned because results may not replicate across configurations (Bisbee et al. 2024).
The key danger is over-reliance on AI at the expense of human decision-making. Social science could become a loop in which AI generates treatments, tests them on synthetic respondents, and analyzes results. The gains described here do not require this path, and we do not consider it ideal. Agentic AI is most useful when it augments human judgment by handling mechanical and iterative tasks. Researchers can and should make more judgment calls, not fewer. Design decisions, theoretical interpretations, and validation against real-world behavior remain human responsibilities.