13 Artificial Intelligence and the Measurement Imperative
Rehan Mirza, Annenberg School for Communication, University of Pennsylvania.
Sandra González-Bailón, University of Pennsylvania sandra.gonzalez.bailon@asc.upenn.edu.
Abstract: The social sciences emerged from the sweeping transformations triggered by the Industrial Revolution. The quantification of social life became a tool to understand societal change and propose evidence-based reforms. Here we argue that in the AI era measurement is even more crucial to the social science mission, not just to document transformations and guide social reform, but also to define human benchmarks and boundary markers for domains that should not be reduced to quantities. The choice of which measures should guide the research workflow must remain fundamentally human, or we risk losing control over the priorities that shape the pursuit of knowledge for the benefit of all.
AI usage statement: TBD
13.1 Introduction
It has become commonplace to compare the rise of AI with the Industrial Revolution. The latter took a hundred years to fully unleash while the former is taking shape in the scale of a decade. But there is one key aspect that both eras have in common: the depth of the social transformations triggered by technological breakthroughs. The spinning and steam engines sent ripples through every layer of society, affecting how we work, live, communicate, and organize; so is deep learning and the many models it helps create: there is no domain of human activity that will remain untouched by AI technologies. The two eras share another similarity: the role the social sciences play in making sense of these transformations. Beatrice Webb, a sociologist studying the consequences of the Industrial Revolution (and one of the founders of the London School of Economics in 1895), made an echoing point: you cannot improve what you cannot measure.
Today, the measurement imperative is equally central to the social sciences mission. It is a requirement if we are to assess the social impact of AI with the hopes of stirring change. Webb and her contemporaries campaigned for evidence-based reform to improve the living conditions of the working class, whose lives had been radically transformed by the automation of labor (Webb 1926). Social research must once again establish the key facts about how AI will reshape labor markets, institutions, and human behavior if we are to keep pace with the disruption and regulate its deployment based on data. But if numerical measurement is needed for credible research (to demonstrate impact, quality, truth, or progress), in this AI era we also need measurement for another, more fundamental, reason: to define human benchmarks.
Measurement plays a dual role: it enables credible research, which we can conduct using AI, and it helps us define the human capabilities against which AI is evaluated (both in research and other domains). The definition of benchmarks helps us decide what to automate and what to preserve for human mastery. These benchmarks are implicit in our answers regarding what counts as intelligence, what stands as good judgment, or what it means to be connected to one another when machines increasingly do the thinking and the judging. We cannot determine if machines operate as intended without performance measures. As of today, the most consequential measures of performance are in the hands of the companies developing the AI models. These companies define their success using metrics they select for themselves.
13.2 The Double Edge of Measurement
In science, it is often said that you do not understand what you cannot measure. In the social sciences, we recognize that many human concepts (e.g., creativity, consciousness, social cohesion, moral judgment) resist quantification. And yet we aspire to find better proxies, if only to delimit the space of what we consider uniquely human territory. We also know that measurement often changes behavior: when a measure is used as an indicator for decision-making, it soon ceases to be a good measure (Campbell 1979). The reason is that incentives realign around the metric itself, rather than the original goal. Social media optimizing for engagement is a prominent example of how distorting this phenomenon can be. Engagement was originally conceived as a proxy to the quality of the content, but it also served commercial interests, so instead of quality it ended up capturing just time spent on platforms.
There are other examples closer to home: our measures of academic productivity will soon lose their meaning, as papers that once took months to produce (and much human mastery) can now be generated in a matter of days with the help of automated pipelines. We forget that productivity was never the point. The point was the learning and the craft and the development of shared structures of understanding. Systems that optimize for productivity risk the same fate as social media platforms that optimize for engagement: the metric improves while the underlying quality collapses. This is evidenced by the spike in hallucinated references reported in recent scientific research (Topaz et al. 2026; Zhao et al. 2026).
The social sciences have always paid special attention to the institutional arrangements that sustain knowledge creation. AI is reconfiguring those arrangements in ways that force us to rethink what we measure and with what purpose. The question of whose knowledge is treated as authoritative will be key, especially given the opacity that surrounds the development of AI models. One purpose of measurement is to bring transparency to decision-making by making choices more visible, comparable, and contestable. Another purpose is to control, shape, and constraint choices. The emerging social sciences took it upon themselves to fulfill the first purpose at a time of social disruption and lack of accountability. Today the social sciences must reclaim measurement from private interests and their drive to control.
13.3 AI and the Research Workflow
It is already common to insert AI tools into the different stages of the research workflow to augment scientists’ capabilities. Our argument is that deciding what to measure remains the most important stage of the research process. It has always been key, but in the AI era it is decisive: losing control over what we measure means losing control over our knowledge priorities. The good news is that we can now unleash our research imagination with less anxious attention to resource constraints. Feasibility always imposes a ceiling on what can be achieved, but now the proverbial genie-in-a-bottle is out, ready to grant our wishes. Thanks to AI, sources of information that were never seen as data can now be treated as such. Two examples: video footage of urban spaces and relational data buried in text.
The first example is illustrated by a recent paper that makes use of computer vision and deep learning to analyze video footage of public spaces in three US cities during the late 1970s and the late 2000s (Salazar-Miranda et al. 2025). One of the high-level findings of this research is that city streets are acting as corridors for movement rather than as spaces for social interaction, which has implications for how we design public spaces and how we can foster social connections and community-building. The second example comes from another recent paper that uses LLMs to extract social network data from the acknowledgement section of hundreds of thousands of published articles (Danús et al. 2026). These informal ties create structures of support and a type of social capital that influences the process of knowledge generation and dissemination. Science, after all, is a collective endeavor and mapping how scientists self-organize can help foster more effective forms of collaboration.
AI tools augment our ability to measure intangible social facts like the use of public spaces or the social-support structures we depend on – facts we could not easily quantify before these tools were available. Early social science struggled because of this central obstacle, “the sheer difficulty of obtaining numerical information on social topics” (Lazarsfeld 1961). Even as quantitative methods grew more sophisticated, it remained true that not everything that could be counted did actually count much (Cameron 1963). Because so few things of interest were measurable at scale, quantitative social science overused the same metrics, exploiting known data sources and consequently engaging in too little exploration. Now, AI tools allow us to push into uncharted territory, expanding both what counts and what can be counted. These new capabilities will be essential as social research studies the very transformations AI is producing.
13.4 Benchmarks in AI Research
When AI is used not to uncover structural facts but to infer aspects of human skill or behavior, benchmarking becomes necessary. Benchmarks combine data and metrics to define tasks and capabilities: they offer a shared reference point to compare different models. In the case of AI, the comparison of interest is algorithmic performance vis-a-vis human behavior. Benchmarks allow us to track progress and reveal errors. But to judge whether AI performs at human levels, we first need to establish where those levels actually lie. The question here is not just where to find the right data source and how to best extract meaningful measures from new sources. The key element is how we define the tasks and capabilities of interest. A research priority for the social sciences in the AI era is to develop a framework that clearly specifies which dimensions are valid to measure in which contexts.
In developing this framework, a familiar reminder applies: the map is never the territory. The metrics we use to define tasks and capabilities are always abstractions, partial representations of the far richer human experience. Abstraction is a fundamental part of the research process, but it works only if social scientists retain the authority to decide which representations are appropriate for a given domain. This also includes deciding which aspects of human life should resist simplification, such as the inherent messiness of democratic practice. Again, not everything that counts can be counted.
There are at least two types of benchmarks social science will need to help curate and oversee as AI developments progress: behavioral benchmarks and knowledge benchmarks.
One example of behavioral benchmark applies game theory to investigate the human traits learned by LLMs (Mei et al. 2024). Over the years, tens of thousands of human subjects in dozens of countries have participated in laboratory experiments examining strategic interactions formalized through game theory. Machines can now play those games too. Relative to machines, humans display greater heterogeneity and are more difficult to predict. This type of reference point is crucial when discussing the deployment of agentic AI across social contexts.
One example of knowledge benchmark is the recent compilation of expert-level academic questions that help assess AI capabilities and limitations (Phan et al. 2026). This type of benchmark relies on a collaborative effort to compile exam questions that require both text and image comprehension across a range of subjects, from math, physics and biology to the humanities and the social sciences. Unlike other existing benchmarks used by AI companies, which are already saturated (and therefore not very useful at measuring performance, already too clustered at the top), the expert-led benchmark reveals LLMs to consistently have very low accuracy. In other words, when experts define the metrics of success, the distance between frontier LLMs capabilities and human expertise is undeniably large.
These types of benchmarks rely on focused measures of technical knowledge and reasoning, with a stronger representation of math and STEM subjects. Existing benchmarks are STEM-dominated partly because that is the disciplinary composition of the teams building the AI systems. Social scientists need a greater role in defining these reference points if we are to meaningfully assess AI agents: their capabilities and failures not only on research workflows but on social life more broadly.