flowchart TB plotter["**«interface»**<br/>**Plotter**<br/>+ plot()"] app["**make_figure()**<br/>uses Plotter"] ggplot["**ggplot2 adapter**<br/>+ plot()"] mpl["**matplotlib adapter**<br/>+ plot()"] app --> plotter ggplot --> plotter mpl --> plotter classDef box fill:#ffffff,stroke:#111111,stroke-width:1px,color:#111111; classDef iface fill:#eeeeee,stroke:#111111,stroke-width:2px,color:#111111; class app,ggplot,mpl box; class plotter iface;
8 Research Transparency and Collaboration in the Era of Generative AI Requires Open, Clean Code
Chris Cochrane, University of Toronto, christopher.cochrane@utoronto.ca
Michael Cowan, University of Toronto
Abstract: TBD
AI usage statement: TBD
8.1 Introduction
A political science student in 2008 might have spent three months writing the code to download the transcripts of parliamentary debates, decode their cryptic XML structure, extract relevant data, and convert them into a tabular data structure for analysis. The same task today, with generative AI in the loop, takes an hour (Lee et al. 2025). Generative AI loosens constraints on the capacity of social scientists to model and test their understandings of social phenomena. Technical knowledge is more accessible than ever (Zhang 2025). Tasks which once took months of effort to code and debug can now be completed in an afternoon. As a result of these transformations, a greater share of what is possible to compute in theory is now possible to compute in practice.
Beyond a certain point, “quantity has a quality of its own.” As the size and complexity of projects increase, there is a risk that generative AI will do to codebases what cars did to cities—i.e., create sprawl that can only be navigated by the very tool that enabled the sprawl in the first place. Like cars, cell phones, email, and now generative AI, innovations that are initially liberating can generate utter dependency as environments reorganize around them (Sparrow, Liu, and Wegner 2011). This poses several challenges, not least to research transparency, which requires open access of the methods and code used in the research (Alexander 2023). If we are to maintain research transparency and enable collaboration in projects at the scale and complexity of what generative AI enables, quantitative social scientists will need to adopt not just open code principles, but also programming practices suitable for large projects. Open code is necessary but insufficient for research transparency. Increasingly, research transparency requires code that is transparent and clean.
This note introduces a set of programming principles for large, complex codebases. Aside from their intrinsic benefits, we believe these principles are now integral to research transparency in a growing number of projects, which are now scaling massively, as they should, due to generative AI. We conclude with a discussion of the costs of implementing these principles, and how it can be extended to integrate other design principles, such as ports-and-adapters architecture (Cockburn 2005) and domain-driven design (Evans 2004; Vernon 2013).
8.2 Programming Principles
A common scripting style builds code around specific data. The first lines of code may load a specific dataset. The next lines clean up its column names. The next creates a multi-item measure from these cleaned-up columns, and so on. Other data is loaded, cleaned, processed, and merged with the first dataset using the appropriate column names, and the process continues through the analysis and visualization stages, where the code calls functions from specific modules like dplyr and ggplot2 in R, or pandas and matplotlib in Python. Functions are reused and repurposed throughout the codebase, often by appending parameters to accommodate each new need. A clean_data() function written for one source of data grows to include optional parameters required for other data sources, such as survey_year, then country, province, state and parliament_number, and so on. To make matters worse, the names of functions and parameters may be idiosyncratic, like ClnDta, syr, nat, and parl, which become unmanageable as the codebase grows, even with detailed annotation, which further clutters the codebase. This approach is efficient for small projects, but it becomes unwieldy at scale because the lines of code become coupled to the data, the sequence, the modules, and each other. This is why it is sometimes called “spaghetti code.” It is impossible to pull on one strand without pulling on all the other ones too (Martin 2009, 184).
Tightly coupled code is hard to reuse across projects because it is, by definition, coupled to the specific codebase for which it was written. It is also hard to modify and maintain at scale. With coupled code, each fix becomes more difficult to implement, both because the same change may need to be made in multiple places and because each change ripples through other parts of the code that depended on the original formulation. The same dynamic complicates collaboration. If one person changes a line of code that another person’s work depends on, or if a single fix requires changes in multiple places, the risk of conflict and error rises with the number of people, or AI agents, on the project (Brooks 1995). Eventually, the code becomes unmanageable as collaborators and external reviewers struggle to understand how the parts of the code fit together and what each block is for. Reading such a codebase is like being thrust into the cockpit of a high-tech plane where every button does at least three things, but the pilot can only ever know what one of those things is. The natural reaction is “do not touch.”
In response to the problem of coupled, unmaintainable code in software development, Martin (2002) articulated a set of programming principles that have become standard in software engineering and that generative AI now makes relevant to quantitative social science. In contrast to building code “upwards” from specific data and then layering measurement, modeling and visualization on top, Martin’s approach builds “downwards” from the abstract ideas the project tries to capture to both the models that make use of those ideas, and, separately, to the concrete details for implementing them. The intuition is that the core of a project is about the abstract ideas and the uses to which they are put, rather than about the specific data, measures, and tools that happened to be used in a particular implementation. Indeed, what are robustness checks and validation in quantitative social science if not the tests that ensure the stability of the abstract ideas in the face of alternative implementations? By making the codebase depend on the abstract ideas, it makes it possible for the data, measures, and tools to change while the core of the project remains stable.
8.2.1 Modularity and Interfaces
A large codebase should be built from modules, which are discrete units of functionality that partition the project into independently understandable parts (Liskov 1987, 20). Modularity comes naturally to programming languages like Python and R, where users routinely import modules as libraries and packages. The library’s inner workings are hidden behind its interface, which is the set of functions it makes available for programmers to call. A researcher loading the ggplot2 library, for instance, writes library(ggplot2) and then calls functions from its interface, such as ggplot(data=df, aes(x=x_column_name, y=y_column_name)), where ggplot and aes are functions in ggplot’s interface; data, x, and y are parameters of those functions; and df, x_column_name, and y_column_name are the arguments the researcher passes to those parameters. The library’s source code, which may be hundreds or thousands of lines, is not the researcher’s concern. The researcher only needs to know how to call the functions, not how they work internally. Building a project in modules means writing each module around a specific task and assembling the project by importing the modules and calling their interfaces.
8.2.2 The SOLID Principles
Each module should have a single responsibility (Martin 2002). This does not mean that each module does only one thing. Rather, it means that “a module should have one, and only one, reason to change” (Martin 2009, 138). A module for cleaning a specific dataset will do many things, but if a researcher needs to adjust how a specific dataset is cleaned, the module for cleaning that data is the only module that changes. Likewise, if an update to a library changes how data is acquired from an online source, the module for acquiring data is the only module that changes. In this way, each task in the project lives in its own module. A reader who wants to know how a particular step was implemented can navigate to the module for that step, without having to follow the rest of the code. Likewise, a researcher who wants to implement that step in their own project can import the module and call its interface, without also importing the context of the original project.
Codebases should be designed so their behavior can be extended by adding new code rather than modifying existing code that already works. That is, a codebase should be “open for extension but closed for modification” (Martin 2009, 149). The primary purpose of this principle is to reduce the need to make changes throughout a codebase whenever adding additional functionality. It also reduces the likelihood that a change in one of those places will alter the functionality of other parts of the code on which other functions, or users, depend.
A common violation of the “open-closed” principle is heavy reliance on if-else statements, as in Listing 8.1.
def area(shape):
if shape.kind == "circle":
return math.pi * shape.radius ** 2
elif shape.kind == "rectangle":
return shape.width * shape.height
def perimeter(shape):
if shape.kind == "circle":
return 2 * math.pi * shape.radius
elif shape.kind == "rectangle":
return 2 * (shape.width + shape.height)
def describe(shape):
if shape.kind == "circle":
return f"Circle with radius {shape.radius}"
elif shape.kind == "rectangle":
return f"Rectangle {shape.width} x {shape.height}"Each function has a single responsibility, which is to calculate a specific property of a shape. If we want to add a new shape, however, consider the changes that need to be made throughout the codebase, as shown in Listing 8.2:
def area(shape):
if shape.kind == "circle":
return math.pi * shape.radius ** 2
elif shape.kind == "rectangle":
return shape.width * shape.height
elif shape.kind == "triangle":
return 0.5 * shape.base * shape.height
def perimeter(shape):
if shape.kind == "circle":
return 2 * math.pi * shape.radius
elif shape.kind == "rectangle":
return 2 * (shape.width + shape.height)
elif shape.kind == "triangle":
return shape.side1 + shape.side2 + shape.side3
def describe(shape):
if shape.kind == "circle":
return f"Circle with radius {shape.radius}"
elif shape.kind == "rectangle":
return f"Rectangle {shape.width} x {shape.height}"
elif shape.kind == "triangle":
return f"Triangle with sides {shape.side1}, {shape.side2}, {shape.side3}"As Listing 8.2 shows, each of the functions needs to be modified to accommodate the new shape. If there are many functions that depend on the shape, the number of changes that need to be made to accommodate each new shape multiplies accordingly, and the risk of error or breakage to existing, working code also increases. If these functions are spread throughout the codebase, the time it takes to find and then change them increases still further, and these costs compound as the size of the codebase grows. In addition to the time costs, there are also risks. When modifying all of the existing functions so that they can include a new shape, the existing functions are modified and any errors those modifications generated would disrupt the already working code for circles and rectangles that other users, or publications, depended on.
In political science, an equivalent violation of open-closed might look something along the lines of Listing 8.3:
def allocate_seats(result):
if result.system == "FPTP":
# First-past-the-post: winner takes the single seat
winner = max(result.votes, key=result.votes.get)
return {winner: 1}
elif result.system == "PR":
# Proportional representation
return allocate_pr_seats(result.votes, result.seats)
def winning_party(result):
if result.system == "FPTP":
return max(result.votes, key=result.votes.get)
elif result.system == "PR":
seats = allocate_pr_seats(result.votes, result.seats)
return max(seats, key=seats.get)
def describe(result):
if result.system == "FPTP":
return (
f"{result.constituency}: FPTP, "
f"{sum(result.votes.values())} votes cast"
)
elif result.system == "PR":
return (
f"{result.constituency}: PR ({result.seats} seats), "
f"{sum(result.votes.values())} votes cast"
)Adding a mixed-member proportional system or a ranked-choice voting system to Listing 8.3 would require modifying every function whose behaviour varied by electoral system. These functions may be numerous and spread throughout the codebase, which makes the code challenging to maintain and review. As with the example of shapes, each change would risk breaking existing code that other research on those electoral systems depended on.
The solution is a codebase that can be extended without modifying existing code. In the case of shapes, this would resemble Listing 8.4:
class Shape(Protocol):
def area(self) -> float: ...
def perimeter(self) -> float: ...
def describe(self) -> str: ...
class Circle(Shape):
def __init__(self, radius: float):
self.radius = radius
def area(self) -> float:
return math.pi * self.radius ** 2
def perimeter(self) -> float:
return 2 * math.pi * self.radius
def describe(self) -> str:
return f"Circle with radius {self.radius}"
class Rectangle(Shape):
def __init__(self, width: float, height: float):
self.width = width
self.height = height
def area(self) -> float:
return self.width * self.height
def perimeter(self) -> float:
return 2 * (self.width + self.height)
def describe(self) -> str:
return f"Rectangle {self.width} x {self.height}"
def total_area(shapes: Iterable[Shape]) -> float:
return sum(s.area() for s in shapes)
def total_perimeter(shapes: Iterable[Shape]) -> float:
return sum(s.perimeter() for s in shapes)In Listing 8.4, the Shape class defines the interface for all shapes, and the classes for each shape implement that interface. The functions total_area and total_perimeter depend on the Shape interface, rather than on any specific shape. To add a new shape, such as a triangle, we only need to create a new Triangle class that implements the Shape interface. Thus, we have changed the codebase by extending it rather than modifying any of its functions or modules. Likewise, if we wanted to add a new function for all shapes — such as french_name — we could add it to the Shape interface and to each shape class that implements it. Critically, the pre-existing code for total_area and total_perimeter, which other researchers or publications depended on, would not need to change. The architecture protects the consumers of an existing interface from changes, even when the interface itself grows. That is, the codebase is open for extension but closed for modification.
Returning to the example from political science, the series of if-else statements could be refactored to Listing 8.5:
class ElectoralSystem(Protocol):
def allocate_seats(self, votes: dict[str, int]) -> dict[str, int]: ...
def winning_party(self, votes: dict[str, int]) -> str: ...
def describe(self, constituency: str, votes: dict[str, int]) -> str: ...
class FPTP(ElectoralSystem):
def allocate_seats(self, votes):
return {max(votes, key=votes.get): 1}
def winning_party(self, votes):
return max(votes, key=votes.get)
def describe(self, constituency, votes):
return f"{constituency}: FPTP, {sum(votes.values())} votes cast"
class ProportionalRepresentation(ElectoralSystem):
def __init__(self, seats: int):
self.seats = seats
def allocate_seats(self, votes):
return allocate_pr_seats(votes, self.seats)
def winning_party(self, votes):
seats = self.allocate_seats(votes)
return max(seats, key=seats.get)
def describe(self, constituency, votes):
return (
f"{constituency}: PR ({self.seats} seats), "
f"{sum(votes.values())} votes cast"
)
def total_seats(
results: Iterable[tuple[ElectoralSystem, dict[str, int]]],
) -> dict[str, int]:
totals: dict[str, int] = {}
for system, votes in results:
for party, n in system.allocate_seats(votes).items():
totals[party] = totals.get(party, 0) + n
return totalsIn Listing 8.5, adding a new electoral system means adding a new class that implements the ElectoralSystem interface. If we wanted to add a new method for all electoral systems — such as effective_number_of_parties() — we could add it to the ElectoralSystem interface and to each electoral-system class that implements it. Crucially, the pre-existing code for total_seats and any other functions that depend on ElectoralSystem, which other researchers or publications could well be depending on in their established pipelines, would not need to change. Here again, the codebase is open for extension but closed for modification.
The principle that enables this architecture is Liskov’s Substitution Principle, originally formulated by Barbara Liskov (Liskov 1987). In Liskov’s formal terms:
“If for each object \(o_1\) of type \(S\) there is an object \(o_2\) of type \(T\) such that for all programs \(P\) defined in terms of \(T\), the behavior of \(P\) is unchanged when \(o_1\) is substituted for \(o_2\), then \(S\) is a subtype of \(T\)” (Liskov 1987, 25).
In less formal terms, the shapes (Circle, Rectangle) are subtypes of the supertype Shape, and the electoral systems (FPTP, ProportionalRepresentation) are subtypes of the supertype ElectoralSystem. Liskov’s substitution property guarantees that any class implementing the interface can stand in for any other, without affecting the behavior of code that depends on the interface. When we add a new shape or electoral system, the functions that depend on the interface (total_area, total_seats) cannot tell whether the new class is the original implementation or a new one. Any class that satisfies the contract is acceptable. The key point, however, is that each subtype must satisfy the contract of the supertype—that is, any operation that can be performed on the supertype can also be performed on the subtype. With Liskov substitution, an architecture can have a small number of supertypes and a far larger number of subtypes. It can also have functions defined in terms of the supertypes, which then work on any of the subtypes. This is what allows new subtypes to be added to the codebase without modifying any of the existing functions.
The fact that each subtype must satisfy the contract of the interface is what makes Liskov substitution possible. Yet this requirement also imposes a constraint on interface design. If an interface includes methods that are relevant only to some implementations, then every implementation will be forced to define methods it does not support. This constraint applies whether the relationship is established through inheritance or through Protocols, as in modern Python. With inheritance, each subclass must implement all methods of the superclass to avoid violating the contract and triggering a runtime NotImplementedError. With Protocols, a class that fails to implement all of the Protocol’s methods is flagged as not satisfying the Protocol, and any consumer that depends on the Protocol will refuse to accept it. The consequence is that if the Shape interface included a method for calculating volume, then Circle and Rectangle would have to implement it even though they are two-dimensional shapes. Likewise, if the ElectoralSystem interface included a method for district_magnitude, then FPTP would have to implement it even though magnitude is fixed at one for FPTP by definition. Beyond cluttering the codebase, this confuses anyone reading, modifying, or implementing the interfaces, including the author returning to the project a few weeks later. From the perspective of the consumer, “…it is harmful to depend on modules that contain more than you need” (Martin 2018, ch.10). Each interface should expose only the methods its consumers actually use, with separate interfaces defined when consumers have different needs. The Interface Segregation Principle is violated when an interface bundles together more functionality than its consumers or implementers need.
The final and, arguably, defining principle of the approach focuses on the stability of the codebase over time. Imagine a research project with extensive visualization needs, relying on an excellent library such as ggplot2. Each time a visualization is needed, the codebase calls ggplot() directly. As the codebase grows, calls to ggplot() accumulate throughout the project — in analysis modules and in places that pass ggplot2 output through to other functions for further formatting. If ggplot2 releases a version update, every place that calls ggplot() may need to be edited. The cost is worse if a new library emerges and the team wants to migrate, or even if two collaborators wish to experiment with alternative libraries. Not only does every call site change, but every downstream function that depended on ggplot2’s output type must also be modified. The same dynamic applies to data sources and formats, specific measures, data libraries, and other tools. If the codebase depends on concrete implementations, then changes to those implementations ripple through the codebase and disrupt other parts of the project.
Concrete implementations are highly likely to change over the course of a project, whether because of updates to external modules or because of new data, measures, and tools. The ideas and needs stay the same, but the concrete details of how they are implemented need to be adjusted. If the codebase is designed so that the abstract ideas are at the core and the concrete details are derivative, then changes to the concrete details will not affect the core of the project. For example, if the codebase defines a Plotter interface and uses a subtype of Plotter to implement the interface using ggplot2, as illustrated in Figure 8.1, then the rest of the codebase depends on the Plotter interface, not on ggplot2 itself. If ggplot2 updates its interface, the only changes that need to be made are to the subtype that implements Plotter for ggplot2. If a new library emerges, a new subtype can be written that implements Plotter using the new library. In both cases, the core of the project, which depends only on the Plotter interface, remains stable and unaffected by changes to the concrete implementation.
This principle is dependency inversion (Martin 2002, 130–31). Instead of the project’s analysis depending on the specific tools that happen to be available, the tools depend on the analysis’s abstract requirements. With this architecture, changes to the concrete details do not ripple through the codebase and disrupt other parts of the project. Instead, they are contained within their own modules, which can be modified or swapped altogether without affecting any other modules that depend on them.
8.3 Conclusion
This note introduced a set of programming principles that software engineers use to manage large, complex codebases. These principles are now relevant to quantitative social science because generative AI has made it possible to build codebases at a scale and complexity that require them. There are many sets of rules for embedding these principles in code, such as ports-and-adapters architecture (Cockburn 2005). There are also other design patterns, such as Domain Driven Design (Evans 2004), that become necessary as projects grow larger. These patterns require rich qualitative expertise to implement effectively (Cochrane et al. 2026). In any case, single responsibility, open-closed design, Liskov substitution, interface segregation, and dependency inversion are no longer specialized concerns for large teams of software developers. They are, increasingly, among the design principles that — alongside open data and open code — now define what research transparency looks like at scale.