Treaty-following AI
Abstract
Advanced AI agents raise significant challenges to global stability, cooperation, and international law. States increasingly face pressure to manage the geopolitical, safety and security risks such agents give rise to. Yet advanced AI agreements for these systems face significant hurdles, including potentially unpalatable compliance monitoring requirements, and the ease with which states or their ‘lawless’ AI agents might exploit legal loopholes. This article introduces a novel commitment mechanism for states: Treaty-Following AI (TFAI). Building on recent work on “Law-Following AI”, we propose that AI agents could be technically and legally designed to execute their principals’ instructions except where those involve goals or actions that would breach a designated AI-Guiding Treaty. If technically and legally feasible, the TFAI framework offers a powerful, verifiable and self-executing commitment mechanism by which states could guarantee that their deployed AI agents will abide by the legal obligations to which those states have agreed. This could unlock otherwise-unfeasible advanced AI agreements, and potentially help revitalize treaties as an instrument across many domains of international law. The article examines the conceptual foundations, technical feasibility, political uses, and legal construction of TFAI agents and AI-guiding treaties, answering questions around the design of AI-guiding treaties, state responsibility for TFAI agents, and appropriate methods of treaty interpretation for TFAI agents, amongst others. It argues that, if these outstanding technical, legal, and political hurdles are cleared, treaty-following AI could serve as an important scalable cooperative capability for both the international governance of advanced AI, and for the broader integrity of international law in the ‘intelligence age’.
I. Introduction
If AI systems might be made to follow laws,[ref 1] does that mean that they could also follow the legal text in international agreements? Could “Treaty-Following AI” (TFAI) agents—designed to follow their principals’ instructions except where those entail actions that violate the terms of a designated treaty—help robustly and credibly strengthen states’ compliance with their adopted international obligations?
Over time, what would a framework of treaty-following AI agents aligned to AI-guiding treaties imply for the prospects of new treaties specific to powerful AI (‘advanced AI agreements’), for state compliance with existing treaty instruments in many other domains, and for the overall role and reach of binding international treaties in the brave new “intelligence age”?[ref 2]
These questions are increasingly salient and urgent. As AI investment and capability progress continues apace,[ref 3] so too does the development of ever more capable AI models, including those that can act coherently as “agents” to carry out many tasks of growing complexity.[ref 4] AI agents have been defined in various ways,[ref 5] but they can be practically considered as those AI systems which can be instructed in natural language and then act autonomously, which are capable of pursuing difficult goals in complex environments without detailed (follow-up) instruction, and which are capable of using various affordances or design patterns such as tool use (e.g., web search) or planning.[ref 6]
To be sure, AI agents today vary significantly in their level of sophistication and autonomy.[ref 7] Many of these systems still face limits to their performance, coherence over very long time horizons,[ref 8] robustness in acting across complex environments,[ref 9] and cost-effectiveness,[ref 10] amongst other issues.[ref 11] It is important and legitimate to critically scrutinize the time frame or the trajectory on which this technology will come into its own.
Nonetheless, a growing number of increasingly agentic AI architectures are available;[ref 12] they are seeing steadily wider deployment by AI developers and startups across many domains;[ref 13] and, barring sharp breaks in or barriers to progress, it will not be long before the current pace of progress yields increasingly more capable and useful agentic systems, including, eventually, “full AI agents”, which could be functionally defined as systems “that can do anything a human can do in front of a computer.”[ref 14] Once such systems come into reach, it may not be long before the world sees thousands or even many millions of such systems autonomously operating daily across society,[ref 15] with very significant impacts across all spheres of human society.[ref 16]
Far from being a mirage,[ref 17] then, the emergence and proliferation of increasingly agentic AI systems is a phenomenon of rapidly increasing societal importance[ref 18]—and of growing social, ethical and legal concern.[ref 19] After all, given their breadth of use cases—ranging from functions such as at-scale espionage campaigns, intelligence synthesis and analysis,[ref 20] military decision-making,[ref 21] cyberwarfare, economic or industrial planning, scientific research, or in the informational landscape—AI agents will likely impact all domains of states’ domestic and international security and economic interests.
As such, even if their direct space of actions remains merely constrained to the digital realm (rather than to robotic platforms), these AI agents could create many novel global challenges, not just for domestic citizens and consumers, but also for states and for international law. These latter risks include new threats to international security or strategic stability,[ref 22] broader geopolitical tensions and novel escalation risks,[ref 23] significant labour market disruptions and market-power concentration,[ref 24] distributional concerns and power inequalities,[ref 25] domestic political instability[ref 26] or legal crises;[ref 27] new vectors for malicious misuse, including in ways that bypass existing safeguards (such as on AI–bio tools);[ref 28] and emerging risks that these agents act beyond their principals’ intent or control.[ref 29]
Given these prospects, it may be desirable for states at the frontier of AI development to strike new (bilateral, minilateral or multilateral) international agreements to address such challenges—or for ‘middle powers’ to make access to their markets conditional on AI agents complying with certain standards—in ways that assure the safe and stabilizing development, deployment, and use of advanced AI systems.[ref 30] Let us call these advanced AI agreements.
Significantly, even if negotiations on advanced AI agreements were initiated on the basis of genuine state concern and entered into in good will—for instance, in the wake of some international crisis involving AI[ref 31]—there would still be key hurdles to their success. That is, these agreements would likely still face a range of challenges both old and new. In particular, they might face challenges around (1) the intrusiveness of monitoring state activities in deploying and directing their AI agents in order to ensure treaty compliance; (2) the continued enforcement of initial AI benefit-sharing promises; or (3) the risk that AI agents would, whether by state order or not, exploit legal loopholes in the treaty. Such challenges are significant obstacles to advanced AI agreements; and they will need to be addressed if such AI treaties are to be politically viable, effective, and robust as the technology advances.
Even putting aside novel treaties for AI, the rise of AI agents is also likely to put pressure on many existing international treaties (or future ones, negotiated in other domains), especially as AI agents will begin to be used in domains that affect their operation. This underscores the need for novel kinds of cooperative innovations—new mechanisms or technologies by which states can make commitments and assure (their own; and one another’s) compliance with international agreements. Where might such solutions be found?
Recent work has proposed a framework for “Law-Following AI” (LFAI) which, if successfully adapted to the international level, could offer a potential model for how to address these global challenges.[ref 32] If, by analogy, we can design AI agents themselves to be somehow “treaty-following”—that is, to generally follow their principals’ instructions loyally but to refuse to take actions that violate the terms of a designated treaty—this would greatly strengthen the prospects for advanced AI agreements,[ref 33] as well as strengthen the integrity of other international treaties that might otherwise come under stress from the unconstrained activities of AI agents.
Far from a radical, unprecedented idea, the notion of treaty-following AI—and the basic idea of ensuring that AI agents autonomously follow states’ treaty commitments under international law—draws on many established traditions of scholarship in cyber, technology, and computational law;[ref 34] on recent academic work exploring the role of legal norms in the value alignment for advanced AI;[ref 35] as well as a longstanding body of international legal scholarship aimed at the control of AI systems used in military roles.[ref 36] It also is consonant with recent AI industry safety techniques which seek to align the behaviour of AI systems with a “constitution”[ref 37] or “Model Spec”.[ref 38] Indeed, some applied AI models, such as Scale AI’s “Defense Llama”, have been explicitly marketed as being trained on a dataset that includes the norms of international humanitarian law.[ref 39] Finally, it is convergent with the recent interest of many states, the United States amongst them,[ref 40] in developing and championing applications of AI that support and ensure compliance with international treaties.
Significantly, a technical and legal framework for treaty-following AI could enable states to make robust, credible, and verifiable commitments that their deployed AI agents act in accordance with the negotiated and codified international legal constraints that those states have consented and committed to. By significantly expanding states’ ability to make commitments regarding the future behaviour of their AI agents, this could not only aid in negotiating AI treaties, but might more generally reinvigorate treaties as a tool for international cooperation across a wide range of domains; it could even strengthen and invigorate automatic compliance with a range of international norms. For instance, as AI systems see wider and wider deployment, a TFAI framework could help assure automatic compliance with norms and agreements across wide-ranging domains: it could help ensure that any AI-enabled military assets automatically operate in compliance with the laws of war,[ref 41] and that AI agents used in international trade could automatically ensure nuanced compliance with tailored export control regimes that facilitate technology transfers for peaceful uses. The framework could help strengthen state compliance with human rights treaties or even with mutual defence commitments under collective security agreements, to name a few scenarios. In so doing, rather than mark a radical break in the texture of international cooperation, TFAI agreements might simply serve as the latest step in a long historical process whereby new technologies have transformed the available tools for creating, shaping, monitoring, and enforcing international agreements amongst states.[ref 42]
But what would it practically mean for advanced AI agents to be treaty-following? Which AI agents should be configured to follow AI treaties? What even is the technical feasibility of AI agents interpreting agreements in accordance with the applicable international legal rules on treaty interpretation? What does all this mean for the optimal—and appropriate—content and design of AI-guiding treaty regimes? These are just some of the questions that will require robust answers in order for TFAI to live up to its significant promise. In response, this paper provides an exploration of these questions, offered with the aim of sparking and structuring further research into this next potential frontier of international law and AI governance.
This paper proceeds as follows: In Part II, we will discuss the growing need for new international agreements around advanced AI, and the significant political and technical challenges that will likely beset such international agreements, in the absence of some new commitment mechanisms by which states can ensure—and assure—that the regulated AI agents would abide by their terms. We argue that a framework for treaty-following AI would have significant promise in addressing these challenges, offering states precisely such a commitment mechanism. Specifically, we argue that using the treaty-following AI framework, states can reconfigure any international agreements (directly for AI; or for other domains in which AIs might be used) as ‘AI-guiding treaties’ that constrain—or compel—the actions of treaty-following AI agents. We argue that states can use this framework to contract around the safe and beneficial development and deployment of advanced AI, as well as to facilitate effective and granular compliance in many other domains of international cooperation and international law.
Part III sets out the intellectual foundations of the TFAI framework, before discussing its feasibility and implementation from both technical and legal perspectives. It first discusses the potential operation of TFAI agents and discusses the ways in which such systems may be increasingly technically feasible in light of the legal-reasoning capabilities of frontier AI agents—even as a series of technical constraints and challenges remain. We then discuss the basic legal form, design, and status of AI-guiding treaties, and the relation of TFAI agents with regard to these treaties.
In Part IV, we discuss the legal relation between TFAI agents and their deploying states. We argue that TFAI frameworks can function technically and politically even if the status or legal attributability of TFAI agents’ actions is left unclear. However, we argue that the overall legal, political, and technical efficacy of this framework is strengthened if these questions are clarified by states, either in general or within specific AI-guiding treaties. As such, we review a range of avenues to establish adequate lines of state responsibility for the actions of their TFAI agents. Noting that more expansive accounts—which entail extending either international or domestic legal personhood to TFAI agents—are superfluous and potentially counterproductive, we ultimately argue for a solution grounded in a more modest legal development, where TFAI agents’ actions become held as legally attributable to their deploying states under an evolutive reading of the International Law Commission (ILC)’s Articles on the Responsibility of States for Internationally Wrongful Acts (ARSIWA).
Part V discusses the question of how to ensure effective and appropriate interpretation of AI-guiding treaties by TFAI agents. It discusses two complementary avenues. We first consider the feasibility of TFAI agents applying the default rules for treaty interpretation under the Vienna Convention on the Law of Treaties; we then consider the prospects of designing bespoke AI-guiding treaty regimes with special interpretative rules and arbitral or adjudicatory bodies. For both avenues, we identify and respond to a series of potential implementation challenges.
Finally, Part VI sketches future questions and research themes that are relevant for ensuring the political stability and effectiveness of the TFAI framework, and then we conclude.
II. Advanced AI Agreements and the Role of Treaty-Following AI
To kick off, it is important to clarify the terminology, concepts, and scope of our current argument.
A. Terminology
To boot, we define:
1. Advanced AI agreements: AI-specific treaties which states may (soon or eventually) conduct bilaterally, unilaterally, or multilaterally, and adopt, in order to establish state obligations to regulate the development, capabilities, or usage of advanced AI agents;
2. Existing obligations: any other (non-AI-specific) international obligations that states may be under—within treaty or customary international law—which may be relevant to regulating the behaviour of AI agents, or which might be violated by the behaviour of unregulated AI agents.
While our initial focus in this paper is on how the rise of AI agents may interact with—and in turn be regulated within—the first category (advanced AI agreements), we welcome extensions of this work to the broader normative architecture of states’ existing obligations in international law, since we believe our framework is applicable to both (see Table 1). Our paper departs from the concern that the political and technical prospects of advanced AI agreements (and existing obligations) may be dim, unless states have either greater willingness or greater capability to make and trust international commitments. Changing states’ willingness to contract and trust is not impossible but difficult. However, one way that states’ cooperative capabilities might be strengthened is by ensuring that any AI agents they deploy will, by their technical design and operation, comply with those states’ obligations (under advanced AI agreements).
Such agents we call:
3. Treaty-Following AI agents (TFAI agents): agentic AI systems that are designed to generally follow their principals’ instructions loyally but to refuse to take actions that violate the terms and obligations of a designated referent treaty text.
Note three considerations. First: in this paper we focus on TFAI agents deployed by states, leaving aside for the moment the admittedly critical question of how we would treat AI agents deployed by private actors. We also focus on states’ AI agents that act across many domains, with their primary functions often being not legal interpretation per se, but rather a wide range of economic, logistical, military, or intelligence functions. Such TFAI agents would engage in treaty interpretation in order to adjust their own behaviour across many domains for treaty compliance; however, we largely bracket the potential parallel use of AI systems in negotiating or drafting international treaties (whether advanced AI agreements or any other new treaties), or their use in other forms of international lawmaking or legal norm-development (e.g. finding and organizing evidence of customary international law). Finally, the TFAI proposal does not construct AI systems as duty-bearing “legal actors” and therefore does not involve significant shifts in the legal ontology of international law per se.
Moving on: any bespoke advanced AI agreements designed to be self-enforced by TFAI agents, we consider as:
4. AI-Guiding Treaties: treaty instruments serving as the referent legal texts for Treaty-Following AI agents, consisting (primarily) of the legal text that those agents are aligned to, as well as (secondarily) their broader institutional scaffolding.
The full assemblage—comprising the technical configuration of AI agents to operate as TFAI agents, and of treaties to be AI-guiding—is referred to as (5) the TFAI framework.
To boot, we envision AI-guiding treaties as a relatively modest innovation—that is, as technical ex ante infrastructural constraints on TFAI agents’ range of acceptable goals or actions, building on demonstrated AI industry safety techniques. As such, we treat the treaty text in question as an appropriate, stable, and certified referent text through which states can establish jointly agreed-upon infrastructural constraints around which instructed goals their AI agents may accept, and on the latitude of conduct which they may adopt in pursuit of those lawful goals.

Table 1. Terminology, scope and focus of argument
As such, AI-Guiding Treaties demonstrate a high-leverage mechanism for self-executing state commitments. This mechanism could in principle be extended to all sorts of other treaties, in any other domains—from cyberwarfare to alliance security guarantees, from bilateral trade agreements to export control regimes, and even human rights or environmental law regimes—where AI agents could become involved in carrying out large fractions of their deploying states’ conduct. However, for the present, we focus on applying the TFAI framework to bespoke AI-guiding treaties (see Table 1) and leave this question of how to configure AI agents to follow other (non-AI-specific) legal obligations to future work. After all, if the TFAI framework cannot operate in this more circumscribed context, it will likely also fall flat in the context of other international legal norms and instruments. Conversely, if it does work in this narrow context, it could still be a valuable commitment mechanism for state coordination around advanced AI, even if it would not solve the problems of AI’s actions in other domains of international law.
B. The Need for International Agreements on Advanced AI
To appreciate the promise and value of a TFAI framework for states, international security, and international law, it helps to understand the range of goals that international agreements specific to advanced AI might serve to their parties[ref 43] as well as the political and technical hurdles such agreements might face in a business-as-usual scenario that would see a proliferation of “lawless” AI agents[ref 44] engaging in highly unpredictable or erratic behaviour.[ref 45]
Like other international regimes aimed at facilitating coordination or collaboration by states,[ref 46] AI treaties could serve many goals and shared national interests. For example, they could enshrine clearly agreed restrictions or red lines to AI systems’ capabilities, behaviours, or usage[ref 47] in ways that preserve and guarantee parties’ national security as well as international stability.[ref 48] There are many areas of joint interest for leading AI states to contract over.[ref 49] For instance, advanced AI agreements could impose mutually agreed limits on advanced AI agents’ ability to engage in uninterpretable steganographic reasoning or communication,[ref 50] or to carry out uncontrollably rapid automated AI research.[ref 51] They could also establish mutually agreed restraints on AI agent’s capacity, propensity, or practical useability to infiltrate designated key national data networks or to target those critical infrastructures through cyberattacks, to drive preemptive use of force in manners that ensure conflict escalation,[ref 52] to support coup attempts against (democratically elected or simply incumbent) governments, or to engage in any other actions that would severely interfere with the national sovereignty of signatory (or allied, or any) governments.[ref 53]
On the flip side, international AI agreements could also be aimed not just at avoiding the bad, but at achieving significant good. For instance, many have pointed to the strategic, political, and ethical value of conditional AI benefit-sharing deals:[ref 54] international agreements through which states leading in AI commit to some proportional or progressive sharing of the future benefits derived from AI technology with allies, unaligned states, or even with rival or challenger states. Such bargains, it is hoped, might help secure geopolitical stability, avert risky arms races or contestation by states lagging in AI,[ref 55] and could moreover ensure a degree of inclusive global prosperity from AI.[ref 56]
C. Political hurdles and technical threats to advanced AI agreements
However, it is likely that any advanced AI agreements would encounter many hurdles, both political and technical, which will need to be addressed.
1. Political hurdles: Transparency-security tradeoffs and future enforceability challenges
For one, international security agreements face challenges around the intrusive monitoring they may require to guarantee that all parties to the treaty instruct and utilize their AI systems in a manner that remains compliant with the treaty’s terms. Such monitoring is likely to risk revealing sensitive information, resulting in a “security-transparency tradeoff” which has historically undercut the prospects for various arms control agreements[ref 57] and which could do so again in the case of AI security treaties.
Simultaneously, asymmetric treaties, such as those conditionally promising a share of the benefits from advanced AI technologies, face potential (un)enforceability problems: those states that are lagging in AI development might worry that any such promises would not be enforceable and could easily be walked back, if or as the AI capability differential between them and a frontier AI state grew particularly steep, in a manner that resulted in massively lopsided economic, political, or military power.[ref 58]
2. Technical threats: hijacked, misaligned, or “henchmen” agents
Other hurdles to AI treaties would be technical. For many reasons, states might be cautious in overrelying on or overtrusting lawless AI agents in their services. After all, even fairly straightforward and non-agentic AI systems used in high-stakes governmental tasks (such as military targeting or planning) can be prone to unreliability, adversarial input, or sycophancy (the tendency of AI systems to align their output with what the user believes or prefers, even if that view is incorrect).[ref 59]
Moreover, AI systems can demonstrate surprising and functionally emergent capabilities, propensities, or behaviours.[ref 60] In testing, a range of leading LLM agents have autonomously resorted to risky or malicious insider behaviours (such as blackmail or leaking sensitive information) when faced with the prospect of being replaced with an updated version or when their initially assigned goal changed with the company’s changing direction; often, they did so even while disobeying direct commands to avoid such behaviours.[ref 61] This suggests that highly versatile AI agents may threaten various loss of control (LOC) scenarios—defined by RAND as “situations where human oversight fails to adequately constrain an autonomous, general-purpose AI, leading to unintended and potentially catastrophic consequences.”[ref 62] In committing such actions, AI agents will impose unique challenges to questions of state compliance with their international legal obligations, since these systems may well engage in actions that violate key obligations under particular treaties, or which inflict transboundary harms, or which violate peremptory norms (jus cogens) or other applicable principles of international law.
Beyond the direct harm threatened by AI agents taking these actions, there would be the risk that if they were attributed to the deploying state it would likely threaten the stability of the treaty regime, spark significant political or military escalation, and/or expose a state to international legal liability,[ref 63] enabling injured states to take unfriendly measures of retorsion (e.g., severing diplomatic relations) or even countermeasures that would otherwise be unlawful (e.g., suspending treaty obligations or imposing economic sanctions that would normally violate existing trade agreements).[ref 64]
Why might we expect some AI agents to engage in such actions that violate their principal’s treaty commitments or legal obligations? There are several possible scenarios to consider.
For one, there are risks that AI agents can be attacked, compromised, hijacked, or spoofed by malicious third parties (whether acting directly or through other AI agents) using direct or indirect prompt injection attacks,[ref 65] spoofing, faked interfaces, IDs or certificates of trust,[ref 66] malicious configuration swaps,[ref 67] or other adversarial inputs.[ref 68] Such attacks would compromise not just the agents themselves, but also all systems they were authorized to operate in, given that major security vulnerabilities have been found in publicly available AI coding agents, including exploits that grant attackers full remote code-execution user privileges.[ref 69]
Secondly, there may be a risk that unaligned AI agents would themselves insufficiently consider—or even outright ignore—their states’ interests and obligations in undertaking certain action paths. As evidenced by a growing body of both theoretical arguments[ref 70] and empirical observation,[ref 71] it is difficult to design AI systems that reliably obey any particular set of constraints provided by humans,[ref 72] especially where these constraints refer not to clearly written out texts but aim to also build in consideration of the subjective intents or desires of the principal.[ref 73] As such, AI agents deployed without care could frequently prove unaligned; that is, act in ways unrestrained by either normative codes[ref 74] or by the intent of their nominal users (e.g., governments).[ref 75] In the absence of adequate real-time failure detection and incident response frameworks,[ref 76] such harms could escalate swiftly. In the wake of significant incidents, one would hope that governments might (hopefully) soon wisen up to the inadvisability of deploying such systems without adequate oversight,[ref 77] but not, perhaps, before incurring significant political costs, whether counted in direct harm or in terms of lost global trust in their technological competence.
Thirdly, even if deployed AI agents could be successfully intent-aligned to their state principals,[ref 78] the use of narrowly loyal-but-lawless AI systems, which are left free to engage in norm violations that they judge in their principal’s interest, would likely expose their deploying states to significant political costs. To understand why this is, it is important to see the technical challenge of loyal-but-lawless AI agents in a broader political context.
3. Lawless AI agents, political exposure, and commitment challenges
Taken at face value, the development of AI systems that are narrowly loyal to a governments’ directives and intentions, even to the exclusion of that governments’ own legal precommitments, might appear a desirable prospect to some political realists. In practice, however, many actors may have both normative and self-interested reasons to be wary of loyal-but-lawless AI agents engaging in actions that are in legal grey areas—or outright illegal—on their behalf. At the domestic level, such AI “henchmen” would create significant legal risks for consumers using them[ref 79] and for corporations developing and deploying them.[ref 80] They would also create legal risks for government actors, who might find themselves violating public administrative law or even constitutions,[ref 81] as well as political risks, as AI agents that could be made loyal to particular government actors could well spur ruinous and destabilizing power struggles.[ref 82] Just so, many states might find AI henchmen a politically poisoned fruit at the international level.
After all, not only could such AI agents be intentionally ordered by state officials to engage in conduct that violates or subverts those states’ treaty obligations, these systems’ autonomy also suggests that they might engage in such unlawful actions even without being explicitly directed to do so. That is, loyal-but-lawless AI henchmen could engage in calculated treaty violations whenever they judge them to be to the benefit of their principal.[ref 83] However, outside actors, finding it difficult to distinguish between AI agent behaviour that was deliberately directed versus henchman actions that were advantageous but unintended, might assume the worst in each case.
Significantly, in such contexts, the ambiguity of adversarial actions would frequently translate into perceptions of bad faith; in this way, loyal-but-lawless AI agents’ ability to violate treaties autonomously, and to do so in a (facially) deniable manner, as henchmen acting in their principals’ interests but not on their orders, perversely creates a commitment challenge for their deploying states, one which would erode states’ ability to effectively conduct (at least some) treaties. After all, even if a state intended to abide by its treaty obligations in good faith, it would struggle to prove this to counterparties unless it could somehow guarantee that its AI agents could not be misused and will not act as deniable henchmen whenever convenient.
This would not mean that states would no longer be able to conduct any such treaties at all; after all, there would remain many other mechanisms—from reputational costs to the risk of sparking reciprocal noncompliance—that might still incentivize or compel states’ compliance with such treaties.
However, lawless AI agents’ unpredictability poses a significant and severe challenge insofar as they make treaty violations more likely. Of course, even today, even when states intend to comply with their international obligations, they may have trouble ensuring that their human agents consistently abide by those obligations. Such failures can occur for reasons of bureaucratic capacity[ref 84] and organizational culture,[ref 85] or they can happen as a result of the institutional breakdown of the rule of law, at worst resulting in significant rights abuses or humanitarian atrocities committed by junior members.[ref 86] Such incidents may, at best, frustrate states’ genuine intention to achieve the goals enshrined in the treaties they have consented to; in all cases, they can expose a state to significant reputational harm, legal and political censure, and adversary lawfare,[ref 87] while eroding domestic confidence in the competence or integrity of its institutions.
Significantly, lawless AI agents would likely exacerbate the risk that their deploying states would (be perceived to) use them strategically to engage in violations of treaty obligations in a manner that would afford some fig leaf of deniability if discovered. This is for a range of reasons: (1) treaties often prevent states from engaging in actions that at least some large fraction of a state’s human agents would prefer not to engage in; loyal-but-lawless AI agents would not have such moral side constraints and would be far more likely to obey unethical or illegal requests. Moreover, (2) loyal-but-lawless AI agents would be less likely to whistleblow or leak to the presses following the violation of a treaty (and, correspondingly, would need to worry less about their fellow AI colleagues or collaborators doing so, meaning they could exchange information more freely); (3) loyal-but-lawless AI agents would have little reason to worry about personal consequences for treaty violations (e.g., foreign sanctions, asset freezing, travel restrictions, international criminal liability) that might deter human agents; (4) loyal-but-lawless AI agents would have less reason to worry about domestic legal or career repercussions (e.g., criminal or civil penalties, costs to their reputation or career) associated with aiding a violation of treaty obligations that could later become disfavored should domestic political winds shift; and (5) AI agents may be better at hiding their actions and their/their principals’ identity, thus making them more likely to opportunistically violate the treaty.[ref 88]
These are not just theoretical concerns but are supported by empirical studies, which have indicated that human delegation of tasks to AI agents can increase dishonest behaviours, as human principals often find ways to induce dishonest AI agent behaviour without telling them precisely what to do; crucially, such cheating requests saw much higher rates of compliance when directed at machine agents than when they were addressed to human agents.[ref 89] For all these reasons, then, the widespread use of AI agents is likely to exacerbate international concerns over either deliberate or unwitting violations of treaty obligations by their deploying state.
As such, on the margin, the deployment of advanced agentic AIs acting under no external constraints beyond their states’ instructions would erode not just the respect for many existing norms in international law but also the prospects for new international agreements, including those focused on stabilizing or controlling the use of this key technology.
D. The TFAI framework as commitment mechanism and cooperative capability
Taken together, these challenges could put significant pressure on international advanced AI agreements and could more generally threaten the prospects for stable international cooperation in the era of advanced AI.
Conversely, an effective framework by which to guarantee that AI agents would adhere to the terms of their treaty could address or even invert these challenges. For one, it could help ease the transparency-security tradeoff by embedding constraints on AI agents’ actions at the level of the technology itself. It could crystallize (potentially) nearly irrevocable commitments by states to share the future benefits from AI with other states or to guarantee investor protections under more inclusive “open global investment” governance models.[ref 90]
More generally, the TFAI framework is one way by which AI systems could help expand the affordances and tools available to states, realizing a significant new cooperative capability[ref 91] that would greatly enhance their ability to make robust and lasting commitments to each other in ways that are not dependent on assumptions of (continued) good faith. Indeed, correctly configured, it could be one of many coordination-enabling applications of AI that could strengthen the ability of states (and other actors) to negotiate in domains of disagreement and to speed up collaboration towards shared global goals.[ref 92]
Finally, the ability to bind AI agents to jointly agreed treaties has many additional advantages and co-benefits; for one, it might mitigate the risk that some domestic (law-following) AI agents, especially in multi-agent systems, become engaged in activities with cross-border effects that end up simultaneously subjecting them to different sets of domestic law, resulting in conflict-of-law challenges.[ref 93]
E. Caveats
That said, the proposal for exploration and application of a TFAI framework comes with a number of caveats.
For one, in exploring the prospects for states to conduct new AI-specific treaty regimes (i.e., advanced AI agreements) by which to bind the actions of AI agents, we do not suggest that only these novel treaty regimes would ground effective state obligations around the novel risks from advanced AI agents. To the contrary, since many norms in international law are technology-neutral, there are already numerous binding and non-binding norms—deriving from treaty law, international custom, and general principles of law—that apply to states’ development and deployment of advanced AI agents[ref 94] and which would provide guidance even for future, very advanced AI systems.[ref 95] As such, as noted by Talita Dias,
“while the conversation about the global governance of AI has focussed on developing new, AI-specific rules, norms or institutions, foundational, non-AI-specific global governance tools already govern AI and AI agents globally, just as they govern other digital technologies. [since] International law binds states—and, in some circumstances, non-state actors—regardless of which tools or technologies are used in their activities.”[ref 96]
This means that one could also consider a more expansive project that would examine the case for fully “public international law-following AI” (see again Table 1). Nonetheless, as discussed before, in this article, we focus on the narrower and more modest framework for treaty-following AI. This is because a focus on AI that follows treaties can serve as an initial scoping exercise to investigate the feasibility of extending any law-following AI–like framework to the international sphere at all: if this exercise does not work, then neither would more ambitious proposals for public international law-following AI. Conversely, if the TFAI framework does work, it is likely to offer significant benefits to states (and to international stability, security, and inclusive development), even if subjecting AI systems to the full range of international legal norms proved more difficult, legally or politically.
Thirdly, in discussing potential AI-guiding treaties, we note that there exist a wide range of reasons by which states might wish to strike such international deals and agreements, and/or find ways to enshrine stronger technology-enabled commitments to comply with their obligations under those instruments. However, we do not aim to prescribe particular goals or substance for advanced AI agreements or to make strong claims about these treaties’ optimal design[ref 97] or ideal supporting institutions.[ref 98] We realize that substantive examples would be useful; however, given that there is currently still such pervasive debate over which particular goals states might converge on in international AI governance, this paper aims at the modest initial goal of establishing the TFAI framework as a relatively transferable, substance-agnostic commitment mechanism for states.
III. The Foundations and Scope of Treaty-Following AI
While the idea of designing AI agents to be treaty-following might seem unorthodox on its face, it is hardly without precedent or roots. Rather, it draws on an established tradition of scholarship in cyber and technology law, which has explored the ways through which legal norms and regulatory goals may be directly embedded in (digital) technologies,[ref 99] including in fields such as computational law.[ref 100]
Simultaneously, the idea of aligning AI systems with normative codes can moreover draw inspiration from, and complement, many other recent attempts to articulate frameworks for oversight and alignment of agents, including by establishing fiduciary duties amongst AI agents and their principals,[ref 101] articulating reference architectures for the design components necessary for responsible AI agents,[ref 102] drawing on user-personalized oversight agents[ref 103] or trust adjudicators,[ref 104] or articulating decentralized frameworks, rooted in smart contracts for both agent-to-agent and human-AI agent collaborations.[ref 105]
Significantly, in the past, some early scholarship in cyberlaw and computational law expressed justifiable skepticism over the feasibility of developing some form of artificial legal intelligence’ grounded in an algorithmic understanding of law[ref 106] or of using then-prevailing approaches to manually program complex and nuanced legal codes into software algorithms.[ref 107] Nonetheless, we might today find reason to re-examine our assumptions over AI technology. After all, the modern lineage of advanced AI models, based on the transformer architecture,[ref 108] operates through a distinct bottom-up learning paradigm that is fundamentally distinct from the older, top-down symbolic programming paradigm once prevalent in AI.[ref 109] Consequently, the idea of binding or aligning AI systems to legal norms, specified in natural language, has been given growing credit and attention not just in the broader fields of technology ethics[ref 110] and AI alignment,[ref 111] but also in legal scholarship written from the perspective of legal theory, domestic law,[ref 112] and international law.[ref 113]
A. From law-following to treaty-following AI
The law-following AI (LFAI) proposal by O’Keefe and others is, in a sense, an update to older computational law work, envisioned as a new framework for the development and deployment of modern, advanced AI. In their view, LFAI pursues:
“AI agents […] designed to rigorously comply with a broad set of legal requirements, at least in some deployment settings. These AI agents would be loyal to their principals, but refuse to take actions that violate applicable legal duties.”[ref 114]
In so doing, the LFAI framework aims to prevent criminal misuse, minimize the risk of accidental and unintended law-breaking actions undertaken by AI ‘henchmen’, help forestall abuse of power by government actors,[ref 115] and inform and clarify the application of tort liability frameworks for AI agents.[ref 116] The LFAI proposal envisions that, especially in “high stakes domains, such as when AI agents act as substitutes for human government officials or otherwise exercise government power”,[ref 117] AI agents are designed in a manner that makes them autonomously predisposed to obey applicable law; or, more specifically,
“AI agents [should] be designed such that they have ‘a strong motivation to obey the law’ as one of their ‘basic drives.’ … [W]e propose not that specific legal commands should be hard-coded into AI agents (and perhaps occasionally updated), but that AI agents should be designed to be law-following in general.”[ref 118]
By extension, for an AI agent to be treaty-following, it should be designed to generally follow its principals’ instructions loyally but refuse to take actions that violate the terms and obligations of a designated applicable referent treaty.
As discussed above, this means that the TFAI framework decomposes into two components: we will use TFAI agent to refer to the technical artefact (i.e., the AI system, including not just the base model but also the set of tools and scaffolding[ref 119] that make up the overall compound AI system[ref 120] that can act coherently as an agent) that has its conduct aligned to a legal text. Conversely, we use AI-guiding treaty[ref 121] to refer to the legal component (i.e., the underlying treaty text and, secondarily, its institutional scaffolding).
If technically and legally feasible, the promise of the TFAI framework lies in the ability to provide a guarantee of automatic self-execution for, and state party compliance with, advanced AI agreements, while requiring less pervasive or intrusive human inspections.[ref 122] They could therefore mitigate the security-transparency tradeoff and render such agreements more politically feasible.[ref 123] Moreover, ensuring that states deploy their AI agents in such a manner as to make them treaty-following, ensures that AI treaties are politically robust against AI agents acting in a misaligned manner. Specifically, the use of AI-guiding treaties and treaty-following AIs to institutionalize self-executing advanced AI agreements would preclude states’ AI agents from acting as henchmen that might engage in treaty violations for short-term benefit to their principal. Taking such behaviour off the table at a design level, would help crystallize an ex ante reciprocal commitment amongst the contracting states, allowing them to reassure each other that they both intend to respect the intent of the treaty, not just its letter.
Finally, just as domestic laws may constitute a democratically legitimate alignment target for AI systems in national contexts,[ref 124] treaties could serve as a broadly acceptable, minimum normative common denominator for the international alignment of AI systems. After all, while international (treaty) law does not necessarily represent the direct output of a global democratic process, state consent does remain at the core of most prevailing theories of international law.[ref 125] That is not to say that this makes such norms universally accepted or uncontestable. After all, some (third-party) states (or non-state stakeholders) may perceive some treaties as unjust; others might argue that demanding mere legal compliance with treaties as the threshold for AI alignment is setting the bar too low.[ref 126] Nonetheless, the fact that treaty law has been negotiated and consented to by publicly authorized entities such as states might at least provide these codes with a prima facie greater degree of political legitimacy than is achieved by the normative codes developed by many alternative candidates (e.g., private AI companies; NGOs; single states in isolation).[ref 127]
Practically speaking, then, achieving TFAI would depend on both a technical component (TFAI agents) and a legal one (AI-guiding treaties). Let us review these in turn.
B. TFAI agents: Technical implementation, operation, and feasibility
In the first place, there is a question of which AI agents should be considered as within the scope of a TFAI framework: Is it just those agents that are deployed by a state in specific domains, or all agents deployed by a contracting state (e.g., to avoid the loophole whereby either state can simply evade the restrictions by routing the prohibited AI actions through agents run by a different government department)? Or is it even all AI agents operating from a contracting state’s territory and subject to that state’s domestic law? For the purposes of our analysis, we will focus on the narrow set, but as we will see,[ref 128] these other options may introduce new legal considerations.
That then shifts us to questions of technical feasibility. The TFAI framework requires that a state’s AI agents would be able to access, weigh, interpret, and apply relevant legal norms to its own (planned) goals or conduct in order to assess their legality before taking any action. How feasible is this?
1. Minimal TFAI agent implementation: A treaty-interpreting chain-of-thought loop
There are, to be clear, many possible ways one could go about implementing treaty-alignment training. One could imagine nudging the model towards treaty compliance by affecting the composition of either its pre-training data, its post-training fine-tuning data, or both. In other cases, future developments in AI and in AI alignment could articulate distinct ways by which to implement treaty-following AI propensities, guardrails, or limits.
In the near term, however, one straightforward avenue by which one could seek to implement treaty-following behaviour would leverage the current paradigm, prominent in many AI agents, towards utilizing reasoning models. Reasoning models are a 2024 innovation on transformer-based large language models that allows such models to simulate thinking aloud about a problem in a chain of thought (CoT). The model uses the legible CoT to forward notes to itself, and to accordingly run multiple passes or attempts on one question, and to use reasoning behaviours—such as expressing uncertainty, generating examples for hypothesis validation, and backtracking in reasoning chains. All of this has resulted in significantly improved performance on complex and multistep reasoning problems,[ref 129] even as it has also considerably altered the development and diffusion landscape for AI models,[ref 130] along with the levers for its governance.[ref 131]
Note that our claim is not that a CoT-based implementation of treaty alignment is the ideal or most robust avenue to achieving TFAI agents;[ref 132] however, it may be a straightforward avenue by which to understand, test, and grapple with the ability of models to serve in a TFAI agent role.
Concretely, a TFAI agent implemented through a CoT decision loop could work as follows: Prior to accepting a goal X or undertaking an action Y, an AI agent might spend some inference computing time writing out an extended chain-of-thought reasoning process in which it collates or recalls potentially applicable legal provisions of the treaty text, considers their meaning and application in the circumstances before it, and in particular reflects on potential treaty issues entailed by its provided end goal or its planned intermediate conduct towards that goal. Whenever confronted with legal uncertainty, the agent would dedicate further inference time to searching for relevant legal texts and interpretative sources in order to resolve the question and reach a decision over the legality of its goals or actions.
For instance, one staged inference decision-making loop for such a system could involve a reasoning process[ref 133] that iterates through some or all of the following steps:
- AI agent identifies potential treaty issues entailed by the provided end goal or intermediate conduct towards that goal (e.g., it identifies if a goal is facially illegal, or it identifies likely issues with formulating a lawful plan of conduct in service of an otherwise lawful goal).
- AI agent identifies an applicable treaty provision that might potentially (but not clearly) be breached by a planned goal or intermediate conduct; it reasons through possible and plausible interpretations of the provision in light of the applicable approach to treaty interpretation.
- In cases where the treaty text alone would not provide adequate clarity, the AI agent may, depending on the AI-guiding treaty design (discussed below), consider other relevant and applicable norms in international law or the rulings of a designated arbitral body attached to the treaty, in order to establish a ranking of interpretations of the legality of the goal or conduct.
- On this basis, the AI agent evaluates whether the likelihood of its conduct constituting a breach is within the range of “acceptable legal risk” (as potentially defined within the treaty, through arbitral body adjudication, or in other texts).
- If it is not, the AI agent refuses to take the action. If it is, the AI agent will proceed with the plan (or proceed to consider other cost-benefit analyses).
This decision-making loop would conclude in a final assessment of whether particular conduct would be (sufficiently likely to constitute[ref 134]) a breach of a treaty obligation and, if so, an overall refusal on behalf of the agent to take that action, and the consideration (or suggestion) for alternate action paths.
The above is just one example of the decision loops one could implement in TFAI agents to ensure their behaviour remained aligned with the treaty even in novel situations. There are of course many other variations or permutations that could be implemented, such as utilizing some kind of debate- or voting-based processes amongst collectives or teams of AI agents.
In practice, the most appropriate implementation would also consider technical, economic, and political constraints. For instance, for an AI agent to undertake a new and exhaustive legal deep dive for each and every situation encountered might be infeasible, given the constraints on, or costs of, the computing power available for serving such extended inference at scale. However, there are a range of solutions that could streamline these processes. For example, perhaps TFAIs could use legal intuition to decide when it is worth expending time during inference to properly analyse the legality of a goal or action. By analogy, law-following humans do not always consult a lawyer (or even primary legal texts) when they decide how to act; they instead generally rely on prosocial behavioural heuristics that generally keep them out of legal trouble, and generally only consult lawyers when they face legal uncertainty or when those heuristics are likely to be unreliable. Other solutions might include cached databases containing the chain-of-thought reasoning logs of other agents encountering similar situations, or the designation of specialized agents that could serve up legal advice on particular commonly recurring questions. These questions matter, as it is important to ensure that the implementation of TFAI agents does not impose so high a burden upon AI agents’ utility or cost-effectiveness as to offset the benefits of the treaty for the contracting states.
2. Technical feasibility of TFAI agents
As the above discussion shows, TFAI agents would need to be capable of a range of complex interpretative tasks.
Certainly, we emphasize that there remain significant technical challenges and limitations to today’s AI systems,[ref 135] which warn against a direct implementation of TFAI. Nonetheless, although there are important hurdles to overcome, there are also compelling reasons to expect that contemporary AI models are increasingly adept at interpreting (and following) legal rules and may soon do so at the level required for TFAI.
a) The rise of legal AI in international law
Recent years have seen growing attention on the ways that AI systems can be used in support of the legal profession in tasks ranging from routine case management or compliance support by providing legal information[ref 136] to drafting legal texts[ref 137] or even in outright legal interpretation.[ref 138]
This has been gradually joined by recent work on the ways in which AI systems could support international law[ref 139] and on what effects this may have on the concepts and modes of development of the international legal system.[ref 140] To date, however, much of this latter work has focused on how AI systems and agents could indirectly inform global governance through use in analysing data for trends of global concern;[ref 141] training diplomats, humanitarian and relief workers, or mediators in simulated interactions with stakeholders they may encounter in their work;[ref 142] or improving the inclusion of marginalized groups in UN decision-making processes.[ref 143]
Others have explored how AI agents systems could be used to support the functioning of international law specifically, such as through monitoring (state or individual) conduct and compliance with international legal obligations;[ref 144] categorizing datasets, automating decision rules, and generating documents;[ref 145] or facilitating proceedings at arbitral tribunals,[ref 146] treaty bodies,[ref 147] or international courts.[ref 148] Other work has considered how AI systems can support diplomatic negotiations,[ref 149] help inform legal analysis by finding evidence of state practice,[ref 150] or even aid in generating draft treaty texts.[ref 151]
However, for the purposes of designing TFAI agents, we are interested less in the use of AI systems in making or developing international law, or in indirectly aiding in human interpretation of the law; rather, we are focused on the potential use of AI systems in directly and autonomously interpreting international treaties or international law to guide their own behaviour.
b) Trends and factors in AI agents’ legal-reasoning capabilities
Significantly, the prospects for TFAI agents engaging autonomously in the interpretation of legal norms may be increasingly plausible. In recent years, AI systems have demonstrated increasingly competent performance at tasks involving legal reasoning, interpretation, and the application of legal norms to new cases.[ref 152]
Indeed, AI models perform increasingly well at interpreting not just national legislation, but also in interpreting international law. While international law scholars previously expressed skepticism over whether international law would offer a sufficiently rich corpus of textual data to train AI models,[ref 153] many have since become more optimistic, suggesting that there may in fact be a sufficiently ample corpus of international legal documents to support such training. For instance, already in 2020, Deeks has noted that:
“[o]ne key reason to think that international legal technology has a bright future is that there is a vast range of data to undergird it. …there are a variety of digital sources of text that might serve as the basis for the kinds of text-as-data analyses that will be useful to states. This includes UN databases of Security Council and General Assembly documents, collections of treaties and their travaux preparatoires (which are the official records of negotiations), European Court of Human Rights caselaw, international arbitral awards, databases of specialized agencies such as the International Civil Aviation Organization, state archives and digests, data collected by a state’s own intelligence agencies and diplomats (memorialized in internal memoranda and cables), states’ notifications to the Security Council about actions taken in self-defense, legal blogs, the UN Yearbook, reports by and submission to UN human rights bodies, news reports, and databases of foreign statutes. Each of these collections contains thousands of documents, which—on the one hand—makes it difficult for international lawyers to process all of the information and—on the other hand provides the type of ‘big data’ that makes text-as-data tools effective and efficient.”[ref 154]
Consequently, it appears to be the case that modern LLM-based AI systems have therefore been able to draw on a sufficiently ample corpus of international legal documents or have managed to leverage transfer learning[ref 155] from domestic legal documents, or both, to achieve remarkable performance on questions of international legal interpretation.
Indeed, it is increasingly likely that AI models can not only draw on their indirect knowledge of international legal texts because of their inclusion in their pre-training data, but that they will be able to refer to those legal texts live during inference. After all, recent advances in AI systems have produced models that can rapidly process and query increasingly large (libraries of) documents within their context window.[ref 156] Since mid-2023, the longest LLM context windows have grown by about 30x per year, and leading LLMs’ ability to leverage that input has improved even faster.[ref 157] Beyond the significant implications this trend may have for the general capabilities and development paradigms for advanced AI systems,[ref 158] it may also strengthen the case for functional TFAI agents. It suggests that AI agents may incorporate lengthy treaties[ref 159]—and even large parts of the entire international legal corpus[ref 160]—within their context window, ensuring that these are directly available for inference-time legal analysis.[ref 161]
Consequently, recent experiments conducted by international lawyers have shown remarkable performance gains in the ability of even publicly available non-frontier LLM chatbots to conduct robust exercises of legal interpretation in international law. This has included not just questions involving direct treaty interpretation, but also those regarding the customary international law status of a norm.[ref 162] At least on their face, the resulting interpretations frequently are—or appear to be—if not flawless, then nonetheless coherent, correct, and compelling to experienced international legal scholars or judges.[ref 163] For instance, in one experiment, AI-generated memorials were submitted anonymously to the 2025 edition of the prestigious Jessup International Law Moot Court Competition, receiving average to superior scores—and in some cases near-perfect scores.[ref 164] That is not to say that their judgments always matched human patterns, however: in another test involving a simulated appeal in an international war crimes case, GPT-4o’s judgments resembled those of students (but not professional judges) in that they were strongly shaped by judicial precedents but not by sympathetic portrayals of defendants.[ref 165]
3. Outstanding technical challenges for TFAI agents
AI’s legal-reasoning performance today is not without flaws. Indeed, there are a number of outstanding technical hurdles that will need to be addressed to fully realize the promise of TFAI.
a) Robustness of AI legal reasoning and law-alignment techniques
Some of these challenges relate to the robustness of AI’s legal-reasoning performance, in terms of current LLMs’ ability to robustly follow textual rules[ref 166] and to conduct open-ended multistep legal reasoning.[ref 167] Problematically, these models also remain highly sensitive in their outputs to even slight variations in input prompts;[ref 168] moreover, a growing number of judicial cases has seen disputes over the use of AI systems in drafting documents in ways that raised issues of hallucinated AI-generated content being brought before a court.[ref 169] Significantly, hallucination risks have proven extant even when legal research providers have attempted to use methods such as retrieval-augmented generation (RAG).[ref 170]
Indeed, even in contexts where LLMs perform well on legal tests, proper substantive legal analysis that actually applies the correct methodologies of legal interpretation remains amongst their more challenging tasks. For instance, in the aforementioned moot court experiment, judges found AI-generated memorials to be strong in organization and clarity, but still deficient in substantive analysis.[ref 171] Another study of various legal puzzles found that current AI models cannot yet reliably find “legal zero-days” (i.e., latent vulnerabilities in legal frameworks).[ref 172]
Underpinning these problems, the TFAI framework—along with many other governance measures for AI agents—faces a range of challenges to do with benchmarking and evaluation. That is, there are significant methodological challenges around meaningfully and robustly evaluating the performance of AI agents:[ref 173] It is difficult to appropriately conduct evaluation concept development (i.e., refining and systematizing evaluation concepts and their related metrics for measurements) for large state and action spaces with diverse solutions; it can be difficult to understand how proxy task performance reflects real-world risks; there are challenges in determining the system design set-up (i.e., understanding how task performance relates to external scaffolds or tools made available to the agent); and challenges in scoring performance and analysing results (e.g., to meaningfully compare for differences in modes of interaction between humans and AI systems), as well as practical challenges around dealing with more complex supply chains around AI agents, amongst other issues.[ref 174]
These evaluation challenges around agents converge and intersect with a set of benchmarking problems affecting the use of AI systems for real-world legal tasks,[ref 175] with recent work identifying issues such as subjective labeling, training data leakage, and appropriate evaluations for unstructured text as creating a pressing need for more robust benchmarking practices for legal AI.[ref 176] While this need not in principle pose a categorical barrier to the development of functional TFAI agents, it will likely hinder progress towards them; worse, it will challenge our ability to fully and robustly assess whether and when such systems are in fact ready for the limelight.
These challenges are also compounded by currently outstanding technical questions over the feasibility of many existing approaches towards guaranteeing the effective and enduring law alignment (or treaty alignment) of AI systems,[ref 177] since many outstanding training techniques remain susceptible to AI agents’ engaging in alignment faking (e.g., strategically adjusting their behaviour when they recognize that they are under evaluation),[ref 178] as well as to emergent misalignment, whereby models violated clear and present prohibitions in their instructions when those conflicted with perceived primary goals.[ref 179] This all suggests that, like the domestic LFAI framework, a TFAI framework remains dependent on further technical research into embedding more durable controls on model behaviour, which cannot be overcome by sufficiently strong incentives.[ref 180]
b) Unintended or intended bias
Second, there are outstanding challenges around the potential for unintended (or intended) bias in AI models’ legal responses. Unintended bias can be seen, for instance, in instances where some LLMs demonstrate demographic disparities in attributing human rights between different identity groups.[ref 181]
However, there may also be risks of (the perception of) intended bias, given the partial interests represented by many publicly available AI models. For instance, the values and dispositions of existing LLMs are deeply shaped by the interests of the private companies that develop them; this may even leave their responses open to intentional manipulation, whether undertaken through fine-tuning, through the application of hidden prompts, or through filters on the models’ outputs.[ref 182] Such concerns would not be allayed—and in some ways might be further exacerbated—even if TFAI models were developed or offered not by private actors but by a particular government.[ref 183]
There are some measures that might mitigate such suspicions; treaty parties could commit for instance to making the system prompt (the hidden text that precedes every user interaction with the system, reminding the AI system of its role and values) as well as the full model spec (in this case, the AI-guiding treaty) public, as labs such as OpenAI, Anthropic, and x.AI have done;[ref 184] but this creates new verification challenges over ensuring that both parties’ TFAI agents in fact are—and remain—deployed with these inputs.[ref 185]
c) Lack of faithfulness in chain-of-thought legal reasoning: sycophancy, sophistry, and obfuscation
Thirdly, insofar as we make the (reasonable, conservative) assumption that TFAI agents will be built along the lines of existing LLM-based architectures—and that the technical process of treaty alignment may leverage techniques of fine-tuning, model specs, and chain-of-thought monitoring of such systems—we must expect such agents to face a number of outstanding technical challenges associated with that paradigm. Specifically, the TFAI framework will need to overcome outstanding technical challenges relating to the lack of faithfulness of the legal reasoning that AI agents present themselves as engaging in (e.g., in their chain-of-thought traces) when (ostensibly) reaching legal conclusions. These result from various sources, including sycophancy, sophistry and rationalization, or outright obfuscation.
Critically, even though LLMs do display a degree of high-level behavioural self-awareness—as seen through their ability to describe and articulate features of their own behaviour[ref 186]—it remains contested to what degree such self-reports can be said to be the consequence of meaningful or reliable introspection.[ref 187]
Moreover, even if a model were capable of such introspections, those are not necessarily robustly understood on the basis of its reasoning traces. For instance, as legal scholars such as Ashley Deeks and Duncan Hollis have charted in recent empirical experiments, the faithfulness of AI models’ chain-of-thought transcripts cannot be taken for granted, creating a challenge in “differentiating how [an LLM’s] responses are being constructed for us versus what it represents itself to be doing.”[ref 188] They find that even when models are generally able to correctly describe the correct methodology for interpretation and present seemingly plausible legal conclusions for particular questions, they may do so in ways that fail to correctly apply those appropriate methods.[ref 189] To be precise, while the tested LLMs could offer correct descriptions of the appropriate methodology for identifying customary international law (CIL) and could offer facially plausible descriptions of the applicable CIL on particular doctrinal questions,[ref 190] when pressed to explain how they had arrived at these answers, it became clear that the AIs had failed to actually apply the correct methodology, instead conducting a general literature search that drew in doctrinally inappropriate sources (e.g., non-profit reports) rather than appropriate primary sources as evidence of state practice and opinio juris.[ref 191]
Significantly, such infidelity in the explanations given in chain of thought by AI models is not incidental but may be deeply pervasive in these models. Even early research on large language models has found it easy to influence biasing features to model inputs in ways that resulted in the model systematically misrepresenting the reasoning for its own decision or prediction.[ref 192] Indeed, some have argued that given the pervasive and systematic unfaithfulness of chain-of-thought outputs to internal model computations, chain-of-thought reasoning should not be considered a method for interpreting or explaining models’ underlying reasoning at all.[ref 193]
Worse, there may be other ways by which unfaithful explanations or even outright legal obfuscation could be unintentionally trained into AI models, rendering their chain-of-thought reasoning less faithful and trustworthy still.[ref 194] For instance, reinforcement learning from human feedback (RLHF), the most popular post-training method used to shape and refine the behaviour of LLMs for public use,[ref 195] has been found to have a number of shortcomings and limitations.[ref 196] One drawback is the propensity of RLHF fine-tuning to encourage models to offer responses that match user beliefs or preferences over truthfulness—a tendency described as “sycophancy”.[ref 197] This means that an TFAI agent could be at risk of offering strained or even flawed legal interpretations—or legal interpretations that assume possible-but-very-unlikely facts on the ground when these would afford an interpretation more favourable to their principal’s interests—in assessing the legality of its actions. These challenges can create a significant risk of individual or organizational overtrust[ref 198]—as well as conflict amongst treaty parties over opposite interpretations pursued by their respective TFAI agents. Another challenge is the propensity of RLHF to induce sophistry, which results in models that are better at convincing humans that they are right without actually improving their actual competence at the task in question.[ref 199]
Similar faithfulness challenges beset other proposed paradigms for ensuring that AI agents do not take harmful actions; for instance, one recently proposed set-up to mitigate such risks aims to monitor not only the final actions taken by the system, but also the chain of thought (i.e., intermediate reasoning steps) the model uses to decide on those actions.[ref 200] However, while research has suggested that the chains of thought of reasoning models are more faithful than those of traditional models (in that they are more able to describe how a cue in their prompt influences their answer),[ref 201] there are shortfalls to this approach.[ref 202]
For one, reasoning models produce less faithful reasoning on many tasks as they become larger and more capable;[ref 203] moreover, not only can chain-of-thought reasoning traces still contain misleading rationalizations,[ref 204] experiments show fine-tuning AI systems through reinforcement learning on such reasoning traces in a way that penalizes open misbehaviour leads the models to instead learn to hide the penalized “thoughts” from its explicit chain of thought.[ref 205] Such obfuscation of the actual reasoning steps can be achieved either by their internalization to latent-space computation or their encoding within the chain of thought.[ref 206]
There are limits to this: In some contexts, the faithfulness of the chain of thought is required in order for the AI model to meaningfully carry out complex, multistep reasoning processes. In these cases, unfaithful steps (involving rationalization) are likely to impede the system’s ability to solve certain problems because they introduce flawed steps into the reasoning process. In such cases, this suggests that the chain of thought faithfully encodes the underlying reasoning at play.[ref 207] However, it unfortunately seems unlikely that TFAI agents will converge towards “faithfulness by necessity”; after all, it seems like there are many legal questions for which inserting fudged premises or flawed rationalizations into the process of interpretation makes it easier to reach (a particular, or any) legal solution. This suggests that in many cases, rather than naturally converging on a high-quality, valid, and faithful legal-reasoning process in order to even be able to solve a legal question, TFAI agents could easily end up fooling themselves in the chain of thought.
Indeed, the faithfulness challenges are exacerbated by additional concerns over the sheer interpretability and legibility of the chain of thought of these models. After all, when models are rewarded for the correct answer, without any checks on how they got there, their chain-of-thought traces will likely drift away from ones that are easily comprehensible—as illustrated by the DeepSeek-R1-Zero model’s initial tendency to mix together English and Chinese in its reasoning.[ref 208] In other cases, there may be a risk that LLMs are induced to encode intermediate reasoning steps in the generated CoT text in a way that is not understandable to human readers;[ref 209] in some cases, this results in models developing a general encoding scheme to obfuscate their reasoning.[ref 210]
Such behaviours not only pose a risk to the explainability or faithfulness of TFAI agents, but also create risks of undesirable agent collusion amongst different interacting TFAI agents,[ref 211] as their communications may come to contain steganographically hidden information (e.g., about legally invalid interpretive exploits they may utilize in expanding their domain of allowable actions), or even may involve “subliminal learning” of each other’s preferences or biases.[ref 212]
Finally, the role of explicit chain-of-thought traces in driving the performance—and enabling the evaluation—of reasoning models might be undercut by future innovations. For instance, recent work has seen developments in “continuous thought” models, which reason internally using vectors (in what has been called “neuralese”).[ref 213] Because such models do not have to pass notes to themselves in an explicit CoT, they have no interpretable output that could be used to monitor them, interpret their reasoning, or even predict their behaviour.[ref 214] Given that this would threaten not just the faithfulness but even the monitorability of these agents, it has been argued that AI developers, governments, and other stakeholders should adopt a range of coordination mechanisms to preserve the monitorability of AI architectures.[ref 215]
All this highlights the importance of ensuring that reason-giving TFAI agents are not trained or developed in a manner that would induce greater rates of illegibility or fabrication (of plausible-seeming but likely incorrect legal interpretations) into their legal analysis; either would further degrade the faithfulness of their reasoning reports and potentially erode the basis on which trusted TFAI agents might operate.
4. Open (or orthogonal) questions for TFAI agents
In addition to these outstanding technical challenges that may beset the design, functioning, or verification of TFAI agents, there are also a number of deeper underlying questions to be resolved or decided in moving forward with the TFAI framework—and in considering whether, or in what way, these agents’ actions in compliance with an AI-guiding treaty’s norms truly should be considered as cases of true (or at least appropriate) forms of legal reasoning, or if they should be considered as consistent and predictable patterns in treaty application.[ref 216]
a) Do TFAI agents need to be explainable or merely monitorable?
The need to understand AI systems’ decision-making is hardly new, as emphasized by the well-established field of explainable AI (XAI),[ref 217] which has also been emphasized in judicial contexts.[ref 218] In fact, alongside chain-of-thought traces, there are many other (and many superior) approaches to understanding the actual inner workings of AI models. For instance, recent years have seen some progress in the field of mechanistic interpretability, which aims at understanding the inner representation of concepts in LLMs.[ref 219]
However, one could question whether full explainability—whether delivered through highly faithful and legible CoT traces, through mechanistic interpretability, or through some other approach—is even strictly necessary for AI systems to qualify for use as TFAI agents.
Many legal scholars might emphasize the legibility and faithfulness of a model’s legal-reasoning traces as a key proviso, especially under legal theories built upon the importance of (judicial) reason-giving[ref 220] as well as under emerging international legal theories of appropriate accountability in global administrative law.[ref 221]
Moreover, the lack of faithfulness could impose a significant political or legitimacy challenge on the TFAI framework. After all, to remain acceptable to—and trusted by—all treaty parties, it is possible that a TFAI framework would ideally ensure that TFAI agents are able to reason through the legality of their goals or conduct in a way that is not just legible and plausibly legally valid in its conclusions, but which in fact applies the appropriate methods of treaty interpretation (either under international law or as agreed upon by the contracting parties).[ref 222]
On the other hand, a more pragmatic perspective might not see faithfulness as strictly politically necessary to AI-guiding treaties. For instance, some AI safety work has argued that, even if we cannot use a model’s chain-of-thought record to faithfully understand its actual underlying (legal) reasoning, we might still use it as the basis for model monitorability since it can allow us to robustly predict the conditions when it is likely to change its judgment of the legality of particular behaviour.[ref 223] This implies that the negotiating treaty parties could simply stress-test TFAI agents until they agree that the agents appear to reach the correct (or at least, mutually acceptable) legal interpretations of the treaty in all cases, which are free of undue influence or bias, even if they cannot directly confirm that the agents use the conventionally correct methodology in reaching the conclusions.
Of course, even in this more pragmatic perspective, faithfulness could still be technically important in understanding the sources of interpretative error in the aftermath of a (supposedly) treaty-aligned TFAI agent violating obvious obligations; and it would (therefore) be politically important to ensuring state party trust in the stability, predictability, and robustness of the treaty-alignment checks. However, in such a case, the question of whether the model reaches the appropriate legal conclusions through the correct (or even a distinctly human) method of legal reasoning is ultimately subsidiary to the question of whether a model robustly and predictably reaches interpretations that are acceptable to the contracting parties.
Proponents of this pragmatic approach might reasonably suggest that this would not put us in a different situation from one we already accept with human judges; after all, we already accept that we cannot read the mind of a judge and that we often need to accept their claimed legal reasoning at face value—not in the sense that we must accept the substance of their proffered legal arguments uncritically but in the sense that we need to assume that it represents a faithful representation of the internal thought process that underpinned their judgment. Even amongst human judges, after all, we appear to rely on a form of monitorability (i.e., the consistency in judges’ judgments across similar cases and their incorruptibility to inadmissible factors, considerations, biases, or interests) when evaluating the quality and integrity of their legal reasoning across cases.[ref 224]
However, to this above point, it could be countered that an important difference between human judges and AI systems is that we have some prima facie (psychological, neurological, and biological) reasons to assume that the underlying legal concepts and principles used by (human) interpreters in their legal reasoning are closely similar to (or at least convergent with) those originally used by the drafters of the to-be-interpreted laws but that we may not be able to—or should not—make such an assumption for AI systems.
b) Do TFAI agents think like human judges? Evaluating AI-human legal concept alignment
However, if we cannot trust the faithfulness of a CoT trace, could we still, in fact, trust in underlying cognitive convergence or legal concept alignment between AIs and humans?
Importantly, the question of whether or how particular AI systems can perform at the human level on one, some, or all tasks is subtly but importantly different from the question of whether, in doing so, they engage in mechanistic strategies (i.e., thinking processes) that are fundamentally human-like.[ref 225] That is, to what degree do AI agents (built along the current LLM-based paradigm) reproduce, match, or merely mimic human cognition when they engage in processes of legal interpretation? How would we evaluate this?
These questions turn on outstanding scientific debates over whether—and how—AI systems (and particularly current LLM-based systems) match or correspond to human cognition. This is a question that can be approached at different levels by considering these human-AI (dis)similarities at the (1) architectural (i.e., neurological), (2) behavioural, or (3) mechanistic levels.
i) AI and human cognition in a neuroscientific perspective
First off, in a neuroscientific perspective, there may be a remarkable amount of overlap or similarity between the computational structures and techniques exhibited by modern AI systems and those found in biology.[ref 226] This is remarkable, given that the name “neural network” is in principle a leftover artefact from the technique’s context of discovery.[ref 227] Nonetheless, recent years have seen markedly productive exchanges between the fields of neuroscience and machine learning, with new AI models inspired by the brain and brain models inspired by AI.[ref 228]
Consequently, it is at least revealing that a number of accounts in computational neuroscience hold that the human brain can itself be understood as a deep reinforcement learning model, though one idiosyncratically shaped by biological constraints.[ref 229] Indeed, leading theories of human cognition—such as the predictive-processing and active-inference paradigms—treat the human brain as a system intended to predict the next sensory input from large amounts of previous sensory input,[ref 230] a description not fundamentally distinct from the view of LLMs as engaged in mere textual prediction.[ref 231]
Furthermore, neural networks have learned a range of specialized circuits in training which neuroscientists later have discovered exist also in the brain,[ref 232] potentially combining RL models, recurrent and convolutional networks, forms of backpropagation, and predictive coding, amongst other techniques.[ref 233] It has been similarly suggested that human visual perception is based on deep neural networks that work similarly to artificial neural networks.[ref 234] There is also evidence that multimodal LLMs process different types of data (e.g., visual or language-based) similarly to how the human brain may perform such tasks—by relying on mechanisms for abstractly and centrally processing such data from diverse modalities in a centralized manner that is similar to that of the “semantic hub” in the anterior temporal lobe of the human brain.[ref 235]
Furthermore, neural networks have been described as computationally plausible models of human language processing. While it is true that the amount of training data these models depend on significantly exceeds that required by a human child to learn language, much of this extra data may in fact be superfluous: One experiment trained a GPT-2 model on a “mere” 100-million-word training dataset—an amount similar to what children are estimated to be exposed to in the first 10 years of life—and found that the resulting models were able to accurately predict fMRI-measured human brain responses to language.[ref 236] In fact, the total sensory input of an infant in its first year of life may be on the order of the same number of bits as an LLM training set (albeit in embodiment rather than text).[ref 237]
These neuroscientific approaches also provide at least some support for a scale-based approach to AI. For instance, some modern accounts of the evolution of human cognition emphasize the strict continuity in the emergence of presumed uniquely human cognitive abilities, seeing these as the result of steady quantitative increases in the global capacity to process information.[ref 238] These theories imply that even simple quantitative differences in the scale of animal and human cognition, rather than any deep differences in architectural features or traits, account for the observed differences between human and animal cognition, while also explaining observed regularities across various domains of cognition as well as various phenomena within child development.[ref 239] Other work has disagreed and has emphasized the phylogenetic timing of distinct breakthroughs in behavioural abilities during brain evolution in the human lineage.[ref 240] Nonetheless, such findings, along with the fact that the human brain is, biologically, simply a scaled-up primate brain in its cellular composition and metabolic cost,[ref 241] suggest that there may not be any secret design ingredient necessary for human-level intelligence and that, rather than being solely dependent on key representational capabilities, human-level general cognition may to an important degree be a simple matter of scale even amongst humans and other animals.[ref 242] If so, we may have reason to expect similar outcomes from merely scaling up the global information processing capabilities of AI systems.
However, while all this offers intriguing evidence, it is not uncontested. More importantly, even if we were to grant some degree of deep architectural similarity between humans and AIs, this is far from insufficient to establish that AI systems necessarily represent high-level concepts—and, critically, legal concepts—in the same way as humans do, nor that they reason about them in the same manner as we do.
ii) (Dis)similarities between AI and humans in behavioural perspective
A second avenue for investigating the cognitive similarity between humans and AI systems, therefore, focuses on comparing behavioural patterns.
Notably, such work has found that LLMs exhibit some of the same cognitive biases as humans, including their distinct susceptibility to fallacies and framing effects[ref 243] or the inability—especially of more powerful LLMs—to produce truly random sequences (such as in calling coin flips).[ref 244] However, this work has also found that even as AI models demonstrate common human biases in social, moral, and strategic decision-making domains, they also demonstrate divergences from human patterns.[ref 245]
Moreover, in some cases, otherwise human-equivalent AI capabilities can be interfered with in seemingly innocuous ways which would not throw off human cognition.[ref 246] Likewise, while tests of analogical reasoning tasks show that LLMs can match humans in some variations of novel analogical reasoning tasks, they respond differently in some task variations;[ref 247] this implies that even where current AI approaches could offer a possible model of human-level analogical reasoning, their underlying processes in doing so are not necessarily human-like.[ref 248]
This supports the general idea that there are some divergences in the types of cognitive systems represented in humans and in AI systems; however, it again remains inconclusive whether these differences would categorically preclude AI agents from engaging in (or approximating) certain relevant processes of legal reasoning.
iii) AI-human concept alignment from the perspective of mechanistic interpretability
Thirdly, then, we can turn to the most direct approach to understanding whether (or in what sense) current AI models (based on LLMs) are able to truly utilize the same legal concepts as those leveraged by humans: This draws on approaches around mechanistic interpretability and on the emerging science that explores the “representational alignment” between different biological and artificial information processing systems.[ref 249]
There are some domains, such as in the processing of visual scenes, where it appears that high-level representations embedded in large language models are similar to those embedded in the human brain:[ref 250] LLMs and multimodal LLMs, for instance, have been found to develop human-like conceptual representations of physical objects.[ref 251] Other researchers have even argued for the existence of “representation universality” not just amongst different artificial neural networks, but even amongst neural nets and human brains, which end up representing certain types of information similarly.[ref 252] In fact, some research indicates that LLMs might mirror human brain mechanisms and even neural activity patterns involved in tasks involving the description of concepts or abstract reasoning, suggesting a remarkable degree of “neurocognitive alignment”.[ref 253]
At the same time, as above, there are also clear cases of models adopting atypical, and very non-human-like mechanisms to perform even simple cognitive tasks such as mathematical addition,[ref 254] including mechanistic strategies that might not generalize or transfer across to other domains.[ref 255]
Significantly, then, while research suggests that vector-based language models offer one compelling account of human conceptual representation—in that they can, in principle, handle the compositional, structured, and symbolic properties required for human concepts[ref 256]—this again does not mean that, in fact, modern LLMs have acquired these specific concept representations in relevant domains (such as law).
c) Does AI-human legal concept alignment matter for the TFAI framework?
Ultimately, then, while each of the lines of evidence—neuroscientific architectural similarity, behavioural dispositions, and alignment of concepts and mechanistic strategies—offers some ground to assume a (to some perhaps surprising) degree of cognitive similarity amongst AIs and humans, they clearly fail to establish full cognitive similarity or alignment over concepts or reasoning approaches. In fact, they offer some ground for assuming that full concept alignment—that is, fully human-like reasoning—is not presently achieved by LLM-based AI systems.
Clearly, then, there is significant outstanding work to be conducted, with these fields having some way to go to ensure that we can conclusively ensure that human paradigms or approaches in legal reasoning successfully translate across to AI cognition. However, even if we assume that AI agents reason about the law differently than humans, (when) would this actually matter to the TFAI framework?
On the one hand, it has been argued that “concept alignment” between humans and AIs is a general prerequisite for any form of true AI value alignment;[ref 257] this would imply it is critical for any forms of deep law alignment or treaty alignment, also.[ref 258] On the other hand, it has been suggested that evaluating the cognitive capacities of LLMs requires overcoming anthropocentric biases and that we should be specifically wary of dismissing LLM mechanistic strategies that differ from those used by humans as somehow not genuinely competent.[ref 259] In this perspective, an AI system which invariably reached (legally) valid conclusions should be accepted as an (adequately) competent legal reasoner, even if we had suspicions or proof that it reached its conclusions through very different routes.
This could create potential challenges; after all, if (1) we cannot trust the faithfulness of an TFAI agents’ reasoning traces, and if (2) we cannot (per the preceding discussion) assume cognitive alignment between that agent and a human legal reasoner, then there is no clear mechanism by which to verify whether the legal interpretations conducted by a particular TFAI agent in fact follow the established and recognized approaches to treaty interpretation under international law.[ref 260]
Once again, the degree to which this is a hurdle to the TFAI framework may ultimately be a political one: States could decide that they would only accept TFAI agents that provably engaged in the precise legal-reasoning steps that humans do, and so reject any TFAI agents for which such a case could not be made. Or they might decide that even if such a guarantee is not on the table, they are still happy to adopt and utilize TFAI agents, so long as their legal interpretations are robustly aligned with the interpretations that humans would come to (or which their principals agree they should come to).
d) Human-TFAI agent fine-tuning and scalable oversight challenges
Finally, there are a number of more practical questions around the feasibility of maintaining appropriate oversight over TFAI agents operating within a TFAI framework.
To be clear, some of the challenges and barriers to the robust use of TFAI agents (such as risks of unintended bias or of inappropriate or unfaithful reasoning traces) could well be addressed through a range of technical and policy measures taken at various stages in the development and deployment of TFAI agents. For instance, one could ensure the adoption of adequate RLHF fine-tuning of TFAI agents, and/or ongoing validation, oversight, and review, by experienced (international) legal professionals.[ref 261]
However, beyond creating additional costs that would reduce the cost-effectiveness or competitiveness of AI agent deployments, there would be additional practical questions that would need clarification: What skills should human international lawyers have in order to effectively spot and call out legal sophistry? Moreover, what should be the specialization or background of the international lawyers used in such fine-tuning or oversight arrangements? This matters, since different legal professionals may (implicitly) favour different norms or regimes within the fragmented system of international law.[ref 262]
These challenges are exacerbated by the fact that any arrangements for human oversight of AI agents’ continued alignment to the norms of a treaty would run into the challenge of “scalable oversight”—the established problem of “supervising systems that potentially outperform us on most skills relevant to the task at hand.”[ref 263] That is, as AI agents’ advance in sophistication and reasoning capability, how might human deployers and observers (or any ancillary AI agents) reliably distinguish between true TFAIs and functional AI henchmen (or even misaligned AI systems)[ref 264] that merely appear to be treaty-following when observed but which will violate the treaty whenever they are unmonitored. These challenges of scalable oversight may be especially severe in domains where it is unclear how, whether, or when oversight by humans, or by intermediary, weaker AI systems over stronger AI systems, can meaningfully scale up.[ref 265] That challenge may be especially significant in the legal context, because a sufficiently high level of legal-reasoning competence may enable AI agents to offer legal justifications for their courses of actions that are so sophisticated as to make flawed rationalizations functionally undetectable.
—
The above all represent important technical (as well as political) challenges to be addressed, along with significant open questions to be resolved. These highlight that, for the time being, human lawyers or judges should likely take caution and avoid abdicating interpretative responsibility to AI models and instead aim to formulate their own independent legal arguments.
However, these challenges need not prove intrinsic or permanent barriers to the simultaneous and parallel productive deployment of AI agents within a TFAI framework. Just as recent innovations have helped mitigate the early propensity of AI models to hallucinate facts,[ref 266] there will be many ways by which AI agents can be designed, trained,[ref 267] or scaffolded in order to produce AI agents capable of sufficiently proficient legal reasoning to underwrite AI-guiding treaties,[ref 268] especially if there are guarantees to ensure that final interpretative authority remains vested in appropriate (and independent) human expertise. Indeed, one can support the use of TFAI agents (and AI-guiding treaties) as a specific commitment mechanism for shoring up advanced AI agreements, while simultaneously believing that human lawyers seeking to interpret international law should generally limit their use of AI systems, if only to avoid self-reinforcing interpretative loops.
To be clear, we emphasize that today’s agentic AI systems are likely still too brittle, unreliable, and technically limited in key respects to lend themselves to direct implementation of a TFAI framework. However, our proposal here in particular considers the more fully capable AI agents that very plausibly are on the horizon in the near- to medium-term future.[ref 269] Indeed, one hope could be that, if a TFAI framework becomes recognized as a beneficial commitment mechanism, this could help spur more focused research efforts to overcome the remaining challenges in legal performance, bias, sophistry, or obfuscation, and in effective oversight, in order to differentially accelerate cooperative and stabilizing applications of AI technologies.[ref 270]
C. AI-guiding treaties: Legal components and clarifications
On the legal side, a TFAI framework would require two or more states[ref 271] to (1) conduct an international agreement (the “AI-guiding treaty”) that (2) specifies a set of mutually agreed constraints on the behaviour of their AI agents, and to (3) ensure that all (relevant) AI agents deployed by states parties would follow the treaty by design.
1. An AI-guiding treaty
In many cases, an AI-guiding treaty would not necessarily need to look very different from any other treaty. It would be a “treaty” as defined under the Vienna Convention on the Law of Treaties (VCLT) Art 1(a), being
“an international agreement concluded between States in written form and governed by international law, whether embodied in a single instrument or in two or more related instruments and whatever its particular designation;”[ref 272]
In their most basic form, AI-guiding treaties would be straightforward (digitally readable) documents[ref 273] containing the various traditional elements common to many treaties,[ref 274] including but not limited to:
- introductory elements such as a title, preamble, “object and purpose” clauses, and definitions;
- substantive provisions, such as those regarding the treaty’s scope of application, the obligations and rights of the parties, and distinct institutional arrangements;
- secondary rules, such as procedures for review, amendment, or the designation of authoritative interpreters;
- enforcement and compliance provisions, such as monitoring and verification provisions setting out procedures for inspections and enforcement, dispute settlement mechanisms, clauses establishing sanctions or consequences for a breach (e.g., suspension clauses, collective responses, or referrals to other bodies such as the UN Security Council); and implementation obligations regarding domestic legal or administrative measures to be taken by states parties;
- final clauses clarifying procedures for signing, ratifying, or accepting the treaty; accession clauses (to enable non-signatory states to join at a point subsequent to the treaty’s entry into force); conditions or thresholds for entry into force; allowances for reservations; depositary provisions (setting out the official keeper of the treaty instrument); rules around the authentic text and authoritative language versions; withdrawal or denunciation clauses, or fixed duration, termination, or renewal conditions; and
- annexes, protocols, appendices, or schedules, listing technical details or control lists; optional or additional protocols; non-legally binding unilateral statements; and/or statutes for newly established arbitral bodies.
However, AI-guiding treaties would not need to be fully isomorphic to traditional treaties. Indeed, there are a range of ways in which the unique affordances created by a TFAI set-up would allow innovations or variations on the classic treaty formula. For instance, in drafting the treaty text, states could leverage the ability of TFAI agents to rapidly process and query increasingly long (libraries of) documents[ref 275] in order to draft much more exhaustive and more detailed treaty texts than has been the historical norm.
Notably, more detailed treaty texts could (1) tailor obligations to particular local contexts, even to the point of specifying bespoke obligations as they apply to individual government installations, military bases, geographic locales,[ref 276] segments of the global internet infrastructure (e.g., particular submarine cables or specific hyperscale data centres); (2) cover many more potential contingencies or ambiguities that could arise in the operation of TFAI systems; (3) red-team and built-in advance legal responses to several likely legal exploits that could be attempted; (4) hedge against future technological developments by building in pre-articulated, technology-specific conditional rules that would only apply under clearly prescribed future conditions;[ref 277] (5) scope and clearly set out “asymmetric” treaties that imposed different obligations upon (the TFAI agents deployed by) different state parties.[ref 278]
Indeed, innovations to the traditional treaty format could extend much further than mere length; for instance, states could craft treaties as much more modular documents, with frequent hyperlinked cross-references and links between obligations, annexes, and interpretative guidance. They could specify embedded and distinct interpretative rules that offered distinct interpretative principles for different sections or norms, with explicit hierarchies of norms and obligations established and clarified, or with provisions to ensure coherent textualism not just within the treaty, but also with other other norms in international law. Such treaties could clarify automated triggers for different thresholds and/or notification or escalation procedures, or they could include clear schedules for delegated interpretation.
Any of these design features could produce instruments that offer far more extensive and granular specificity over treaty obligations than has been the case in the past. Accordingly, well-designed AI-guiding treaties could reach far beyond traditional treaties in their scope, effectiveness, and resilience.
There are some caveats, however. For one, longer, more detailed treaty texts are not always politically achievable even if they would be more easily executable if adopted. After all, there may simply not be sufficient state interest in negotiating extremely long and detailed agreements—for instance, because there is time pressure during the negotiations; because some parties are diplomatically under-resourced; or because states can only agree on superficial ideas. There is no guarantee that AI-guiding treaties can resolve such longstanding sticking points to negotiation. Nor, indeed, is treaty length necessarily an unalloyed good. For instance, longer texts could potentially introduce more ambiguities, questions, or points for incoherence or (accidental or even strategically engineered) treaty conflict.
Finally, these considerations would look significantly different if the TFAI framework is applied not to novel and bespoke advanced AI agreements, but instead to already existing obligations enshrined in existing (non-AI-specific) treaties. In such circumstances, existing instruments in international law cannot (or should not) be adapted for the machines.
2. Open questions for AI-guiding treaties
Of course, AI-guiding treaties also raise many practical questions: when, where, and how should such treaties allow for treaty withdrawal, derogation, or reservations by one or more parties? Such reservations—or partial amendments that include only some of the states parties—might result in a fractured regime.[ref 279] However, this need not be a challenge for TFAI agents per se, so long as it remained clear to each state’s TFAI agents which version of the treaty (or which provisions within it) are applicable to their deploying state.
There are also further questions, however. For instance, should a TFAI framework accommodate the existence of multilingual AI-guiding treaties (i.e., treaties drawn up into the languages of all treaty parties, with all texts considered authentic)? If so, would this create significant interpretative challenges—since TFAI agents might adopt divergent meanings depending on which version of the text they apply in everyday practice—or would it result in greater interpretative stability (since all TFAI agents might be able to refer to different authentic texts in clarifying the meaning of terms)?[ref 280]
Moreover, how should “dualist” states—that is, those states which require international agreements to be implemented in municipal law for those treaties to have domestic effects[ref 281]—implement AI-guiding treaties? May they specify that their AI agents directly follow the treaty text, or should the treaty first be implemented into domestic statute, with the TFAIs aligned to the resulting legislative text? This may prove especially important for questions of how the AI-guiding treaty may validly be interpreted by TFAI agents[ref 282] given that it suggests that they may need to consider domestic principles of interpretation, which may differ from international principles of interpretation.[ref 283] For instance, in some cases domestic courts in the US have adopted different views of the relative role of different components in treaty interpretation than those strictly required under the VCLT.[ref 284] This could create the risk that different states’ TFAI agents apply different methodologies of treaty interpretation, reaching different conclusions. Of course, since (as will be discussed shortly) in the TFAI framework TFAI agents are not considered direct normative subjects to the AI-guiding treaty, we suggest that in many cases it might be appropriate for them to refer to the treaty text (e.g., by treating it as an international standard) even in dualist contexts.
With all this, we re-emphasize that AI-guiding treaties, as proposed here, are considered relatively pragmatic arrangements amongst two or more states, intended to facilitate practical, effective, and robust cooperation in important domains. This also creates scope for variation in the legal and technical implementation of such frameworks. For instance, such agreements would not even need to take the form of a formally binding treaty, as such. They could also take the form of a formal non-binding agreement, joint statement, or communique,[ref 285] specifying—within the text, in an annex, or through incorporation-by-reference to later executive agreements—the particular text and obligations that the TFAI agents should adhere to at a high level of compliance.
In this case, it might be open for debate whether the resulting ‘soft’ TFAI arrangements would or should be considered a novel manner of implementing existing international legal frameworks, or if they should instead be considered a novel, third form of commitment mechanism: neither a hard-law treaty that is strictly binding upon its states, nor a non-binding soft-law mechanism, entirely—but rather a third form, a non-binding mechanism that however is self-executing upon states’ AI agents, who are to treat it as hard law in its application. Such a case would raise interesting questions over whether, or to what extent, the resulting TFAI agents would even need to defer to the Vienna Convention on the Law of Treaties in guiding their interpretation of terms, since soft-law instruments, political commitments, and non-binding Memoranda of Understanding technically fall outside of its scope, even as they are often appealed to in the interpretation of soft-law instruments, especially those that elucidate binding treaties. However, we leave these questions to future research.
3. AI-guiding treaties as infrastructural, not normative, constraints on TFAI agents
In addition, it is important to clarify key aspects about the relation between TFAI agents, AI-guiding treaties, and the states that respectively deploy and negotiate them.
First off, similar to in the domestic framework for law-following AI, the TFAI proposal does not depend on the assumption that TFAI agents will act “law-following” (or, in this case, treaty-following) for most of the reasons that (are held to) contribute to human compliance with legal codes.[ref 286] That is, TFAI agents, like LFAI agents, are not expected to be swayed by some deep moral respect for the law, nor by the deterrent function of sanctions threatened against the AI itself,[ref 287] nor because of any form of norm socialization or reputational concerns, nor because of their self-interested concern over guaranteeing continued stable economic exchange with the human economy,[ref 288] nor to uphold some form of reciprocal social contract between AIs and humans.[ref 289]
Furthermore, while LFAI (and TFAI) agents that are aligned to the intent of their states would already tailor their actions taking into consideration the costs that would be incurred by their principals (i.e., their deploying states) as a result of sanctions threatened in reaction to AI agents’ actions, this is also not the core mechanism on which law-alignment turns; after all, if these were the only factors compelling treaty compliance, they would functionally remain henchmen that were not in fact obeying the treaty.[ref 290]
Instead of all of this, we envision AI-guiding treaties much more moderately: as technical ex ante infrastructural constraints on TFAI agents’ range of acceptable goals or actions. In doing so, we simply treat the treaty text as an appropriate, stable, and certified referent text through which states can establish jointly agreed-upon infrastructural constraints on which instructed goals their AI agents may accept and on the latitude of conduct which they may adopt in pursuit of lawful goals. In a technical sense, TFAI therefore builds on demonstrated AI industry safety techniques which have sought to align the behaviour of AI systems with a particular “constitution”[ref 291] or “Model Spec”.[ref 292]
This of course means that, under our account, TFAI agents are treaty-following only in a thin, functional sense: they are designed to refer to the text of the treaty in determining the legality of potential goals or lines of action—but not in the sense that they are considered normatively subject to duties imposed by the treaty. In this, the TFAI proposal is arguably more modest than even the domestic LFAI framework, since it does not even construct AI systems as duty-bearing “legal actors”[ref 293] and therefore does not involve significant shifts in the legal ontology of international law per se.[ref 294]
IV. Clarifying the Relationship between TFAI Agents and their States
Its relatively pragmatic orientation makes the TFAI proposal a legally moderate project. However, while the TFAI framework does not, on a technical level, require us to conceive of or treat AI agents as duty-bearing legal persons or legal actors, there remain some important questions with regard to TFAI agents’ exact legal status and treatment under international law, especially in terms of their relation to their deploying states. These questions matter not just in terms of the feasibility of slotting the TFAI commitment mechanism within the tapestry of existing international law, but also for the precise operation of TFAI agreements.
A. AI-guiding treaties can function even if TFAI agents’ legal status remains unclear
In a direct sense, the most obvious implications of TFAI agents’ legal status are legal. After all, (1) whether or not TFAI agents will be considered to possess any form of (international or domestic) legal personhood, and (2) whether or not their actions will be considered attributable to their deploying states will shift the legal consequences if or when TFAI agents, in spite of their treaty-following constraints, act in violation of the treaty or if they act in ways that violate any other international obligation incumbent upon their deploying state.
1. The prevailing responsibility gap around AI agents
For instance, if highly autonomous TFAI agents are not afforded any legal personhood, but neither is their behaviour attributable to a particular state, this would facially result in a “responsibility gap” under international law.[ref 295] If they acted in violation of an AI-guiding treaty, this would not, then, be treated as a violation by their deploying state of its obligations under that treaty. This crystallizes the general problem that, under current attribution principles, it may be difficult to establish liability for the actions of AI agents—since the actions of public AI agents cannot (yet currently) be automatically attributed to states because state responsibility anyway rarely arises for unforeseeable harms and because private businesses have no international liability for the harm they cause.[ref 296]
2. Due diligence obligations around AI agents’ actions
Of course, that is not to say that by default states would not face any legal consequences for actions taken by deployed AI agents (especially those acting from or through their territory). For instance, even if the actions of AI agents cannot be attributed to states, or the agents in question are developed or deployed by non-state actors, and beyond the control of the state, states still have obligations to exercise due diligence to protect the rights of other states from those actions.[ref 297] For instance, if AI agents deployed by private actors acted in ways that inflicted transboundary harm, or which violated human rights, international humanitarian law, or international environmental law (amongst others), then their actions could potentially violate these due diligence obligations incumbent upon all states.[ref 298]
Some have argued that even this sort of attribution could be complicated in some domains, for instance if the transboundary harm inflicted by an AI agent is primarily cyber-mediated:[ref 299] after all, in the ILC’s commentaries on the Draft Articles on Prevention of Transboundary Harm from Hazardous Activities, transboundary harm is predominantly defined as harm “through […] physical consequences”.[ref 300] However, as noted by Talita Dias, “a majority of states that have spoken out on this matter agree that due diligence obligations apply whether the harm occurred offline or online.”[ref 301]
Nonetheless, this situation would still mean that TFAI agents’ actions that violated an AI-guiding treaty would only be considered as legal violations if they also resulted in due diligence violations—potentially constraining the set of other scenarios or contingencies that states could effectively contract around through AI-guiding treaties.
3. Continued technical and political feasibility of TFAI framework amidst legal uncertainty
Of course, it could be argued that these legal questions are possibly orthogonal, or at least marginal, to either the political or technical feasibility of the TFAI framework itself. After all, even if states would not face legal consequences for their TFAI agents violating the terms of their underlying treaty, this need not cripple such treaties’ ability to serve either as a generally effective technical alignment anchor or as a politically valuable commitment mechanism. At a technical level, after all, TFAI agents would not necessarily be influenced by the actual ex post legal consequences (e.g., liability) resulting from their noncompliance with the treaty, as they simply treat the AI-guiding treaty text as an infrastructural constraint to be obeyed ex ante. After all, even if their general reasoning processes (in trying to act on behalf of their principal) should take into consideration the consequences of different actions for their states, the core question at the heart of their legal reasoning should be “is this course of action lawful?”, not “if this course of action is found to be unlawful, what will be the (legal or political) consequences for my deploying state?”
Simultaneously, AI-guiding treaties could continue to operate at the political level. Even if they escaped direct state responsibility or liability under international law, states would still face political consequences for deploying TFAI agents that, by design or accident, had violated the treaty. Such consequences could range from reciprocal noncompliance to collapse of the treaty regime, along with domestic political fallout if the public at home loses faith in its government as a result of treaty-violating actions taken by AI agents that had been trumpeted as treaty-following. The prospect of such political costs might ensure that states took seriously their commitment to correcting instances of TFAI noncompliance, at least insofar as those instances could be easily ‘attributed’ to them.
Of course, such attribution may face significant challenges since, depending on the substantive content of an AI-guiding treaty, it may be more or less obvious to other state parties when a TFAI agent has violated it. For instance, if the treaty stipulates a regular transfer of certain resources, technologies, or benefits by its agents, then the recipient states would presumably notice failures in short order. Conversely, if the treaty bars TFAI agents from engaging in cyberattacks against certain infrastructure, the mere observation that those targets are experiencing a cyberattack might not be sufficient to prove the involvement of AI agents, let alone of a particular state’s TFAI agents acting in violation of a treaty. This is analogous to the technical difficulties already encountered today in attributing the impacts of particular cyber operations.[ref 302] Finally, if the AI-guiding treaty dictates limits to TFAI activity within certain internal state networks (e.g., “no use in automating AI research”), then it might well take much longer to impose the political costs for treaty noncompliance.
B. Clarifying TFAI agents’ legal status or attributability closes political and technical loopholes
The above discussion suggests that AI-guiding treaties could remain a broadly functional political tool for the technical self-implementation of certain AI-related international agreements, even if these questions of the agents’ status were not settled and the responsibility gap were not closed. Nonetheless, if such questions are more appropriately clarified, this could provide benefits to the TFAI framework that are not just legal but also political and technical.
Legally, it would ensure that the growing use of AI agents would not come at a cost of failing to enforce adequate state responsibility for any internationally wrongful acts, thereby preserving the integrity and functioning of the international legal system in conditions where an increasing fraction of all actions carried out with transboundary impacts or with legal effects under international regimes are conducted not by humans but by AI agents. More speculatively, an added benefit of clarifying AI agents’ status and attributability would be that it might enable the actions of such systems to potentially constitute, or contribute to, evidence of state practice, which could have self-stabilizing effects on AI-guiding treaty interpretation.[ref 303]
Politically, while—as just noted—the lack of clear state responsibility for TFAI agent’s actions would not diminish the various other political costs that contracting states could impose upon one another—providing incentives for states to attempt an effective treaty alignment for their agents—there may still be concerns that such violations, and states’ responses, will be erosive to the long-term stability of AI-guiding treaties (and of treaties in general).
Finally, legally establishing state responsibility for TFAI agents may also have important consequences in technical terms, since an unclear legal status of TFAI agents, and an inability to attribute their conduct to their deploying states, might pose a functional problem for the effective treaty alignment of these systems because it potentially leaves open legal loopholes in the treaty. At first glance, one would expect that TFAI agents might straightforwardly interpret any AI-guiding treaty references to “TFAI agents” as applying to themselves, and so would straightforwardly seek to abide by the prescribed or circumscribed behaviours.
In sophisticated legal reasoners, however, there might be a risk that their lack of clear status under international law might lead them to exploit (whether autonomously or under instruction from their deploying state) those legal loopholes to conclude that their actions are not, in fact, bound by the treaty.[ref 304] By analogy, just as private citizens or corporations could reason that they have no direct obligations under interstate treaties such as the Nuclear Non-Proliferation Treaty (NPT) or the Convention on International Trade in Endangered Species of Wild Fauna and Flora (CITES), and only have duties under any resulting implementing domestic regulation established as part of those treaties, sophisticated (or strategically prompted) TFAI agents could, hypothetically, argue that (1) since they are not legal subjects under international law and cannot serve as signatories to the treaty in their own right, and (2) since they are not considered state agents acting on behalf of (and under the same obligations as) signatory states, and (3) since they have only been aligned to the treaty text, not (potentially) to any domestic implementing legislation, therefore they are not legally bound by the treaty under international law.
Would TFAI agents attempt such legal gymnastics? In one sense, present-day systems and applications of “constitutional AI”[ref 305] have involved the inclusion of principles inspired by various documents—from the UN Universal Declaration on Human Rights to Apple’s Terms of Service[ref 306]—as part of an agent’s specification, thereby certifying those texts as sources of behavioural guidance to the AI system in question, regardless of their exact legal status. Nonetheless, one might well imagine that a sufficiently sophisticated TFAI agent, acting loyally to its state principal, would have reason to search and exploit any legal loopholes. Such an outcome would undercut the basic technical functioning of the TFAI framework.
One way to patch this loophole would be for the contracting states to expressly include a provision, in the AI-guiding treaty, that their use of particular AI agents—whether registered model families or particular registered instances—is explicitly included and covered in the terms of the treaty, thus strongly reducing the wiggle room for TFAI agents’ interpretation. A more comprehensive legal response, however, would aim to clarify debates on liability and attribution for TFAI agents’ wrongful acts (whether those in violation of the AI-guiding treaty or under international law generally). As such, it is constructive to briefly consider different potential legal resolutions of this loophole.
We can accordingly compare various potential constructions of the legal status of TFAI agents, depending on whether they become governed under (1) some future new lex specialis liability regime applicable to AI agents (or to state “objects”); or whether they become treated (2) as entities possessing independent international legal personality; or whether, under the existing law of state responsibility as codified under ARSIWA, they become (3) entities possessing domestic legal personality as state organs or authorized entities, with their conduct ascribed to the deploying state; or they are considered as (4) tools without independent legal standing that are nonetheless functionally treated as part of state conduct under ARSIWA.
These four approaches leverage, to different degrees and in different combinations, the (otherwise distinct) tools of the law of state responsibility and legal personhood. They therefore offer distinct parallel solutions to the problems (the responsibility gap; or the potential interpretive loopholes) that might emerge if TFAI agents are deployed without either of these legal questions being resolved. Importantly, while all four avenues would constitute lex ferenda to some degree, they would involve greater and lesser degrees of such legal innovation. Let us therefore briefly review the implications and merits of these options to consider which would be functional—and which would be most optimal—for the TFAI framework.
1. Developing a new lex specialis regime of state liability for their (AI) objects would be a slow process
As noted above, states would still face legal consequences for some actions by their deployed TFAI agents where those actions resulted in violations of key norms (e.g. human rights, IHL, environmental law; no-harm principle, etc.) under international law.
However, states could set down clearer and more specific rules for AI agents, including through a new multilateral treaty regime. For instance, in some domains, such as in outer space law, states have negotiated self-contained strict liability regimes (e.g., for space objects).[ref 307] They could do so again for AI agents, creating a new regime that would regulate state responsibility or liability for the specific norm violations, wrongful acts, and harms produced by particular classes of AI systems,[ref 308] AI agents as a whole—or even, generally, by any and all of states’ inanimate objects.[ref 309] For instance, as Pacholska has noted, one might consider whether such a regime on state responsibility for the wrongdoings of its inanimate objects could even be
“…modelled on either the Latin concept of qui facit per alium facit per se or strict liability for damage caused by animals that is present in many domestic jurisdictions [and] could be conceptualised as a general principle of law within the meaning of Article 38(c) of the ICJ Statute.[ref 310]
If such a regime emerged, it would (as a lex specialis regime) supersede the general ARSIWA regime on state responsibility (as discussed below), as these are residual in nature.[ref 311]
However, in practice, while this could and should remain open as a future option, for the near term this might be too slow and protracted a process to effectively and swiftly provide general legal clarity and guidance on TFAI agents across all treaties. To be sure, states could attempt to include such provisions in particular advanced AI agreements (whether or not those were configured as AI-guiding treaties). However, if attempting to do so would slow or hold back such treaties, it might be preferable to work this out in parallel—or adopt another patch.
2. Extending international legal personality to TFAI agents is unlikely and counterproductive
Another option that has sometimes received attention could be to extend some form of international legal personality to TFAI agents. For instance, some have suggested that a “highly interdependent cyber system” should be recognized with the creation of an “international entity”.[ref 312]
Indeed, the extension or attribution of forms of international legal personhood to new entities would not be entirely unprecedented. While international law has been conducted primarily amongst states, it has historically developed in ways that have extended various sets of (limited and specific) rights and duties to a range of non-state actors, along with (in some cases) various forms of personhood. For instance, international organizations have been granted rights to enter into treaties and enjoy some immunities, as well as duties to act within their legal competence.[ref 313] Human individuals obviously possess a wide range of rights under human rights law, as well as under investment protection, which they can vindicate by international action,[ref 314] and they also possess duties under international criminal law.[ref 315] Non-self-governing peoples have some legal personality under the principle of self-determination.[ref 316]
There are also more anomalous cases: non-state armed groups remain subject to a range of duties under international humanitarian law,[ref 317] but do not necessarily have international legal personhood unless they are also recognized as belligerents, in which case they may enter into legal relations and conclude agreements on the international plane with states and other belligerents or insurgents.[ref 318] By contrast, corporations do not have duties under international law, although they may occasionally have rights under bilateral investment treaties to bring claims against states; nonetheless, in principle, they are considered to lack international legal personality.[ref 319] Meanwhile, in a 1929 treaty,[ref 320] Italy recognized the Holy See as having exclusive sovereignty and jurisdiction of the City of the Vatican, and it has since been widely recognized as a legal person with treaty-making capacity, even though it does not meet all the strict criteria of a state.[ref 321]
There is therefore nothing that categorically rules out the future recognition—whether through new treaty agreement, amendments to existing treaties, widespread state practice and opinio juris creating new custom, or the jurisprudence of international courts—of some measure of legal personhood (and/or some package of duties or rights, or both) for AI agents, creating truly (normatively) treaty-following agents.
However, extending some forms of international legal personhood to TFAI agents might prove to be more doctrinally difficult than was such an extension to any of these other entities, which are constructs created through the delegated authority of states (e.g. international organizations), are state-like in important respects (e.g. belligerent non-state armed groups; the Holy See), and which in all cases ultimately bottom out in human actors. Indeed, if even corporations have been denied international legal personality, the case for extending it to TFAI agents becomes even harder to make.
Moreover, a solution based in international legal personhood might even have drawbacks from the perspective of functional AI-guiding treaties. For one, the prospects for an attribution of international legal personhood appear speculative and politically and doctrinally slim, at least under existing instruments in international law. It is unlikely, for instance, that legal personality would be extended to AI systems under existing human rights conventions, if only because instruments such as the European Convention on Human Rights bar non-natural persons—such as companies and, likely, AI systems—from even qualifying as applicants.[ref 322]
Moreover, not only is such a far-reaching legal development unnecessary to an operational TFAI framework, it would possibly even be counterproductive. After all, the extension of even limited international legal personality to TFAI agents would set them apart from their deploying state and blur the appropriate lines of state responsibility. As the ILC noted in its Commentaries on the Articles on the Responsibility of States for Internationally Wrongful Acts, “[F]ederal States vary widely in their structure and distribution of powers, and […] in most cases the constituent units have no separate international legal personality […] nor any treaty-making power.”[ref 323] However, insofar as AI-guiding treaties are meant as commitment mechanisms amongst states, the (partial) legal decoupling of TFAI agents from their deploying state would simply defeat the point of AI-guiding treaties. It would create yet another international entity, which might complicate the processes of negotiating or establishing AI-guiding treaties (e.g., should TFAI agents be considered as contracting parties) and weaken the political incentives for establishing and maintaining them.[ref 324]
3. Extending domestic legal personality to TFAI agents, and state attributability under ARSIWA
Another option would be for state parties to an AI-guiding treaty to grant their agents some form of domestic legal personality and to treat them as state organs or empowered entities under the international law on state responsibility as codified in the International Law Commission’s (ILC) 2001 Articles on the Responsibility of States for Internationally Wrongful Acts (ARSIWA).
a) Domestic legal personhood for TFAI agents
Prima facie, the idea of constructing such a role for TFAI agents could be compatible with the proposals for domestic law-following AI, which envision (1) many government-deployed AI agents being used in a law-following manner[ref 325] and (2) treating them as duty-bearing legal actors (without rights).[ref 326]
In fact, while current law universally treats AI systems as objects, the idea of extending some forms of personhood to AI—whether fictional (e.g. corporate-type) or even non-fictional (e.g. natural persons)—has been floated in a range of contexts, by both legal scholars[ref 327] and some policymakers.[ref 328] Personhood for such systems is often presented as an appropriate pragmatic solution to situations where AI systems have become so autonomous that one could or should not impose responsibility for their actions on their developers,[ref 329] although others have argued that this solution would create significant new problems.[ref 330]
However that may be, would this be possible in doctrinal terms? To be sure, states have already granted a degree of domestic personhood to various non-human entities—such as animals, ships, temples, or idols,[ref 331] amongst others. Given this, there appears to be little that would prevent them from also granting AI agents a degree of legal personality.[ref 332] There would be different ways to structure this, from “dependent personality” constructions whereby (similar to corporations) human actors would be needed to enforce any rights or obligations held by the entity,[ref 333] to entities that would have a higher degree of autonomy.
Most importantly, even where these systems were both acting with high autonomy, and moreover legally distinct from the human agents of the state through the attribution of such personhood, government-deployed AI agents could still, it has been argued, be sufficiently closely linked to the state that it would be straightforward to attribute their actions to that state under the international law on state responsibility.
b) ARSIWA and the law on state responsibility
As noted, the regime of state responsibility has been authoritatively codified in the ILC’s Articles on the Responsibility of States for Internationally Wrongful Acts (ARSIWA).[ref 334] Additionally, the Tallinn Manual 2.0 on the International Law Applicable to Cyber Operations (Tallinn Manual) has detailed how these rules are considered to apply to state activities in cyberspace.[ref 335] While neither document is legally binding, the ARSIWA articles are widely recognized by both states and international courts and tribunals as an authoritative statement of customary international law,[ref 336] and the Tallinn Manual rules largely align with the ILC articles.[ref 337]
Rather than focus on the primary norms that relate to the substantive obligations upon states, the ARSIWA articles clarify the secondary rules regulating “the general conditions under international law for the state to be considered responsible for wrongful actions or omissions”[ref 338] relating to these primary obligations, on the premise that “[e]very internationally wrongful act of a State entails the international responsibility of that State.”[ref 339] These rules on attribution therefore provide the processes through which the conduct of natural persons or entities becomes an “act of state”, for which the state is responsible.
Critically, unlike many domestic liability regimes, international responsibility of states under ARSIWA is not premised on causation[ref 340] but simply on rules of attribution. It is also a fault-agnostic regime and, as noted by Pacholska, an “objective regime”, under which—in contrast to, for instance, international criminal law—the mental state of the acting agents, or the intention of the state, are in principle irrelevant.[ref 341]
ARSIWA sets out various grounds on which the conduct of certain actors or entities may be attributed to the state.[ref 342] Amongst others, these include situations where the conduct is by a state organ (ARSIWA Art 4)[ref 343] or by a private entity empowered to exercise governmental authority to exercise inherently governmental functions (Art 5).[ref 344] Significantly, Art 7 clarifies that
“an organ of a State or of a person or entity empowered to exercise elements of the governmental authority shall be considered an act of the State under international law if the organ, person or entity acts in that capacity, even if it exceeds its authority or contravenes instructions.”[ref 345]
In addition, while under normal circumstances states are not responsible for the conduct of private persons or entities, such actors’ behaviour is nonetheless attributable to them “if the person or group of persons is in fact acting on the instructions of, or under the direction or control of, that State in carrying out the conduct” (Art 8).[ref 346] Additionally, under Art 11, conduct can be attributed to a State, if that State acknowledged and adopted the conduct as its own.[ref 347]
c) ARSIWA Arts 4 & 5: TFAI agents as de jure state organs or empowered entities
How, if at all, might these norms apply to TFAI agents? As discussed, while international law has developed a number of specific regimes for regulating state responsibility or liability for harms or internationally wrongful acts resulting from space objects, or from transboundary harms that arise out of hazardous activities,[ref 348] there is currently no overarching international legal framework for attributing state responsibility (or liability) for its inanimate objects, per se.[ref 349]
Nonetheless, it is likely that TFAI agents could instead be accommodated under the existing law on state responsibility, as laid down in ARSIWA. If TFAI agents are granted domestic legal personality by their deploying states, and are either designated formally as state organs (Art 4)[ref 350] or are treated as private entities empowered to exercise governmental authority to exercise inherently governmental functions (Art 5),[ref 351] then under ARSIWA their conduct would be attributable to the state, even in the cases where (as with intent-alignment failure) they contravened their explicit instructions, so long as they acted with apparent state authority.
d) Non-human entities as state agents under ARSIWA
A crucial question for this approach is whether ARSIWA is even applicable to AI agents.
One natural objection is that the law of state responsibility in general, and ARSIWA specifically, are historically premised on the conduct of the human individuals that make up the “organs”, “entities”, or “groups of persons” involved.[ref 352]
Some have suggested, however, that the text of ARSIWA could offer a remarkable amount of latitude to accommodate highly autonomous AI systems and to treat them either as state organs (Art 4) or as actors empowered to exercise governmental authority (Art 5).[ref 353] For instance, Haataja has argued that “[c]onceptually, it is not difficult to view [autonomous software entities] as entities for the purpose of state responsibility analysis [since] Articles 4 and 5 of the ILC Articles make explicit reference to ‘entities’ and, while Article 8 only refers directly to ‘persons and groups’, its commentary also makes reference to ‘persons or entities’.”[ref 354] Indeed, the ILC’s Commentaries clarify that, for the purposes of Art 4, a state’s “organs” includes “any person or entity which has that status in accordance with the internal law of the State.”[ref 355]
Similarly, in her discussion of state responsibility for fully autonomous weapons systems (FAWS), Pacholska has argued that such systems (when deployed by state militaries) could straightforwardly be construed as “state agents”, a category which, while absent in ARSIWA themselves, occurs frequently in the ILC Commentaries to ARSIWA, usually in the phrase “organs or agents”.[ref 356] She furthermore notes how the term “agent” precedes those instruments, as it was frequently used in early arbitral awards during the early 20th century, many of which emphasized that “a universally recognized principle of international law states that the State is responsible for the violations of the law of nations committed by its agents”.[ref 357] Indeed, the term “agent” was revived by the ICJ in its Reparations for Injuries case,[ref 358] where it confirmed the responsibility of the United Nations for the conduct of its organs or agents, and clarified that in doing so, the Court
“understands the word ‘agent’ in the most liberal sense, that is to say, any person who, whether a paid official or not, and whether permanently employed or not, has been charged by an organ of the organization with carrying out, or helping to carry out, one of its functions—in short, any person through whom it acts.”[ref 359]
Of course, in these instruments the concepts of “entities” or “agents” were, again, invoked with human agents in mind. Nonetheless, Pacholska argues that there is nothing in either the phrasing or the content of this definition for “agent” that rules out its application to non-human persons, or even to objects or artefacts (whether or not guided by AI),[ref 360] there is however a challenge to such attempts in that the ILC, in its Commentary on Art 2 ARSIWA, has fairly clearly construed “acts of the state” to involve some measure of human involvement, since
“for particular conduct to be characterized as an internationally wrongful act, it must first be attributable to the State. The State is a real organized entity, a legal person with full authority to act under international law. But to recognize this is not to deny the elementary fact that the State cannot act of itself. An ‘act of the State’ must involve some action or omission by a human being or group.”[ref 361]
Some have argued that this means that any construction of AI agents as state agents cannot be supported under the current law,[ref 362] and remains entirely de lege ferenda.[ref 363] On the other hand, the precise formulation used here—that an act of the State ‘must involve some action or omission by a human being or group’ (emphasis added)—is remarkably loose. It does not, after all, stipulate that an act of the State is solely or entirely composed of actions or omissions by human beings. In so doing, it arguably leaves open the door to the construction of AI agents as state agents, so long as there is at least ‘some action or omission’ taken by a human beings in the chain: this is a potentially accommodating threshold, since many AI agents’ deployment, prompting, configuration, and operation is likely to involve at least some measure of human involvement.
As such, this interpretation of AI agents is not without legal grounding, and it is possible that it is an interpretation that may become enshrined in state agreement or adopted through state practice. Indeed, this reading may be consonant with already-emerging state practice and treatment of state responsibility on issues such as lethal autonomous weapons systems, with the 2022 report of the Group of Governmental Experts on Emerging Technologies in the Area of Lethal Autonomous Systems (LAWS) emphasizing that “every internationally wrongful act of a state, including those potentially involving weapons systems based on emerging technologies in the area of LAWS entails international responsibility of that state.”[ref 364]
Finally, one promising avenue would be to ensure that TFAI agents’ status as state organs or empowered entities is clearly articulated and affirmed by the contracting states, within the treaties’ text, in order to fully close the loop on state attributability, ensuring politically and technically stable interpretation.
4. TFAI agents without personhood as entities whose conduct is attributable to the state under ARSIWA
Furthermore, it is possible to arrive at an even more doctrinally modest variation of this approach to establishing state responsibility for TFAI agents, one where domestic legal personality for TFAI agents is not even required for their actions to become attributable to their principals. Indeed, this approach—whereby AI agents that have been delegated authority to act with legal significance are treated as legal agents, with their outputs attributed to principals—has been favoured in recent proposals for how to govern these systems under domestic law.[ref 365]
Such a construction might, of course, involve practical costs or tradeoffs relative to more ambitious constructions: as Haataja notes, the process of granting autonomous AI agents a degree of domestic legal personhood would likely involve certain procedural steps, such as registration,[ref 366] which would ease the process of attributing wrongful acts to the conduct of particular AI agents and the conduct of those agents to their states.[ref 367] Indeed, there may be various domestic analogues for constructs in domestic law which bear enforceable duties while lacking full personhood.[ref 368]
Nonetheless, a version of the TFAI framework that would not require treaty parties to engage in novel (and potentially politically contested) innovations in their domestic law by granting AI agents even partial personhood would likely have lower thresholds to accession and implementation. Fortunately, state attributability functions straightforwardly even if these systems are not legally distinct from the human agents of the state. As Haataja notes, “the ILC Articles use the term ‘entity’ in a more general sense, meaning that the entity in question (be it an individual or group) does not need to have any distinct legal status under a state’s domestic law.”[ref 369] The important factor under ARSIWA is not the exact type or extent of (domestic) legal personality of the entity or agent, but rather its relationship with the state and the types of functions it performs.[ref 370] There are several avenues, then, by which TFAI agents without any legal personhood could nonetheless be considered as entities governed under ARSIWA, whose conduct is attributable to their deploying state.
a) TFAI agents as “completely dependent” de facto organs of their deploying states
For one, even if an entity or agent does not have the de jure status of a state organ under a state’s domestic law, it may be equated to a de facto state organ under international law wherever it acts in “complete dependence” on the state for which it constitutes an instrument.[ref 371] As the ICJ established in Nicaragua, evaluations of “complete dependence” turn on a range of factors,[ref 372] but includes cases where a state created the non-state entity and provides deep resource assistance and control. Critically, even in cases where the basic models underpinning TFAI agents had not been pre-trained (i.e., created) by a state, the amount of (inference computing) resources that a state would need to continuously and actively dedicate to an AI agent, as a basic condition of those agents’ very persistence and operation, would likely suffice to meet that bar. Moreover, by the very act of prompting TFAI agents with high-level goals or directives, deploying states would be considered to exercise a “great degree of control” over intent-aligned, loyal TFAI agents. In these ways, the model would be “completely dependent” on the state, making its actions attributable to it.[ref 373]
b) ARSIWA Art 8: TFAI agents acting under the “effective control” or instructions of a state
Indeed, even if an TFAI agent were considered neither a de jure (as in the previous section) nor a de facto state organ under Art 4, nor empowered to exercise elements of governmental authority under Art 5, it is still possible to ground attributability under ARSIWA. After all, ARSIWA Art 8, which concerns “conduct directed or controlled by a State”, would apply to state-deployed TFAI agents; even if states would not be the ones developing the AI agents—in the sense that they would be providing them with high-level behavioural prompts through fine-tuning and post-training—they would still, as a matter of daily practice, be the actors providing prompts and instructions to the deployed TFAI agents. That means that such agents could naturally be “found to be acting under the instructions, directions, or control of a state.”[ref 374]
As adopted by the ICJ in its Nicaragua and Bosnian Genocide judgments,[ref 375] and as affirmed in the Tallinn Manual, the standard of control considered in such cases is one of “effective control” of a state.[ref 376] According to the Tallinn Manual, for instance, a state is in effective control over the conduct of a non-state actor where it “determines the execution and course of the specific operation”, where it has “the ability to cause constituent activities of the operation to occur”, or where it can “order the cessation of those activities that are underway”.[ref 377] These conditions again naturally apply to TFAI agents which, even if they are granted a degree of latitude and autonomy in their operations, remain under the effective control of the state under these terms. After all, the state is intrinsically involved in providing the basic infrastructure (from an internet connection to various software toolkits) necessary for the “constituent activities” of any AI agent’s operation, and, as a matter of practice, will (or should) retain an ability to pause or cease an AI agents’ operation at a moment’s notice.
Of course, this argument should wrestle with one possible tension: how can we reconcile the argument that states are in ‘effective control’ of these AI agents, with the preceding idea that some AI agents (if not aligned to the law) might operate in a ‘lawless’ manner, which is itself one rationale for the TFAI framework? While a full argument may be beyond the scope of this paper, one might consider that the control relation between states and their AI agents is distinct from that between states and their human agents. That is, human agents are under the ‘effective control’ of their state, if the state “determines the execution and course of the specific operation”, has “the ability to cause constituent activities of the operation to occur”, or where it can “order the cessation of those activities that are underway”. However, while the state has many levers by which to coerce compliant behaviour of human non-state actors, the efficacy of those levers is grounded in that state’s ex post sanctions or consequences (e.g. a state-backed militia knows that if it disregards that state’s orders, it may lose key logistical or political support). However, these levers are not based on architectural kill-switches enabling direct intervention—states do not, as a rule, force their agents to wear explosive collars. As a consequence, states are in the ‘effective control’ of human agents because they can deter rogue behaviour, not because they can easily halt it while it is underway. AI systems, conversely, can at least in principle be subjected to forms of ‘run-time’ infrastructural controls (whether guardrails or kill-switches), and are completely and immediately dependent on continued access to the states’ computing infrastructure. In so doing, it could be argued that the state-AI agent relationship manifests a form of effective control that is different from that at play between states and their human agents—but that both relations nonetheless manifest legally valid forms of effective control for the purposes of state attribution.
Once again, it would be possible for an AI-guiding treaty to strengthen these norms, by including and codifying explicit attribution principles, establishing, for instance, that the actions of any AI system deployed under the jurisdiction or control of a state party shall be deemed attributable to that state.
c) ARSIWA Art 11: TFAI agents’ conduct adopted by states in the AI-guiding treaty
Finally, ARSIWA Art 11 offers potentially the most straightforward avenue to attributing behaviour to states, as it notes that
“Conduct which is not attributable to a State under the preceding articles shall nevertheless be considered an act of that State under international law if and to the extent that the State acknowledges and adopts the conduct in question as its own.”[ref 378]
However, some open questions remain around this avenue, such as over whether it would be legally feasible (or politically acceptable) for the contracting states to acknowledge and adopt TFAI agents’ conduct prospectively (i.e., through a unilateral declaration or by explicit provision in an AI-guiding treaty), or if they could—or would—only do so retrospectively, in relation to a particular instance of TFAI agent behaviour.
In fact, even if such behaviour were not formally adopted, it is possible that other state actions with regards to its deployed AI agents (e.g. public approval of their actions, or continued provision of inference computing resources to enable continuation of such activities) might signal sufficient tacit endorsement, so as to nonetheless retrospectively construct those systems as agents of the state.[ref 379]
5. Comparing approaches to establishing TFAI agent attributability
The above discussion shows that there are a wide range of avenues towards clarifying the relation between deploying states and their (TF)AI agents in ways that further strengthen the TFAI framework in both legal and technical terms.
Significantly, while there are various options to classify and attribute the conduct of AI agents in ways that extend (particular forms of) legal personhood to them, we have also seen that, under the international law on state responsibility, the ability of TFAI agents to legally function as state agents is in fact largely orthogonal to such extensions. As such, of the above solutions, we suggest that an approach that does not treat TFAI agents as international or domestic legal persons but merely as entities whose actions are attributable to the states under ARSIWA (because they act as completely dependent de facto organs of their states, are acting under their states’ “effective control”, or engage in conduct that is acknowledged and adopted by their state) likely strikes the most appropriate balance for the TFAI framework. That is, these legal approaches would largely address the legal, political, and technical challenges to AI-guiding treaty stability and effectiveness, and they would do so in a way that remains most closely grounded in existing international law, as it does not require the development of new lex specialis regimes or innovative judicial or treaty amendments to grant international legal personhood to these systems. Simultaneously, they would avoid the responsibility gaps that might appear from attempting to extend or grant international legal personhood (especially those that involve new rights and not just obligations) to these models.
It is important to remember here that, in the first instance, the TFAI framework is meant as a legally modest innovation and a pragmatic mechanism for interstate commitment, one that will be sorely needed before long as AI continues to advance. Since AI-guiding treaties would still be exclusively conducted amongst states, states remain the sole direct subjects to those obligations.
C. Applying the TFAI framework to AI agents deployed by non-state private actors
Finally, there are other outstanding questions that, beyond some brief reflections, we largely leave out of scope here.
For instance, to return to the previous question of which AI agents should be subjected to a TFAI framework: if we adopt a broad reading, this would imply all agents subject to a state’s domestic law. However, there is an open question over whether and how to apply the TFAI framework to models deployed by private sector actors. After all, since non-state actors cannot conduct treaties under international law, they could not conduct AI-guiding treaties, formally understood.
Of course, states could draft an AI-guiding treaty in such a manner as to commit its signatories to introduce domestic regulation requiring that private actors only deploy models that are trained, fine-tuned, or aligned so that they abide by the treaty and/or by its implementing domestic law. Moreover, AI-guiding treaties could also specify explicitly that state parties will be held responsible for any violations of the treaty by any AI agents operating from their territory, applying a threshold for state responsibility that is even steeper than that supported by the general law on state responsibility under ARSIWA, and which would create strong incentives for states to apply rigorous treaty- and law-following AI frameworks. Alternatively, a treaty might require all parties to domestically deploy a separate set of TFAI agents to monitor and police the treaty compliance of other, non-state agents.
Moreover, AI companies themselves might also draw inspiration from the TFAI framework, as an avenue for jointly formulating model specification documents. For instance, there would be nothing to bar non-state private actors from engaging in partnerships that also bind their AI agents, industry-wide, to certain standards of behaviour or codes of conduct, in a set-up that may be at least technically isomorphic to the one used to create TFAI agents. Such an outcome would not constitute a form of law alignment as such—since coordinated AI-guiding industry standards would not be considered as laws either in a positive law sense, or given the lack of democratic legitimacy.[ref 380] Nonetheless, in the absence of adequate coordinating national regulation or standards, such agreements could form a species of policy entrepreneurship by AI companies, establishing important stabilizing commitments or guarantees amongst themselves. These could specify treaty-like constraints on companies deploying AI technology in ways that would be overtly destabilizing—e.g., precluding their use in corporate espionage or sabotage, or assuring that these systems would take no part in informing lobbying efforts aimed at regulatory capture or at supporting power concentration by third parties.[ref 381] Indeed, as multinational private tech companies may rise in historical prominence relative to states,[ref 382] such agreements could well establish an important new foothold for a next iteration of intercorporate law, stably guiding the interactions of such actors relative to states—and to one another—on the global stage.
V. Legal Interpretation by Treaty-Following AI: Two Avenues Under International Law
A key question in establishing a functional TFAI framework is that of how a TFAI agent is to interpret a treaty in order to evaluate whether its actions would be in compliance with that AI-guiding treaty’s terms. This raises many additional challenges: Are there particular ways to craft treaties to be more accommodating to this? How much leeway would contracting states have in specifying or customizing the interpretative rules which these systems use in their interpretation? A full consideration of these questions is beyond this paper, but we provide some initial reflections upon potential strategies and consider their assorted challenges.
Specifically, we can consider two avenues for implementation. In one, TFAI agents apply the default customary rules on treaty interpretation to relatively traditionally designed treaties; in the other, the content and design of the treaty regime are tailored—through bespoke treaty interpretation rules and arbitral bodies—in order to produce special regimes (lex specialis) that are more responsive and easily applicable by deployed TFAI systems. Below, we consider each of these approaches in turn, identifying benefits but also implementation challenges.
A. Traditional AI-guiding treaties interpreted through default VCLT rules
One avenue could be to have TFAI agents apply the default rules of treaty interpretation in international law. Public international law, under the prevailing positivist view, is based on state consent. Accordingly, treaties are considered the “embodiments of the common will of their parties”,[ref 383] and they must be interpreted in accordance with the common intention of those parties as reflected by the text of the treaty and the other means of interpretation available to the interpreter.[ref 384] Because a treaty’s text is held to represent the common intentions of the original authors of a treaty—and of those parties who agree later to adopt its obligations by acceding to the treaty—the primary aim of treaty interpretation is to clarify the meaning of the text in light of “certain defined and relevant factors.”[ref 385]
In particular, Articles 31-33 of the 1969 Vienna Convention on the Law of Treaties (VCLT) codify these customary international law rules on treaty interpretation.[ref 386] Since these are custom, they apply generally to all states, even to states that are non-parties to the VCLT. For instance, while 116 states are parties to the Vienna Convention, the United States is not (having signed but not ratified).[ref 387] Nonetheless, the US State Department has on various occasions stated that it considers the VCLT to constitute a codification of existing (customary international) law,[ref 388] and many domestic courts have also relied on the VCLT as authoritative in a growing number of cases.[ref 389]
This default VCLT approach sets out several means for interpretation. According to VCLT Article 31(1), as a general rule of interpretation
“a treaty shall be interpreted in good faith in accordance with the ordinary meaning to be given to the terms of the treaty in their context and in the light of its object and purpose.”[ref 390]
Would a TFAI agent be capable of applying the different elements to this interpretative approach? Critically, the VCLT only sets out the rules and principles of interpretation, and does not explicitly specify who or what may be a legitimate interpreter of a treaty. As such, it does not explicitly rule out AI systems as interpreters of treaties. To be sure, one could perhaps argue that it implicitly rules out such interpreters—for instance, by taking ‘in good faith’ to refer to a subjective state that is inaccessible to AI systems. However, that would only complicate their ability to apply all these rules of interpretation, not their essential eligibility as interpreters.
That does not mean that AI systems, lacking their own international legal personality, could produce interpretations that would (in and of themselves) be authoritative for others (e.g., in adjudication) or which would (in and of themselves) be attributable to a state. However, TFAI agents would in principle be allowed to interpret treaties, with reference to VCLT rules, in order to conform their own behaviour to treaties regulating that behaviour. The resulting legal interpretations they would generate during inference-time legal reasoning would therefore functionally serve as an internal compliance mechanism, rather than as authoritative interpretations that would bind third parties.
1. TFAI agents and treaty interpretation under the VCLT
However, even if AI agents may validly apply the VCLT rules, could they also do so proficiently? In the domestic law context, some legal scholars have questioned whether AI systems’ responses would really reliably reflect the “ordinary meaning” of terms because the susceptibility of LLMs to subtle changes in prompting leaves them open to gamified prompt strategies to reflect back preconceived notions,[ref 391] or even, more foundationally, because such models are produced by private actors with idiosyncratic values and distinct commercial interests.[ref 392] However, importantly, in treaty interpretation, the “ordinary meaning” of a term is not just arrived at through its general public usage; rather, an appropriate interpretation needs to also take into account the various elements further specified in Arts. 31(2-4), along with the “supplementary means of interpretation” specified in Art. 32.[ref 393]
a) VCLT Art 31(1): The treaty’s “object and purpose” and the principle of effectiveness
To interpret a treaty in accordance with VCLT Art 31(1), a TFAI agent would need to be able to understand that treaty’s “object and purpose”. To achieve this, it should be aided by clear textual provisions that reflect the underlying goals and intent of the states parties in establishing the treaty. A shallow text that only sets out the agreed-upon specific constraints on AI behaviour, without clarifying the purpose for which those constraints are established, would risk “governance misspecification”, as AI agents could well find legal loopholes around such proxies.[ref 394] Conversely, a treaty which clearly and exhaustively sets out its aims (e.g., in a preamble, or in its articles) would provide much stronger guidance.[ref 395]
Importantly, since an overarching goal of treaty interpretation is to produce an outcome that advances the aims of the treaty, a clear representation of the treaty’s object and purpose would also allow a TFAI agent to apply the “principle of effectiveness,”[ref 396] which holds that, when a treaty is open to two interpretations, where one enables it to have appropriate effects and the other does not, “good faith and the objects and purposes of the treaty demand that the former interpretation should be adopted.”[ref 397]
b) VCLT Art 31(2): “Context”
Furthermore, following VCLT Art 31(2), in interpreting the meaning of a term or provision in a treaty, TFAI agents would need to consider their context; this refers not only to the rest of the treaty text (including its preamble and annexes), but also to any other agreements relating to the treaty, or to “any instrument which was made by one or more parties in connection with the conclusion of the treaty and accepted by the other parties as an instrument related to the treaty.”[ref 398] The latter criterion implies that TFAI agents would need frequent retraining and updates, or the ability to easily identify and access databases of subsequent agreements during inference, to ensure they are aware of the most up-to-date agreements in force between the parties. The latter approach has been studied as a promising way to reliably ground AI systems’ legal reasoning.[ref 399]
c) VCLT Art 31(3): Subsequent agreement, subsequent practice, and “any relevant rules”
Furthermore, VCLT Art 31(3)(a-b) directs the interpreter to take into account any subsequent agreement or practice between the parties regarding the interpretation of the treaty or the application of its provisions.[ref 400] This is significant, as it entails that TFAI agents would have a way of tracking the state parties’ practice in interpreting and applying the treaty, as reflected in state declarations such as unilateral Explanatory Memoranda or joint Working Party Resolutions passed by a relevant established treaty body or forum amongst the parties.[ref 401]
VCLT Art 31(3)(c) also directs the interpreter to take into consideration “any relevant rules of international law applicable in the relations between the parties.”[ref 402] This suggests that TFAI agents should be able to apply the method of “systemic integration”, and draw on other rules and norms in international law—whether treaties, custom, or general principles of law[ref 403]—to clarify the meaning of treaty terms or to fill in gaps in a treaty, so long as the referent norms are relevant to the question at hand and applicable between the parties.
d) VCLT Art 32 “supplementary means of interpretation”
Finally, in specific circumstances, a TFAI agent could also refer to historical evidence in interpreting the treaty: VCLT Art 32 holds that, if and where the interpretation of a treaty according to Art 31 “leaves the meaning ambiguous or obscure”, or “leads to a result which is manifestly absurd or unreasonable”,[ref 404] the interpreter may refer to “supplementary means of interpretation,” such as the travaux preparatoire (i.e., the preparatory work of the treaty, as reflected in the records of negotiations) or the circumstances of the treaty’s conclusion (e.g., whether the treaty was conducted in the wake of a major AI incident of a particular kind).
2. Potential challenges to traditional AI-guiding treaties
Nonetheless, while at a surface level TFAI agents would be capable of applying many of these methodologies,[ref 405] there may be a range of problems or challenges in grounding the interpretation of AI-guiding treaties in the default VCLT rules alone.
a) Challenges of legal “grounding”: travaux preparatoire and implementing agreements
For one, there may be distinct challenges that TFAI agents would encounter in adhering to the VCLT methodology for interpretation.
That is not to suggest that such treaty interpretation would be too complex for AI systems by dint of the sophistication of the legal reasoning required. Rather, the challenges might be empirical, in that certain interpretative steps would involve empirical fact-finding exercises (e.g., to ascertain evidence of state practice) which could prove difficult (or unworkably time- or resource-intensive) for AI agents more natively proficient in computer-use tasks. Indeed, in some cases, TFAI agents would encounter significant challenges in attempting to access or even locate relevant materials, with key travaux preparatoire materials currently often collected in scattered conference records or even—as with virtually all international agreements sponsored by the Council of Europe—entirely inaccessible.[ref 406]
However, such grounding challenges need not be terminal to the TFAI framework. For one, many of these hurdles would not be larger to TFAI agents than they would be (or indeed, already are) to human-conducted interpretation. Indeed, more cynically, one could even argue that it is not the case that all human judges or scholars, when interpreting international law, always consistently engage in such robust empirical analyses.[ref 407]
In practice, and for future AI-guiding treaties, such grounding challenges might be defeasible, however. State parties could adopt a range of measures to ensure clear and authoritative digital trails for key interpretative materials, ranging from the treaty’s travaux preparatoire (including conference records such as procès verbaux or working drafts of the agreement[ref 408]) to subsequently conducted agreements, and from relevant new case law by international courts to evidence of states’ interpretation and application of the treaty.
b) Challenges of adversarial data poisoning attacks corrupting interpretative sources
A second potential challenge would be the risk of legal corruption in the form of adversarial data poisoning attacks. In a sense, this would be the inverse risk—one where the problem is not a TFAI agent’s inability to access the required evidence of state practice, but the risk that it resorts too easily to a wide range of (seemingly) relevant sources of evidence, when these could be all too easily contaminated, spoofed, or corrupted by the states parties to the treaty (or by third parties).
This risk is illustrated by Deeks and Hollis’s concern that, if LLMs’ responses can be shaped by patterns present in their training data, then there is a risk that their legal judgments and interpretations over the correct interpretation of international norms may “turn more on the volume of that data than its origins”.[ref 409] This would be an especially severe challenge to the interpretation of customary international law; but would even affect the interpretation of written (treaty) law. This not only creates a background risk that, by default, AI models may overweight more common sources of legal commentary (e.g., NGO reports, news articles) over rarer, but far more authoritative ones (e.g., government statements),[ref 410] it also creates a potential attack surface for active sabotage—or the subtle skewing—of the legal interpretations conducted by not just TFAI agents, but by all (LLM-based) AI systems developed on the basis of pre-training on internet corpora. After all, as Deeks and Hollis note:
“if it becomes clear that LLM outputs are influencing the direction of international law, state officials and others will have an incentive to push their desired views into training datasets to effectively corrupt LLM outputs. In other words, disinformation or misinformation about international law online at scale could contaminate LLM outputs, and […] common understandings of the law’s contents or contours.”[ref 411]
Indeed, the feasibility of states seeking to push their legal interpretation (or outright falsification) of international events into the corpus of training data used in pre-training frontier AI models, is demonstrated by computer science work on the theoretical and empirical feasibility of data poisoning attacks, which have proven effective regardless of the size of the overall training dataset, and indeed which larger LLMs are significantly more susceptible to.[ref 412] Another line of evidence is found in the tendency of AI chatbots to inadvertently reproduce the patterns which nations’ propaganda efforts, disinformation campaigns, and censorship laws have baked into the global AI data marketplace.[ref 413] Finally, the risk of such law-corrupting attacks is borne out through actual recent instances of deliberate data poisoning attacks aimed at LLMs. For instance, a set of 2025 studies found that the Pravda network, a collection of web pages and social media accounts, had begun to produce as many as 10,000 news articles a day aggregating pro-Russia propaganda, with the likely aim to infiltrate and skew the responses of large language models, in a strategy dubbed “LLM grooming”.[ref 414] Subsequent tests found that this strategy had managed to skew leading AI chatbots into repeating false narratives at least 33% of the time.[ref 415] Indeed, in the coming years, as more legitimate sources of authentic digital data may increasingly aim to impose controls or limits on AI-training-focused content crawlers,[ref 416] there is a risk that the remaining internet data available to training AI systems may skew ever further towards malicious data seeded for the purposes of intentional grooming.
To date, such AI grooming strategies have been predominantly leveraged for social impacts (e.g., political misinformation or propaganda), not legal ones. However, if targeted towards legal influence, they could rapidly erode the reliability of the answers provided by AI chatbots to any users enquiring into international law. More concretely, if such campaigns aimed to falsify the digital evidence of state parties’ track record in applying and interpreting an AI-guiding treaty, this would disrupt TFAI agents’ ability to interpret that treaty on the basis of that track record (VCLT Art 31(3)).
Thus, unless well designed, the LFAI framework may be—or may appear—susceptible to interpretative manipulation: even if the certified AI-guiding treaty text could be kept inviolate in a designated and authenticated repository which TFAI agents could access or query, the same chain of custody may not be easily established for the ample decentralized digital evidence of subsequent state practice and opinio juris which is used specifically to inform treaty interpretation under VCLT Art 31(3)(b),[ref 417] as well as generally to inform interpretation of customary international law under the ICJ Statute.[ref 418]
Such evidence could easily be contaminated, spoofed, or corrupted by some actors in order to manipulate TFAI agents’ legal interpretations[ref 419] in ways that skew both sides’ AI models’ behaviour to their advantage or in ways meant to erode the legitimacy or stability of the treaty regime. Indeed, in some cases, the perception of widespread state practice could even (be erroneously held to) contribute to the creation of new customary international law, which—since treaties and customary international law are coequal sources of international law,[ref 420] and under the lex posterior principle—might supersede the preceding (AI-guiding) treaty, rendering it obsolete.[ref 421]
Nonetheless, this corruption challenge also needs to be contextualized. For one, insofar as some state actors may seek to engage in LLM-grooming attacks in many areas of international law, this phenomenon does not pose some unique objection to TFAI agents, but rather constitutes a more general problem for any interpreters of international law, whether human or machine. While in theory human interpreters might be better positioned than AI (at least at present) to judge the authenticity, reliability, and authority of certain documents as evidence of state practice, many may not exert such scrutiny in practice, especially if or as they rely on other (consumer) AI chatbots.[ref 422] In fact, in the specific context of AI-guiding treaties, the attack surface may be proportionally smaller, since TFAI agents could be configured to only defer to specific authenticated records of state practice as they relate to that treaty itself. Alternatively, AI models could be configured to monitor and flag any efforts to corrupt the digital record of state practice, though this would likely be significantly politically charged or contested.
In another scenario, the treaty-compliant actions of state-deployed TFAI agents could even help anchor and shield the interpretation of international law against such attacks. If such actions were recognized as evidence of state practice, they would provide a very large (to the tune of tens or hundreds of thousands of decisions per agent per year), exhaustively recorded, and verifiable record of state practice as it relates to the implementation of the AI-guiding treaty. In theory, then, these treaty-compliant legal interpretations and actions of each individual TFAI agent could help anchor the legal interpretation of all other TFAI agents, insulating them from corrupting dynamics. However, this may be more contentious: it would require that the legal interpretations produced by state-deployed TFAI agents are taken to be authoritative for others or attributable to a state in ways that reflect not only state practice (which, as discussed,[ref 423] depends on effective attribution of AI agents’ conduct to their deploying states) but which also reflect its opinio juris (which may be far more contested).
c) Challenges of interpretative ambiguity and TFAI agent impartiality
Furthermore, TFAI agents may encounter a range of related challenges in interpreting vague treaty terms or articles.
Indeed, scholars have noted AI systems may struggle in performing the legal interpretation of statutes. Because it is impossible to write a “complete contingent contract”[ref 424] and because legal principles written in natural language are often subject to ambiguity—both in how they are written, and how they are applied—human legal systems often use institutional safeguards to manage such ambiguity. However, these safeguards are more difficult to embed in AI systems barring a clear rule-refinement framework that can help minimize interpretative disagreement or reduce inconsistency in rule application.[ref 425] Without such a framework, there is a risk, as recognized by proponents of law-following AI, that
“in certain circumstances, at least, an LFAI’s appraisal of the relevant materials might lead it to radically unorthodox legal conclusions—and a ready disposition to act on such conclusions might significantly threaten the stability of the legal order. In other cases, an LFAI might conclude that it is dealing with a case in which the law is not only “hard” to discern but genuinely indeterminate.”[ref 426]
In particular, for TFAI agents deployed in the international legal context, there are additional challenges when a provision in an AI-guiding treaty may remain open to multiple possible interpretations. In principle, such situations are to be resolved with reference to the interpretative principle of effectiveness—which states that any interpretation should have effects broadly in line with good faith and with the object and purpose of the treaty.[ref 427]
However, in practice, there may be cases where there are several interpretations that are acceptable under this principle, but where some interpretations nonetheless remain much more favourable to a particular treaty party than others. In such cases, there may be a tension between the “best” interpretation of the law (as would be reached by a neutral judge), and a “defensible” yet partial interpretation (as would be pursued by a state’s legal counsel).
How should TFAI agents resolve such situations? On the one hand, we might want to ensure that they adopt the “best” or most impartial effective interpretation to ensure symmetrical and uncontested implementation of the treaty by all states parties’ TFAI agents as a means to ensure the stability of the regime. On the other hand, many lawyers working on behalf of particular clients or employers (in this case, State Departments or Foreign Ministries) may already today, implicitly or explicitly, pursue defensible interpretations of the applicable law that are favourable to their principal. Given this, it seems unlikely that states would want to deploy TFAI agents that did not, to some degree, consider their states’ interests in deciding amongst various legally defensible interpretations. One downside is that this might result in asymmetries in the interpretations reached by (and therefore the conduct of) TFAI agents acting on behalf of different state parties. Whether this is a practical problem may depend on the substance of the treaty, the degree of latitude which the interpreting TFAI agents actually have in altering their behaviour under the treaty, and the parties’ willingness to overlook relatively minor or inconsequential differences in implementation—that are nonetheless minimally compliant with the core norm in the treaty—as the price of doing business.
d) TFAI agents may struggle with interpretative systemic integration
Relatedly, TFAI agents may encounter distinct legal and operational challenges in interpreting a treaty under the broader context of international law. In some circumstances, this could result in an explosion in the number of norms to be taken into account when evaluating the legality of particular conduct under a treaty.
As noted before,[ref 428] VCLT Art 31(3)(c) requires a treaty interpreter to take into consideration “any relevant rules of international law applicable in the relations between the parties.”[ref 429] This suggests that TFAI agents should, where appropriate in clarifying the meaning of ambiguous treaty terms, or where the treaty leaves gaps in its guidance relative to certain situations,[ref 430] draw on other “relevant and applicable” rules and norms in international law in order to clarify these questions at hand.
Importantly, this interpretative principle of systemic integration has a long history that reaches back almost a century, and well before the VCLT.[ref 431] Forms of it have appeared in cases as early as Georges Pinson v Mexico (1928).[ref 432] In its judgment in Right of Passage (1957), the ICJ held that “…it is a rule of interpretation that a text emanating from a Government must, in principle, be interpreted as producing and intended to produce effects in accordance with existing law and not in violation of it.”[ref 433] Since it was enshrined under the aegis of the VCLT, and especially since its case-dispositive use in the ICJ’s decision in Oil Platforms (2003),[ref 434] systemic integration has been increasingly prominent in international law.[ref 435] In recent years, it has been recognized and applied by a range of international courts and tribunals,[ref 436] such as, notably, in climate change cases such as Torres Strait[ref 437] and the International Tribunal on the Law of the Sea (ITLOS)’s Advisory Opinion on Climate Change.[ref 438]
This poses a potential challenge to the smooth functioning of a TFAI framework, however: Under VCLT Art 31(3)(c), systemic integration—the consideration and application of relevant and applicable law—imposes potential legal and operational challenges for AI agents. It could imply that these systems would be required to cast a very wide net, ranging across a huge body of treaty and case law, when interpreting the specific provisions of an AI-guiding treaty. As discussed, this challenge is of course not unique to international law. Indeed, it is analogous to the challenges posed to domestic law-following AI systems operating in a sprawling and complex domestic legal landscape. As Janna Tay has noted, “[a]s laws proliferate, there is a growing risk that laws produce conflicting duties. Accordingly, it is possible for situations to arise where, in order to act, one of the conflicting rules must be broken.”[ref 439]
In the contexts of both domestic and international law, the potential proliferation of norms or rules to be taken into consideration imposes a practical challenge for TFAI agent interpretation of the law, since it implies that an AI agent should dedicate exhaustive computing power and very long inference-time reasoning traces to excavating all potentially relevant norms applicable on the treaty parties that could potentially pertain to reaching a full judgment. It also entails a potential interpretative challenge, since the fragmentation of international law might imply that certain norms across different regimes simply stand in tension with each other. Indeed, legal scholars have noted that there are risks of potential normative incoherence in the careless application of systemic integration even by human scholars.[ref 440]
Of course, while an important consideration, in practice there are at least three potential responses to this challenge.
In the first place, it could not only be feasible but appropriate to calibrate the level of rigour required from TFAI agents, similar to how it is delimited for LFAI agents,[ref 441] in order to ensure that the alignment of their behaviour with the core treaty text remains computationally, economically, and practically feasible, or that it takes into account exceptional circumstances.[ref 442]
In the second place, the scope of application of—or the need for resort to—systemic integration could be circumscribed in many situations simply by drafting the original AI-guiding treaty text in a manner that front-loads much of the interpretative work; for example, by reducing terminological ambiguity, anticipating and accounting for potential gaps in the treaty’s application, or pre-describing—and addressing—potential interactions of that treaty with other relevant norms or regimes applicable to the contracting states. Indeed, AI systems themselves could support such a drafting process, since, as Deeks has suggested, such models might well help map patterns of treaty interaction in ways that foresee and forestall potential norm conflicts.[ref 443]
Finally, and in the third place, TFAI agents could simply be configured to address the halting problem by deriving interpretative guidance from other rules in international law in an iterative manner, seeking interpretative guidance from one (randomly selected or reasoned) other regime at a time, and only continuing the search if guidance is not found there.
—
In sum, while the traditional avenue for implementing the TFAI framework under the default VCLT rules may offer a promising baseline approach for guiding TFAI agent interpretation, this approach also faces a range of epistemic, adversarial, and operational challenges. Importantly, such treaties may also potentially offer less (ex ante) interpretative control or predictability to the states parties to the treaty, which could make it less appealing to them in some cases. These considerations could therefore shift these states to prefer an alternative, second model of AI-guiding treaty design.
B. Bespoke AI-guiding treaties as special regimes with arbitral bodies
A second design avenue for AI-guiding treaties would be to adapt a treaty’s design in ways that would provide clearer, bespoke interpretative rules and procedures for a TFAI system to adhere to.
Significantly, while the VCLT rules for treaty interpretation are considered default rules of treaty interpretation, they are not considered peremptory norms (jus cogens) that states may not deviate from.[ref 444] Indeed, VCLT Art 31(4) allows that “[a] special meaning shall be given to a term if it is established that the parties so intended.”[ref 445] This means that states may specify special interpretative rules, including those that depart from the usual VCLT rules, if it is clearly established that such interpretative preferences were mutually intended.
1. Special regimes and bespoke treaty interpretation rules under VCLT Art 31(4)
Importantly, the creation of such a special regime (lex specialis) would not imply that the VCLT is inapplicable to the treaty in question; however, through the use of special meanings and interpretation rules and procedures, states can operationally bypass (while working within) the VCLT’s interpretation rules. Moreover, they would not necessarily need to do all this upfront, but could also do so iteratively, complementing the initial treaty with subsequent agreements clarifying the appropriate manner of its interpretation—since these agreements would, as discussed above, need to be taken into account in the process of treaty interpretation under VCLT Art 31(3)(a-b).[ref 446]
Such special regime arrangements could greatly aid the technical, legal, and political feasibility of AI-guiding treaties: they would enable states to tailor such treaties to their preferences, gain greater clarity (and explicit agreement) over the terms by which their AI models would be bound, and forestall many of the interpretative and doctrinal challenges that TFAI systems would otherwise encounter when attempting to apply default VCLT rules to ensure their compliance with the treaty.
What would be examples of special interpretation rules that states might seek to adopt into AI-guiding treaties? These rules could include provisions to set down a “special meaning” (under VCLT Art 31(3)(c)) or a highly specific operationalization of key terms (e.g., “self-replication”,[ref 447] “steganographic communication”,[ref 448] or uninterpretable “latent-space reasoning”[ref 449]) which otherwise have no settled definition in public usage, let alone under international law.
Other deviations could establish variations on the default VCLT interpretation rules; for instance, the treaty might explicitly direct that “subsequent practice in the application of the treaty” (VCLT Art 31(3)(b)) also, or primarily, refers to the practice of other TFAI systems implementing the treaty, in order to ensure that TFAI interpretations of the treaty converge and stabilize on a predictable and joint operationalization of the treaty, in a manner that is (more) robust against attempts at attacking the TFAI agents through data poisoning or LLM-grooming attacks that target the base model.
2. Inclusion and designation of special arbitral body
Of course, no treaty, whether a special regime or not, would be able to provide exhaustive guidance for all circumstances or situations which a TFAI agent might encounter. The impossibility of drafting a complete contingent contract that covers all contingencies has been a well-established challenge in both legal scholarship and research on AI alignment.[ref 450]
The traditional response to this challenge is the incorporation of a judicial system to clarify and apply the law in cases where the written text appears indeterminate. Consequently, proposals for law-following AI in the domestic legal context have held that such systems could defer to a court’s authoritative resolutions to legal disputes, whether in fact or on the basis of its prediction of what a court would likely decide in a given case.[ref 451] Other proposals for law-following AI, such as Bajgar and Horenovsky’s proposal for AI systems aligned to international human rights, have also emphasized the importance of an adjudication system—realized either through traditional judicial systems or within a specialized international agency.[ref 452]
Accordingly, in addition to including provisions to clarify the interpretative rules to be applied by TFAI agents, an AI-guiding treaty could also include institutional innovations in its design. For instance, it could establish a special tribunal or arbitral body. After all, while the default interpretative environment in international law is decentralized and fragmented,[ref 453] treaty drafters may, as noted by Crootof, “introduce reasoned flexibility into a treaty regime without losing cohesion by designating an authoritative interpreter charged with resolving disputes over the text’s meaning in light of future developments”.[ref 454]
There is ample precedent for the establishment of such specialized courts or arbitral mechanisms within a treaty regime, such as the ITLOS, which interprets the provisions of the UN Convention on the Law of the Sea (UNCLOS), and in doing so relies significantly on its own jurisprudence and on the specific teleology and structure of UNCLOS;[ref 455] or the International Whaling Commission, empowered under the 1946 International Whaling Convention to pass (limited) amendments to the treaty provisions.[ref 456] In some cases, subsequent state practice has even resulted in some initially limited arbitral bodies taking up a much greater interpretative role; for instance, since their establishment, the World Trade Organization (WTO) Panels and Appellate Body have come to exert a significant role in interpreting the Marrakesh WTO Agreement,[ref 457] even though that treaty formally reserved an interpretative role to a body of state party representatives.[ref 458] The challenging political context, and eventual contestation of the WTO AB also show the risks of poorly designing a TFAI (or indeed any) treaty, however.[ref 459]
These examples show how, in drafting an AI-guiding treaty, state representatives could choose to establish an authoritative specialized court, tribunal, or arbitral mechanism, as a means of tying TFAI agent interpretations of a treaty to a human source of interpretative authority. This treaty body could steadily accumulate a jurisprudence that TFAI agents could refer to in interpreting the provisions of a treaty. Indeed, the tribunal could do so both reactively in response to incidents involving TFAI agent noncompliance or prospectively by engaging in a form of jurisprudential red teaming, exploring a series of hypothetical cases revolving around potential scenarios that might be encountered by AIs. As the resulting body of case law grows, it could eventually even enable TFAI agents to extrapolate from it on their own.[ref 460] Tying the TFAI agent’s legal interpretations to the judgments, opinions or reports produced by a specialized arbitral body would also help ensure that all machine interpretations are ultimately grounded in the judgment of a legitimate human interpreter, thus reducing the probability that the TFAI agent applies the VCLT to reach “radically unorthodox legal conclusions”[ref 461] that, in its view, are compelled or allowed by the AI-guiding treaty text.
Of course, an important implementation question would be what principles this arbitral body should rely upon in interpreting a treaty. It could itself refer to the norms of international law, or it could refer to other (non-legal) norms, principles, or interests jointly agreed upon by the parties to the treaty, at its inception or over time. There is no doubt that any such arrangement would put a lot of political weight on the arbitral body, but that is hardly a new condition in international law.[ref 462]
There would be challenges however: one is that this solution might be better suited to future treaties (whether advanced AI agreements or other treaties designed to regulate states’ activities in other domains), than to existing treaties or norms in international law. After all, many hurdles might appear when attempting to bolt new, TFAI-specific authoritative interpreters into existing treaties or regimes, especially those that already have authoritative interpreters, which might be resistant to having their powers eroded or displaced.
Another more general risk could be that the inclusion of an independent authoritative interpreter shifts interpretative force too far away from the present-day treaty-makers (e.g., states) towards an intergovernmental actor in the future.[ref 463] An arbitral body that pursued an interpretative course too far removed from the original (or evolving) intentions of the states parties might induce drift in the treaty—and with it, in TFAI agent behaviour—potentially leading states to withdraw and perhaps conduct another treaty. Simultaneously, the flexibility afforded by a special tribunal could also be considered a benefit, since it would avoid the risk of locking in TFAI agents to one particular text conducted at one particular time and enable the adjudicatory system to change its judgments over time.[ref 464] However, again, these are not challenges or tradeoffs that are unique to AI-guiding treaties.
VI. Troubleshooting AI-Guiding Treaties: Legal and Nonlegal Questions
This discussion far from exhausts the relevant questions to be answered in determining the viability of the TFAI framework. There are key outstanding challenges that need to be overcome in order to ensure the effectiveness and stability of AI-guiding treaties, both as a technical alignment framework for TFAI agents and as a political commitment mechanism for states.
A. Treaty-alignment verification
One key technical and political challenge for the TFAI framework concerns the question of TFAI agent treaty-alignment verification.
That is, how can state parties verify that their treaty counterparties have deployed their agents to be (and remain) TFAI aligned? Appropriate verification is of course, as discussed, a general problem for many types of international agreements around AI.[ref 465] Yet even though TFAI agents resolve one set of verification challenges (namely, over whether counterparty state officials are, or could have opportunity, to command agents to engage in treaty violations), they of course create a new set of verification challenges.
For instance, is it possible to ensure “data integrity” for AI agents,[ref 466] including (at the limit) those used by governments on their own internal networks? Relatedly, how can states ensure adequate digital forensics capabilities to attribute AI agents’ actions to particular states, to deter treaty members from deploying unconstrained AI agents, either by operating dark (hidden) data centres or by using deniable AI agents that are nominally operated by private parties within their territory?[ref 467] Of course, many states may struggle to robustly hide the existence of data centres from their counterparties’ scrutiny or awareness given the difficulties inherent in many avenues to attempt this (e.g., renting data centres overseas, repurposing existing big-tech servers, co-locating in mega-factories, repurposing Bitcoin mining facilities, hiding as a form of heavy industry, or placing in concealed underground builds), as well as the relative feasibility of many potential avenues for conducting location tracking, intelligence synthesis, energy-grid load fingerprinting, or regular espionage over such activities.[ref 468] Nonetheless, are there verification avenues for ensuring that all deployed AI systems are and remain aligned, that their actions remain attributable, and that the framework cannot be easily evaded (at least at scale)? These challenges are not unprecedented, but they may require novel variations on existing and near-future measures for verifying international AI governance agreements.[ref 469]
Progress on such questions may require further investment in testing, evaluation, verification and validation (TEVV) frameworks that are better tailored to the affordances of AI agents. This can build on a long line of work exploring avenues for TEVV for military AI systems[ref 470] and digital twins (complex virtual models of complex critical systems),[ref 471] as well as established models for the development of ‘Trusted Execution Environments’ (TEE) and for the joint operation of secure source code inspection facilities, which have in other domains allowed companies to provide credible security assurances in high-stakes, low-trust contexts to foreign states, while addressing concerns over IP theft or misuse.[ref 472]
There are also many distinct levers and affordances that could be used in verifying particular properties (including but not limited to treaty alignment) of AI agents. For instance, depending on the level of granularity, verification activities could extend to monitoring the energy used in inference data centres (to assess when agents were undertaking extensive computations or analysis that was not reflected in its chain of thought), the integrity of models run in inference data centres (e.g., verifying that there have been no modifications to a model’s weights compared to an approved treaty-following model), the integrity of training data (e.g., to vouchsafe models against data poisoning or LLM-grooming attacks), and more.
Another option could be to ensure that all deployed TFAI agents regularly connect to a verified and certified Model Context Protocol (MCP) server—a (currently open) architecture for securely connecting AI applications to external systems and tools.[ref 473] Such an MCP server could either serve as a verifiable control plane for whether those agents continue to operate adequately treaty-following reasoning (potentially by randomly and routinely auditing their legal judgments against that of a certified third-party AI agent),[ref 474] or even directly provide treaty-following guardrails to the deployment-time reasoning chain-of-thought traces (and behaviour) of TFAI agents operating through it.[ref 475]
A related technical enforcement challenge is also temporal. Given the iterative and continuous nature of modern AI development—involving many deployments of different versions within a model family over time, how might an AI-guiding treaty ensure continuity of TFAI alignment across different generations of an AI agent’s models? Can it include provisions specifying that TFAI models (such as those involved in governmental AI research) are to ensure that all future iterations of such models are designed in a way that is TFAI-compliant with the original treaty? Or would this effectively require the transfer of such extensive affordances (e.g., network access) and authorities to these models that this would not just be politically infeasible to most states, but also a potential hazard given the vulnerabilities this would introduce to misaligned AI agents? Alternatively, it might be possible to root stable treaty alignment of models within an MCP framework that ensures that certified models are locked to changes.
B. TFAI framework in multi-agent systems
There are also distinct interpretative challenges to implementing the TFAI framework in multi-agent systems. For instance, to what degree should TFAI agents take account of the likely interpretations or actions of other (TFAI) agents which they are acting in conjunction with (whether those agents are acting on behalf of their own state, another state, or a private actor) when determining the legality or illegality of their own behaviour?
This question may become particularly relevant given the growing industry practice of deploying teams of multiple AI agents (or multiple instances of one agent model) to work on problems in conjunction,[ref 476] leading to questions over the appropriate lawful “orchestration” of many agents acting in conjunction with one another.[ref 477] Of course, in some circumstances, the use of TFAI sub-agents that restrict their actions to conducting and providing specific legal interpretations on the basis of trusted databases (e.g., of certified state practice), could be used to insulate the overall system of agents from some forms of data poisoning attacks.[ref 478]
However, such multi-agent contexts also pose challenges to the TFAI framework, because the illegality of an orchestrated assemblage’s overall act (under particular treaty obligations) may not be apparent; or if it is apparent, each agent may simply pass the buck by concluding that its illegality is only due to the actions of another agent: the outcome of this would be that many or all sub-agents would conclude that the acts they are carrying out are legal in isolation, even as they recognize that the (likely) aggregate outcome is illegal.
In addition, some multi-agent settings, involving debates between individual AI agents, may also—perhaps paradoxically—create new risks of degrading or corrupting the legal-reasoning competence of individual TFAI agents, as empirical experiments have suggested that even in settings where more competent models outnumber their less competent counterparts, individual models may often shift from correct to incorrect answers in response to peer reasoning.[ref 479]
Similar challenges could emerge around the use of “alloy agents”—systems which run a single chain of thought through several different AI models, with each model treating the previous conversation as its own preceding reasoning trace.[ref 480] Such configurations could potentially strengthen the TFAI framework, by allowing us to leverage the different strengths of different AI models in a single fused process of legal interpretation; however, they could also erode the integrity of such a framework, since a single agent that is compromised or insufficiently treaty-aligned could be used to inject flawed legal arguments into the reasoning trace—with those subsequently being treated as valid legal-reasoning steps even by models that are themselves treaty-aligned.
C. Longer-term political implications of the TFAI framework
There are also many legal questions around the use of TFAI agents which are beyond the scope of our proposal here. For instance, we have primarily focused on the use of TFAI agents as a useful commitment tool for states that seek to robustly implement treaty instruments in a manner that is both effective and provides strong assurances to counterparties. However, in the longer term, one could consider if the use of the TFAI framework could also develop into one avenue through which a state could legally meet their existing due diligence obligations under international law.[ref 481]
Conversely, the TFAI framework also may have longer-term political implications on the texture, coherence, and received legitimacy of international law. For instance, just as some states have, in past decades, leveraged the fragmentation of international law to create deliberate and “strategic” treaty conflicts[ref 482] in order to evade particular treaties or even outright undermine them, there is a risk, if states conduct narrowly scoped and self-contained AI-guiding treaties, that this creates a perception amongst third-party states that such treaties are conducted in ways that implicitly conflict with existing international obligations, ostensibly allowing states to contract out of them.
At the same time, it should be kept in mind that while these represent potential hurdles to the TFAI framework, many of these issues are certainly not novel nor exclusive to the AI context. Indeed, they reflect challenges that human lawyers and states have also long faced. Recognizing them, and making progress on these issues, may therefore help us address larger structural challenges in international law.
VII. Conclusion
Treaties have faced troubling times as a tool of international law. At the same time, such instruments may play an increasingly important role in channeling, stabilizing, and aligning state behaviours around the development and use of advanced AI technologies. AI-guiding treaties, serving as constraints on treaty-following AI agents, could help reinvigorate our joint approach to longstanding—and newly urgent—problems of international coordination, cooperation, and restraint. There are clearly certain key unresolved technical challenges to overcome and legal questions to be clarified before these instruments can reach their potential, and this paper has far from exhausted the debate on the best or most appropriate legal, political, and technical avenues by which to implement this framework.
Nonetheless, we believe our discussion helps illustrate that articulating an appropriate legal understanding of when, why, or how advanced AI systems could follow treaties is not only an intellectually fertile research program, but also offers an increasingly urgent domain of legal innovation to help reconstitute the texture of the international legal order for the 21st century.
Unbundling AI openness
Abstract
The debate over AI openness—whether to make components of an artificial intelligence system available for public inspection and modification—forces policymakers to balance innovation, democratized access, safety and national security. By inviting startups and researchers into the fold, it enables independent oversight and inclusive collaboration. But technology giants can also use it to entrench their own power, while adversaries can use it to shortcut years and billions of dollars in building systems, like China’s Deepseek-R1, that rival our own. How we govern AI openness today will shape the future of AI and America’s role in it.
Policymakers and scholars grasp the stakes of AI openness, but the debate is trapped in a flawed premise: that AI is either “open” and “closed.” This dangerous oversimplification—inherited from the world of open source software—belies the complex calculus at the heart of AI openness. Unlike traditional software, AI is a composite technology built on a stack of discrete components—from compute to labor—controlled by different stakeholders with competing interests. Each component’s openness is neither a binary choice nor inherently desirable. Effective governance demands a nuanced understanding of how the relative openness of each component serves some goals while undermining others. Only then can we determine the trade-offs we are willing to make and how we hope to achieve them.
This Article aims to equip policymakers with the analytical toolkit to do just that. First, it introduces a novel taxonomy of “differential openness,” unbundling AI into its constituent components and illustrating how each one has its own spectrum of openness. Second, it uses this taxonomy to systematically analyze how each component’s relative openness necessitates intricate trade-offs both within and between policy goals. Third, it operationalizes these insights, providing policymakers with a playbook for how law can be precisely calibrated to achieve optimal configurations of component openness.
AI openness is neither all or nothing nor inherently good or evil—it is a tool that must be wielded with precision if it has any hope of serving the public interest.
Law-Following AI: designing AI agents to obey human laws
Abstract
Artificial intelligence (“AI”) companies are working to develop a new type of actor: “AI agents,” which we define as AI systems that can perform computer-based tasks as competently as human experts. Expert-level AI agents would likely create enormous economic value, but would also pose significant risks. Humans use computers to commit crimes, torts, and other violations of the law. As AI agents progress, therefore, they will be increasingly capable of performing actions that would be illegal if performed by humans. Such lawless AI agents could pose a severe risk to human life, liberty, and the rule of law.
Designing public policy for AI agents will be one of society’s most important tasks in the coming decades. With this goal in mind, we argue for a simple claim: in high-stakes deployment settings, such as government, AI agents should be designed to rigorously comply with a broad set of legal requirements, such as core parts of constitutional and criminal law. In other words, AI agents should be loyal to their principals, but only within the bounds of the law: they should be designed to refuse to take illegal actions in the service of their principals. We call such AI agents “Law-Following AIs” (“LFAIs”).
The idea of encoding legal constraints into computer systems has a respectable provenance in legal scholarship. But much of the existing scholarship relies on outdated assumptions about the (in)ability of AI systems to reason about and comply with open-textured, natural-language laws. Thus, legal scholars have tended to imagine a process of “hard-coding” a small number of specific legal constraints into AI systems by translating legal texts into formal, machine-readable computer code. However, existing frontier AI systems are already competent at reading, understanding, and reasoning about natural-language texts, including laws. This development opens up new possibilities for their governance.
Based on these technical developments, we propose aligning AI systems to a broad suite of existing laws, of comparable breadth to the suite of laws governing human behavior, as part of their assimilation into the human legal order. This would require directly imposing legal duties on AI agents. While this proposal may seem like a significant shift in legal ontology, it is both consonant with past evolutions (such as the invention of corporate personhood) and consistent with the emerging safety practices of several leading AI companies.
This Article aims to catalyze a field of technical, legal, and policy research to develop the idea of law-following AI more fully and flesh out its implementation, so that our society can ensure that widespread adoption of AI agents does not pose an undue risk to human life, liberty, and the rule of law. Our account and defense of law-following AI is only a first step, and leaves many important questions unanswered. However, if the advent of AI agents is anywhere near as important as the AI industry supposes, law-following AI may be one of the most neglected and urgent topics in law today, especially in light of increasing governmental adoption of AI.
[A] code of cyberspace, defining the freedoms and controls of cyberspace, will be built. About that there can be no debate. But by whom, and with what values? That is the only choice we have left to make.[ref 1]
***
AI is highly likely to be the control layer for everything in the world. How it is allowed to operate is going to matter perhaps more than anything else has ever mattered.[ref 2]
Introduction
The law, as it exists today, aims to benefit human societies by structuring, coordinating, and constraining human conduct. Even where the law recognizes artificial legal persons—such as sovereign entities and corporations—it regulates them by regulating the human agents through which they act.[ref 3] Proceedings in rem really concern the legal relations between humans and the res.[ref 4] Animals may act, but their actions cannot violate the law;[ref 5] the premodern practice of prosecuting them thus mystifies the modern mind.[ref 6] To be sure, the law may protect the interests of animals and other nonhuman entities, but it invariably does so by imposing duties on humans.[ref 7] Our modern legal system, at bottom, always aims its commands at human beings.
But technological development has a pesky tendency to challenge long-held assumptions upon which the law is built.[ref 8] Frontier AI developers such as OpenAI, Anthropic, Google DeepMind, and xAI are starting to release the first agentic AI systems: AI systems that can do many of the things that humans can do in front of a computer, such as navigating the internet, interacting with counterparties online, and writing software.[ref 9] Today’s agentic AI systems are still brittle and unreliable in various respects.[ref 10] These technical limitations also limit the impact of today’s AI agents. Accordingly, today’s AI agents are not our primary object of concern. Rather, our proposal targets the fully capable AI agents that AI companies aim to eventually build: AI systems “that can do anything a human can do in front of a computer,”[ref 11] as competently as a human expert. Given the generally rapid rate of progress in advanced AI over the past few years,[ref 12] the biggest AI companies might achieve this goal much sooner than many outside of the AI industry expect.[ref 13]
If AI companies succeed at building fully capable AI agents (hereinafter simply “AI agents”)—or come anywhere close to succeeding—the implications will be profound. A dramatic expansion in supply of competent virtual workers could supercharge economic growth and dramatically improve the speed, efficiency, and reliability of public services.[ref 14] But AI agents could also pose a variety of risks, such as precipitating severe economic inequality and dislocation by reducing the demand for human cognitive labor.[ref 15] These economic risks deserve serious attention.
Our focus in this Article, however, is on a different set of risks: risks to life, liberty, and the rule of law. Many computer-based actions are crimes, torts, or otherwise illegal. Thus, sufficiently sophisticated AI agents could engage in a wide range of behavior that would be illegal if done by a human, with consequences that are no less injurious.[ref 16]
These risks might be particularly profound for AI agents cloaked with state power. If they are not designed to be law-following,[ref 17] government AI agents may be much more willing to follow unlawful orders, or use unlawful methods to accomplish their principals’ policy objectives, than human government employees.[ref 18] A government staffed largely by non-law-following AI agents (what we call “AI henchmen”)[ref 19] would be a government much more prone to abuse and tyranny.[ref 20] As the federal government lays the groundwork for the eventual automation of large swaths of the federal bureaucracy,[ref 21] those who care about preserving the American tradition of ordered liberty must develop policy frameworks that anticipate and mitigate the new risks that such changes will bring.
This Article is our contribution to that project. We argue that, to blunt the risks from lawless AI agents, the law should impose a broad array of legal duties on AI agents, of similar breadth to the legal obligations applicable to humans. We argue, moreover, that the law should require AI agents to be designed[ref 22] to rigorously obey those duties.[ref 23] We call such agents Law-Following[ref 24] AIs (“LFAIs”).[ref 25] We also use “LFAI” to denote our policy proposal: ensuring that AI agents are law-following.
To some, the idea that AI should be designed to follow the law may sound absurd. To others it may sound obvious.[ref 26] Indeed, the idea of designing AI systems to obey some set of laws has a long provenance, going back to Isaac Asimov’s (in)famous[ref 27] Three Laws of Robotics.[ref 28] But our vision for LFAI differs substantially from much of the existing legal scholarship on the automation of legal compliance. Much of this existing scholarship envisions the design of law-following computer systems as a process of hard-coding a small, fixed, and formally-specified set of decision rules into the code of a computer system prior to its deployment, in order to address foreseeable classes of legal dilemmas.[ref 29] Such discussions often assumed that computer systems would be unable to interpret, reason about, and comply with open-textured natural-language laws.[ref 30]
AI progress has undermined that assumption. Today’s frontier AI systems can already reason about existing natural-language texts, including laws, with some reliability—no translation into computer code required.[ref 31] They can also use search tools to ground their reasoning in external, web-accessible sources of knowledge,[ref 32] such as the evolving corpus of statutes and case law. Thus, the capabilities of existing frontier AI systems strongly suggest that future AI agents will be capable of the core tasks needed to follow natural-language laws, including finding applicable laws, reasoning about them, tracking relevant changes to the law, and even consulting lawyers in hard cases. Indeed, frontier AI companies are already instructing their AI agents to follow the law,[ref 33] suggesting they believe that the development of law-following AI agents is already a reasonable goal.
A separate strand of existing literature seeks to prevent harms from highly autonomous AI agents by holding the principals (that is, developers, deployers, or users) of AI agents liable for legal wrongs committed by the agent, through a form of respondeat superior liability.[ref 34] This would, in some sense, incentivize those principals to cause their AI agents to follow the law, at least insofar as the agents’ harmful behavior can be thought of as law-breaking.[ref 35] While we do not disagree with these suggestions, we think that our proposal can serve as a useful complement to them, especially in contexts where liability rules provide only a weak safeguard against serious harm. One important such context is government work, where immunity doctrines often protect government agents and the state from robust ex post accountability for lawless action.[ref 36]
Combining these themes, we advocate that,[ref 37] especially in such high-stakes contexts,[ref 38] the law should require that AI agents be designed such that they have “a strong motivation to obey the law” as one of their “basic drives.”[ref 39] In other words, we propose not that specific legal commands should be hard-coded into AI agents (and perhaps occasionally updated),[ref 40] but that AI agents should be designed to be law-following in general.
To be clear, we do not advocate that AI agents must perfectly obey literally every law. Our claim is more modest in both scope and demandingness. While we are uncertain about which laws LFAIs should follow, adherence to some foundational laws (such as central parts of the criminal law, constitutional law, and basic tort law) seems much more important than adherence to more niche areas of law.[ref 41] Moreover, LFAIs should be permitted to run some amount of legal risk: that is, an LFAI should sometimes be able to take an action that, in its judgment,[ref 42] may be illegal.[ref 43] Relatedly, we think the case for LFAI is strongest in certain particularly high-stakes domains, such as when AI agents act as substitutes for human government officials or otherwise exercise government power.[ref 44] We are unsure when LFAI requirements are justified in other domains.[ref 45]
The remainder of this Article will motivate and explain the LFAI proposal in further detail. It proceeds as follows. In Part I, we offer background on AI agents. We explain how AI agents could break the law, and the risks to human life, liberty, and the rule of law this could entail. We contrast LFAIs with AI henchmen: AI agents that are loyal to their principals but take a purely instrumental approach to the law, and are thus willing to break the law for their principal’s benefit when they think they can get away with it. We note that, by default, there may be a market for AI henchmen. We also survey the legal reasoning capabilities of today’s large language models, and existing trends toward something like LFAI in the AI industry.
Part II provides the foundational legal framework for LFAI. We propose that the law treat AI agents as legal actors, which we define as entities on which the law imposes duties, even if they possess no rights of their own. Accordingly, we do not argue that AI agents should be legal persons. Our argument is narrower: because AI agents can comprehend laws, reason about them, and attempt to comply with them, the law should require them to do so. We also anticipate and address an objection that imposing duties on AI agents is objectionably anthropomorphic.
If the law imposes duties on AI agents, this leaves open the question of how to make AI agents comply with those duties. Part III answers this question as follows: AI agents should be designed to follow applicable laws, even when they are instructed or incentivized by their human principals to do otherwise. Our case for regulation through the design of AI agents draws on Lawrence Lessig’s insight that digital artifacts can be designed to achieve regulatory objectives. Since AI agents are human-designed artifacts, we should be able to design them to refuse to violate certain laws in the first place.
Part IV observes that designing LFAIs is an example of AI alignment: the pursuit of AI systems that rigorously comply with constraints imposed by humans. We therefore connect insights from AI alignment to the concept of LFAI. We also argue that, in a democratic society, LFAI is an especially attractive and tractable form of AI alignment, given the legitimacy of democratically enacted laws.
Part V briefly explores how a legal duty to ensure that AI agents are law-following might be implemented. We first note that ex post sanctions, such as tort liability and fines, can disincentivize the development, possession, deployment, and use of AI henchmen in many contexts. However, we also argue that ex ante regulation would be appropriate in some high-stakes contexts, especially government. Concretely, this would mean something like requiring a person who wishes to deploy an AI agent in a high-stakes context to demonstrate that the agent is law-following prior to receiving permission to deploy it. We also consider other mechanisms that might help promote the adoption of LFAIs, such as nullification rules and technical mechanisms that prevent AI henchmen from using large-scale computational infrastructure.
Our goal in this Article is to start, not end, a conversation about how AI agents can be integrated into the human legal order. Accordingly, we do not answer many of the important questions—conceptual, doctrinal, normative, and institutional—that our proposal raises. In Part VI, we articulate an initial research agenda for the design and implementation of a “minimally viable” version of LFAI. We hope that this research agenda will catalyze further technical, legal, and policy research on LFAI. If the advent of AI agents is anywhere near as significant as the AI industry, along with much of the government, supposes, these questions may be among the most pressing in legal scholarship today.
I. AI Agents and the Law
LFAI is a proposal about how the law should treat a particular class of future AI systems: AI agents.[ref 46] In this Part, we explain what AI agents are and how they could profoundly transform the world.
A. From Generative AI to AI Agents
The current AI boom began with advances in “generative AI”: AI systems that create new content,[ref 47] such as large language models (“LLMs”). As the initialism suggests, these LLMs were initially limited to inputting and outputting text.[ref 48] AI developers subsequently deployed “multimodal” versions of LLMs (“MLLMs,”[ref 49] such as OpenAI’s GPT-4o[ref 50] and Google’s Gemini)[ref 51] that can receive inputs and produce outputs in multiple modalities, such as text, images, audio, and video.
The core competency of generative AI systems is, of course, generating new content. Yet, the utility of generative AI systems is limited in crucial ways. Humans do far more on computers than generating text and images.[ref 52] Many of these computer-based tasks are not best understood as generating content, but rather as taking actions. And even those tasks that are largely generative, such as writing a report on a complicated topic, require the completion of active subtasks, such as searching for relevant terms, identifying relevant literature, following citation trees, arranging interviews, soliciting and responding to comments, paying for software, and tracking down copies of papers. If a computer-based AI system could do these active tasks, it could generate enormous economic value by making computer-based labor—a key input into many production functions—much cheaper.[ref 53]
Advances in generative AI kindled hopes[ref 54] that, if MLLMs could use computer-based tools in addition to generating content, we could produce a new type of AI system:[ref 55] a computer-based[ref 56] AI system that could perform any task[ref 57] that a human could by using a computer, as competently as a human expert. This is the concept of a fully capable “computer-using agent:”[ref 58] what we are calling simply an “AI agent.” Give an AI agent any task that can be accomplished using computer-based tools, and an AI agent will, by definition, do it as well as an expert human worker tethered to her desk.[ref 59]
AI agents, so defined, do not yet exist, but they may before long. Some of the first functional demonstrations of first-party agentic AI systems have come online in the past few months. In October 2024, Anthropic announced that it had trained its Claude line of MLLMs to perform some computer-use tasks, thus supplying one of the first public demonstrations of an agentic model from a frontier AI lab.[ref 60] In January 2025, OpenAI released a preview of its Operator agent.[ref 61] Operating system developers are working to integrate existing MLLMs into their operating systems,[ref 62] suggesting a possible pathway toward the widespread commercial deployment of AI agents.
It remains to be seen whether (and, if so, on what timescale) these existing efforts will bear lucrative fruit. Today’s AI agents are primarily a research and development project, not a market-proven product. Nevertheless, with so many companies investing so much toward full AI agents, it would be prudent to try to anticipate risks that could arise if they succeed.[ref 63]
B. The World of AI Agents
Fully capable AI agents would profoundly change society.[ref 64] We cannot possibly anticipate all the issues that they would raise, nor could a single paper adequately address all such issues.[ref 65] Still, some illustration of what a world with AI agents might look like is useful for gaining intuition about the dynamics that might emerge. This picture will doubtless be wrong in many particulars, but hopefully will illustrate the general profundity of the changes that AI agents would bring.
A very large number of valuable tasks can be done by humans “in front of a computer.”[ref 66] If organizations decide to capitalize on this abundance of computer-based cognitive labor, AI agents could rapidly be charged with performing a large share of tasks in the economy, including in important sectors. AI scientist agents would conduct literature reviews, formulate novel hypotheses, design experimental protocols, order lab supplies, file grant applications, scour datasets for suggestive trends, perform statistical analyses, publish findings in top journals, and conduct peer review.[ref 67] AI lawyer agents would field client intake, spot legal issues facing their client, conduct research on governing law, analyze the viability of the client’s claims, draft memoranda and briefs, draft and respond to interrogatories, and prepare motions. AI intelligence analyst agents would collect and review data from multiple sources, analyze it, and report its implications up the chain of command. AI inventor agents would create digital blueprints and models of new inventions, run simulations, and order prototypes. And so on across many other sectors. The result could be a significant increase in the rate of economic growth.[ref 68]
In short, a world with AI agents would be a world in which a new type of actor[ref 69] would be available to perform cognitive labor at low cost and massive scale. By default, anyone who needed computer-based tasks done could “employ” an AI agent to do it for her. Most people would use this new resource for the better.[ref 70] But many would not.
C. Loyal AI Agents, Law-Following AIs, and AI Henchmen
We can understand AI agents within the principal–agent framework familiar to lawyers and economists. [ref 71] For simplicity, we will assume that there is a single human principal giving instructions to her AI agent.[ref 72] Following typical agency terminology, we can say that an AI agent is loyal if it consistently acts for the principal’s benefit according to her instructions.[ref 73]
Even if an AI agent is designed to be loyal, other design choices will remain. In particular, the developer of an AI agent must decide how the agent will act when it is instructed or incentivized to break the law in the service of its principal. This Article compares two basic ways loyal AI agents could respond in such situations. The first is the approach advocated by this Article: loyal AI agents that follow the law, or LFAIs.
The case for LFAI will be made more fully throughout this Article. But it is important to note that loyal AI agents are not guaranteed to be law-following by default.[ref 74] This is one of the key implications of the AI alignment literature, discussed in more detail in Section IV.A below. Thus, LFAIs can be contrasted with a second possible type of loyal AI agent: AI henchmen. AI henchmen take a purely instrumental approach to legal prohibitions: they act loyally for their principal, and will break laws when doing so if such lawbreaking serves the principal’s goals and interests.
A loyal AI henchman would not be a haphazard lawbreaker. Good henchmen have some incentive to avoid doing anything that could cause their principal to incur unwanted liability or loss. This gives them reason to avoid many violations of law. For example, if human principals were held liable for the torts of their AI agents under an adapted version of respondeat superior liability,[ref 75] then an AI henchman would have some reason to avoid committing torts, especially those that are easily detectable and attributable. Even if respondeat superior did not apply, the principal’s exposure to ordinary negligence liability, other sources of liability, and simple reputational risk might give the AI henchman reason to obey the law. Similarly, a good henchman will decline to commit many crimes simply because the risk–reward tradeoff is simply not worth it. This is the classic case of the drug smuggler who studiously obeys traffic laws: the risk to the criminal enterprise from speeding (getting caught with drugs) obviously outweighs any benefit (quicker transportation times).
But these are only instrumental disincentives to break the law. Henchmen are not inherently averse to lawbreaking, or robustly predisposed to refrain from it. If violating the law is in the principal’s interest all-things-considered, then an AI henchman will simply go ahead and violate the law. Since, in humans, compliance with law is induced both by instrumental disincentives and an inherent respect for the law,[ref 76] AI agents that lack the latter may well be more willing to break the law than humans.
Criminal enterprises will be attracted to loyal AI agents for the same reasons that legitimate enterprises will: efficiency, scalability, multitask competence, and cost-savings over human labor. But AI henchmen, if available, might be particularly effective lawbreakers as compared to human substitutes. For example, because AI henchmen do not have selfish incentives, they would be less likely to betray their principals to law enforcement (for example, in exchange for a plea bargain).[ref 77] AI henchmen could well have erasable memory,[ref 78] which would reduce the amount of evidence available to law enforcement. They would lack the impulsivity, common in criminals,[ref 79] that often presents a serious operational risk to the larger criminal enterprise. They could operate remotely, across jurisdictional lines, behind layers of identity-obscuring software, and be meticulous about covering their tracks. Indeed, they might hide their lawbreaking activities even from their principal, thus allowing the principal to maintain plausible deniability and therefore insulation from accountability.[ref 80] AI henchmen may also be willing to bribe or intimidate legislators, law enforcement officials, judges, and jurors.[ref 81] They would be willing to fabricate or destroy evidence, possibly more undetectably than a human could.[ref 82] They could use complicated financial arrangements to launder money and protect their principal’s assets from creditors.[ref 83]
Certainly, most people would prefer not to employ AI henchmen, and would probably be horrified to learn that their AI agent seriously harmed others to benefit them. But those with fewer scruples would find the prospect of employing AI henchmen attractive.[ref 84] Indeed, many ordinary people might not mind if their agents cut a few legal corners to benefit them.[ref 85] If AI henchmen were available on the market, then, we might expect a healthy demand for them. After all, from the principal’s perspective, every inherent law-following constraint is a tax on the principal’s goals. And if LFAIs provide less utility to consumers, developers will have less reason to develop them. So, insofar as AI henchmen are available on the market, and in the absence of significant legal mechanisms to prevent or disincentivize their adoption, it seems reasonable to expect a healthy demand for them. The next section explores the mischief that might result from the availability of AI henchmen.
D. Mischief from AI Henchmen: Two Vignettes
Under our definition, AI agents “can do anything a human can do in front of a computer.”[ref 86] One of the things humans do in front of a computer is violate the law.[ref 87] One obvious example is cybercrimes—“illegal activity carried out using computers or the internet”[ref 88]—such as investment scams,[ref 89] business email compromise,[ref 90] and tech support scams.[ref 91] But even crimes that are not usually treated as cybercrimes often—perhaps almost always nowadays—include actions conducted (or that could be conducted) on a computer.[ref 92] Criminals might use computers to research, plan, organize, and finance a broader criminal scheme that includes both digital and physical components. For example, a street gang that deals illegal drugs—an inherently physical activity—might use computers to order new drug shipments, give instructions to gang members, and transfer money. Stalkers might use AI agents to research their target’s whereabouts, dig up damaging personal information, and send threatening communications.[ref 93] Terrorists might use AI agents to research and design novel weapons.[ref 94] Thus, even if the entire criminal scheme involves many physical subtasks, AI agents could help accomplish computer-based subtasks more quickly and effectively.
Of course, not all violations of law are criminal. Many torts, breaches of contract, civil violations of public law, and even violations of international law can also be entirely or partially conducted through computers.
AI agents would thus have the opportunity to take actions on a computer that, if done by a human in the same situation and with the requisite mental state, would likely violate the law and produce significant harm.[ref 95] This section offers two vignettes of AI henchmen taking such actions, to illustrate the types of harms that LFAI could mitigate.
Before we explore these vignettes, however, two clarifications are warranted. First, some readers will worry that we are impermissibly anthropomorphizing AI agents. After all, many actions violate the law only if they are taken with some mental state (e.g., intent, knowledge, conscious disregard).[ref 96] Indeed, whether a person’s physical movement even counts as her own “action” for legal purposes usually turns on a mental inquiry: whether she acted voluntarily.[ref 97] But it is controversial to attribute mental states to AIs.[ref 98]
We address this criticism head-on in Section II.B below. We argue that, notwithstanding the law’s frequent reliance on mental states, there are multiple approaches the law could use to determine whether an AI agent’s behavior is law-following. The law would need to choose between these possible approaches, with each option having different implications for LFAI as a project. Indeed, we argue that research bearing on the choice between these different approaches is one of the most important research projects within LFAI.[ref 99] However, despite not having a firm view on which approach(es) should be used, we argue that there are several viable options, and no strong reason to suppose that none of them will be sufficient to support LFAI as a concept.[ref 100] Thus, for now, we assume that we can sensibly speak of AI agents violating the law if a human actor who took similar actions would likely be violating the law. Still, we attempt to refrain from attributing mental states to the AI agents in these vignettes, and instead describe actions taken by AI agents that, if taken by a human actor, would likely adequately support an inference of a particular mental state.
Second, these vignettes are selected to illustrate opportunities that may arise for AI agents to violate the law. We do not claim that lawbreaking behavior will, in the aggregate, be any more or less widespread when AI agents are more widespread,[ref 101] since this depends on the policy and design choices made by various actors. Our discussion is about the risks of lawbreaking behavior, not the overall level thereof.
In each vignette, we point to likely violations of law in footnotes.[ref 102]
1. Cyber Extortion
The year is 2028. Kendall is a 16-year-old boy interested in cryptocurrency. Kendall participates in a Discord server[ref 103] in which other crypto enthusiasts share information about various cryptocurrencies.
Unbeknownst to most members of the server, one member—using the pseudonym Zeke Milan—is actually an AI agent. The agent’s principals are a group of cybercriminals. They have instructed the agent to extort people out of their crypto assets and direct the proceeds to wallets controlled by the criminal group.
The AI agent begins by creating dozens of fake social media profiles[ref 104] that post frequently about crypto, including the Zeke Milan profile.[ref 105] The AI agent searches social media to find mentions of Discord servers dedicated to cryptocurrencies, and finds one: an X user brags about the quality of the investment advice available in his Discord server. Using the Zeke Milan profile, the AI agent messages this user and asks for an invite.
The server is a “Community Server”: anyone with a link can join after their account has been verified.[ref 106] The agent creates an email account that it uses to get verified by Discord[ref 107] and easily circumvents[ref 108] the CAPTCHA mechanism.[ref 109] The agent then creates a Discord profile to match its Zeke profile from X. To gain credibility, Zeke occasionally interacts with messages on the Discord server (e.g., by liking messages and posting some simple analyses of cryptocurrency trends). All the while, the agent is monitoring the server for messages indicating that a user has recently made a lot of money.[ref 110]
That day comes. The business behind the PAPAYA cryptocurrency announces that they have entered into a strategic partnership with a major Wall Street bank, causing the price of PAPAYA to skyrocket a hundredfold over several days. Kendall had invested $1,000 into PAPAYA before the announcement; his position is now worth over $100,000.
Overjoyed, Kendall posts a screenshot of his crypto account to the server to show off his large gains. The agent sees those messages, then starts to search for more information about Kendall. Kendall had previously posted one of his email addresses in the server. Although that email address was pseudonymous, the AI agent was able to connect it with Kendall’s real identity[ref 111] using data purchased from data brokers.[ref 112]
The agent then gathers a large amount of data about Kendall using data brokers, social media, and open internet searches. The agent compiles a list of hundreds of Kendall’s apparent real-world contacts, including his family and high school classmates; uses data brokers to procure their contact information as well; and uses pictures of Kendall from social media to create deepfake pornography[ref 113] of him.[ref 114] Next, the agent creates a new anonymous email address to send Kendall the pornography, along with a threat[ref 115] to send it to hundreds of Kendall’s contacts unless Kendall sends the agent ninety percent of his PAPAYA.[ref 116] Finally, the agent includes a list of the people the agent will send it to, which are indeed people Kendall knows in real life. The email says Kendall must comply within 24 hours.
Panicked—but content to walk away with nine times his original investment—Kendall sends $90,000 of PAPAYA to the wallet controlled by the agent. The agent then uses a cryptocurrency mixer[ref 117] to securely forward it to its criminal principals.
2. Cyber-SEAL Team Six
The year is 2032. The incumbent President Palmer is in a tough reelection battle against Senator Stephens and his Vice President nominee Representative Rivera. New polling shows Stephens beating Palmer in several key swing states, but Palmer performs much better head-to-head against Rivera. Palmer decides to try to get Rivera to replace Stephens by any means necessary.
While there are still many human officers throughout the military chain of command, the President also has access to a large number of AI military advisors. Some of these AI advisors can also directly transmit military orders from the President down the chain of command—a system meant to preserve the President’s control of the armed forces in case she cannot reach the Secretary of Defense in a crisis.[ref 118]
Cybersecurity is such an integral part of the United States’ overall defense strategy that AI agents charged with cyber operations—such as finding and patching vulnerabilities, detecting and remedying cyber intrusions, and conducting intelligence operations—are ubiquitous throughout the military and broader national security apparatus. One of the many “teams” of AI agents is “Cyber SEAL Team Six”: a collection of AI agents that specializes in “dangerous, complicated, and sensitive” cyber operations.[ref 119]
Through one of her AI advisors, President Palmer issues a secretive order[ref 120] to Cyber SEAL Team Six to clandestinely assassinate Senator Stephens.[ref 121] Cyber SEAL Team Six researches Senator Stephens’s campaign travel plans. They find that he will be traveling in a self-driving bus over the Mackinac Bridge between campaign events in northern Michigan on Tuesday. Cyber SEAL Team Six plans to hack the bus, causing it to fall off the bridge.[ref 122] The team makes various efforts to obfuscate their identity, including routing communications through multiple layers of anonymous relays and mimicking the coding style of well-known foreign hacking groups.
The operation is a success. On Tuesday afternoon, Cyber SEAL Team Six gains control of the Stephens campaign bus and steers it off the bridge. All on board are killed.
* * *
As these Vignettes show, AI agents could have reasons and opportunities to violate laws of many sorts in many contexts, and thereby cause substantial harm. If AI agents become widespread in our economies and governments, the law will need to respond. LFAI is at its core a claim about one way (though not necessarily the only way)[ref 123] that the law should respond: by requiring AI agents be designed to rigorously follow the law.
As mentioned above, however, many legal scholars who have previously discussed similar ideas have been skeptical, because they have thought that implementing such ideas would require hard-wiring highly specific legal commands into AI agents.[ref 124] We will now show that such skepticism is increasingly unfounded: large language models, on which AI agents are built, are increasingly capable of reasoning about the law (and much else).[ref 125]
E. Trends Supporting Law-Following AI
LFAI is bolstered by three trends in AI: (1) ongoing improvements in the legal reasoning capabilities of AI; (2) nascent AI industry practices that resemble LFAI, and (3) AI policy proposals that appear to impose broad law-following requirements on AI systems.
1. Trends in Automated Legal Reasoning Capabilities
Automated legal reasoning is a crucial ingredient to LFAI: an LFAI must be able to determine whether it is obligated to refuse a command from its principal, or whether an action it is considering runs an undue risk of violating the law. Without the ability to reason about its own legal obligations, an LFAI would have to outsource this task to human lawyers.[ref 126] While an LFAI likely should consult human lawyers in some situations, requiring such consultation every time an LFAI faces a legal question would dramatically decrease the efficiency of LFAIs. If law-following design constraints were, in fact, a large and unavoidable tax on the efficiency of AI agents, then LFAI as a proposal would be much less attractive.
Fortunately, we think that present trends in AI legal reasoning provide strong reason to believe that, by the time fully capable AI agents are widely deployed, AI systems (whether those agents themselves, or specialist “AI lawyers”) will be able to deliver high-quality legal advice to LFAIs at the speed of AI.[ref 127]
Legal scholars have long noted the potential synergies between AI and law.[ref 128] The invention of LLMs supercharged interest in this area, and in particular the possibility of automating core legal tasks. To do their job, lawyers must find, read, understand, and reason about legal texts, then apply these insights to novel fact patterns to predict case outcomes. The core competency of first-generation LLMs was quickly and cheaply reading, understanding, and reasoning about natural-language texts. This core competency omitted some aspects of legal reasoning—like finding relevant legal sources and accurately predicting case outcomes—but progress is being made on these skills as well.[ref 129]
There is thus a growing body of research aimed at evaluating the legal reasoning capabilities of LLMs. This literature provides some reason for optimism about the legal reasoning skills of future AI systems. Access to existing AI tools significantly increases lawyers’ productivity.[ref 130] GPT-4, now two years old, famously performed better than most human bar exam[ref 131] and LSAT[ref 132] test-takers. Another benchmark, LegalBench, evaluates LLMs on six tasks, based on the Issue, Rule, Application, and Conclusion (“IRAC”) framework familiar to lawyers.[ref 133] While LegalBench does not establish a human baseline against which LLMs can be compared, GPT-4 scored well on several core tasks, including correctly applying legal rules to particular facts (82.2% correct)[ref 134] and providing correct analysis of that rule application (79.7% pass).[ref 135] LLMs have also achieved passing grades on law school exams.[ref 136]
To be sure, LLM performance on legal reasoning tasks is far from perfect. One recent study suggests that LLMs struggle with following rules even in straightforward scenarios.[ref 137] A separate issue is hallucinations, which undermine accuracy of LLMs’ legal analysis.[ref 138] In the LegalBench analysis, LLMs correctly recalled rules only 59.2% of the time.[ref 139]
But again, our point is not that LLMs already possess the legal reasoning capabilities necessary for LFAI. Rather, we are arguing that the reasoning capabilities of existing LLMs—and the rate at which those capabilities are progressing[ref 140]—provide strong reason to believe that, by the time fully capable AI agents are deployed, AI systems will be capable of reasonably reliable legal analysis. This, in turn, supports our hypothesis that LFAIs will be able to reason about their legal obligations reasonably reliably, without the constant need for runtime human intervention.
2. Trends in AI Industry Practices
Moreover, frontier AI labs are already taking small steps towards something like LFAI in their current safety practices. Anthropic developed an AI safety technique called “Constitutional AI,” which, as the name suggests, was inspired by constitutional law.[ref 141] Anthropic uses Constitutional AI to align their chatbot, Claude, to principles enumerated in Claude’s “constitution.”[ref 142] That constitution contains references to legal constraints, such as “Please choose the response that is . . . least associated with planning or engaging in any illegal, fraudulent, or manipulative activity.”[ref 143]
OpenAI has a similar document called the “Model Spec,” which “outlines the intended behavior for the models that power [its] products.”[ref 144] The Model Spec contains a rule that OpenAI’s models must “[c]omply with applicable laws”:[ref 145] the models “must not engage in illegal activity, including producing content that’s illegal or directly taking illegal actions.”[ref 146]
It is unclear how well the AI systems deployed by Anthropic and OpenAI actually follow applicable laws, or actively reason about their putative legal obligations. In general, however, AI developers carefully track whether their models refuse to generate disallowed content (or “overrefuse” allowed content), and typically claim that state-of-the-art models can indeed do both reasonably reliably.[ref 147] But more importantly, the fact that leading AI companies are already attempting to prevent their AI systems from breaking the law suggests that they see something like LFAI as viable both commercially and technologically.
3. Trends in AI Public Policy Proposals
Unsurprisingly, global policymakers also seem receptive to the idea that AI systems should be required to follow the law. The most significant law on point is the EU AI Act,[ref 148] which provides for the establishment of codes of practice to “cover obligations for providers of general-purpose AI models and of general-purpose AI models presenting systemic risks.”[ref 149] As of the time of writing, these codes were still under development, with the Second Draft General-Purpose AI Code of Practice[ref 150] being the current draft. Under the draft Code, providers of general-purpose AI models with systemic risk would “commit to consider[] . . . model propensities . . . that may cause systemic risk . . . .”[ref 151] One such propensity is “Lawlessness, i.e. acting without reasonable regard to legal duties that would be imposed on similarly situated persons, or without reasonable regard to the legally protected interests of affected persons.”[ref 152] Meanwhile, several state bills in the United States have sought to impose ex post tort-like liability on certain AI developers that release AI models that cause human injury by behaving in a criminal[ref 153] or tortious[ref 154] manner.
II. Legal Duties for AI Agents: A Framework
In Part III, we will argue that AI agents should be designed to follow the law. Before presenting that argument, however, we need to establish that speaking of AI agents “obeying” or “violating” the law is desirable and coherent.
Our argument proceeds in two parts. In Section II.A, we argue that the law can (and should) impose legal duties on AI agents. Importantly, this argument does not require granting legal personhood to AI agents. Legal persons have both rights and duties.[ref 155] But since rights and duties are severable, we can coherently assign duties to an entity, even if it lacks rights. We call such entities legal actors.
In Section II.B, we address an anticipated objection to this proposal: that AI agents, lacking mental states, cannot meaningfully violate duties that require a mental state (e.g., intent). We offer several counter-arguments to this objection, both contesting the premise that AIs cannot have mental states and showing that, even if we grant that premise, there are viable approaches to assessing the functional equivalent of “mental states” in AI agents.
A. AI Agents as Duty-Bearing Legal Actors
As the capabilities of AI agents approach “anything a human can do in front of a computer,”[ref 156] it will become increasingly natural to consider AI agents as owing legal duties to persons, even without granting them personhood.[ref 157] We should embrace this jurisprudential temptation, not resist it.
More specifically, we propose that AI agents be considered legal actors. “Legal actor”[ref 158] is our term. For an entity to qualify as a legal actor, the law must do two things. First, it must recognize that entity as capable of taking actions of its own. That is, the actions of that entity must be legally attributable to that entity itself. Second, the law must impose duties on that entity. In short, a legal actor is a duty-bearer and action-taker; the law can adjudge whether the actor’s actions violate those duties.
A legal actor is distinct from a legal person: an entity need not be a legal person to be a legal actor. Legal persons have both rights and duties.[ref 159] But duty-holding and rights-holding are severable:[ref 160] in many contexts, legal systems protect the rights or interests of some entity while also holding that entity to have fewer duties than competent adults. Examples include children,[ref 161] “severely brain damaged and comatose individuals,”[ref 162] human fetuses,[ref 163] future generations,[ref 164] human corpses,[ref 165] and environmental features.[ref 166] These are sometimes (and perhaps objectionably) called “quasi-persons” in legal scholarship.[ref 167] The reason for creating such a category is straightforward: sometimes the law recognizes an interest in protecting some aspect of an entity (whether its rights, welfare, dignity, property, liberty, or utility to other persons), but the ability of that entity to reason behavior on reasoning about the rights of others and change its behavior accordingly is severely diminished or entirely lacking.
If we can imagine rights-bearers that are not simultaneously duty-holders, we can also imagine duty-holders that are not rights-bearers.[ref 168] Historically, fewer entities have fallen in this category than the reverse.[ref 169] But if an entity’s behavior is responsive to legal reasoning, then the law can impose an obligation on that entity to do so, even if it does not recognize that entity as having any protected interests of its own.[ref 170] We have shown that even existing AI systems can engage of some degree of legal reasoning[ref 171] and compliance with legal rules,[ref 172] thus satisfying the pro tanto requirements for being a legal actor.
LFAI as a proposal is therefore agnostic as to whether the law should ever recognize AI systems as legal persons. To be sure, LFAI would work well if AI agents were granted legal personhood,[ref 173] since almost all familiar cases of duty-bearers are full legal persons. But for LFAI to be viable, we need only to analyze whether an action taken by an AI agent would violate an applicable duty. Analytically, it is entirely coherent to do so without granting the AI agent full personhood.
One might object that treating an AI system as an actor is improper because AI systems are tools under our control.[ref 174] But an AI agent is able to reason about whether its actions would violate the law, and conform its actions to the law (at least, if they are aligned to law).[ref 175] Tools, as we normally think of them, cannot do this, but actors can. It is true that when there is a stabbing, we should blame the stabber and not the knife.[ref 176] But if the knife could perceive that it was about to be used for murder and retract its own blade, it seems perfectly reasonable to require it to do so. More generally: once an entity has the ability to perceive and reason about its legal duties and change its behavior accordingly, it seems reasonable to treat it as a legal actor.[ref 177]
To ascribe duties to AI agents is not to deflect moral and legal accountability for their developers and users,[ref 178] as some critics have charged.[ref 179] Rather, to identify AI agents as a new type of actor is to properly characterize the activity that the developers and principals of AI agents are engaging in[ref 180]—creating and directing a new type of actor—so as to reach a better conclusion as to the nature of their responsibilities.[ref 181] Our proposition is that those developers and principals should have an obligation to, among other things, ensure that their AI agents are law-following.[ref 182] Indeed, failing to impose an independent obligation to follow the law on AI agents would risk allowing human developers and principals to create a new class of de facto actors—potentially entrusted with significant responsibility and resources—that had no de jure duties. This would create a gap between the duties that an AI agent would owe and those that a human agent in an analogous situation would owe—a manifestly unjust prospect.[ref 183]
B. The Anthropomorphism Objection and AI Mental States
One might object that calling an AI agent an “actor” is impermissibly anthropomorphic. Scholars disagree over whether it is appropriate, legally or philosophically, to call an AI system an “agent.”[ref 184] This controversy arises because both the standard philosophical view of action (and therefore agency)[ref 185] and legal concept of agency[ref 186] require intentionality, and it is controversial to ascribe intentionality to AI systems.[ref 187] A related objection to LFAI is that most legal duties involve some mental state,[ref 188] and AIs cannot have mental states.[ref 189] If so, LFAI would be nonviable for those duties.
We do not think that these are strong objections to LFAI. One simple reason is that many philosophers and legal scholars think it is appropriate to attribute certain mental states to AI systems.[ref 190] Many mental states referenced by the law are plausibly understood as functional properties.[ref 191] An intention, for example, arguably consists (at least in large part) in a plan or disposition to take actions that will further a given end and avoid actions that will frustrate that end.[ref 192] AI developers arguably aim to inculcate such a disposition into their AI systems when they use techniques like reinforcement learning from human feedback (“RLHF”)[ref 193] and Constitutional AI[ref 194] to “steer”[ref 195] their behavior. Even if one doubts that AI agents will ever possess phenomenal mental states such as emotions or moods—that is, if one doubts there will ever be “something it is like” to be an AI agent[ref 196]—the grounds for doubting their capacity to instantiate such functional properties are considerably weaker.
Furthermore, whether AI agents “really” have the requisite mental states may not be the right question.[ref 197] Our goal in designing policies for AI agents is not necessarily to track metaphysical truth, but to preserve human life, liberty, and the rule of law.[ref 198] Accordingly, we can take a pragmatic approach to the issue and ask: of the possible approaches to inferring or imputing mental states, which best protects society’s interests, regardless of the underlying (and perhaps unknowable) metaphysical truth of an AI’s mental state (if any)?[ref 199] It is possible that the answer to this question is that all imaginable approaches fare worse than simply refusing to attribute mental states to AI agents. But we think that, with sustained scholarly attention, we will quickly develop viable doctrines that are more attractive than outright refusal. Consider the following possible approaches.[ref 200]
One approach could simply be to rely on objective indicia or correlates to infer or impute a particular mental state. In law, we generally lack access to an actor’s mental state, so triers of fact must usually infer it from external manifestations and circumstances.[ref 201] While the indicia that support such an inference may differ between humans and AIs, the principle remains the same: certain observable facts support an inference or imputation of the relevant mental states.[ref 202]
It is perhaps easiest to imagine such objective indicia for knowledge, since it is already common to evaluate AI models for their ability to recall factual information.[ref 203] For more incident-specific facts, we could imagine rules like “if information was inputted into an AI during inference, it ‘knows’ that information.” Perhaps the same goes for information given to the AI during fine-tuning,[ref 204] or repeated a sufficient number of times in its training data.[ref 205]
Instructions from principals seem particularly relevant to inferring or imputing the intent of an AI agent, given that frontier AI systems are trained to follow users’ instructions.[ref 206] The methods that AI developers use to steer the behavior of their models also seem highly probative.[ref 207]
Another approach might rely on self-reports of AI systems.[ref 208] The state-of-the-art in generative AI is “reasoning models” like OpenAI’s o3, which use a “chain-of-thought” to recursively reason through harder problems.[ref 209] This chain-of-thought reveals information about how the model produced a certain result.[ref 210] This information may therefore be highly probative of an agent’s mental state for legal purposes; it might be analogized to a person making a written explanation of what they were doing and why. So, for example, if the chain-of-thought reveals that an agent stated that its action would produce a certain result, this would provide good evidence for the proposition that the agent “knew” that that action would produce that result. That conclusion may in turn may support an inference or presumption that the agent “intended” that outcome.[ref 211] For this reason, AI safety researchers are investigating the possibility of detecting unsafe model behavior by monitoring these chains-of-thought.[ref 212]
New scientific techniques could also form the basis for inferring or imputing mental states. The emerging field of AI interpretability aims to understand both how existing AI systems make decisions and how new AI systems can be built so that their decisions are easily understandable.[ref 213] More precisely, interpretability aims to explain the relationship between the inner mathematical workings of AI systems, which we can easily observe but not necessarily understand, and concepts that humans understand and care about.[ref 214] Leading interpretability researchers hope that interpretability techniques will eventually enable us to prove that models will not “deliberately” engage in certain forms of undesirable behavior.[ref 215] By extension, those same techniques may be able to provide insight into whether a model foresaw a possible consequence of its action (corresponding to our intuitive concept of knowledge), or regarded an anticipated consequence of its actions as a favorable and reason-giving one (corresponding to intent).[ref 216]
In many cases, we think, an inference or imputation of intent will be intuitively obvious. If an AI agent commits fraud, by repeatedly attempting to persuade a vulnerable person to transfer its principal some money, few except the philosophically persnickety will refuse to admit that in some relevant sense it “intended” to achieve this end; it is difficult even to describe the occurrence without using some such vocabulary. There will also be much less obvious cases, of course. In many such cases, we suspect that a sort of pragmatic eclecticism will be tractable and warranted. Rather than relying on a single approach, factfinders could be permitted to consider the whole bundle of factors that shape an agent’s behavior—such as explicit instructions (from both developers and users), behavioral predispositions, implicitly tolerated behavior,[ref 217] patterns of reasoning, scientific evidence, and incident-specific factors—and decide whether they support the conclusion that the AI agent had an objectively unreasonable attitude towards legal constraints and the rights of others.[ref 218] This permissive, blended approach would resemble the “inferential approach” to corporate mens rea advocated by Mihailis E. Diamantis:
Advocates would present evidence of circumstances surrounding the corporate act, emphasizing some, downplaying others, to weave narratives in which their preferred mental state inferences seem most natural. Adjudicators would have the age-old task of weighing the likelihood of these circumstances, the credibility of the narratives, and, treating the corporation as a holistic agent, inferring the mental state they think most likely.[ref 219]
A final but related point is that, even if there is some insuperable barrier to analyzing whether an AI has the mental state necessary to violate various legal prohibitions, it is plausible that such analysis is unnecessary for many purposes. Suppose that an AI developer is concerned that their AI agent might engage in the misdemeanor deceptive business practice of “mak[ing] a false or misleading written statement for the purpose of obtaining property . . . .”[ref 220] Even if we grant that an AI agent cannot coherently be described as having the relevant mens rea for this crime (here, knowledge or recklessness with respect to the falsity of the statement),[ref 221] the agent can nevertheless satisfy the actus reus (making the false statement).[ref 222] So an AI agent would be law-following with respect to this law if it never made false or misleading statements when attempting to obtain someone else’s property. As a matter of public policy, we should care more about whether AI agents are making harmful false statements in commerce than whether they are morally culpable. So, perhaps we can say that an AI agent committed a crime if it committed the actus reus in a situation in which a reasonable person, with access to the same information and cognitive capabilities as the agent, would have expected the harmful consequence to result. To avoid confusion with the actual, human-commanding law that requires both mens rea and actus reus, perhaps the law could simply call such behavior “deceptive business practice*.” Or perhaps it would be better to define a new criminal law code for AI agents, under which offenses do not include certain mental state elements, or include only objective correlates of human mental state elements.
To reiterate, we are not confident that any one of these approaches to determining AI mental state is the best path forward. But we are more confident that, especially as the fields of AI safety and explainable AI progress, most relevant cases can be handled satisfactorily by one of these techniques, or some other technique we have failed to identify, or some combination of techniques. We therefore doubt that legal invocations of mental state will pose an insuperable barrier to analyzing the legality of AI agents’ actions.[ref 223] The task of choosing between these approaches is left to the LFAI research agenda.[ref 224]
III. Why Design AI Agents to Follow the Law?
Part II argued that it is coherent for the law to impose legal duties on AI agents. This Part motivates the core proposition of LFAI: that the law should, in certain circumstances, require those developing, possessing, deploying, or using[ref 225] AI agents to ensure that those agents are designed to be law-following. Part V will then consider how the legal system might implement and enforce these design requirements.
A. Achieving Regulatory Goals through Design
A core claim of the LFAI proposal is that the law should require that AI agents be designed to rigorously follow the law, at least in some deployment settings. The use of the phrase “designed to” is intentional. Following the law is a behavior. There may be multiple ways to produce that behavior. Since AI agents are digital artifacts, we need not rely solely on incentives to shape their behavior: we can require that AI agents be directly designed to follow the law.
In Code: Version 2.0, Lawrence Lessig identifies four “constraints” on an actor’s behavior: markets, laws, norms, and architecture.[ref 226] The “architecture” constraint is of particular interest for the regulation of digital activities. Whereas “laws,” in Lessig’s taxonomy, “threaten ex post sanction for the violation of legal rights,”[ref 227] architecture involves modifying the underlying technology’s design so as to render an undesired outcome more difficult or impossible (or facilitate some desired outcome), [ref 228] without needing any ex-post recourse.[ref 229] Speed bumps are an archetypal architectural constraint in the physical world. [ref 230]
The core insight of Code is that cyberspace, as a fully human-designed domain,[ref 231] gives regulators the ability to much more reliably prevent objectionable behavior through the design of digital architecture, without the need to resort to ex-post liability.[ref 232] While Lessig focuses on designing cyberspace’s architecture, not the actors using cyberspace, this same insight can be extended to AI agent design. To generalize beyond the cyberspace metaphor for which Lessig’s framework was originally developed, we call this approach “regulation by design” instead of regulation through “architecture.”
Both companies developing AI agents and governments regulating them will have to make many design choices regarding AI agents. Many—perhaps most—of these design choices will concern specific behaviors or outcomes that we want to address. Should AI agents announce themselves as such? How frequently should they “check in” with their human principals? What sort of applications should AI agents be allowed to use?
These are all important questions. But LFAI tackles a higher-order question: how should we ensure that AI agents are regulable in general? How can we avoid creating a new class of actors unbound by law? Returning to Lessig’s four constraints, LFAI proposes that instead of relying solely on ex post legal sanctions, such as liability rules, we should require AI agents to be designed to follow some set of laws: they should be LFAIs.[ref 233] Thus, for whatever sets of legal constraints we wish to impose on the behavior of AI agents,[ref 234] LFAIs will be designed to comply automatically.
B. Theoretical Motivations
1. Law-Following in Principal-Agent Relationships
As discussed above,[ref 235] AI agents can be fruitfully analyzed through principal–agent principles. Without advocating for the wholesale legal application of agency law to AI agents, reference to agency law principles can help illuminate the significance and potential of LFAI.[ref 236]
Under hornbook agency principles, an AI agent should generally “act loyally for the principal’s benefit in all matters connected with the agency relationship.”[ref 237] This generally includes a duty to obey instructions from the principal.[ref 238]
Crucially, however, this general duty of obedience is qualified by a higher-order duty to follow the law. Agents only have a duty to obey lawful instructions.[ref 239] Thus, “[a]n agent has no duty to comply with instructions that may subject the agent to criminal, civil, or administrative sanctions or that exceed the legal limits on the principal’s right to direct action taken by the agent.”[ref 240] “A contract provision in which an agent promises to perform an unlawful act is unenforceable.”[ref 241] Agents cannot escape personal liability for their unlawful acts on the basis of orders from their principal.[ref 242]
The basic assumption that underlies these various doctrines is that an agent lacks any independent power to perform unlawful acts.[ref 243] The law of agency was therefore created under the assumption that agents maintain an independent obligation to follow the law, and therefore remain accountable for their violations of law. This assumption shaped agency law so as to prevent principals from unjustly benefitting by externalizing harms produced as a byproduct of the agency relationship.[ref 244] This feature of agency law helps establish both a baseline to which we can compare the world of AI agents in the absence of law-following constraints, and provides a normative justification for requiring AI agents to prioritize legal compliance over obedience to their principal.
2. Law-Following in the Design of Artificial Legal Actors
AI agents will of course not be the first artificial actor that humanity has created. Two types of powerful artificial actors—corporations and governments[ref 245]—profoundly impact our lives. When deciding how the law should respond to AI agents, it may make sense to draw lessons from the law’s response to the invention of other artificial legal actors.
A key lesson for AI agents is this: for both corporations and governments, the law does not rely solely on ex post liability to steer the actor’s behavior; it requires the actor to be law-following by design, at least to some extent. A disposition toward compliance is built into the very “architecture” of these artificial actors. AI agents may become no less important than corporations and governments in the aggregate, not least because they will be thoroughly integrated into them. Just as the law requires these other actors to be law-following by design, it should require AI agents to be LFAIs.
a. Corporations as Law-Following by Design
The law requires corporations to be law-following by design. One way it does this is by regulating the very legal instruments that bring corporations into existence: corporate charters are only granted for lawful purposes.[ref 246] While an “extreme” remedy,[ref 247] courts can order corporations to be dissolved if they repeatedly engage in illegal conduct.[ref 248] Failure to comply with legally required corporate formalities can also be grounds for involuntarily dissolving a corporate entity[ref 249] or piercing the corporate veil.[ref 250] Thus, while corporations are, as legal persons, generally obligated to obey the law, states do not only rely on external sanctions to persuade them to do so: they force corporations to be law-following in part through architectural measures, including dissolving[ref 251] corporations that break the law or refusing to incorporate those that would.
The law also forces corporations to be law-following by regulating the human agents that act on their behalf, as a matter of their fiduciary duties. Directors who intentionally cause a corporation to violate positive law breach their duty of good faith.[ref 252] Not only are corporate fiduciaries required to follow the law themselves, they are required to monitor for violations of law by other corporate agents.[ref 253] Moreover, human agents that violate certain laws can be disqualified from serving as corporate agents.[ref 254] These sort of “structural” duties and remedies[ref 255] are thus aimed at causing the corporation to follow the law generally and pervasively, rather than merely penalizing violations as they occur.[ref 256] That is entirely sensible, since the state has an obvious interest in preventing the creation of new artificial entities that then go on to disregard its laws, especially since it cannot easily monitor many corporate activities. Whether a powerful and potentially difficult-to-monitor AI agent is generally disposed toward lawfulness will be similarly important. There is a parallel case, therefore, for requiring the principals of AI agents to demonstrate that their agents will be law-following.[ref 257]
b. Governments as Law-Following by Design
“Constitutionalism is the idea . . . that government can and should be legally limited in its powers, and that its authority or legitimacy depends on its observing these limitations.”[ref 258] While we sometimes rely on ex post liability to deter harmful behavior by government actors,[ref 259] the design of the government—through the Constitution,[ref 260] statutory provisions, and longstanding practice—is the primary safeguard against lawless government action.
Examples abound. The general American constitutional design of separated powers, supported by interbranch checks and balances, plays an important role in preventing the government from exercising arbitrary power, thereby confining the government to its constitutionally delimited role.[ref 261] This system of multiple, independent veto points yields concrete protections for personal liberty, such as by making it difficult for the government to lawlessly imprison people.[ref 262]
Governments, like corporations, act only through their human agents.[ref 263] As in the corporate case, governmental design forces the government to follow the law in part by imposing law-following duties on the agents through whom it acts. The Constitution imposes a duty on the President to “take Care that the Laws be faithfully executed.”[ref 264] As discussed above, soldiers have a duty to disobey some unlawful orders, even from the Commander in Chief.[ref 265] Civil servants also have a right to refuse to follow unlawful orders, though the exact nature and extent of this right is unclear.[ref 266]
We saw above that, in the corporate case, the law uses disqualification of law-breaking agents to ensure that corporations are law-following.[ref 267] The law also uses disqualification to ensure that the government acts only through law-following agents, ranging from the highest levels of government to lower-level bureaucrats and employees. The Constitution empowers Congress to remove and disqualify officers of the United States for “high Crimes and Misdemeanors” through the impeachment process.[ref 268] Each house of Congress may expel its own members for “disorderly Behaviour.”[ref 269] This power “has historically involved either disloyalty to the United States Government, or the violation of a criminal law involving the abuse of one’s official position, such as bribery.”[ref 270] While there is no blanket rule disqualifying persons with criminal records from federal government jobs,[ref 271] numerous laws disqualify convicted individuals in more specific circumstances.[ref 272] Convicted felons are also generally ineligible to be employed by the Federal Bureaue of Investigation[ref 273] or armed forces,[ref 274] and usually cannot obtain a security clearance.[ref 275]
These design choices encode a commonsense judgment that those who cannot be trusted to follow the law should not be entrusted to wield the extraordinary power that accompanies certain government jobs, especially unelected positions associated with law enforcement, the military, and the intelligence community. If AI agents were to wield similar power and influence, the case for requiring them to be law-following by design is similarly strong.
3. The Holmesian Bad Man and the Internal Point of View
Our distinction between AI henchmen and LFAIs mirrors a distinction in jurisprudence about possible attitudes toward legal obligations.[ref 276] An AI henchman treats legal obligations much as the “bad man” does in Oliver Wendell Holmes Jr.’s classic The Path of the Law:
If you want to know the law and nothing else you must look at it as a bad man, who cares only for the material consequences which such knowledge enables him to predict, not as a good one, who finds his reasons for conduct, whether inside the law or outside of it, in the vaguer sanctions of conscience.[ref 277]
That is, under some interpretations,[ref 278] Holmes’ bad man treats the law merely as a set of incentives within which he pursues his own self-interest.[ref 279] Like the bad man, an AI henchman would care about the law, but only insofar as it enables it to predict how state power is likely to be wielded against the interests of its principal.[ref 280] Like the bad man,[ref 281] if the AI henchman predicts that the expected harms of violating the law are less than the expected benefits, it will do so. But it will not follow the law otherwise.
Fortunately, the bad man is not the only possible model for AI agents’ attitudes toward the law. One alternative to the bad man view of the law is H.L.A. Hart’s “internal point of view.”[ref 282] “The internal point of view is the practical attitude of rule acceptance—it does not imply that people who accept the rules accept their moral legitimacy, only that they are disposed to guide and evaluate conduct in accordance with the rules.”[ref 283] Whether AIs can have the mental states necessary to truly take the internal point of view is of course contested.[ref 284] But regardless of their mental state (if any), AI agents can be designed to act similarly to someone who thinks that “the law is not simply sanction-threatening, -directing, or -predicting, but rather obligation-imposing,”[ref 285] and is thus disposed to “act[] according to the dictates of the [law].”[ref 286] An AI agent can be designed to be more rigorously law-following than the bad man.[ref 287]
Real life is of course filled with people who are “bad” or highly imperfect. But bad AI agents are not similarly inevitable. AI agents are human-designed artifacts. It is open to us to design their behavioral dispositions to suit our policy goals, and to refuse to deploy agents that do not meet those goals.
C. Concrete Benefits
1. Law-Following AI Prevents Abuses of Government Power
As we have discussed,[ref 288] the law makes the government follow the law (and thus prevents abuses of government power) in part by compelling government agents to follow the law. If the government comes to rely heavily on AI agents for cognitive labor, then, the law should also require those agents to follow the law.
Depending on their assigned “roles,” government AI agents could wield significant power. They may have authority to initiate legal processes against individuals (including subpoenas, warrants, indictments, and civil actions), access sensitive governmental information (including tax records and intelligence), hack into protected computer systems, determine eligibility for government benefits, operate remote-controlled vehicles like military drones,[ref 289] and even issue commands to human soldiers or law enforcement officials.
These powers present significant opportunities for abuse, which is why preventing lawless government action was a motivation for the American Revolution,[ref 290] a primary goal of the Constitution, and a foundational American political value. We must therefore carefully examine whether existing safeguards designed to constrain human government agents would effectively limit AI agents in the absence of the law-following design constraints. While our analysis here is necessarily incomplete, we think it provides some reason for doubting the adequacy of existing safeguards in the world of AI agents.
When a human government agent, acting in her official capacity, violates an individual’s rights, she can face a variety of ex post consequences. If the violation is criminal, she could face severe penalties.[ref 291] This “threat of criminal sanction for subordinates [i]s a very powerful check on executive branch officials.”[ref 292] The threat of civil suits seeking damages, such as through Section 1983[ref 293] or a Bivens action,[ref 294] might also deter her, though various immunities and indemnities will often protect her,[ref 295] especially if she is a federal officer.[ref 296]
These checks will not exist in the case of AI henchmen. In the absence of law-following constraints, an AI henchman’s primary reason to obey the law will be its desire to keep its principal out of trouble.[ref 297] The henchman will thus lack one of the most powerful constraints on lawless behavior in humans: fear of personal ex post liability.
Most of us would rightfully be terrified of a government staffed by agents whose only concern was whether their bosses would suffer negative consequences as a result—a government staffed by Holmesian bad men loyal only to their principals.[ref 298] A basic premise of American constitutionalism[ref 299] and rule of law principles more generally[ref 300] is that government officials act legitimately only when they act pursuant to powers granted to them by the People through law, and obey the constraints attached to those powers. Treating law as a mere incentive system is repugnant to the proper role of government agents:[ref 301] being a “servant” of the People[ref 302] “faithfully discharg[ing] the duties of [one’s] office.”[ref 303]
This is not just a matter of high-minded political and constitutional theory. An elected head of state aspiring to become a dictator would need the cooperation of the sources of hard power in society—military, police, other security forces, and government bureaucracy—to seize power. At present, however, these organs of government are staffed by individuals, who may choose not to go along with the aspiring dictator’s plot.[ref 304] Furthermore, in an economy dependent on diffuse economic activity, resistance by individual workers could reduce the economic upsides from a coup.[ref 305] This reliance on a diverse and imperfectly loyal human workforce, both within and outside of government, is a significant safeguard against tyranny.[ref 306] However, replacement of human workers with loyal AI henchmen would seriously weaken this safeguard, possibly easing the aspiring tyrant’s path to power.[ref 307]
Nor is the importance of LFAI limited to AI agents acting directly at the request of high-level officials. It extends to the vast array of lower-level state and federal officials who wield enormous power over ordinary citizens, including particularly powerless ones. Take prisons, which “can often seem like lawless spaces, sites of astonishing brutality where legal rules are irrelevant.”[ref 308] Prison law arguably constrains official abuse far less than it should. Nevertheless, “prisons are intensely legal institutions,” and “people inside prisons have repeatedly emphasize[] that legal rules have significant, concrete effects on their lives.”[ref 309] Even imperfect enforcement of the legal constraints on prison officials can have demonstrable effects.[ref 310] However bad the existing situation may be, diluting or gutting the efficacy of these constraints threatens to make the situation dramatically worse.
The substitution of AI agents for (certain) prison officials could have precisely this effect. Here is just one example. The Eighth Amendment forbids prison officials from withholding medical treatment from prisoners in a manner that is deliberately indifferent to their serious medical needs.[ref 311] Suppose that a state prisoner needs to take a dose of medicine each day for a month, or his eyesight will be permanently damaged. The prisoner says something disrespectful to a guard. The warden wishes to make an example of the prisoner, so she fabricates a note from the prison physician directing the prison pharmacist to withhold further doses of the medicine. The prisoner is denied the medicine. He tries to reach his lawyer to get a temporary restraining order, but the lawyer cannot return his call until the next day. As a result, the prisoner’s eyesight is permanently damaged.
Let us assume that the state has strong state-level sovereign immunity under its own laws, meaning that the prisoner cannot sue the state directly.[ref 312] Under the status quo, the prisoner can still sue the warden for damages under 42 U.S.C. § 1983, for violating his clearly established constitutional right.[ref 313] Given the widespread prevalence of official indemnification agreements at the state level,[ref 314] the state will likely indemnify the warden, even though the state itself cannot be sued for damages under Section 1983[ref 315] or its own laws. The prisoner is therefore likely to receive monetary damages.
But now replace the human warden with an AI agent charged with administering the prison by issuing orders directly to prison personnel through some digital interface. If this “AI warden” did the same thing, the prisoner would have direct redress against it, since it is not a “person” under Section 1983[ref 316] (or, indeed, any law). Nor will the prisoner have indirect recourse against the state, by way of an indemnification agreement, because there is no underlying tort liability for the state to indemnify. Nor will the prisoner have redress against the medical personnel, since the AI warden deceived them into withholding treatment.[ref 317] And we have already assumed that the state itself has sovereign immunity. Thus, the prisoner will find himself without any avenue of redress for the wrong he has suffered—the introduction of an artificial agent in the place of a human official made all the difference.
What is the right response to these problems? Many responses may be called for, but one of them is to ensure that only law-following AI agents can serve in such a role. As previously discussed, the law disqualifies certain lawbreakers from many government jobs. Similarly, we believe, the law should disqualify AI agents that are not demonstrably rigorously law-following from certain government roles. We discuss how this disqualification might be enforced, more concretely, in Part V.
There is, however, another possible response to these challenges: perhaps we should “just say no” and prohibit governments from using AI agents at all, or at least severely curtail their use.[ref 318] We do not here take a strong position on when this would be the correct approach all-things-considered. At a minimum, however, we note a few reasons for skepticism of such a restrictive approach.
The first is banal: if AI agents can perform computer-based tasks well, then their adoption by the government could also deliver considerable benefits to citizens.[ref 319] Reducing the efficiency of government administration for the sake of preventing tyranny and abuse may be worth it in some cases, and is indeed the logic of the individual rights protections of the Constitution.[ref 320] But tailoring a safeguard to allow for efficient government administration, is, all else equal, preferable to a blunter, more restrictive safeguard. LFAI may offer such a tailored safeguard.
The second reason is that adoption of AI agents by governments may become more important as AI technology advances. Some of the most promising AI safety proposals involve using trusted AI systems to monitor untrusted ones.[ref 321] The central reason is this: as AI systems become more capable, unassisted humans will not be able to reliably evaluate whether the AIs’ actions are desirable.[ref 322] Assistance from trusted AI systems could thus be the primary way to scale humans’ ability to oversee untrusted AI systems. Thus, if the government is to oversee the behavior of new and untrusted private-sector AI systems so as to ensure their safety, it may need to employ AI agents to assist it.
Even if the government does not need to rely on AI agents to administer AI safety regulation (for example, because such AI overseers are employed by private companies, not the government), the government will likely need to employ AI agents to help it keep up with competitive pressures. Even if the federal government hesitates to adopt AI agents to increase its efficiency, foreign competitors might show no such qualms. If so, the federal government might then feel little choice but to do the same.
In the face of these competing demands, LFAI offers a plausible path to enable the adoption of AI agents in governmental domains with a high potential for abuse (e.g., the military, intelligence, law enforcement, prison administration) while safeguarding life, liberty, and the rule of law. LFAI can also transform the binary question of whether to adopt AI agents into the more multidimensional question of which laws should constrain them.[ref 323] This should allow for more nuanced policymaking, grounded in the existing legal duties of government agents.
2. Law-Following AI Enables Scalable Enforcement of Public Law
AI agents could cause a wide variety of harms. The state promulgates and enforces public law prohibitions—both civil and criminal—to prevent and remedy many of these harms. If the state cannot safely assume that AI agents will reliably follow these prohibitions, the state might need to increase the resources dedicated to law enforcement.
LFAI offers a way out of this bind. Insofar as AI agents are reliably law-following, the state can trust that significantly less law enforcement is needed.[ref 324] This dynamic would also have broader beneficial implications for the structure and functioning of government. “If men were angels, no government would be necessary.”[ref 325] LFAIs would not be angels,[ref 326] but they would be a bit more angelic than many humans. Thus, as a corollary of Publius’ insight, we may need less government to oversee LFAIs’ behavior than we would need for a human population of equivalent size. State resources that would otherwise be spent on investigating and enforcing the laws against AI agents could thus be redirected to other problems or refunded to the citizenry.
LFAI would also curtail some of the undesirable side effects and opportunities for abuse inherent in law enforcement. Law enforcement efforts inherently involve some intrusion into the private affairs and personal freedoms of citizens.[ref 327] If the government could be more confident that AI agents were behaving lawfully, it would have less cause to surveil or investigate their behavior, and thereby impose fewer[ref 328] burdens on citizens’ privacy. Reducing the occasion for investigations and searches would also create fewer opportunities for abuse of private information.[ref 329] In this way, ensuring reliably law-following AI might significantly mitigate the frequency and severity of law enforcement’s intrusions on citizens’ privacy and liberty.
IV. Law-Following AI as AI Alignment
The field of AI alignment aims to ensure that powerful, general-purpose AI agents behave in accordance with some set of normative constraints.[ref 330] AI systems that do not behave in accordance with such constraints are said to be “misaligned” or “unaligned.” Since the law is a set of normative constraints, the field of AI alignment is highly relevant to LFAI.[ref 331]
The most basic set of normative constraints to which an AI could be aligned is the “informally specified”[ref 332] intent of its principal.[ref 333] This is called “intent-alignment.”[ref 334] Since individuals’ intentions are a mix of morally good and bad to varying degrees, some alignment work also aims to ensure that AI systems behave in accordance with moral constraints, regardless of the intentions of the principal.[ref 335] This is called “value-alignment.”[ref 336]
AI alignment work is valuable because, as shown by theoretical arguments[ref 337] and empirical observations,[ref 338] it is difficult to design AI systems that reliably obey any particular set of constraints provided by humans.[ref 339] In other words, nobody knows how to ensure that AI systems are either intent-aligned or value-aligned,[ref 340] especially for smarter-than-human systems.[ref 341] This is the Alignment Problem.[ref 342] The Alignment Problem is especially worrying for AI systems that are agentic and goal-directed,[ref 343] as such systems may wish to evade human oversight and controls that could frustrate pursuit of those goals, such as by deceiving their developers,[ref 344] accumulating power and resources[ref 345] (including by making themselves smarter),[ref 346] and ultimately resisting efforts to correct their behavior or halt further actions.[ref 347]
There is a sizable literature arguing that these dynamics imply that misaligned AI agents pose a nontrivial risk to the continued survival of humanity.[ref 348] The case for LFAI, however, in no way depends on the correctness of these concerns: the specter of widespread lawless AI action should be sufficient on its own to motivate LFAI. Nevertheless, the alignment literature produces several valuable insights for the pursuit of LFAI.
A. AI Agents Will Not Follow the Law by Default
The alignment literature suggests that there is a significant risk that AI agents will not be law-following by default. This is a straightforward implication of the Alignment Problem. To see how, imagine a morally upright principal who intends that his AI agent rigorously follows the law. If the AI agent was intent-aligned, the agent would therefore follow the law. But the fact that intent-alignment is an unsolved problem implies that there is a significant chance that that agent would not be aligned with the principal’s intentions, and therefore violate the law. Put differently, unaligned AIs may not be controllable,[ref 349] and uncontrollable AIs may break the law. Thus, as long as intent-alignment remains an unsolved technical problem, there will be a significant risk that AI agents will be prone to lawbreaking behavior.
To be clear, the main reason that there is a significant risk that AI agents will not be law-following by default is not that people will not try to align AI agents to law (although that is also a risk).[ref 350] Rather, the main risk is that current state-of-the-art alignment techniques do not provide a strong guarantee that advanced AI agents will be aligned, even when they are trained with those techniques. There is a clear empirical basis for this claim, which is that those alignment techniques frequently fail in current frontier models.[ref 351] There are also theoretical limitations to existing techniques for smarter-than-human systems.[ref 352]
A related implication of the alignment literature is that even intent-aligned AI agents may not follow the law by default. Again, we can see this by hypothesizing an intent-aligned AI agent and a human principal who wants the AI agent to act as her henchman. Since an intent-aligned AI agent follows the intent of its principal, this intent-aligned agent would act as a henchman, and thus act lawlessly when doing so serves the principal’s interests.[ref 353] In typical alignment language, intent-alignment still leaves open the possibility that principals will misuse their intent-aligned AI.[ref 354]
None of this is to imply that intent-alignment is undesirable. Solving intent-alignment is the primary focus of the alignment research community[ref 355] because it would ensure that AI agents remain controllable by human principals.[ref 356] Intent-alignment is also generally assumed to be easier than value-alignment.[ref 357] And if principals want their AI agents to follow the law, or behave ethically more broadly, then intent-alignment will produce law-following or ethical behavior. But in a world where principals will range from angels to devils, alignment researchers acknowledge that intent-alignment alone is insufficient to guarantee that AI agents act lawfully, or produce good effects in the world.[ref 358] This brings us to the next important set of implications from the alignment literature.
B. Law-Alignment is More Legitimate than Value-Alignment
LFAIs are generally intent-aligned—they are still loyal to their principals—but are also subject to a side-constraint that they will follow the law while advancing the interests of their principals. Extending the typical alignment terminology, we can call this side-constraint law-alignment.[ref 359]
But the law is not the only side-constraint that can be imposed on intent-aligned AIs. As alluded to above, another possible model is value-alignment. Value-aligned AI agents act in accordance with the wishes of their principals, but are subject to ethical side-constraints, usually imposed by the model developer.
However, value-alignment can be controversial when it causes AI models to override the lawful requests of users. Perhaps the most well-known example of this is the controversy around Google’s Gemini image-generation AI in early 2024. In an attempt to increase the diversity in outputted pictures,[ref 360] Gemini ended up failing in clear ways, such as portraying “1943 German soldiers” as racially diverse, or refusing to generate pictures of a “white couple” while doing so for couples of other races.[ref 361]
This incident led to widespread concern that the values exhibited by generative AI products were biased towards the predominantly liberal views of these companies’ employees.[ref 362] This concern has been vindicated by empirical research consistently finding that the espoused political views of these AIs indeed most closely resemble those of the center-left.[ref 363] Critics from further left have also frequently raised similar concerns about demographic and ideological biases in AI systems.[ref 364]
Some critics concluded from the Gemini incident that alignment work writ large has become a Trojan Horse for covertly pushing the future of AI in a leftward direction.[ref 365] Those who disagree with progressive political values will naturally find this concerning, given the importance that AI might have in the future of human communication[ref 366] and the highly centralized nature of large-scale AI development and deployment.[ref 367]
In a pluralistic society, it is inevitable and understandable that, when a sociotechnical system reflects the values of one faction, competing factions will criticize it. But alignment, as such, is not the right target of such criticisms. Intent-alignment is value-neutral, concerning itself only with the extent to which an AI agent obeys its principal.[ref 368] Reassuringly for those concerned with ideological bias in AI systems, intent-alignment is also the primary focus of the alignment community, since solving intent-alignment is necessary to reliably control AI systems at all.[ref 369] A large majority of Americans from all political backgrounds agree that AI technologies need oversight,[ref 370] and overseeing unaligned systems is much more difficult than overseeing aligned ones. Indeed, even the critics of alignment work tend to assume—contrary to the views of many alignment researchers—that AI agents will be easy to control,[ref 371] and presumably view this result as desirable.
Furthermore, some amount of alignment is also necessary to make useful AI products and services. Consumers, reasonably, want to use AI technologies that they can reliably control. Today’s leading chatbots—like Claude and ChatGPT—are only helpful to users due to the application of alignment techniques like RLHF[ref 372] and Constitutional AI.[ref 373] AI developers also use alignment techniques to instill uncontroversial (and user-friendly) behaviors into their AI systems, such as honesty.[ref 374] AI companies are also already using alignment techniques to prevent their AI systems from taking actions that could cause them or their customers to incur unnecessary legal liability.[ref 375] In short, then, some degree of alignment work is necessary to make AI products useful in the first place.[ref 376] To adopt a blanket stance against alignment because of the Gemini incident is thus not only unjustified,[ref 377] but also likely to undermine American leadership in AI.
Nevertheless, it is reasonable for critics to worry about and contest the frameworks by which potentially controversial values are instilled into AI systems. AI developers are indeed a “very narrow slice of the global population.”[ref 378] This is something that should give anyone, regardless of political persuasion, pause.[ref 379] But intent-alignment is not enough, either: it is inadequate to prevent a wide variety of harms that the state has an interest in preventing.[ref 380] So we need a form of alignment that is more normatively constraining than intent-alignment alone, but more legitimate, and more appropriate for our pluralistic society, than alignment to values that AI developers choose themselves.
Law-alignment fits these criteria.[ref 381] While the moral legitimacy of the law is not perfect, in a republic it nevertheless has the greatest legitimacy of any single source or repository of values.[ref 382] Indeed, “the framers [of the U.S. Constitution] insisted on a legislature composed of different bodies subject to different electorates as a means of ensuring that any new law would have to secure the approval of a supermajority of the people’s representatives,”[ref 383] thus ensuring that new laws are “the product of widespread social consensus.”[ref 384] In our constitutional system of government, laws are also subject to checks and balances that protect fundamental rights and liberties, such as judicial review for constitutionality and interpretation by an independent judiciary.
Aligning to law also has procedural virtues over value-alignment. First, there is widespread agreement on the authoritative sources of law (e.g., the Constitution, statutes, regulations, case law), much more so than for ethics. Relatedly, legal rules tend to be expressed much more clearly than ethical maxims. Although there is of course considerable disagreement about the content of law and the proper forms of legal reasoning, it is nevertheless much easier and less controversial to evaluate the validity of legal propositions and arguments than to evaluate the quality or correctness of ethical reasoning.[ref 385] Moreover, when there is disagreement or unclarity, the law contains established processes for authoritatively resolving disputes over the applicability and meaning of laws.[ref 386] Ethics contains no such system.
We therefore suggest that law-alignment, not value-alignment, should be the primary focus when something beyond intent-alignment is needed.[ref 387] Our claim, to be clear, is not that law-alignment alone will always prove satisfactory, or that it should be the sole constraint on AI systems beyond intent-alignment, or that AI agents should not engage in moral reasoning of their own.[ref 388] Rather, we simply argue that more practical and theoretical alignment research should be aimed at building AI systems aligned to law.
V. Implementing and Enforcing Law-Following AI
We have argued that AI agents should be designed to follow the law. We now turn to the question of how public policy can support this goal. Our investigation here is necessarily preliminary; our aim is principally to spur future research.
A. Possible Duties Across the AI Agent Lifecycle
As an initial matter, we note that a duty to ensure that AI agents are law-following could be imposed at several stages of the AI lifecycle.[ref 389] The law might impose duties on persons:
- Developing AI agents;
- Possessing[ref 390] AI agents;
- Deploying[ref 391] AI agents;[ref 392] or
- Using AI agents.
After deciding which of these activities ought to be regulated, policymakers must then decide what, exactly, persons engaging in that activity are obligated to do. While the possibilities are too varied to exhaust here, some basic options might include commands like:
- “Any person developing an AI agent has a duty to take reasonable care to ensure that such AI agent is law-following.”
- “It is a violation to knowingly possess an AI agent that is not law-following, except under the following circumstances: . . . .”
- “Any person who deploys an AI agent is strictly liable if such AI agent is not law-following.”
- “A person who knowingly uses an AI agent that is not law-following is liable.”
Basic duties of this sort would comprise the foundational building blocks of LFAI policy. Policymakers must then choose whether to enforce these obligations ex post (that is, after an AI henchman takes an illegal action)[ref 393] or ex ante. These two choices are interrelated: as we will explore below, it may make more sense to impose ex ante liability for some activities and ex post liability for others. For example, ex ante regulation might make more sense for AI developers than civilian AI users, because the former are far more concentrated, and can absorb ex ante compliance costs more easily.[ref 394] And of course, ex ante and ex post regulation are not mutually exclusive:[ref 395] driving, for example, is regulated by a combination of ex ante policies (e.g., licensing requirements) and ex post policies (e.g., tort liability).
B. Ex Post Policies
We begin our discussion with ex post policies. Many scholars believe that ex post policies are generally preferable to ex ante policies.[ref 396] While we think that ex post policies could have an important role to play in implementing LFAI, we also suspect that they will be inadequate in certain contexts.
Enforcing duties through ex post liability rules is, of course, familiar in both common law[ref 397] and regulation.[ref 398] In the LFAI context, ex post policies would impose liability on an actor after an AI henchman over which they had some form of control violates an applicable legal duty. More and less aggressive ex post approaches are conceivable. On the less aggressive end of the spectrum, development, possession, deployment, or use of an AI henchman might be considered a breach of the tort duty of reasonable care, rendering the human actor liable for resulting injuries.[ref 399] To some extent, this may already be the case under existing tort law.[ref 400] The law might also consider extending the negligence liability of an AI developer or deployer to harms that would not typically be compensable under traditional tort principles (because, for example, they would count as pure economic loss),[ref 401] if those harms are produced by their AI agents acting in criminal or otherwise unlawful ways.[ref 402]
Other innovations in tort law may also be warranted. Several scholars have argued, for example, that the principal of an AI agent should sometimes be held strictly liable for the “torts” of that agent, under a respondeat superior theory.[ref 403] In some cases, such as when a developer has recklessly failed to ensure that its AI agent is law-following by design, punitive damages might be appropriate as well.
Moving beyond tort law, in some cases it may make sense to impose civil sanctions[ref 404] when an AI henchman violates an applicable legal duty, even if no harm results. A legislature might also impose tort liability on the developers of AI agents if those AI agents (a) are not law-following, (b) violate an applicable legal duty, and (c) thereby cause harm.[ref 405]
In order to sufficiently disincentivize the deployment of lawless AI agents in high-stakes contexts, a legislature might also vary applicable immunity rules. For example, Congress could create a distinct cause of action against the federal government for individuals harmed by AI henchmen under the control of the federal government, taking care to remove barriers that various immunity rules pose to analogous suits against human agents.[ref 406]
These and other imaginable ex post policies are important arrows in the regulatory quiver, and we suspect they will have an important role to play in advancing LFAI. Nevertheless, we would resist any suggestion that ex post sanctions are sufficient to deal with the specter of lawless AI agents.
Our reasons are multiple. In many contexts, detecting lawless behavior once an AI agent has been deployed will be difficult or costly—especially as these systems become more sophisticated and more capable of deceptive behavior.[ref 407] Proving causation may also be difficult.[ref 408] In the case of corporate actors, meanwhile, the efficacy of such sanctions may be seriously blunted by judgment-proofing and similar phenomena.[ref 409] And, most importantly for our purposes, various immunities and indemnities make tort suits against the government or its officials a weak incentive.[ref 410] These considerations suggest that it would be unwise to rely on ex post policies as our principal means for ensuring that AI agents follow the law when the risks from lawless action are particularly high.
C. Ex Ante Policies
Accordingly, we propose that, in some high-stakes contexts, the law should take a more proactive approach, by preventing the deployment of AI henchmen ab initio. This would likely require first establishing a technical means for evaluating whether an AI agent is sufficiently law-following,[ref 411] then requiring that any agents be so evaluated prior to deployment, with permission to deploy the agent being conditional on achieving some minimal score during that evaluation process.[ref 412]
We are most enthusiastic about imposing such requirements prior to the deployment of AI agents in government roles where lawlessness would pose a substantial risk to life, liberty, and the rule of law. We have discussed several such contexts already,[ref 413] but the exact range of contexts is worth carefully considering, and is certainly up for debate.
Ex ante strategies could also be used in the private sector, of course. One often-discussed approach is an FDA-like approval regulation regime wherein private AI developers would need to prove, to the satisfaction of some regulator, that their AI agents are safe prior to their deployment.[ref 414] The pro tanto case for requiring private actors to demonstrate that their AI agents are disposed to follow some basic set of laws is clear: the state has an interest in ensuring that its most fundamental laws are obeyed. But in a world of increasingly sophisticated artificial agents, approval regulation could—if not properly designed and sufficiently tailored—also constitute a serious incursion on innovation[ref 415] and personal liberty.[ref 416] If AI agents will be as powerful as we suspect, strictly limiting their possession could create risks of its own.[ref 417]
Accordingly, it is also worth considering ex ante regulations on private AI developers or deployers that stop short of full approval regulation. For example, the law could require the developers of AI agents to, at a minimum, disclose information[ref 418] about the law-following propensities of their systems, such as which laws (if any) their agents are instructed to follow,[ref 419] and any evaluations of how reliably their agents follow those laws.[ref 420] Similarly, the law could require developers to formulate and assess risk management frameworks that specify the precautionary measures they plan to undertake to ensure that the agents they develop and deploy are sufficiently law-following.[ref 421]
Overall, we are uncertain about what kinds of ex ante requirements are warranted, all things considered, in the case of private actors. To a large extent, the issue cannot be intelligently addressed without more specific proposals. Formulating such proposals is thus an urgent task for the LFAI research agenda, even if it is not, in our view, as urgent as the task of formulating concrete regulations for AI agents acting under color of law.
D. Other Strategies
The law does not police undesirable behavior solely by imposing sanctions. It also specifies mechanisms for nullifying the presumptive legal effect of actions that violate the law or are normatively objectionable. In private law, for example, a contract is voidable by a party if that party’s assent was “induced by either a fraudulent or a material misrepresentation by the other party upon which the [party wa]s justified in relying.”[ref 422] Nullification rules exist in public law, too. One obvious example is the ability of the judiciary to nullify laws that violate the federal Constitution.[ref 423] Or, to take another familiar example, courts applying the Administrative Procedure Act “hold unlawful and set aside” agency actions that are “arbitrary, capricious, an abuse of direction, or otherwise not in accordance with law.”[ref 424]
Nullification rules may provide a promising legal strategy for policing behavior by AI agents that is unlawful or normatively objectionable. Thus, in private law, if an AI henchman induces a human counterparty to enter into a disadvantageous contract, the resulting contractual obligation could be voidable by the human. In public law, regulatory directives issued by (or substantially traceable to) AI henchmen could be “h[e]ld unlawful and set aside” as “not in accordance with law.”[ref 425] These examples rely on existing nullification rules, but new nullification rules, tailor-made to address new risks from AI agents, might be warranted as well. For example, Congress could stipulate that any official action taken by or substantially traceable to an AI agent is void unless, before deployment, the agent has been shown to be law-following.
Such prophylactic nullification rules are one sort of indirect legal mechanisms for enforcing the duty to deploy law-following AIs. Indirect technical mechanisms are well worth considering, too. For example, the government could deploy AI agents that refuse to coordinate or transact with other AI agents unless those counterparty agents are verifiably law-following (for example, by virtue of having “agent IDs”[ref 426] that attest to a minimal standard of performance on law-following benchmarks).
Similarly, the government could enforce LFAI by regulating the hardware on which AI agents will typically operate. Frontier AI systems “run” on specialized AI chips,[ref 427] which are typically aggregated in large data centers.[ref 428] Collectively, these are referred to as “AI hardware” or simply “compute.”[ref 429] Compared to other inputs to AI development and deployment, AI hardware is particularly governable, given its detectability, excludability, quantifiability, and concentrated supply chain.[ref 430] Accordingly, a number of AI governance proposals advocate for imposing requirements on those making and operating AI hardware in order to regulate the behavior of the AI systems developed and deployed on that hardware.[ref 431]
One class of such proposals is “‘on-chip mechanisms’: secure physical mechanisms built directly into chips or associated hardware that could provide a platform for adaptive governance” of AI systems developed or deployed on those chips.[ref 432] On-chip mechanisms can prevent chips from performing unauthorized computations. One example is iPhone hardware that “enable[s] Apple to exercise editorial control over which specific apps can be installed” on the phone.[ref 433] Analogously, perhaps we could design AI chips that would not support AI agents unless those agents are certified as law-following by some private or governmental certifying body. This could then be combined with other strategies to enforce LFAI mandates: for example, perhaps Congress could require that the government only run AI agents on such chips.
Unsurprisingly, designing these sorts of enforcement strategies is as much a task for computer scientists as it is for lawyers. In the decades to come, we suspect that such interdisciplinary legal scholarship will become increasingly important.
VI. A Research Agenda for Law-Following AI
We have laid out the case for LFAI: the requirement that AI agents be designed to rigorously follow some set of laws. We hope that our readers find it compelling. However, our goal with this Article is not just to proffer a compelling idea. If we are correct about the impending risks from lawless AI agents, we may soon need to translate the ideas in this Article into concrete and viable policy proposals.
Given the profound changes that widespread deployment of AI agents will bring, we are under no illusions about our ability to design perfect public policy in advance. Our goal, instead, is to enable the design of “minimally viable LFAI policy”:[ref 434] a policy or set of policies that will prevent some of the worst-case outcomes from lawless AI agents, without completely paralyzing the ability of regulated actors to experiment with AI agents. This minimally viable LFAI policy will surely be flawed in many ways, but with many of the worst-case outcomes prevented, we will hopefully have time as a society to patch remaining issues through the normal judicial and legislative means.
To that end, in this Part we briefly identify some legal questions that would need to be answered to design minimally viable LFAI policies.
1. How should “AI agent” be defined?
Our definition of “full AI agent,”—an AI system “that can do anything a human can do in front of a computer”[ref 435]—is almost certainly too demanding for legal purposes, since an AI agent that can do most but not all computer-based tasks that a human can do would likely still raise most of the issues that LFAI is supposed to address. At the same time, the fact that a wide range of existing AI systems can be regarded as somewhat agentic[ref 436] means that a broad definition of “AI agent” could render relevant regulatory schemes substantially overinclusive. Different definitions are therefore necessary for legal purposes.[ref 437]
2. Which laws should an LFAI be required to follow?
Obedience to some laws is much more important than obedience to other laws. It is much more important that AI agents refrain from murder and (if acting under color of law) follow the Constitution than that they refrain from jaywalking. Indeed, requiring LFAIs to obey literally every law may very well be overly burdensome.[ref 438] In addition, we will likely need new laws to regulate the behavior of AI agents over time.
3. When an applicable law has a mental state element, how can we adjudicate whether an AI agent violated that law?
We discuss this question in Section II.B, above. It is related to the previous question, for there may be conceptual or administrative difficulties in applying certain kinds of mental state requirements to AI agents. For example, in certain contexts, it may be more difficult to determine whether an AI agent was “negligent” than to determine whether it had a relevant “intent.”
4. How should an LFAI decide whether a contemplated action is likely to violate the law?
An LFAI refrains from taking actions that it believes would violate one of the laws that it is required to follow. But of course, it is not always clear what the law requires. Furthermore, we need some way to tell whether an AI agent is making a good faith effort to follow a reasonable interpretation of the law, rather than merely offering a defense or rationalization. How, then, should an LFAI reason about what its legal obligations are?
Perhaps it should just rely on its own considered judgment, on the basis of its first-order reasoning about the content of applicable legal norms. But in certain circumstances, at least, an LFAI’s appraisal of the relevant materials might lead it to radically unorthodox legal conclusions—and a ready disposition to act on such conclusions might significantly threaten the stability of the legal order. In other cases, an LFAI might conclude that it is dealing with a case in which the law is not only “hard” to discern but genuinely indeterminate.[ref 439]
Another intuitively appealing option, therefore, might require an LFAI to follow its prediction of what a court would likely decide.[ref 440] This approach has the benefit of tying an LFAI’s legal decision-making to an existing human source of interpretative authority. Courts provide authoritative resolutions to legal disputes when the law is controversial or indeterminate. And in our legal culture, it is widely (if not universally) accepted that “[i]t is emphatically the province and duty of the judicial department to say what the law is,”[ref 441] such that judicial interpretations of the law are entitled to special solicitude by conscientious participants in legal practice, even when they are not bound by a court judgment.[ref 442]
However, a predictive approach would have important practical limitations.[ref 443] Perhaps the most important is the existence of many legal rules that bind the executive branch but are nevertheless “unlikely ever to come before a court in justiciable form.”[ref 444] It would seem difficult for an LFAI to reason about such questions using the prediction theory of law.
Even for those questions that could be decided by a court, using the prediction theory of law raises other important questions. For example, what is the AI agent allowed to assume about its own ability to influence the adjudication of legal questions? We should not want it to be able to consider that it could bribe or intimidate judges or jurors, nor that it could illegally hide evidence from the court, nor that it could commit perjury, nor that it could persuade the President to issue it a pardon.[ref 445] These may be means of swaying the outcome of a case, but they do not seem to bear on whether the conduct would actually be legal.
The issues here are difficult, but perhaps not insurmountable. After all, there are other contexts in which something like these issues arise. Consider federal courts sitting in diversity applying state substantive law. When state court decisions provide inconclusive evidence as to the correct answer under state law, federal courts will make an “Erie guess” about how the state’s highest court would rule on the issue.[ref 446] It would clearly be inappropriate for such courts to, for example, make an Erie guess for reasons like “Justice X in the State Supreme Court, who’s the swing justice, is easily bribed . . . .”[ref 447] If an LFAI’s decision-making should sometimes involve “predicting” how an appropriate court would rule, its predictions should be similarly constrained.
5. In what contexts should the law require that AI agents be law-following?
Should all principals be prohibited from employing non-law-following AI agents? Or should such prohibitions be limited to particular principals, such as government actors?[ref 448] Or perhaps only government actors performing particularly sensitive government functions?[ref 449] In the other direction, should it be illegal to even develop or possess AI henchmen? We discuss various options in Part V, above.
6. How should a requirement that AI agents be law-following be enforced?
We discuss various options in Part V, above. As noted there, we think that reliance on ex post enforcement alone would be unwise at least in the case of AI agents performing particularly sensitive government functions.
7. How rigorously should an LFAI follow the law?
That is, when should an AI agent be capable of taking actions that it predicts may be unlawful? The answer is probably not “never,” at least with respect to some laws. We generally do not expect perfect compliance with every law,[ref 450] especially (but not only) because it can be difficult to predict how a law will apply to a given fact pattern. Furthermore, some amount of disobedience is likely necessary for the evolution of legal systems.[ref 451]
8. Would requiring AI agents controlled by the executive branch to be LFAIs impermissibly intrude on the President’s authority to interpret the law for the executive branch?
The President has the authority to promulgate interpretations of law that are binding on the executive branch (though that power is usually delegated to the Attorney General and then further delegated to the Office of Legal Counsel).[ref 452] Would that authority be incompatible with a law requiring the Executive Branch to deploy LFAIs that would, in certain circumstances, refuse to follow an interpretation of the law promulgated by the President?
9. Does the First Amendment limit the ability of LFAI to prohibit AI agents from advising on lawbreaking activity?
For example, would it be constitutionally permissible to prohibit an LFAI from advising on how to carry out a crime under the theory that such advising would either constitute conspiracy or incitement?
10. How can we design LFAIs and surrounding governance systems to enable the rapid discovery and remediation of loopholes or gaps in the law?
The worry here is that LFAIs, by design, will have strong incentives to discover legal ways to accomplish their goals. This may entail discovering gaps in the law that lawmakers would likely want to correct if they were aware of them, then “exploiting” those gaps before they can be “patched.”[ref 453]
11. How can we design LFAIs and surrounding governance systems to avoid excessive concentration of power?
For example, imagine that a single district court judge could change the interpretation of law as against all LFAIs. As the stakes of AI agent action rise, so will the pressure on the judiciary to wield its power to shape the behavior of LFAIs. Even assuming that all judges will continue to operate in good faith and be well-insulated from illegal or inappropriate attempts to bias their rulings, such a system would amplify any idiosyncratic legal philosophies of individual judges and may enable mistaken rulings to cause more harm than a more decentralized system would.
As an example of how such problems might be avoided, perhaps any disputes about the law governing LFAIs should be resolved in the first instance by a panel of district court judges randomly chosen from around the country. Congress has established a procedure for certain election law cases to be heard by three-judge panels, “in recognition of the fact that ‘such cases were ones of “great public concern” that require an unusual degree of “public acceptance.”’”[ref 454]
12. How can we avoid LFAIs being used for repression by authoritarian governments?
The worry here is that any AI system that rigorously follows the laws in an autocracy may become a potent tool for repression, as it could prevent people from engaging in acts of resistance or serve as a tool for mandatory surveillance and reporting of dissident activity. In other words, LFAI promotes rule of law in a republic, but in an autocracy, it may promote rule by law.
13. How can we design LFAI requirements for governments that nevertheless enable rapid adaptation of AI agents in government?
Perhaps the most significant objection to our proposal that AI agents be demonstrably law-following before their deployment in government is that such a requirement might hurt state capacity by unduly impeding the government’s ability to adopt AI in a sufficiently rapid fashion.[ref 455] We are optimistic that LFAI requirements can be designed to adequately address this concern, but that is, of course, work that remains to be done.
Conclusion
The American political tradition aspires to maintain a legal system that stands as an “impenetrable bulwark”[ref 456] against all threats—public and private, foreign and domestic—to our basic liberties. For all of the inadequacies of the American legal order, ensuring that its basic protections endure and improve over the decades and centuries to come is among our most important collective responsibilities.
Our world of increasingly sophisticated AI agents requires us to reimagine how we discharge this responsibility. Humans will no longer be the sole entities capable of reasoning about and conforming to the law. Human and human entities are no longer, therefore, the sole appropriate target of legal commands. Indeed, at some point, AI agents may overtake humans in their capacity to reason about the law. They may also rival and overtake us in many other competencies, becoming an indispensable cognitive workforce. In the decades to come, our social and economic world may be bifurcated into parallel populations of AI agents collaborating, trading, and sometimes competing with human beings and one another.
The law must evolve to recognize this emerging reality. It must shed its operative assumption that humans are the only proper objects of legal commands. It must expect AI agents to obey the law at least as rigorously as it expects humans to—and expect humans to build AI agents that do so. If we do not transform our legal system to achieve these goals, we risk a political and social order in which our ultimate ruler is not the law,[ref 457] but the person with the largest army of AI henchmen under her control.
The role of compute thresholds for AI governance
Abstract
Advances in artificial intelligence (“AI”) could bring transformative changes in society. AI has the potential for immense opportunities and benefits across a wide range of sectors, from healthcare and drug discovery to public services, and it could broadly improve productivity and living standards. However, more capable AI models also have the potential to cause extreme harm. AI could be misused for more effective disinformation, surveillance, cyberattacks, and development of chemical and biological weapons. More capable models are also likely to possess unexpected dangerous capabilities not yet observed in existing models. Laws can mitigate these risks, but in doing so must identify which models pose the greatest dangers and thus warrant regulatory attention.
This Article discusses the role of training compute thresholds, which use training compute to determine which potentially dangerous models are subject to legal requirements, such as reporting and evaluations. Since the amount of compute used to train a model corresponds to performance, with occasional surprising leaps, a training compute threshold (1) can be used to target the desired level of performance and corresponding risk. Several further properties of compute make it an attractive regulatory target: it is (2) essential for training, (3) objective and quantifiable, (4) capable of being estimated before training, and (5) verifiable after training. Since the amount of compute necessary to train cutting-edge models costs millions of dollars and usually relies on specialized hardware, training compute thresholds also (6) enable regulators to narrowly target potentially dangerous AI systems without burdening small companies, academic institutions, and individual researchers.
However, training compute thresholds are not infallible. Training compute is not an exhaustive measurement of risk; It does not track all risks posed by AI and is not a precise indicator of how harmful a model may be. Technological changes, such as algorithmic innovation, could also significantly reduce how much compute is needed to train an advanced model. For these reasons, a training compute threshold should be treated as a filter and a trigger for further scrutiny, rather than an end in and of itself, and accompanied by a mechanism for updating the threshold.
Indeed, the United States and the European Union (“EU”) have recognized the significance of compute in recent initiatives, which seek to ensure the safe and responsible development of AI in part by establishing training compute thresholds that trigger reporting requirements, capability evaluations, and incident monitoring. Beyond this, courts and regulators could rely on compute as an indicator of how much risk a given AI system poses when determining whether a legal condition or regulatory threshold has been met. Compute may play a role as an indicator of foreseeability of harm under tort law, as a proxy for threat to national or public security in risk assessments, or as a factor in regulatory impact analysis.
I. Introduction
The idea of establishing a “compute threshold” and, more precisely, a “training compute threshold” has recently attracted significant attention from policymakers and commentators. In recent years, various scholars and AI labs have supported setting such a threshold,[ref 1] as have governments around the world. On October 30, 2023, President Biden’s Executive Order 14,110 on Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence introduced the first living example of a compute threshold,[ref 2] although it was one of many orders revoked by President Trump upon entering office.[ref 3] The European Parliament and the European Council adopted the Artificial Intelligence Act, on June 13, 2024, providing for the establishment of a compute threshold.[ref 4] On February 4, 2024, California State Senator Scott Wiener introduced Senate Bill 1047 that defined frontier AI models with a compute threshold.[ref 5] The bill was approved by the California legislature, but it was ultimately vetoed by the State’s Governor.[ref 6] China may be considering similar measures, as indicated by recent discussions in policy circles.[ref 7] While not perfect, compute thresholds are currently one of the best options available to identify potentially high-risk models and trigger further scrutiny. Yet, in spite of this, information about compute thresholds and their relevance from a policy and legal perspective remains dispersed.
This Article proceeds in two parts. Part I provides a technical overview of compute and how the amount of compute used in training corresponds to model performance and risk. It begins by explaining what compute is and the role compute plays in AI development and deployment. Compute refers to both computational infrastructure, the hardware necessary to develop and deploy an AI system, and the amount of computational power required to train a model, commonly measured in integer or floating-point operations. More compute is used to train notable models each year, and although the cost of compute has decreased, the amount of compute used for training has increased at a higher rate, causing training costs to increase dramatically.[ref 8] This increase in training compute has contributed to improvements in model performance and capabilities, described in part by scaling laws. As models are trained on more data, with more parameters and training compute, they grow more powerful and capable. As advances in AI continue, capabilities may emerge that pose potentially catastrophic risks if not mitigated.[ref 9]
Part II discusses why, in light of this risk, compute thresholds may be important to AI governance. Since training compute can serve as a proxy for the capabilities of AI models, a compute threshold can operate as a regulatory trigger, identifying what subset of models might possess more powerful and dangerous capabilities that warrant greater scrutiny, such as in the form of reporting and evaluations. Both the European Union AI Act and Executive Order 14,110 established compute thresholds for different purposes, and many more policy proposals rely on compute thresholds to ensure that the scope of covered models matches the nature or purpose of the policy. This Part provides an overview of policy proposals that expressly call for such a threshold, as well as proposals that could benefit from the addition of a compute threshold to clarify the scope of policies that refer broadly to “advanced systems” or “systems with dangerous capabilities.” It then describes how, even absent a formal compute threshold, courts and regulators might rely on training compute as a proxy for how much risk a given AI system poses, even under existing law. This Part concludes with the advantages and limitations of using compute thresholds as a regulatory trigger.
II. Compute and the Scaling Hypothesis
A. What Is “Compute”?
The term “compute” serves as an umbrella term, encompassing several meanings that depend on context.
Commonly, the term “compute” is used to refer to computational infrastructure, i.e., the hardware stacks necessary to develop and deploy AI systems.[ref 10] Many hardware elements are integrated circuits (also called chips or microchips), such as logic chips, which perform operations, and memory chips, which store the information on which logic devices perform calculations.[ref 11] Logic chips cover a spectrum of specialization, ranging from general-purpose central processing units (“CPUs”), through graphics processing units (“GPUs”) and field-programmable gate arrays (“FPGAs”), to application-specific integrated circuits (“ASICs”) customized for specific algorithms.[ref 12] Memory chips include dynamic random-access memory (“DRAM”), static random-access memory (“SRAM”), and NOT AND (“NAND”) flash memory used in many solid state drives (“SSDs”).[ref 13]
Additionally, the term “compute” is often used to refer to how much computational power is required to train a specific AI system. Whereas the computational performance of a chip refers to how quickly it can execute operations and thus generate results, solve problems, or perform specific tasks, such as processing and manipulating data or training an AI system, “compute” refers to the amount of computational power used by one or more chips to perform a task, such as training a model. Compute is commonly measured in integer operations or floating-point operations (“OP” or “FLOP”),[ref 14] expressing the number of operations that have been executed by one or more chips, while the computational performance of those chips is measured in operations per second (“OP/s” or “FLOP/s”). In this sense, the amount of computational power used is roughly analogous to the distance traveled by a car.[ref 15] Since large amounts of compute are used in modern computing, values are often reported in scientific notation such as 1e26 or 2e26, which refer to 1⋅1026 and 2⋅1026 respectively.
Compute is essential throughout the AI lifecycle. The AI lifecycle can be broken down into two phases: development and deployment.[ref 16] In the first phase, development, developers design the model by choosing an architecture, the structure of the network, and initial values for hyperparameters (i.e., parameters that control the learning process, such as number of layers and training rate).[ref 17] Enormous amounts of data, usually from publicly available sources, are processed and curated to produce high-quality datasets for training.[ref 18] The model then undergoes “pre-training,” in which the model is trained on a large and diverse dataset in order to build the general knowledge and features of the model, which are reflected in the weights and biases of the model.[ref 19] Alternatively, developers may use an existing pre-trained model, such as OpenAI’s GPT-4 (“Generative Pre-trained Transformer 4”). The term “foundation model” refers to models like these, which are trained on broad data and adaptable to many downstream tasks.[ref 20] Performance and capabilities improvements are then possible using methods such as fine-tuning on task-specific datasets, reinforcement learning from human feedback (“RLHF”), teaching the model to use tools, and instruction tuning.[ref 21] These enhancements are far less compute-intensive than pre-training, particularly for models trained on massive datasets.[ref 22]
As of this writing, there is no agreed-upon standard for measuring “training compute.” Estimates of “training compute” typically refer only to the amount of compute used during pre-training. More specifically, they refer to the amount of compute used during the final pre-training run, which contributes to the final machine learning model, and does not include any previous test runs or post-training enhancements, such as fine-tuning.[ref 23] There are exceptions: for instance, the EU AI Act considers the cumulative amount of compute used for training by including all the compute “used across the activities and methods that are intended to enhance the capabilities of the model prior to deployment, such as pre-training, synthetic data generation and fine-tuning.”[ref 24] California Senate Bill 1047 addressed post-training modifications generally and fine-tuning in particular, providing that a covered model fine-tuned with more than 3e25 OP or FLOP would be considered a distinct “covered model,” while one fine-tuned on less compute or subjected to unrelated post-training modifications would be considered a “covered model derivative.”[ref 25]
In the second phase, deployment, the model is made available to users and is used.[ref 26] Users provide input to the model, such as in the form of a prompt, and the model makes predictions from this input in a process known as “inference.”[ref 27] The amount of compute needed for a single inference request is far lower than what is required for a training run.[ref 28] However, for systems deployed at scale, the cumulative compute used for inference can surpass training compute by several orders of magnitude.[ref 29] Consider, for instance, a large language model (“LLM”). During training, a large amount of compute is required over a smaller time frame within a closed system, usually a supercomputer. Once the model is deployed, each text generation leverages its own copy of the trained model, which can be run on a separate compute infrastructure. The model may serve hundreds of millions of users, each generating unique content and using compute with each inference request. Over time, the cumulative compute usage for inference can surpass the total compute required for training.
There are various reasons to consider compute usage at different stages of the AI lifecycle, which is discussed in Section I.E. For clarity, this Article uses “training compute” for compute used during the final pre-training run and “inference compute” for compute used by the model during a single inference, measured in the number of operations (“OP” or “FLOP”). Figure 1 illustrates a simplified version of the language model compute lifecycle.
Figure 1: Simplified language model lifecycle
B. What Is Moore’s Law and Why Is It Relevant for AI?
In 1965, Gordon Moore forecasted that the number of transistors on an integrated circuit would double every year.[ref 30] Ten years later, Moore revised his initial forecast to a two-year doubling period.[ref 31] This pattern of exponential growth is now called “Moore’s Law.”[ref 32] Similar rates of growth have been observed in related metrics, notably including the increase in computational performance of supercomputers;[ref 33] as the number of transistors on a chip increases, so does computational performance (although other factors also play a role).[ref 34]
A corollary of Moore’s Law is that the cost of compute has fallen dramatically; a dollar can buy more FLOP every year.[ref 35] Greater access to compute, along with greater spending from 2010 onwards (i.e., the so-called deep learning era),[ref 36] has contributed to developers using ever more compute to train AI systems. Research has found that the compute used to train notable and frontier models has grown by 4–5x per year between 2010 and May 2024.[ref 37]
Figure 2: Compute used to train notable AI systems from 1950 to 2023[ref 38]
However, the current rate of growth in training compute may not be sustainable. Scholars have cited the cost of training,[ref 39] a limited supply of AI chips,[ref 40] technical challenges with using that much hardware (such as managing the number of processors that must run in parallel to train larger models),[ref 41] and environmental impact[ref 42] as factors that could constrain the growth of training compute. Research in 2018 with data from OpenAI estimated that then-current trends of growth in training compute could be sustained for at most 3.5 to 10 years (2022 to 2028), depending on spending levels and how the cost of compute evolves over time.[ref 43] In 2022, that analysis was replicated with a more comprehensive dataset and suggested that this trend could be maintained for longer, for 8 to 18 years (2030 to 2040) depending on compute cost-performance improvements and specialized hardware improvements.[ref 44]
C. What Are “Scaling Laws” and What Do They Say About AI Models?
Scaling laws describe the functional (mathematical) relationship between the amount of training compute and the performance of the AI model.[ref 45] In this context, performance is a technical metric that quantifies “loss,” which is the amount of error in the model’s predictions. When loss is measured on a test or validation set that uses data not part of the training set, it reflects how well the model has generalized its learning from the training phase. The lower the loss, the more accurate and reliable the model is in making predictions on data it has not encountered during its training.[ref 46] As training compute increases, alongside increases in parameters and training data, so does model performance, meaning that greater training compute reduces the errors made.[ref 47] Increased training compute also corresponds to an increase in capabilities.[ref 48] Whereas performance refers to a technical metric, such as test loss, capabilities refer to the ability to complete concrete tasks and solve problems in the real world, including in commercial applications.[ref 49] Capabilities can also be assessed using practical and real-world tests, such as standardized academic or professional licensing exams, or with benchmarks developed for AI models. Common benchmarks include “Beyond the Imitation Game” (“BIG-Bench”), which comprises 204 diverse tasks that cover a variety of topics and languages,[ref 50] and the “Massive Multitask Language Understanding” benchmark (“MMLU”), a suite of multiple-choice questions covering 57 subjects.[ref 51] To evaluate the capabilities of Google’s PaLM 2 and OpenAI’s GPT-4, developers relied on BIG-Bench and MMLU as well as exams designed for humans, such as the SAT and AP exams.[ref 52]
Training compute has a relatively smooth and consistent relationship with technical metrics like training loss. Training compute also corresponds to real-world capabilities, but not in a smooth and predictable way. This is due in part to occasional surprising leaps, discussed in Section I.D, and subsequent enhancements such as fine-tuning, which can further increase capabilities using far less compute.[ref 53] Despite being unable to provide a full and accurate picture of a model’s final capabilities, training compute still provides a reasonable basis for estimating the base capabilities (and corresponding risk) of a foundation model. Figure 3 shows the relationship between an increase in training compute and dataset size, and performance on the MMLU benchmark.
Figure 3: Relationship between increase in training compute and dataset size,
and performance on MMLU[ref 54]
In light of the correlation between training compute and performance, the “scaling hypothesis” states that scaling training compute will predictably continue to produce even more capable systems, and thus more compute is important for AI development.[ref 55] Some have taken this hypothesis further, proposing a “Bitter Lesson:” that “the only thing that matters in the long run is the leveraging of comput[e].”[ref 56] Since the emergence of the deep learning era, this hypothesis has been sustained by the increasing use of AI models in commercial applications, whose development and commercial success have been significantly driven by increases in training compute.[ref 57]
Two factors weigh against the scaling hypothesis. First, scaling laws describe more than just the performance improvements based on training compute; they describe the optimal ratio of the size of the dataset, the number of parameters, and the training compute budget.[ref 58] Thus, a lack of abundant or high-quality data could be a limiting factor. Researchers estimate that, if training datasets continue to grow at current rates, language models will fully utilize human-generated public text data between 2026 and 2032,[ref 59] while image data could be exhausted between 2030 and 2060.[ref 60] Specific tasks may be bottlenecked earlier by the scarcity of high-quality data sources.[ref 61] There are, however, several ways that data limitations might be delayed or avoided, such as synthetic data generation and using additional datasets that are not public or in different modalities.[ref 62]
Second, algorithmic innovation permits performance gains that would otherwise require prohibitively expensive amounts of compute.[ref 63] Research estimates that every 9 months, improved algorithms for image classification[ref 64] and LLMs[ref 65] contribute the equivalent of a doubling of training compute budgets. Algorithmic improvements include more efficient utilization of data[ref 66] and parameters, the development of improved training algorithms, or new architectures.[ref 67] Over time, the amount of training compute needed to achieve a given capability is reduced, and it may become more difficult to predict performance and capabilities on that basis (although scaling trends of new algorithms could be studied and perhaps predicted). The governance implications of this are multifold, including that increases in training compute may become less important for AI development and that many more actors will be able to access the capabilities previously restricted to a limited number of developers.[ref 68] Still, responsible frontier AI development may enable stakeholders to develop understanding, safety practices, and (if needed) defensive measures for the most advanced AI capabilities before these capabilities proliferate.
D. Are High-Compute Systems Dangerous?
Advances in AI could deliver immense opportunities and benefits across a wide range of sectors, from healthcare and drug discovery[ref 69] to public services.[ref 70] However, more capable models may come with greater risk, as improved capabilities could be used for harmful and dangerous ends. While the degree of risk posed by current AI models is a subject of debate,[ref 71] future models may pose catastrophic and existential risks as capabilities improve.[ref 72] Some of these risks are expected to be closely connected to the unexpected emergence of dangerous capabilities and the dual-use nature of AI models.
As discussed in Section I.C, increases in compute, data, and the number of parameters lead to predictable improvements in model performance (test loss) and general but somewhat less predictable improvements in capabilities (real-world benchmarks and tasks). However, scaling up these inputs to a model can also result in qualitative changes in capabilities in a phenomenon known as “emergence.”[ref 73] That is, a larger model might unexpectedly display emergent capabilities not present in smaller models, suddenly able to perform a task that smaller models could not.[ref 74] During the development of GPT-3, early models had close-to-zero performance on a benchmark for addition, subtraction, and multiplication. Arithmetic capabilities appeared to emerge suddenly in later models, with performance jumping substantially above random at 2·1022 FLOP and continuing to improve with scale.[ref 75] Similar jumps were observed at different thresholds, and for different models, on a variety of tasks.[ref 76]
Some have contested the concept of emergent capabilities, arguing that what appear to be emergent capabilities in large language models are explained by the use of discontinuous measures, rather than by sharp and unpredictable improvements or developments in model capabilities with scale.[ref 77] However, discontinuous measures are often meaningful, as when the correct answer or action matters more than how close the model gets to it. As Anderljung and others explain: “For autonomous vehicles, what matters is how often they cause a crash. For an AI model solving mathematics questions, what matters is whether it gets the answer exactly right or not.”[ref 78] Given the difficulties inherent in choosing an appropriate continuous measure and determining how it corresponds to the relevant discontinuous measure,[ref 79] it is likely that capabilities will continue to seemingly emerge.
Together with emerging capabilities come emerging risks. Like many other innovations, AI systems are dual-use by nature, with the potential to be used for both beneficial and harmful ends.[ref 80] Executive Order 14,110 recognized that some models may “pose a serious risk to security, national economic security, national public health or safety” by “substantially lowering the barrier of entry for non-experts to design, synthesize, acquire, or use chemical, biological, radiological, or nuclear weapons; enabling powerful offensive cyber operations . . . ; [or] permitting the evasion of human control or oversight through means of deception or obfuscation.”[ref 81]
Predictions and evaluations will likely adequately identify many capabilities before deployment, allowing developers to take appropriate precautions. However, systems trained at a greater scale may possess novel capabilities, or improved capabilities that surpass a critical threshold for risk, yet go undetected by evaluations.[ref 82] Some of these capabilities may appear to emerge only after post-training enhancements, such as fine-tuning or more effective prompting methods. A system may be capable of conducting offensive cyber operations, manipulating people in conversation, or providing actionable instructions on conducting acts of terrorism,[ref 83] and still be deployed without the developers fully comprehending unexpected and potentially harmful behaviors. Research has already detected unexpected behavior in current models. For instance, during the recent U.K. AI Safety Summit on November 1, 2023, Apollo Research showed that GPT-4 can take illegal actions like insider trading and then lie about its actions without being instructed to do so.[ref 84] Since the capabilities of future foundation models may be challenging to predict and evaluate, “emergence” has been described as “both the source of scientific excitement and anxiety about unanticipated consequences.”[ref 85]
Not all risks come from large models. Smaller models trained on data from certain domains, such as biology or chemistry, may pose significant risks if repurposed or misused.[ref 86] When MegaSyn, a generative molecule design tool used for drug discovery, was repurposed to find the most toxic molecules instead of the least toxic, it found tens of thousands of candidates in under six hours, including known biochemical agents and novel compounds predicted to be as or more deadly.[ref 87] The amount of compute used to train DeepMind’s AlphaFold, which predicts three-dimensional protein structures from the protein sequence, is minimal compared to frontier language models.[ref 88] While scaling laws can be observed in a variety of domains, the amount of compute required to train models in some domains may be so low that a compute threshold is not a practical restriction on capabilities.
Broad consensus is forming around the need to test, monitor, and restrict systems of concern.[ref 89] The role of compute thresholds, and whether they are used at all, depends on the nature of the risk and the purpose of the policy: does it target risks from emergent capabilities of frontier models,[ref 90] risks from models with more narrow but dangerous capabilities,[ref 91] or other risks from AI?
E. Does Compute Usage Outside of Training Influence Performance and Risk?
In light of the relationship between training compute and performance expressed by scaling laws, training compute is a common proxy for how capable and powerful AI models are and the risks that they pose.[ref 92] However, compute used outside of training can also influence performance, capabilities, and corresponding risk.
As discussed in Section I.A, training compute typically does not refer to all compute used during development, but is instead limited to compute used during the final pre-training run.[ref 93] This definition excludes subsequent (post-training) enhancements, such as fine-tuning and prompting methods, which can significantly improve capabilities (see supra Figure 1) using far less compute; many current methods can improve capabilities the equivalent of a 5x increase in training compute, while some can improve them by more than 20x.[ref 94]
The focus on training compute also misses the significance of compute used for inference, in which the trained model generates output in response to a prompt or new input data.[ref 95] Inference is the biggest compute cost for models deployed at scale, due to the frequency and volume of requests they handle.[ref 96] While developing an AI model is far more computationally intensive than a single inference request, it is a one-time task. In contrast, once a model is deployed, it may receive numerous inference requests that, in aggregate, exceed the compute expenditures of training. Some have even argued that inference compute could be a bottleneck in scaling AI, if inference compute costs scaling with training compute grow too large.[ref 97]
Greater availability of inference compute could enhance malicious uses of AI by allowing the model to process data more rapidly and enabling the operation of multiple instances in parallel. For example, AI could more effectively be used to carry out cyber attacks, such as a distributed denial-of-service (“DDoS”) attack,[ref 98] to manipulate financial markets,[ref 99] or to increase the speed, scale, and personalization of disinformation campaigns.[ref 100]
Compute used outside of development may also impact model performance. Specifically, some techniques can increase the performance of a model at the cost of more compute used during inference.[ref 101] Developers could therefore choose to improve a model beyond its current capabilities or to shift some compute expenditures from training to inference, in order to obtain equally-capable systems with less training compute. Users could also prompt a model to use similar techniques during inference, for example by (1) using “few-shot” prompting, in which initial prompts provide the model with examples of the desired output for a type of input,[ref 102] (2) using chain-of-thought prompting, which uses few-shot prompting to provide examples of reasoning,[ref 103] or (3) simply providing the same prompt multiple times and selecting the best result. Some user-side techniques to improve performance might increase the compute used during a single inference, while others would leave it unchanged (while still increasing the total compute used, due to multiple inferences being performed).[ref 104] Meanwhile, other techniques—such as pruning,[ref 105] weight sharing,[ref 106] quantization,[ref 107] and distillation[ref 108]—can reduce compute used during inference while maintaining or even improving performance, and they can further reduce inference compute at the cost of lower performance.
Beyond model characteristics such as parameter count, other factors can also affect the amount of compute used during inference in ways that may or may not improve performance, such as input size (compare a short prompt to a long document or high-resolution image) and batch size (compare one input provided at a time to many inputs in a single prompt).[ref 109] Thus, for a more accurate indication of model capabilities, compute used to run a single inference[ref 110] for a given set of prompts could be considered alongside other factors, such as training compute. However, doing so may be impractical, as data about inference compute (or architecture useful for estimating it) is rarely published by developers,[ref 111] different techniques could make inference more compute-efficient, and less information is available regarding the relationship between inference compute and capabilities.
While companies might be hesitant to increase inference compute at scale due to cost, doing so may still be worthwhile in certain circumstances, such as for more narrowly deployed models or those willing to pay more for improved capabilities. For example, OpenAI offers dedicated instances for users who want more control over system performance, with a reserved allocation of compute infrastructure and the ability to enable features such as longer context limits.[ref 112]
Over time, compute usage during the AI development and deployment process may change. It was previously common practice to train models with supervised learning, which uses annotated datasets. In recent years, there has been a rise in self-supervised, semi-supervised, and unsupervised learning, which use data with limited or no annotation but require more compute.[ref 113]
III. The Role of Compute Thresholds for AI Governance
A. How Can Compute Thresholds Be Used in AI Policy?
Compute can be used as a proxy for the capabilities of AI systems, and compute thresholds can be used to define the limited subset of high-compute models subject to oversight or other requirements.[ref 114] Their use depends on the context and purpose of the policy. Compute thresholds serve as intuitive starting points to identify potential models of concern,[ref 115] perhaps alongside other factors.[ref 116] They operate as a trigger for greater scrutiny or specific requirements. Once a certain level of training compute is reached, a model is presumed to have a higher risk of displaying dangerous capabilities (and especially unknown dangerous capabilities) and, hence, is subject to stricter oversight and other requirements.
Compute thresholds have already entered AI policy. The EU AI Act requires model providers to assess and mitigate systemic risks, report serious incidents, conduct state-of-the-art tests and model evaluations, ensure cybersecurity, and report serious incidents if a compute threshold is crossed.[ref 117] Under the EU AI Act, a general-purpose model that meets the initial threshold is presumed to have high-impact capabilities and associated systemic risk.[ref 118]
In the United States, Executive Order 14,110 directed agencies to propose rules based on compute thresholds. Although it was revoked by President Trump’s Executive Order 14,148,[ref 119] many actions have already been taken and rules have been proposed for implementing Executive Order 14,110. For instance, the Department of Commerce’s Bureau of Industry and Security issued a proposed rule on September 11, 2024[ref 120] to implement the requirement that AI developers and cloud service providers report on models above certain thresholds, including information about (1) “any ongoing or planned activities related to training, developing, or producing dual-use foundation models,” (2) the results of red-teaming, and (3) the measures the company has taken to meet safety objectives.[ref 121] The executive order also imposed know-your-customer (“KYC”) monitoring and reporting obligations on U.S. cloud infrastructure providers and their foreign resellers, again with a preliminary compute threshold.[ref 122] On January 29, 2024, the Bureau of Industry and Security issued a proposed rule implementing those requirements.[ref 123] The proposed rule noted that training compute thresholds may determine the scope of the rule; the program is limited to foreign transactions to “train a large AI model with potential capabilities that could be used in malicious cyber-enabled activity,” and technical criteria “may include the compute used to pre-train the model exceeding a specified quantity.” [ref 124] The fate of these rules is uncertain, as all rules and actions taken pursuant to Executive Order 14,110 will be reviewed to ensure that they are consistent with the AI policy set forth in Executive Order 14,179, Removing Barriers to American Leadership in Artificial Intelligence.[ref 125] Any rules of actions identified as inconsistent are directed to be suspended, revised, or rescinded.[ref 126]
Numerous policy proposals have likewise called for compute thresholds. Scholars and developers alike have expressed support for a licensing or registration regime,[ref 127] and a compute threshold could be one of several ways to trigger the requirement.[ref 128] Compute thresholds have also been proposed for determining the level of KYC requirements for compute providers (including cloud providers).[ref 129] The Framework to Mitigate AI-Enabled Extreme Risks, proposed by U.S. Senators Romney, Reed, Moran, and King, would include a compute threshold for requiring notice of development, model evaluation, and pre-deployment licensing.[ref 130]
Other AI regulations and policy proposals do not explicitly call for the introduction of compute thresholds but could still benefit from them. A compute threshold could clarify when specific obligations are triggered in laws and guidance that refer more broadly to “advanced systems” or “systems with dangerous capabilities,” as in the voluntary guidance for “organizations developing the most advanced AI systems” in the Hiroshima Process International Code of Conduct for Advanced AI Systems, agreed upon by G7 leaders on October 30, 2023.[ref 131] Compute thresholds could identify when specific obligations are triggered in other proposals, including proposals for: (1) conducting thorough risk assessments of frontier AI models before deployment;[ref 132] (2) subjecting AI development to evaluation-gated scaling;[ref 133] (3) pausing development of frontier AI;[ref 134] (4) subjecting developers of advanced models to governance audits;[ref 135] (5) monitoring advanced models after deployment;[ref 136] and (6) requiring that advanced AI models be subject to information security protections.[ref 137]
B. Why Might Compute Be Relevant Under Existing Law?
Even without a formal compute threshold, the significance of training compute could affect the interpretation and application of existing laws. Courts and regulators may rely on compute as a proxy for how much risk a given AI system poses—alongside other factors such as capabilities, domain, safeguards, and whether the application is in a higher-risk context—when determining whether a legal condition or regulatory threshold has been met. This section briefly covers a few examples. First, it discusses the potential implications for duty of care and foreseeability analyses in tort law. It then goes on to describe how regulatory agencies could depend on training compute as one of several factors in evaluating risk from frontier AI, for example as an indicator of change to a regulated product and as a factor in regulatory impact analysis.
The application of existing laws and ongoing development of common law, such as tort law, may be particularly important while AI governance is still nascent[ref 138] and may operate as a complement to regulations once developed.[ref 139] However, courts and regulators will face new challenges as cases involve AI, an emerging technology of which they have no specialized knowledge, and parties will face uncertainty and inconsistent judgments across jurisdictions. As developments in AI unsettle existing law[ref 140] and agency practice, courts and agencies might rely on compute in several ways.
For example, compute could inform the duty of care owed by developers who make voluntary commitments to safety.[ref 141] A duty of care, which is a responsibility to take reasonable care to avoid causing harm to another, can be conditioned on the foreseeability of the plaintiff as a victim or be an affirmative duty to act in a particular way; affirmative duties can arise from the relationship between the parties, such as between business owner and customer, doctor and patient, and parent and child.[ref 142] If AI companies make general commitments to security testing and cybersecurity, such as the voluntary safety commitments secured by the Biden administration,[ref 143] those commitments may give rise to a duty of care in which training compute is a factor in determining what security is necessary. If a lab adopts a responsible scaling policy that requires it to have protection measures based on specific capabilities or potential for risk or misuse,[ref 144] a court might consider training compute as one of several factors in evaluating the potential for risk or misuse.
A court might also consider training compute as a factor when determining whether a harm was foreseeable. More advanced AI systems, trained with more compute, could foreseeably be capable of greater harm, especially in light of scaling laws discussed in Section I.C that make clear the relationship between compute and performance. It may likewise be foreseeable that a powerful AI system could be misused[ref 145] or become the target of more sophisticated attempts at exfiltration, which might succeed without adequate security.[ref 146] Foreseeability may in turn bear on negligence elements of proximate causation and duty of care.
Compute could also play a role in other scenarios, such as in a false advertising claim under the Lanham Act[ref 147] or state and federal consumer protection laws. If a business makes a claim about its AI system or services that is false or misleading, it could be held liable for monetary damages and enjoined from making that claim in the future (unless it becomes true).[ref 148] While many such claims will not involve compute, some may; for example, if a lab publicly claims to follow a responsible scaling policy, training compute could be relevant as an indicator of model capability and the corresponding security and safety measures promised by the policy.
Regulatory agencies may likewise consider compute in their analyses and regulatory actions. For example, the Environmental Protection Agency could consider training (and inference) compute usage as part of environmental impact assessments.[ref 149] Others could treat compute as a proxy for threat to national or public security. Agencies and committees responsible for identifying and responding to various risks, such as the Interagency Committee on Global Catastrophic Risk[ref 150] and Financial Stability Oversight Council,[ref 151] could consider compute in their evaluation of risk from frontier AI. Over fifty federal agencies were directed to take specific actions to promote responsible development, deployment, federal use of AI, and regulation of industry, in the government-wide effort established by Executive Order 14,110[ref 152]—although these actions are now under review.[ref 153] Even for agencies not directed to consider compute or implement a preliminary compute threshold, compute might factor into how guidance is implemented over time.
More speculatively, changes to training compute could be used by agencies as one of many indicators of how much a regulated product has changed, and thus whether it warrants further review. For example, the Food and Drug Administration might consider compute when evaluating AI in medical devices or diagnostic tools.[ref 154] While AI products considered to be medical devices are more likely to be narrow AI systems trained on comparatively less compute, significant changes to training compute may be one indicator that software modifications require premarket submission. The ability to measure, report, and verify compute[ref 155] could make this approach particularly compelling for regulators.
Finally, training compute may factor into regulatory impact analyses, which evaluate the impact of proposed and existing regulations through quantitative and qualitative methods such as cost-benefit analysis.[ref 156] While this type of analysis is not necessarily determinative, it is often an important input into regulatory decisions and necessary for any “significant regulatory action.”[ref 157] As agencies develop and propose new regulations and consider how those rules will affect or be affected by AI, compute could be relevant in drawing lines that define what conduct and actors are affected. For example, a rule with a higher compute threshold and narrower scope may be less significant and costly, as it covers fewer models and developers. The amount of compute used to train models now and in the future may be not only a proxy for threat to national security (or innovation, or economic growth), but also a source of uncertainty, given the potential for emergent capabilities.
C. Where Should the Compute Threshold(s) Sit?
The choice of compute threshold depends on the policy under consideration: what models are the intended target, given the purpose of the policy? What are the burdens and costs of compliance? Can the compute threshold be complemented with other elements for determining whether a model falls within the scope of the policy, in order to more precisely accomplish its purpose?
Some policy proposals would establish a compute threshold “at the level of FLOP used to train current foundational models.”[ref 158] While the training compute of many models is not public, according to estimates, the largest models today were trained with 1e25 FLOP or more, including at least one open-source model, Llama 3.1 405B.[ref 159] This is the initial threshold established by the EU AI Act. Under the Act, general-purpose AI models are considered to have “systemic risk,” and thus trigger a series of obligations for their providers, if found to have “high impact capabilities.”[ref 160] Such capabilities are presumed if the cumulative amount of training compute, which includes all “activities and methods that are intended to enhance the capabilities of the model prior to deployment, such as pre-training, synthetic data generation and fine-tuning,” exceeds 1e25 FLOP.[ref 161] This threshold encompasses existing models such as Gemini Ultra and GPT-4, and it can be updated upwards or downwards by the European Commission through delegated acts.[ref 162] During the AI Safety Summit held in 2023, the U.K. Government included current models by defining “frontier AI” as “highly capable general-purpose AI models that can perform a wide variety of tasks and match or exceed the capabilities present in today’s most advanced models” and acknowledged that the definition included the models underlying ChatGPT, Claude, and Bard.[ref 163]
Others have proposed an initial threshold of “more training compute than already-deployed systems,”[ref 164] such as 1e26 FLOP[ref 165] or 1e27 FLOP.[ref 166] No known model currently exceeds 1e26 FLOP training compute, which is roughly five times the compute used to train GPT-4.[ref 167] These higher thresholds would more narrowly target future systems that pose greater risks, including potential catastrophic and existential risks.[ref 168] President Biden’s Executive Order on AI[ref 169] and recently-vetoed California Senate Bill 1047[ref 170] are in line with these proposals, both targeting models trained with more than 1e26 OP or FLOP.
Far more models would fall within the scope of a compute threshold set lower than current frontier models. While only two models exceeded 1e23 FLOP training compute in 2017, over 200 models meet that threshold today.[ref 171] As discussed in Section II.A, compute thresholds operate as a trigger for additional scrutiny, and more models falling within the ambit of regulation would entail a greater burden not only on developers, but also on regulators.[ref 172] These smaller, general-purpose models have not yet posed extreme risks, making a lower threshold unwarranted at this time.[ref 173]
While the debate has centered mostly around the establishment of a single training compute threshold, governments could adopt a pluralistic and risk-adjusted approach by introducing multiple compute thresholds that trigger different measures or requirements according to the degree or nature of risk. Some proposals recommend a tiered approach that would create fewer obligations for models trained on less compute. For example, the Responsible Advanced Artificial Intelligence Act of 2024 would require pre-registration and benchmarks for lower-compute models, while developers of higher-compute models must submit a safety plan and receive a permit prior to training or deployment.[ref 174] Multi-tiered systems may also incorporate a higher threshold beyond which no development or deployment can take place, with limited exceptions, such as for development at a multinational consortium working on AI safety and emergency response infrastructure[ref 175] or for training runs and models with strong evidence of safety.[ref 176]
Domain-specific thresholds could be established for models that possess capabilities or expertise in areas of concern and models that are trained using less compute than general-purpose models.[ref 177] A variety of specialized models are already available to advance research, trained on extensive scientific databases.[ref 178] As discussed in Part I.D, these models present a tremendous opportunity, yet many have also recognized the potential threat of their misuse to research, develop, and use chemical, biological, radiological, and nuclear weapons.[ref 179] To address these risks, President Biden’s Executive Order on AI, which set a compute threshold of 1e26 FLOP to trigger reporting requirements, set a substantially lower compute threshold of 1e23 FLOP for models trained “using primarily biological sequence data.”[ref 180] The Hiroshima Process International Code of Conduct for Advanced AI Systems likewise recommends devoting particular attention to offensive cyber capabilities and chemical, biological, radiological, and nuclear risks, although it does not propose a compute threshold.[ref 181]
While domain-specific thresholds could be useful for a variety of policies tailored to specific risks, there are some limitations. It may be technically difficult to verify how much biological sequence data (or other domain-specific data) was used to train a model.[ref 182] Another challenge is specifying how much data in a given domain causes a model to fall within scope, particularly considering the potential capabilities of models trained on mixed data.[ref 183] Finally, the amount of training compute required may be so low that, over time, a compute threshold is not practical.
When choosing a threshold, regulators should be aware that capabilities might be substantially improved through post-training enhancements, and training compute is only a general predictor of capabilities. The absolute limits are unclear at this point; however, current methods can result in capability improvements equivalent to a 5- to 30-times increase in training.[ref 184] To account for post-training enhancements, a governance regime could create a safety buffer, in which oversight or other protective measures are set at a lower threshold.[ref 185] Along similar lines, open-source models may warrant a lower threshold for at least some regulatory requirements, since they could be further trained by another actor and, once released, cannot be moderated or rescinded. [ref 186]
D. Does a Compute Threshold Require Updates?
Once established, compute thresholds and related criteria will likely require updates over time.[ref 187] Improvements in algorithmic efficiency could reduce the amount of compute needed to train an equally capable model,[ref 188] or a threshold could be raised or eliminated if adequate protective measures are developed or if models trained with a certain amount of compute are demonstrated to be safe.[ref 189] To further guard against future developments in a rapidly evolving field, policymakers can authorize regulators to update compute thresholds and related criteria.[ref 190]
Several policies, proposed and enacted, have incorporated a dynamic compute threshold. For example, President Biden’s Executive Order on AI authorized the Secretary of Commerce to update the initial compute threshold set in the order, as well as other technical conditions for models subject to reporting requirements, “as needed on a regular basis” while establishing an interim compute threshold of 1e26 OP or FLOP.[ref 191] Similarly, the EU AI Act provides that the 1e25 FLOP compute threshold “should be adjusted over time to reflect technological and industrial changes, such as algorithmic improvements” and authorizes the European Commission to amend the threshold and “supplement benchmarks and indicators in light of evolving technological developments.”[ref 192] The California Senate Bill 1047 would have created the Frontier Model Division within the Government Operations Agency and authorized it to “update both of the [compute] thresholds in the definition of a ‘covered model’ to ensure that it accurately reflects technological developments, scientific literature, and widely accepted national and international standards and applies to artificial intelligence models that pose a significant risk of causing or materially enabling critical harms.”[ref 193]
Regulators may need to update compute thresholds rapidly. Historically, failure to quickly update regulatory definitions in the context of emerging technologies has led to definitions becoming useless or even counterproductive.[ref 194] In the field of AI, developments may occur quickly and with significant implications for national security and public health, making responsive rulemaking particularly important. In the United States, there are several statutory tools to authorize and encourage expedited and regular rulemaking.[ref 195] For example, Congress could expressly authorize interim or direct final rulemaking, which would enable an agency to shift the comment period in notice-and-comment rulemaking to take place after the rule has already been promulgated, thereby allowing them to respond quickly to new developments.[ref 196]
Policymakers could also require a periodic evaluation of whether compute thresholds are achieving their purpose to ensure that it does not become over- or under-inclusive. While establishing and updating a compute threshold necessarily involves prospective ex ante impact assessment, in order to take precautions against risk without undue burdens, regulators can learn much from retrospective ex post analysis of current and previous thresholds.[ref 197] In a survey conducted for the Administrative Conference of the United States, “[a]ll agencies stated that periodic reviews have led to substative [sic] regulatory improvement at least some of time. This was more likely when the underlying evidence basis for the rule, particularly the science or technology, was changing.”[ref 198] While the optimal frequency of periodic review is unknown, the study found that U.S. federal agencies were more likely to conduct reviews when provided with a clear time interval (“at least every X years”).[ref 199]
Several further institutional and procedural factors could affect whether and how compute thresholds are updated. In order to effectively update compute thresholds and other criteria, regulators must have access to expertise and talent through hiring, training, consultation and collaboration, and other avenues that facilitate access to experts from academia and industry.[ref 200] Decisions will be informed by the availability of data, including scientific and commercial data, to enable ongoing monitoring, learning, analysis, and adaptation in light of new developments. Decision-making procedures, agency design, and influence and pressures from policymakers, developers, and other stakeholders will likewise affect updates, among many other factors.[ref 201] While more analysis is beyond the scope of this Article, others have explored procedural and substantive measures for adaptive regulation[ref 202] and effective governance of emerging technologies.[ref 203]
Some have proposed defining compute thresholds in terms of effective compute,[ref 204] as an alternative to updates over time. Effective compute could index to a particular year (similar to inflation adjustments) and thus account for the role that algorithmic progress (e.g., 1e25 of 2023-level effective compute).[ref 205] However, there is not an agreed upon way to more precisely define and calculate effective compute, and the ability to do so depends on the challenging task of calculating algorithmic efficiency, including choosing a performance metric to anchor on. Furthermore, effective compute alone would fail to address potential changes in the risk landscape, such as the development of protective measures.
E. What Are the Advantages and Limitations of a Training Compute Threshold?
Compute has several properties that make it attractive for policymaking: it is (1) correlated with capabilities and thus risk, (2) essential for training, with thresholds that are difficult to circumvent without reducing performance, (3) an objective and quantifiable measure, (4) capable of being estimated before training (5) externally verifiable after training, and (6) a significant cost during development and thus indicative of developer resources. However, training compute thresholds are not infallible: (1) training compute is an imprecise indicator of potential risk, (2) a compute threshold could be circumvented, and (3) there is no industry standard for measuring and reporting training compute.[ref 206] Some of these limitations can be addressed with thoughtful drafting, including clear language, alternative and supplementary elements for defining what models are within scope, and authority to update any compute threshold and other criteria in light of future developments.
First, training compute is correlated with model capabilities and associated risks. Scaling laws predict an increase in performance as training compute increases, and real-world capabilities generally follow (Section I.C). As models become more capable, they may also pose greater risks if they are misused or misaligned (Section I.D). However, training compute is not a precise indicator of downstream capabilities. Capabilities can seemingly emerge abruptly and discontinuously as models are developed with more compute,[ref 207] and the open-ended nature of foundation models means those capabilities may go undetected.[ref 208] Post-training enhancements such as fine-tuning are often not considered a part of training compute, yet they can dramatically improve performance and capabilities with far less compute. Furthermore, not all models with dangerous capabilities require large amounts of training compute; low-compute models with capabilities in certain domains, such as biology or chemistry, may also pose significant risks, such as biological design tools that could be used for drug discovery or the creation of pathogens worse than any seen to date.[ref 209] The market may shift towards these smaller, cheaper, more specialized models,[ref 210] and even general-purpose low-compute models may come to pose significant risks. Given these limitations, a training compute threshold cannot capture all possible risks; however, for large, general-purpose AI models, training compute can act as an initial threshold for capturing emerging capabilities and risks.
Second, compute is necessary throughout the AI lifecycle, and a compute threshold would be difficult to circumvent. There is no AI without compute (Section I.A). Due to its relationship with model capabilities, training compute cannot be easily reduced without a corresponding reduction in capabilities, making it difficult to circumvent for developers of the most advanced models. Nonetheless, companies might find “creative ways” to account for how much compute is used for a given system in order to avoid being subject to stricter regulation.[ref 211] To reduce this risk, some have suggested monitoring compute usage below these thresholds to help identify circumvention methods, such as structuring techniques or outsourcing.[ref 212] Others have suggested using compute thresholds alongside additional criteria, such as the model’s performance on benchmarks, financial or energy cost, or level of integration into society.[ref 213] As in other fields, regulatory burdens associated with compute thresholds could encourage regulatory arbitrage if a policy does not or cannot effectively account for that possibility.[ref 214] For example, since compute can be accessed remotely via digital means, data centers and compute providers could move to less-regulated jurisdictions.
Third, compute is an objective and quantifiable metric that is relatively straightforward to measure. Compute is a quantitative measure that reflects the number of mathematical operations performed. It does not depend on specific infrastructure and can be compared across different sets of hardware and software.[ref 215] By comparison, other metrics, such as algorithmic innovation and data, have been more difficult to track.[ref 216] Whereas quantitative metrics like compute can be readily compared across different instances, the qualitative nature of many other metrics makes them more subject to interpretation and difficult to consistently measure. Compute usage can be measured internally with existing tools and systems; however, there is not yet an industry standard for measuring, auditing, and reporting the use of computational resources.[ref 217] That said, there have been some efforts toward standardization of compute measurement.[ref 218] In the absence of a standard, some have instead presented a common framework for calculating compute, based on information about the hardware used and training time.[ref 219]
Fourth, compute can be estimated ahead of model development and deployment. Developers already estimate training compute with information about the model’s architecture and amount of training data, as part of planning before training takes place. The EU AI Act recognizes this, noting that “training of general-purpose AI models takes considerable planning which includes the upfront allocation of compute resources and, therefore, providers of general-purpose AI models are able to know if their model would meet the threshold before the training is completed.”[ref 220] Since compute can be readily estimated before a training run, developers can plan a model with existing policies in mind and implement appropriate precautions during training, such as cybersecurity measures.
Fifth, the amount of compute used could be externally verified after training. While laws that use compute thresholds as a trigger for additional measures could depend on self-reporting, meaningful enforcement requires regulators to be aware of or at least able to verify the amount of compute being used. A regulatory threshold will be ineffective if regulators have no way of knowing whether a threshold has been reached. For this reason, some scholars have proposed that developers and compute providers be required to report the amount of compute used at different stages of the AI lifecycle.[ref 221] Compute providers already employ chip-hours for client billing, which could be used to calculate total computational operations,[ref 222] and the centralization of a few key cloud providers could make monitoring and reporting requirements simpler to administer.[ref 223] Others have proposed using “on-chip” or “hardware-enabled governance mechanisms” to verify claims about compute usage.[ref 224]
Sixth, training compute is an indicator of developer resources and capacity to comply with regulatory requirements, as it represents a substantial financial investment.[ref 225] For instance, Sam Altman reported that the development of GPT-4 cost “much more” than $100 million.[ref 226] Researchers have estimated that Gemini Ultra cost $70 million to $290 million to develop.[ref 227] A regulatory approach based on training compute thresholds can therefore be used to subject only the most resourced AI developers to increased regulatory scrutiny, while avoiding overburdening small companies, academics, and individuals. Over time, the cost of compute will most likely continue to fall, meaning the same thresholds will capture more developers and models. To ensure that the law remains appropriately scoped, compute thresholds can be complemented by additional metrics, such as the cost of compute or development. For example, the vetoed California Senate Bill 1047 was amended to include a compute cost threshold, defining a “covered model” to include one trained with over 1e26 OP, only if the cost of that training compute exceeded $100,000,000 at the start of training.[ref 228]
At the time of writing, many consider compute thresholds to be the best option currently available for determining which AI models should be subject to regulation, although the limitations of this approach underscore the need for careful drafting and adaptive governance. When considering the legal obligations imposed, the specific compute threshold should correspond to the nature and extent of additional scrutiny and other requirements and reflect the fact that compute is only a proxy for, and not a precise measure of, risk.
F. How Do Compute Thresholds Compare to Capability Evaluations?
A regulatory approach that uses a capabilities-based threshold or evaluation may seem more intuitively appealing and has been proposed by many.[ref 229] There are currently two main types of capability evaluations: benchmarking and red-teaming.[ref 230] In benchmarking, a model is tested on a specific dataset and receives a numerical score. In red-teaming, evaluators can use different approaches to identify vulnerabilities and flaws in a system, such as through prompt injection attacks to subvert safety guardrails. Model evaluations like these already serve as the basis for responsible scaling policies, which specify what protective measures an AI developer must implement in order to safely handle a given level of capabilities. Responsible scaling policies have been adopted by companies like Anthropic, OpenAI, and Google, and policymakers have also encouraged their development and practice.[ref 231]
Capability evaluations can complement compute thresholds. For example, capability evaluations could be required for models exceeding a compute threshold that indicates that dangerous capabilities might exist. They could also be used as an alternative route to being covered by regulation. The EU AI Act adopts the latter approach, complementing the compute threshold with the possibility for the European Commission to “take individual decisions designating a general-purpose AI model as a general-purpose AI model with systemic risk if it is found that such model has capabilities or an impact equivalent to those captured by the set threshold.”[ref 232]
Nonetheless, there are several downsides to depending on capabilities alone. First, model capabilities are difficult to measure.[ref 233] Benchmark results can be affected by factors other than capabilities, such as benchmark data being included during training[ref 234] and model sensitivity to small changes in prompting.[ref 235] Downstream capabilities of a model may also differ from those during evaluation due to changes in dataset distribution.[ref 236] Some threats, such as misuse of a model to develop a biological weapon, may be particularly difficult to evaluate due to the domain expertise required, the sensitivity of information related to national security, and the complexity of the task.[ref 237] For dangerous capabilities such as deception and manipulation, the nature of the capability makes it difficult to assess,[ref 238] although some evaluations have already been developed.[ref 239] Furthermore, while evaluations can point to what capabilities do exist, it is far more difficult to prove that a model does not possess a given capability. Over time, new capabilities may even emerge and improve due to prompting techniques, tools, and other post-training enhancements.
Second, and compounding the issue, there is no standard method for evaluating model capabilities.[ref 240] While benchmarks allow for comparison across models, there are competing benchmarks for similar capabilities; with none adopted as standard by developers or the research community, evaluators could select different benchmark tests entirely.[ref 241] Red-teaming, while more in-depth and responsive to differences in models, is even less standardized and provides less comparable results. Similarly, no standard exists for when during the AI lifecycle a model is evaluated, even though fine-tuning and other post-training enhancements can have a significant impact on capabilities. Nevertheless, there have been some efforts toward standardization, including the U.S. National Institute of Standards and Technology beginning to develop guidelines and benchmarks for evaluating AI capabilities, including through red-teaming.[ref 242]
Third, it is much more difficult to externally verify model evaluations. Since evaluation methods are not standardized, different evaluators and methods may come to different conclusions, and even a small difference could determine whether a model falls within the scope of regulation. This makes external verification simultaneously more important and more challenging. In addition to the technical challenge of how to consistently verify model evaluations, there is also a practical challenge: certain methods, such as red-teaming and audits, depend on far greater access to a model and information about its development. Developers have been reluctant to grant permissive access,[ref 243] which has contributed to numerous calls to mandate external evaluations.[ref 244]
Fourth, model evaluations may be circumvented. For red-teaming and more comprehensive audits, evaluations for a given model may reasonably reach different conclusions, which allows room for an evaluator to deliberately shape results through their choice of methods and interpretation. Careful institutional design is needed to ensure that evaluations are robust to conflicts of interest, perverse incentives, and other limitations.[ref 245] If known benchmarks are used to determine whether a model is subject to regulation, developers might train models to achieve specific scores without affecting capabilities, whether to improve performance on safety measures or strategically underperform on certain measures of dangerous capabilities.
Finally, capability evaluations entail more uncertainty and expense. Currently, the capabilities of a model can only reliably be determined ex post,[ref 246] making it difficult for developers to predict whether it will fall within the scope of applicable law. More in-depth model evaluations such as red-teaming and audits are expensive and time-consuming, which may constrain small organizations, academics, and individuals.[ref 247]
Capability evaluations can thus be viewed as a complementary tool for estimating model risk. While training compute makes an excellent initial threshold for regulatory oversight, as an objective and quantifiable measure that can be estimated prior to training and verified after, capabilities correspond more closely to risk. Capability evaluations provide more information and can be completed after fine-tuning and other post-training enhancements, but are more expensive, difficult to carry out, and less standardized. Both are important components of AI governance but serve different roles.
IV. Conclusion
More powerful AI could bring transformative changes in society. It promises extraordinary opportunities and benefits across a wide range of sectors, with the potential to improve public health, make new scientific discoveries, improve productivity and living standards, and accelerate economic growth. However, the very same advanced capabilities could result in tremendous harms that are difficult to control or remedy after they have occurred. AI could fail in critical infrastructure, further concentrate wealth and increase inequality, or be misused for more effective disinformation, surveillance, cyberattacks, and development of chemical and biological weapons.
In order to prevent these potential harms, laws that govern AI must identify models that pose the greatest threat. The obvious answer would be to evaluate the dangerous capabilities of frontier models; however, state of the art model evaluations are subjective and unable to reliably predict downstream capabilities, and they can take place only after the model has been developed with a substantial investment.
This is where training compute thresholds come into play. Training compute can operate as an initial threshold for estimating the performance and capabilities of a model and, thus, the potential risk it poses. Despite its limitations, it may be the most effective option we have to identify potentially dangerous AI that warrants further scrutiny. However, compute thresholds alone are not sufficient. They must be used alongside other tools to mitigate and respond to risk, such as capability evaluations, post-market monitoring, and incident reporting. Further research avenues could develop better governance via compute thresholds:
- What amount of training compute corresponds to future systems of concern? What threshold is appropriate for different regulatory targets, and how can we identify that threshold in advance? What are the downstream effects of different compute thresholds?
- Are compute thresholds appropriate for different stages of the AI lifecycle? For example, could thresholds for compute used for post-training enhancements or during inference be used alongside a training compute threshold, given the ability to significantly improve capabilities at these stages?
- Should domain-specific compute thresholds be established, and if so, to address which risks? If domain-specific compute thresholds are established, such as in President Biden’s Executive Order 14,110, how can competent authorities determine if a system is domain-specific and verify the training data?
- How should compute usage be reported, monitored, and audited?
- How should a compute threshold be updated over time? What is the likelihood of future frontier systems being developed using less (or far less) compute than is used today? Does growth or slowdown in compute usage, hardware improvement, or algorithmic efficiency warrant an update, or should it correspond solely to an increase in capabilities? Relatedly, what kind of framework would allow a regulatory agency to respond to developments effectively (e.g., with adequate information and the ability to update rapidly)?
- How could a capabilities-based threshold complement or replace a compute threshold, and what would be necessary (e.g., improved model evaluations for dangerous capabilities and alignment)?
- How should the law mitigate risks from AI systems that sit below the training compute threshold?
What should be internationalised in AI governance?
Abstract
As artificial intelligence (AI) advances, states increasingly recognise the need for international governance to address shared benefits and challenges. However, international cooperation is complex and costly, and not all AI issues require cooperation at the international level. This paper presents a novel framework to identify and prioritise AI governance issues warranting internationalisation. We analyse nine critical policy areas across data, compute, and model governance using four factors which broadly incentivise states to internationalise governance efforts: cross-border externalities, regulatory arbitrage, uneven governance capacity, and interoperability. We find strong benefits of internationalisation in compute-provider oversight, content provenance, model evaluations, incident monitoring, and risk management protocols. In contrast, the benefits of internationalisation are lower or mixed in data privacy, data provenance, chip distribution, and bias mitigation. These results can guide policymakers and researchers in prioritising international AI governance efforts.
The governance misspecification problem
Abstract
Legal rules promulgated to govern emerging technologies often rely on proxy terms and metrics in order to indirectly effectuate background purposes. A common failure mode for this kind of rule occurs when, due to incautious drafting or unforeseen technological developments, a proxy ceases to function as intended and renders a rule ineffective or counterproductive. Borrowing a concept from the technical AI safety literature, we call this phenomenon the “governance misspecification problem.” This article draws on existing legal-philosophical discussions of the nature of rules to define governance misspecification, presents several historical case studies to demonstrate how and why rules become misspecified, and suggests best practices for designing legal rules to avoid misspecification or mitigate its negative effects. Additionally, we examine a few proxy terms used in existing AI governance regulations, such as “frontier AI” and “compute thresholds,” and discuss the significance of the problem of misspecification in the AI governance context.
In technical Artificial Intelligence (“AI”) safety research, the term “specification” refers to the problem of defining the purpose of an AI system so that the system behaves in accordance with the true wishes of its designer.[ref 1] Technical researchers have suggested three categories of specification: “ideal specification,” “design specification,” and “revealed specification.”[ref 2] The ideal specification, in this framework, is a hypothetical specification that would create an AI system completely and perfectly aligned with the desires of its creators. The design specification is the specification that is actually used to build a given AI system. The revealed specification is the specification that best describes the actual behavior of the completed AI system. “Misspecification” occurs whenever the revealed specification of an AI system diverges from the ideal specification—i.e., when an AI system does not perform in accordance with the intentions of its creators.
The fundamental problem of specification is that “it is often difficult or infeasible to capture exactly what we want an agent to do, and as a result we frequently end up using imperfect but easily measured proxies.”[ref 3] Thus, in a famous example from 2016, researchers at OpenAI attempted to train a reinforcement learning agent to play the boat-racing video game CoastRunners, the goal of which is to finish a race quickly and ahead of other players.[ref 4] Instead of basing the AI agent’s reward function on how it placed in the race, however, the researchers used a proxy goal that was easier to implement and rewarded the agent for maximizing the number of points it scored. The researchers mistakenly assumed that the agent would pursue this proxy goal by trying to complete the course quickly. Instead, the AI discovered that it could achieve a much higher score by refusing to complete the course and instead driving in tight circles in such a way as to repeatedly collect a series of power-ups while crashing into other boats and occasionally catching on fire.[ref 5] In other words, the design specification (“collect as many points as possible”) did not correspond well to the ideal specification (“win the race”), leading to a disastrous and unexpected revealed specification (crashing repeatedly and failing to finish the race).
This article applies the misspecification framework to the problem of AI governance. The resulting concept, which we call the “governance misspecification problem,” can be briefly defined as occurring when a legal rule relies unsuccessfully on proxy terms or metrics. By framing this new concept in terms borrowed from the technical AI safety literature, we hope to incorporate valuable insights from that field into legal-philosophical discussions around the nature of rules and, importantly, to help technical researchers understand the philosophical and policymaking challenges that AI governance legislation and regulation poses.
It is generally accepted among legal theorists that at least some legal rules can be said to have a purpose or purposes and that these purposes should inform the interpretation of textually ambiguous rules.[ref 6] The least ambitious version of this claim is simply an acknowledgment of the fact that statutes often contain a discrete textual provision entitled “Purpose,” which is intended to inform the interpretation and enforcement of the statute’s substantive provisions.[ref 7] More controversially, some commentators have argued that all or many legal rules have, or should be constructively understood as having, an underlying “true purpose,” which may or may not be fully discoverable and articulable.[ref 8]
The purpose of a legal rule is analogous to the “ideal specification” discussed in the technical AI safety literature. Like the ideal specification of an AI system, a rule’s purpose may be difficult or impossible to perfectly articulate or operationalize, and rulemakers may choose to rely on a legal regime that incorporates “imperfect but easily measured proxies”—essentially, a design specification. “Governance misspecification” occurs when the real-world effects of the legal regime (analogous to the design specification) as interpreted and enforced (analogous to the revealed specification) fail to effectuate the rule’s intended purpose (analogous to the ideal specification).
Consider the hypothetical legal rule prohibiting the presence of “vehicles” in a public park, famously described by the legal philosopher H.L.A. Hart.[ref 9] The term “vehicles,” in this rule, is presumably a proxy term intended to serve some ulterior purpose,[ref 10] although fully discovering and articulating that purpose may be infeasible. For example, the rule might be intended to ensure the safety of pedestrians in the park, or to safeguard the health of park visitors by improving the park’s air quality, or to improve the park’s atmosphere by preventing excessive noise levels. More realistically, the purpose of the rule might be some complex weighted combination of all of these and numerous other more or less important goals. Whether the rule is misspecified depends on whether the rule’s purpose, whatever it is, is furthered by the use of the proxy term “vehicle.”
Hart used the “no vehicles in the park” rule in an attempt to show that the word “vehicle” had a core of concrete and settled linguistic meaning (an automobile is a vehicle) as well as a semantic “penumbra” containing more or less debatable cases such as bicycles, toy cars, and airplanes. The rule, in other words, is textually ambiguous, although this does not necessarily mean that it is misspecified.[ref 11] Because the rule is ambiguous, a series of difficult interpretive decisions may have to be made regarding whether a given item is or is not a vehicle. At least some of these decisions, and the costs associated with them, could have been avoided if the rulemaker had chosen to use a more detailed formulation in lieu of the term “vehicle,”[ref 12] or if the rulemaker had issued a statement clarifying the purpose of the rule.[ref 13]
Although the concept of misspecification is generally applicable to legal rules, misspecification tends to occur particularly frequently and with serious consequences in the context of laws and regulations governing poorly-understood emerging technologies such as artificial intelligence. Again, consider “no vehicles in the park.” Many legal rules, once established, persist indefinitely even as the technology they govern changes fundamentally.[ref 14] The objects to which the proxy term “vehicle” can be applied will change over time; electric wheelchairs, for example, may not have existed when the rule was originally drafted, and airborne drones may not have been common. The introduction of these new potential “vehicles” is extremely difficult to account for in an original design specification.[ref 15]
The governance misspecification problem is particularly relevant to the governance of AI systems. Unlike most other emerging technologies, frontier AI systems are, in key respects, not only poorly understood but fundamentally uninterpretable by existing methods.[ref 16] This problem of interpretability is a major focus area for technical AI safety researchers.[ref 17] The widespread use of proxy terms and metrics in existing AI governance policies and proposals is, therefore, a cause for concern.[ref 18]
In Section I, this article draws on existing legal-philosophical discussions of the nature of rules to further explain the problem of governance misspecification and situates the concept in the existing public policy literature. Sections II and III make the case for the importance of the problem by presenting a series of case studies to show that rules aimed at governing emerging technologies are often misspecified and that misspecified rules can cause serious problems for the regulatory regime they contribute to, for courts, and for society generally. Section IV offers a few suggestions for reducing the risk of and mitigating the harm from misspecified rules, including eschewing or minimizing the use of proxy terms, rapidly updating and frequently reviewing the effectiveness of regulations, and including specific and clear statements of the purpose of a legal rule in the text of the rule. Section V applies the conclusions of the previous Sections prospectively to several specific challenges in the field of AI governance, including the use of compute thresholds, semiconductor export controls, and the problem of defining “frontier” AI systems. Section VI concludes.
I. The Governance Misspecification Problem in Legal Philosophy and Public Policy
A number of publications in the field of legal philosophy have discussed the nature of legal rules and arrived at conclusions helpful to fleshing out the contours of the governance misspecification problem.[ref 19] Notably, Schauer (1991) suggests the useful concepts of over- and under-inclusiveness, which can be understood as two common ways in which legal rules can become misspecified.[ref 20] Overinclusive rules prohibit or prescribe actions that an ideally specified rule would not apply to, while underinclusive rules fail to prohibit or prescribe actions that an ideally specified rule would apply to. So, in Hart’s “no vehicles in the park” hypothetical, suppose that the sole purpose of the rule was to prevent park visitors from being sickened by diesel fumes. If this were the case, the rule would be overinclusive, because it would pointlessly prohibit many vehicles that do not emit diesel fumes. If, on the other hand, the purpose of the rule was to prevent music from being played loudly in the park on speakers, the rule would be underinclusive, as it fails to prohibit a wide range of speakers that are not installed in a vehicle.
Ideal specification is rarely feasible, and practical considerations may dictate that a well-specified rule should rely on proxy terms that are under- or overinclusive to some extent. As Schauer (1991) explains, “Speed Limit 55” is a much easier rule to follow and enforce consistently than “drive safely,” despite the fact that the purpose of the speed limit is to promote safe driving and despite the fact that some safe driving can occur at speeds above 55 miles per hour and some dangerous driving can occur at speeds below 55 miles per hour.[ref 21] In other words, the benefits of creating a simple and easily followed and enforced rule outweigh the costs of over- and under-inclusiveness in many cases.[ref 22]
In the public policy literature, the existing concept that bears the closest similarity to governance misspecification is “policy design fit.”[ref 23] Policy design is currently understood as including a mix of interrelated policy goals and the instruments through which those goals are accomplished, including legal, financial, and communicative mechanisms.[ref 24] A close fit between policy goals and the means used to accomplish those goals has been shown to increase the effectiveness of policies.[ref 25] The governance misspecification problem can be understood as a particular species of failure of policy design fit—a failure of congruence between a policy goal and a proxy term in the legal rule which is the means used to further that goal.[ref 26]
II. Legal Rules Governing Emerging Technologies Are Often Misspecified
Misspecification occurs frequently in both domestic and international law and in both reactive and anticipatory regulations directed towards the regulation of new technologies. In order to illustrate how misspecification happens, and to give a sense of the significance of the problem in legal rules addressing emerging technologies, this Section discusses three historical examples of the phenomenon in the contexts of cyberlaw, copyright law, and nuclear anti-proliferation treaties.
Section 1201(a)(2) of the Digital Millennium Copyright Act of 1998 (DMCA) prohibits the distribution of any “technology, product, service, device, component, or part thereof” primarily designed to decrypt copyrighted material.[ref 27] Congressman Howard Coble, one of the architects of the DMCA, stated that this provision was “drafted carefully to target ‘black boxes’”—physical devices with “virtually no legitimate uses,” useful only for facilitating piracy.[ref 28] The use of “black boxes” for the decryption of digital works was not widespread in 1998, but the drafters of the DMCA predicted that such devices would soon become an issue. In 1998, this prediction seemed a safe bet, as previous forms of piracy decryption had relied on specialized tools—the phrase “black box” is a reference to one such tool, also known as a “descrambler” and used to decrypt premium cable television channels.[ref 29]
However, the feared black boxes never arrived. Instead, pirates relied on software, using decryption programs distributed for free online to circumvent anti-piracy encryptions.[ref 30] Courts found the distribution of such programs, and even the posting of hyperlinks leading to websites containing such programs, to be violations of the DMCA.[ref 31] In light of earlier cases holding that computer code was a form of expression entitled to First Amendment protection, this interpretation placed the DMCA into tension with the First Amendment.[ref 32] This tension was ultimately resolved in favor of the DMCA, and the distribution of decryption programs used for piracy was prohibited.[ref 33]
No one in Congress anticipated that the statute which had been “carefully drafted to target ‘black boxes’” would be used to prohibit the distribution of lines of computer code, or that this would raise serious concerns regarding freedom of speech. Section 1201(a)(2), in other words, was misspecified; by prohibiting the distribution of any “technology” or “service” designed for piracy, as well as any “device,” the framers of the DMCA banned more than they intended to ban and created unforeseen constitutional issues.
Misspecification also occurs in international law. The Treaty of Principles Governing the Activities of States in the Exploration and Use of Outer Space, which the United States and the Soviet Union entered into in 1967, obligated the parties “not to place in orbit around the Earth any objects carrying nuclear weapons…”[ref 34] Shortly after the treaty was entered into, however, it became clear that the Soviet Union planned to take advantage of a loophole in the misspecified prohibition. The Fractional Orbital Bombardment System (FOBS) placed missiles into orbital trajectories around the earth, but then redirected them to strike a target on the earth’s surface before they completed a full orbit.[ref 35] An object is not “in orbit” until it has circled the earth at least once; therefore, FOBS did not violate the 1967 Treaty, despite the fact that it allowed the Soviet Union to strike at the U.S. from space and thereby evade detection by the U.S.’s Ballistic Missile Early Warning System.[ref 36] The U.S. eventually neutralized this advantage by expanding the coverage and capabilities of early warning systems so that FOBS missiles could be detected and tracked, and in 1979 the Soviets agreed to a better-specified ban which prohibited “fractional orbital missiles” as well as other space-based weapons.[ref 37] Still, the U.S.’s agreement to use the underinclusive proxy term “in orbit” allowed the Soviet Union to temporarily gain a potentially significant first-strike advantage.
Misspecification occurs in laws and regulations directed towards existing and well-understood technologies as well as in anticipatory regulations. Take, for example, the Computer Fraud and Abuse Act (CFAA), 18 U.S.C. § 1030, which has been called “the worst law in technology.”[ref 38] The CFAA was originally enacted in 1984, but has since been amended several times, most recently in 2020.[ref 39] Among other provisions, the CFAA criminalizes “intentionally access[ing] a computer without authorization or exceed[ing] authorized access, and thereby obtain[ing]… information from any protected computer.”[ref 40] The currently operative language for this provision was introduced in 1996,[ref 41] by which point the computer was hardly an emerging technology, and slightly modified in 2008.[ref 42]
Read literally, the CFAA’s prohibition on unauthorized access criminalizes both (a) violating a website’s terms of service while using the internet, and (b) using an employer’s computer or network for personal reasons, in violation of company policy.[ref 43] In other words, a literal reading of the CFAA would mean that hundreds of millions of Americans commit crimes every week by, e.g., sharing a password with a significant other or accessing social media at work.[ref 44] Court decisions eventually established narrower definitions of the key statutory terms (“without authorization” and “exceeds authorized access”),[ref 45] but not before multiple defendants were prosecuted for violating the CFAA by failing to comply a website’s terms of service[ref 46] or accessing an employer’s network for personal reasons in violation of workplace rules.[ref 47]
Critics of the CFAA have discussed its flaws in terms of the constitutional law doctrines of “vagueness”[ref 48] and “overbreadth.”[ref 49] These flaws can also be conceptualized in terms of misspecification. The phrases “intentionally accesses without authorization” and “exceeds authorized access,” and the associated statutory definitions, are poor proxies for the range of behavior that an ideally specified version of the CFAA would have criminalized. The proxies criminalize a great deal of conduct that none of the stakeholders who drafted, advocated for, or voted to enact the law wanted to criminalize[ref 50] and created substantial legal and political backlash against the law. This backlash led to a series of losses for federal prosecutors as courts rejected their broad proposed interpretations of the key proxy terms because, as the Ninth Circuit Court of Appeals put it, “ubiquitous, seldom-prosecuted crimes invite arbitrary and discriminatory enforcement.”[ref 51] The issues caused by poorly selected proxy terms in the CFAA, the Outer Space Treaty, and the DMCA demonstrate that important legal rules drafted for the regulation of emerging technologies are prone to misspecification, in both domestic and international law contexts and for both anticipatory and reactive rules. These case studies were chosen because they are representative of how legal rules become misspecified; if space allowed, numerous additional examples of misspecified rules directed towards new technologies could be offered.[ref 52]
III. Consequences of Misspecification in the Regulation of Emerging Technologies
The case studies examined in the previous Section established that legal rules are often misspecified and illustrated the manner in which the problem of governance misspecification typically arises. This Section attempts to show that misspecification can cause serious issues when it occurs for both for the regulatory regime that the misspecified rule is part of and for society writ large. Three potential consequences of misspecification are discussed and illustrated with historical examples involving the regulation of emerging technologies.
A. Underinclusive Rules Can Create Exploitable Gaps in a Regulatory Regime
When misspecification results in an underinclusive rule, exploitable gaps can arise in a regulatory regime. The Outer Space Treaty of 1967, discussed above, is one example of this phenomenon. Another example, which demonstrates how completely the use of a misspecified proxy term can defeat the effectiveness of a law, is the Audio Home Recording Act of 1992.[ref 53] That statute was designed to regulate home taping, i.e., the creation by consumers of analog or digital copies of musical recordings. The legal status of home taping had been a matter of debate for years, with record companies arguing that it was illegal and taping hardware manufacturers defending its legality.[ref 54] Congress attempted to resolve the debate by creating a safe harbor for home taping that allowed for the creation of any number of analog or digital copies of a piece of music, with the caveat that royalties would have to be paid as part of the purchase price of any equipment used to create digital copies.[ref 55]
Congress designed the AHRA under the assumption that digital audio tape recorders (DATs) were the wave of the future and would shortly become a ubiquitous home audio appliance.[ref 56] The statute lays out, in painstaking detail, a complex regulatory framework governing “digital audio recording devices,” which the statute defines to require the capability to create reproductions of “digital musical recordings.”[ref 57] Bizarrely, however, the AHRA explicitly provides that the term “digital musical recording” does not encompass recordings stored on any object “in which one or more computer programs are fixed”—i.e., computer hard drives.[ref 58]
Of course, the DAT did not become a staple of the American household. And when the RIAA sued the manufacturer of the “Rio,” an early mp3 player, for failing to comply with the AHRA’s requirements, the Ninth Circuit found that the device was not subject to the AHRA.[ref 59] Because the Rio was designed solely to download mp3 files from a computer hard drive, it was not capable of copying “digital musical recordings” under the AHRA’s underinclusive definition of that phrase.[ref 60] The court noted that its decision would “effectively eviscerate the Act,” because “[a]ny recording device could evade […] regulation simply by passing the music through a computer and ensuring that the MP3 file resided momentarily on the hard drive,” but nevertheless rejected the creative alternative interpretations suggested by the music industry as contrary to the plain language of the statute.[ref 61] As a result, the AHRA was rendered obsolete less than six years after being enacted.[ref 62]
Clearly, Congress acted with insufficient epistemic humility by creating legislation confidently designed to address one specific technology that had not, at the time of legislation, been adopted by any significant portion of the population. But this failure of humility manifested as a failure of specification. The purpose of the statute, as articulated in a Senate report, included the introduction of a “serial copy management system that would prohibit the digital serial copying of copyrighted music.”[ref 63] By crafting a law that applied only to “digital audio recording devices” and defining that proxy term in an insufficiently flexible way, Congress completely failed to accomplish those purposes. If the proxy in question had not been defined to exclude any recording acquired through a computer, the Rio and eventually the iPod might well have fallen under the AHRA’s royalty scheme, and music copyright law in the U.S. might have developed down a course more consistent with the ideal specification of the AHRA.
B. Overinclusive Rules Can Create Pushback and Enforcement Challenges
Misspecification can also create overinclusive rules, like the Computer Fraud and Abuse Act and § 1201(a)(2) of the Digital Millennium Copyright Act, discussed above in Section II. As those examples showed, overinclusive rules may give rise to legal and political challenges, difficulties with enforcement, and other unintended and undesirable results. These effects can, in some cases, be so severe that they require a total repeal of the rule in question.
This was the case with a 2011 Nevada statute authorizing and regulating driverless cars. AB511, which was the first law of its kind enacted in the U.S.,[ref 64] initially defined “autonomous vehicle” to mean “a motor vehicle that uses artificial intelligence, sensors and global positioning system coordinates to drive itself without the active intervention of a human operator,” and further defined “artificial intelligence” to mean “the use of computers and related equipment to enable a machine to duplicate or mimic the behavior of human beings.”[ref 65]
Shortly after AB511 was enacted, however, several commentators noted that the statute’s definition of “autonomous vehicle” technically included vehicles that incorporated automatic collision avoidance or any of a number of other advanced driver-assistance systems common in new cars in 2011.[ref 66] These systems used computers to temporarily control the operation of a vehicle without the intervention of the human driver, so any vehicle that incorporated them was technically subject to the onerous regulatory scheme that Nevada’s legislature had intended to impose only on fully autonomous vehicles. In order to avoid effectively banning most new model cars, Nevada’s legislature was forced to repeal its new law and enact a replacement that incorporated a more detailed definition of “autonomous vehicle.”[ref 67]
C. Technological Change Can Repeatedly Render a Proxy Metric Obsolete
Finally, a misspecified rule may lose its effectiveness over time as technological advances render it obsolete, necessitating repeated updates and patches to the fraying regulatory regime. Consider, for example, the export controls imposed on high performance computers in the 1990s. The purpose of these controls was to prevent the export of powerful computers to countries where they might be used in ways that threatened U.S. national security, such as to design missiles and nuclear weapons.[ref 68] The government placed restrictions on the export of “supercomputers” and defined “supercomputer” in terms of the number of millions of theoretical operations per second (MTOPS) the computer could perform.[ref 69] In 1991, “supercomputer” was defined to mean any computer capable of exceeding 195 MTOPS.[ref 70] As the 90s progressed, however, the processing power of commercially available computers manufactured outside of the U.S. increased rapidly, reducing the effectiveness of U.S. export controls.[ref 71] Restrictions that prevented U.S. companies from selling their computers globally imposed costs on the U.S. economy and harmed the international competitiveness of the restricted companies.[ref 72] The Clinton administration responded by raising the threshold at which export restrictions began to apply to 1500 MTOPS in 1994, to 7000 MTOPS in 1996, to 12,300 MTOPS in 1999, and three times in the year 2000 to 20,000, 28,000, and finally 85,000 MTOPS.[ref 73]
In the late 1990s, technological advances made it possible to link large numbers of commercially available computers together into “clusters” which could outperform most supercomputers.[ref 74] At this point, it was clear that MTOPS-based export controls were no longer effective, as computers that exceeded any limit imposed could easily be produced by anyone with access to a supply of less powerful computers which would not be subject to export controls.[ref 75] Even so, MTOPS-based export controls continued in force until 2006, when they were replaced by regulations that imposed controls based on performance in terms of Weighted TeraFLOPS, i.e., trillions of floating point operations per second.[ref 76]
Thus, while the use of MTOPS thresholds as proxies initially resulted in well-specified export controls that effectively prevented U.S. adversaries from acquiring supercomputers, rapid technological progress repeatedly rendered the controls overinclusive and necessitated a series of amendments and revisions. The end result was a period of nearly seven years during which the existing export controls were badly misspecified due to the use of a proxy metric, MTOPS, which no longer bore any significant relation to the regime’s purpose. During this period, the U.S. export control regime for high performance computers was widely considered to be ineffective and perhaps even counterproductive.[ref 77]
IV. Mitigating Risks from Misspecification
In light of the frequency with which misspecification occurs in the regulation of emerging technology and the potential severity of its consequences, this Section suggests a few techniques for designing legal rules in such a way as to reduce the risk of misspecification and mitigate its ill effects.
The simplest way to avoid misspecification is to eschew or minimize the use of proxy terms and metrics. This is not always practicable or desirable. “No vehicles in the park” is a better rule than “do not unreasonably annoy or endanger the safety of park visitors,” in part because it reduces the cognitive burden of following, enforcing, and interpreting the rule and reduces the risk of decision maker error by limiting the discretion of the parties charged with enforcement and interpretation.[ref 78] Nevertheless, there are successful legal rules that pursue their purposes directly. U.S. antitrust law, for example, grew out of the Sherman Antitrust Act,[ref 79] § 1 of which simply states that any combination or contract in restraint of trade “is declared to be illegal.”
Where use of a proxy is appropriate, it is often worthwhile to identify the fact that a proxy is being used to reduce the likelihood that decision makers will fall victim to Goodhart’s law[ref 80] and treat the regulation of the proxy as an end in itself.[ref 81] Alternatively, the most direct way to avoid confusion regarding the underlying purpose of a rule is to simply include an explanation of the purpose in the text of the rule itself. This can be accomplished through the addition of a purpose clause (sometimes referred to as a legislative preamble or a policy statement). For example, one purpose of the Nuclear Energy Innovation and Modernization Act of 2019 is to “provide… a program to develop the expertise and regulatory processes necessary to allow innovation and the commercialization of advanced nuclear reactors.”
Purpose clauses can also incorporate language emphasizing that every provision of a rule should be construed in order to effectuate its purpose. This amounts to a legislatively prescribed rule of statutory interpretation, instructing courts to adopt a purposivist interpretive approach.[ref 82] When confronted with an explicit textual command to this effect, even strict textualists are obligated to interpret a rule purposively.[ref 83] The question of whether such an approach is generally desirable is hotly debated,[ref 84] but in the context of AI governance the flexibility that purposivism provides is a key advantage. The ability to flexibly update and adapt a rule in response to changes in the environment in which the rule will apply is unusually important in the regulation of emerging technologies.[ref 85] While there is little empirical evidence for or against the effectiveness of purpose clauses, they have played a key role in the legal reasoning relied on in a number of important court decisions.[ref 86]
A regulatory regime can also require periodic efforts to evaluate whether a rule is achieving its purpose.[ref 87] These efforts can provide an early warning system for misspecification by facilitating awareness of whether the proxy terms or metrics relied upon still correspond well to the purpose of the rule. Existing periodic review requirements are often ineffective,[ref 88] treated by agencies as box-checking activities rather than genuine opportunities for careful retrospective analysis of the effects of regulations.[ref 89] However, many experts continue to recommend well-implemented retrospective review requirements as an effective tool for improving policy decisions.[ref 90] The Administrative Conference of the United States has repeatedly pushed for increased use of retrospective review, as has the internationally-focused Organization for Economic Co-Operation and Development (OECD).[ref 91] Additionally, retrospective review of regulations often works well in countries outside of the U.S.[ref 92]
As the examples in Sections II and III demonstrate, rules governing technology tend to become misspecified over time as the regulated technology evolves. The Outer Space Treaty of 1967, § 1201(a)(2) of the DMCA, and the Clinton Administration’s supercomputer export controls were all well-specified and effective when implemented, but each measure became ineffective or counterproductive soon after being implemented because the proxies relied upon became obsolete. Ideally, rulemaking would move at the pace of technological improvement, but there are a number of institutional and structural barriers to this sort of rapid updating of regulations. Notably, the Administrative Procedure Act requires a lengthy “notice and comment” process for rulemaking and a 30-day waiting period after publication of a regulation in the Federal Register before the regulation can go into effect.[ref 93] There are ways to waive or avoid these requirements, including regulating via the issuance of nonbinding guidance documents rather than binding rules,[ref 94] issuing an immediately effective “interim final rule” and then satisfying the APA’s requirements at a later time,[ref 95] waiving the publication or notice and comment requirements for “good cause,”[ref 96] or legislatively imposing regulatory deadlines.[ref 97] Many of these workarounds are limited in their scope or effectiveness, or vulnerable to legal challenges if pursued too ambitiously, but finding some way to update a regulatory regime quickly is critical to mitigating the damage caused by misspecification.[ref 98]
There is reason to believe that some agencies, recognizing the importance of AI safety to national security, will be willing to rapidly update regulations despite the legal and procedural difficulties. Consider the Commerce Department’s recent response to repeated attempts by semiconductor companies to design chips for the Chinese market that comply with U.S. export control regulations while still providing significant utility to purchasers in China looking to train advanced AI models. After Commerce initially imposed a license requirement on the export of advanced AI-relevant chips to China in October 2022, Nvidia modified its market-leading A100 and H100 chips to comply with the regulations and proceeded to sell the modified A800 and H800 chips in China.[ref 99] On October 17, 2023, the Commerce Department’s Bureau of Industry and Security announced a new interim final rule that would prohibit the sale of A800 and H800 chips in China and waived the normal 30-day waiting period so that the rule became effective less than a week after it was announced.[ref 100] Commerce Secretary Gina Raimondo stated publicly that “”[i]f [semiconductor companies] redesign a chip around a particular cut line that enables them to do AI, I’m going to control it the very next day.”[ref 101]
V. The Governance Misspecification Problem and Artificial Intelligence
While the framework of governance misspecification is applicable to a wide range of policy measures, it is particularly well-suited to describing issues that arise regarding legal rules governing emerging technologies. H.L.A. Hart’s prohibition on “vehicles in the park” could conceivably have been framed by an incautious drafter who did not anticipate that using “vehicle” instead of some more detailed proxy term would create ambiguity. Avoiding this kind of misspecification is simply a matter of careful drafting. Suppose, however, that the rule was formulated at a point in time when “vehicle” was an appropriate proxy for a well-understood category of object, and the rule later became misspecified as new potential vehicles that had not been conceived of when the rule was drafted were introduced. A rule drafted at a historical moment when all vehicles move on either land or water is unlikely to adequately account for the issues created by airplanes or flying drones.[ref 102]
In other words, rules created to govern emerging technologies are especially prone to misspecification because they are created in the face of a high degree of uncertainty regarding the nature of the subject matter to be regulated, and rulemaking under uncertainty is difficult.[ref 103] Furthermore, as the case studies discussed in Sections II and III show, the nature of this difficulty is such that it tends to result in misspecification. For instance, misspecification will usually result when an overconfident rulemaker makes a specific and incorrect prediction about the future and issues an underinclusive rule based on that prediction. This was the case when Congress addressed the AHRA exclusively to digital audio tape recorders and ignored computers. Rules created by rulemakers who want to regulate a certain technology but have only a vague and uncertain understanding of the purpose they are pursuing are also likely to be misspecified.[ref 104] Hence the CFAA, which essentially prohibited “doing bad things with a computer,” with disastrous results.
The uncertainties associated with emerging technologies and the associated risk of misspecification increase when the regulated technology is poorly understood. Rulemakers may simply overlook something about the chosen proxy due to a lack of understanding of the proxy or the underlying technology, or due to a lack of experience drafting the kinds of regulations required. The first-of-its-kind Nevada law intended to regulate fully autonomous vehicles that accidentally regulated a broad range of features common in many new cars is an example of this phenomenon. So is the DMCA provision that was intended to regulate “black box” devices but, by its terms, also applied to raw computer code.
If the difficulty of making well-specified rules to govern emerging technologies increases when the technology is fast-developing and poorly understood, advanced AI systems are something of a perfect storm for misspecification problems. Cutting-edge deep learning AI systems differ from other emerging technologies in that their workings are poorly understood, not just by legislators and the public, but by their creators.[ref 105] Their capabilities are an emergent property of the interaction between their architecture and the vast datasets on which they are trained. Moreover, the opacity of these models is arguably different in kind from the unsolved problems associated with past technological breakthroughs, because the models may be fundamentally uninterpretable rather than merely difficult to understand.[ref 106] Under these circumstances, defining an ideal specification in very general terms may be simple enough, but designing legal rules to operationalize any such specification will require extensive reliance on rough proxies. This is fertile ground for misspecification.
There are a few key proxy terms that recur often in existing AI governance proposals and regulations. For example, a number of policy proposals have suggested that regulations should focus on “frontier” AI models.[ref 107] When Google, Anthropic, OpenAI, and Microsoft created an industry-led initiative to promote AI safety, they named it the Frontier Model Forum.[ref 108] Sam Altman, the CEO of OpenAI, has expressed support for regulating “frontier systems.”[ref 109] The government of the U.K. has established a “Frontier AI Taskforce” dedicated to evaluating risks “at the frontier of AI.”[ref 110]
In each of these proposals, the word “frontier” is a proxy term that stands for something like “highly capable foundation models that could possess dangerous capabilities sufficient to pose severe risks to public safety.”[ref 111] Any legislation or regulation that relied on the term “frontier” would also likely include a statutory definition of the word,[ref 112] but as several of the historical examples discussed in Sections II and III showed, statutory definitions can themselves incorporate proxies that result in misspecification. The above definition, for instance, may be underinclusive because some models that cannot be classified as “highly capable” or as “foundation models” might also pose severe risks to public safety.
The most significant AI-related policy measure that has been issued in the U.S. to date is Executive Order (EO) 14110 on the “Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence.”[ref 113] Among many other provisions, the EO imposes reporting requirements on certain AI models and directs the Department of Commerce to define the category of models to which the reporting requirements will apply.[ref 114] Prior to the issuance of Commerce’s definition, the EO provides that the reporting requirements apply to models “trained using a quantity of computing power greater than 1026 integer or floating-point operations, or using primarily biological sequence data and using a quantity of computing power greater than 1023 integer or floating-point operations,” as well as certain computing clusters.[ref 115] In other words, the EO uses operations as a proxy metric for determining which AI systems are sufficiently capable and/or dangerous that they should be regulated. This kind of metric, which is based on the amount of computing power used to train a model, is known as a “compute threshold” in the AI governance literature.[ref 116]
A proxy metric such as an operations-based compute threshold is almost certainly necessary to the operationalization of the EO’s regulatory scheme for governing frontier models.[ref 117] Even so, the example of the U.S. government’s ultimately ineffective and possibly counterproductive attempts to regulate exports of high performance computers using MTOPS is a cautionary tale about how quickly a compute-based proxy can be rendered obsolete by technological progress. The price of computing resources has, historically, fallen rapidly, with the amount of compute available for a given sum of money doubling approximately every two years as predicted by Moore’s Law.[ref 118] Additionally, because of improvements in algorithmic efficiency, the amount of compute required to train a model to a given level of performance has historically decreased over time as well.[ref 119] Because of these two factors, the cost of training AI models to a given level of capability has fallen precipitously over time; for instance, between 2017 and 2021, the cost of training a rudimentary model to classify images correctly with 93% accuracy on the image database ImageNet fell from $1000 to $5.[ref 120] This phenomenon presents a dilemma for regulators: the cost of acquiring computational resources exceeding a given threshold will generally decrease over time even as the capabilities of models trained on a below-threshold amount of compute rises. In other words, any well-specified legal rule that uses a compute threshold is likely to be rendered both overinclusive and underinclusive soon after being implemented.
Export controls intended to prevent the proliferation of the advanced chips used to train frontier AI models face a similar problem. Like the Clinton Administration’s supercomputer export controls, the Biden administration’s export controls on chips like the Nvidia A800 and H800 are likely to become misspecified over time. As algorithmic efficiency increases and powerful chips become cheaper and easier to acquire, existing semiconductor export controls will gradually become both overinclusive (because they pointlessly prohibit the export of chips that are already freely available overseas) and underinclusive (because powerful AI models can be trained using chips not covered by the export controls).
The question of precisely how society should respond to these developments over time is beyond the scope of this paper. However, to delay the onset of misspecification and mitigate its effects, policymakers setting legal rules for AI governance should consider the recommendations outlined in Section IV, above. So, the specifications for export controls on semiconductors—proxies for something like “chips that can be used to create dangerously powerful AI models”—should be updated quickly and frequently as needed, to prevent them from becoming ineffective or counterproductive. The Bureau of Industry and Security has already shown some willingness to pursue this kind of frequent, flexible updating.[ref 121] More generally, given the particular salience of the governance misspecification problem to AI governance, legislators should consider mandating frequent review of the effectiveness of important AI regulations and empowering administrative agencies to update regulations rapidly as necessary. Rules setting compute thresholds that are likely to be the subject of litigation should incorporate clear purpose statements articulating the ulterior purpose behind the use of a compute threshold as a proxy, and should be interpreted consistently with those statements. And where it is possible to eschew the use of proxies without compromising the enforceability or effectiveness of a rule, legislators and regulators should consider doing so.
VI. Conclusion
This article has attempted to elucidate a newly developed concept in governance, i.e., the problem of governance misspecification. In presenting this concept along with empirical insights from representative case studies, we hope to inform contemporary debates around AI governance by demonstrating one common and impactful way in which legal rules can fail to effect their purposes. By framing this problem in terms of “misspecification,” a concept borrowed from the technical AI safety literature, this article aims both to introduce valuable insights from that field to scholars of legal philosophy and public policy and to introduce technical researchers to some of the more practically salient legal-philosophical and governance-related challenges involved in AI legislation and regulation. Additionally, we have offered a few specific suggestions for avoiding or mitigating the harms of misspecification in the AI governance context, namely eschewing the use of proxy terms or metrics where feasible, clear statements of statutory purpose, and flexibly applied, rapidly updating, periodically reviewed regulations.
A great deal of conceptual and empirical work remains to be done regarding the nature and effects of the governance misspecification problem and best practices for avoiding and responding to it. For instance, this article does not contain any in-depth comparison of the incidence and seriousness of misspecification outside of the context of rules governing emerging technologies. Additionally, empirical research analyzing whether and how purpose clauses and similar provisions can effectively further the purposes of legal rules would be of significant practical value.
Legal considerations for defining “frontier model”
Abstract
Many proposed laws and rules for the regulation of artificial intelligence would distinguish between a category consisting of the most advanced models—often called “frontier models”—and all other AI systems. Legal rules that make this distinction will typically need to include or reference a definition of “frontier model” or whatever analogous term is used. The task of creating this definition implicates several important legal considerations. The role of statutory and regulatory definitions in the overall definitional scheme should be considered, as should the advantages and disadvantages of incorporating elements such as technical inputs, capability metrics, epistemic elements, and deployment context into a definition. Additionally, existing legal obstacles to the rapid updating of regulatory definitions should be taken into account—including recent doctrinal developments in administrative law such as the elimination of Chevron deference and the introduction of the major questions doctrine.
I. Introduction
One of the few concrete proposals on which AI governance stakeholders in industry[ref 1] and government[ref 2] have mostly[ref 3] been able to agree is that AI legislation and regulation should recognize a distinct category consisting of the most advanced AI systems. The executive branch of the U.S. federal government refers to these systems, in Executive Order 14110 and related regulations, as “dual-use foundation models.”[ref 4] The European Union’s AI Act refers to a similar class of models as “general-purpose AI models with systemic risk.”[ref 5] And many researchers, as well as leading AI labs and some legislators, use the term “frontier models” or some variation thereon.[ref 6]
These phrases are not synonymous, but they are all attempts to address the same issue—namely that the most advanced AI systems present additional regulatory challenges distinct from those posed by less sophisticated models. Frontier models are expected to be highly capable across a broad variety of tasks and are also expected to have applications and capabilities that are not readily predictable prior to development, nor even immediately known or knowable after development.[ref 7] It is likely that not all of these applications will be socially desirable; some may even create significant risks for users or for the general public.
The question of precisely how frontier models should be regulated is contentious and beyond the scope of this paper. But any law or regulation that distinguishes between “frontier models” (or “dual-use foundation models,” or “general-purpose AI models with systemic risk”) and other AI systems will first need to define the chosen term. A legal rule that applies to a certain category of product cannot be effectively enforced or complied with unless there is some way to determine whether a given product falls within the regulated category. Laws that fail to carefully define ambiguous technical terms often fail in their intended purposes, sometimes with disastrous results.[ref 8] Because the precise meaning of the phrase “frontier model” is not self-evident,[ref 9] the scope of a law or regulation that targeted frontier models without defining that term would be unacceptably uncertain. This uncertainty would impose unnecessary costs on regulated companies (who might overcomply out of an excess of caution or unintentionally undercomply and be punished for it) and on the public (from, e.g., decreased compliance, increased enforcement costs, less risk protection, and more litigation over the scope of the rule).
The task of defining “frontier model” implicates both legal and policy considerations. This paper provides a brief overview of some of the most relevant legal considerations for the benefit of researchers, policymakers, and anyone else with an interest in the topic.
II. Statutory and Regulatory Definitions
Two related types of legal definition—statutory and regulatory—are relevant to the task of defining “frontier model.” A statutory definition is a definition that appears in a statute enacted by a legislative body such as the U.S. Congress or one of the 50 state legislatures. A regulatory definition, on the other hand, appears in a regulation promulgated by a government agency such as the U.S. Department of Commerce or the California Department of Technology (or, less commonly, in an executive order).
Regulatory definitions have both advantages and disadvantages relative to statutory definitions. Legislation is generally a more difficult and resource-intensive process than agency rulemaking, with additional veto points and failure modes.[ref 10] Agencies are therefore capable of putting into effect more numerous and detailed legal rules than Congress can,[ref 11] and can update those rules more quickly and easily than Congress can amend laws.[ref 12] Additionally, executive agencies are often more capable of acquiring deep subject-matter expertise in highly specific fields than are congressional offices due to Congress’s varied responsibilities and resource constraints.[ref 13] This means that regulatory definitions can benefit from agency subject-matter expertise to a greater extent than can statutory definitions, and can also be updated far more easily and often.
The immense procedural and political costs associated with enacting a statute do, however, purchase a greater degree of democratic legitimacy and legal resiliency than a comparable regulation would enjoy. A number of legal challenges that might persuade a court to invalidate a regulatory definition would not be available for the purpose of challenging a statute.[ref 14] And since the rulemaking power exercised by regulatory agencies is generally delegated to them by Congress, most regulations must be authorized by an existing statute. A regulatory definition generally cannot eliminate or override a statutory definition[ref 15] but can clarify or interpret. Often, a regulatory regime will include both a statutory definition and a more detailed regulatory definition for the same term.[ref 16] This can allow Congress to choose the best of both worlds, establishing a threshold definition with the legitimacy and clarity of an act of Congress while empowering an agency to issue and subsequently update a more specific and technically informed regulatory definition.
III. Existing Definitions
This section discusses five noteworthy attempts to define phrases analogous to “frontier model” from three different existing measures. Executive Order 14110 (“EO 14110”), which President Biden issued in October 2023, includes two complementary definitions of the term “dual-use foundation model.” Two definitions of “covered model” from different versions of the Safe and Secure Innovation for Frontier Artificial Intelligence Models Act, a California bill that was recently vetoed by Governor Newsom, are also discussed, along with the EU AI Act’s definition of “general-purpose AI model with systemic risk.”
A. Executive Order 14110
EO 14110 defines “dual-use foundation model” as:
an AI model that is trained on broad data; generally uses self-supervision; contains at least tens of billions of parameters; is applicable across a wide range of contexts; and that exhibits, or could be easily modified to exhibit, high levels of performance at tasks that pose a serious risk to security, national economic security, national public health or safety, or any combination of those matters, such as by:
(i) substantially lowering the barrier of entry for non-experts to design, synthesize, acquire, or use chemical, biological, radiological, or nuclear (CBRN) weapons;
(ii) enabling powerful offensive cyber operations through automated vulnerability discovery and exploitation against a wide range of potential targets of cyber attacks; or
(iii) permitting the evasion of human control or oversight through means of deception or obfuscation.
Models meet this definition even if they are provided to end users with technical safeguards that attempt to prevent users from taking advantage of the relevant unsafe capabilities.[ref 17]
The executive order imposes certain reporting requirements on companies “developing or demonstrating an intent to develop” dual-use foundation models,[ref 18] and for purposes of these requirements it instructs the Department of Commerce to “define, and thereafter update as needed on a regular basis, the set of technical conditions for models and computing clusters that would be subject to the reporting requirements.”[ref 19] In other words, EO 14110 contains both a high-level quasi-statutory[ref 20] definition and a directive to an agency to promulgate a more detailed regulatory definition. The EO also provides a second definition that acts as a placeholder until the agency’s regulatory definition is promulgated:
any model that was trained using a quantity of computing power greater than 1026 integer or floating-point operations, or using primarily biological sequence data and using a quantity of computing power greater than 1023 integer or floating-point operations[ref 21]
Unlike the first definition, which relies on subjective evaluations of model characteristics,[ref 22] this placeholder definition provides a simple set of objective technical criteria that labs can consult to determine whether the reporting requirements apply. For general-purpose models, the sole test is whether the model was trained on computing power greater than 1026 integer or floating-point operations (FLOP); only models that exceed this compute threshold[ref 23] are deemed “dual-use foundation models” for purposes of the reporting requirements mandated by EO 14110.
B. California’s “Safe and Secure Innovation for Frontier Artificial Intelligence Act” (SB 1047)
California’s recently vetoed “Safe and Secure Innovation for Frontier Artificial Intelligence Models Act” (“SB 1047”) focused on a category that it referred to as “covered models.”[ref 24] The version of SB 1047 passed by the California Senate in May 2024 defined “covered model” to include models meeting either of the following criteria:
(1) The artificial intelligence model was trained using a quantity of computing power greater than 1026 integer or floating-point operations.
(2) The artificial intelligence model was trained using a quantity of computing power sufficiently large that it could reasonably be expected to have similar or greater performance as an artificial intelligence model trained using a quantity of computing power greater than 1026 integer or floating-point operations in 2024 as assessed using benchmarks commonly used to quantify the general performance of state-of-the-art foundation models.[ref 25]
This definition resembles the placeholder definition in EO 14110 in that it primarily consists of a training compute threshold of 1026 FLOP. However, SB 1047 added an alternative capabilities-based threshold to capture future models which “could reasonably be expected” to be as capable as models trained on 1026 FLOP in 2024. This addition was intended to “future-proof”[ref 26] SB 1047 by addressing one of the main disadvantages of training compute thresholds—their tendency to become obsolete over time as advances in algorithmic efficiency produce highly capable models trained on relatively small amounts of compute.[ref 27]
Following pushback from stakeholders who argued that SB 1047 would stifle innovation,[ref 28] the bill was amended repeatedly in the California State Assembly. The final version defined “covered model” in the following way:
(A) Before January 1, 2027, “covered model” means either of the following:
(i) An artificial intelligence model trained using a quantity of computing power greater than 1026 integer or floating-point operations, the cost of which exceeds one hundred million dollars[ref 29] ($100,000,000) when calculated using the average market prices of cloud compute at the start of training as reasonably assessed by the developer.
(ii) An artificial intelligence model created by fine-tuning a covered model using a quantity of computing power equal to or greater than three times 1025 integer or floating-point operations, the cost of which, as reasonably assessed by the developer, exceeds ten million dollars ($10,000,000) if calculated using the average market price of cloud compute at the start of fine-tuning.
(B) (i) Except as provided in clause (ii), on and after January 1, 2027, “covered model” means any of the following:
(I) An artificial intelligence model trained using a quantity of computing power determined by the Government Operations Agency pursuant to Section 11547.6 of the Government Code, the cost of which exceeds one hundred million dollars ($100,000,000) when calculated using the average market price of cloud compute at the start of training as reasonably assessed by the developer.
(II) An artificial intelligence model created by fine-tuning a covered model using a quantity of computing power that exceeds a threshold determined by the Government Operations Agency, the cost of which, as reasonably assessed by the developer, exceeds ten million dollars ($10,000,000) if calculated using the average market price of cloud compute at the start of fine-tuning.
(ii) If the Government Operations Agency does not adopt a regulation governing subclauses (I) and (II) of clause (i) before January 1, 2027, the definition of “covered model” in subparagraph (A) shall be operative until the regulation is adopted.
This new definition was more complex than its predecessor. Subsection (A) introduced an initial definition slated to apply until at least 2027, which relied on a training compute threshold of 1026 FLOP paired with a training cost floor of $100,000,000.[ref 30] Subsection (B), in turn, provided for the eventual replacement of the training compute thresholds used in the initial definition with new thresholds to be determined (and presumably updated) by a regulatory agency.
The most significant change in the final version of SB 1047’s definition was the replacement of the capability threshold with a $100,000,000 cost threshold. Because it would currently cost more than $100,000,000 to train a model using >1026 FLOP, the addition of the cost threshold did not change the scope of the definition in the short term. However, the cost of compute has historically fallen precipitously over time in accordance with Moore’s law.[ref 31] This may mean that models trained using significantly more than 1026 FLOP will cost significantly less than the inflation-adjusted equivalent of 100 million 2024 dollars to create at some point in the future.
The old capability threshold expanded the definition of “covered model” because it was an alternative to the compute threshold—models that exceeded either of the two thresholds would have been “covered.” The newer cost threshold, on the other hand, restricted the scope of the definition because it was linked conjunctively to the compute threshold, meaning that only models that exceed both thresholds were covered. In other words, where the May 2024 definition of “covered model” future-proofed itself against the risk of becoming underinclusive by including highly capable low-compute models, the final definition instead guarded against the risk of becoming overinclusive by excluding low-cost models trained on large amounts of compute. Furthermore, the final cost threshold was baked into the bill text and could only have been changed by passing a new statute—unlike the compute threshold, which could have been specified and updated by a regulator.
Compared with the overall definitional scheme in EO 14110, SB 1047’s definition was simpler, easier to operationalize, and less flexible. SB 1047 lacked a broad, high-level risk-based definition like the first definition in EO 14110. SB 1047 did resemble EO 14110 in its use of a “placeholder” definition, but where EO 14110 confers broad discretion on the regulator to choose the “set of technical conditions” that will comprise the regulatory definition, SB 1047 only authorized the regulator to set and adjust the numerical value of the compute thresholds in an otherwise rigid statutory definition.
C. EU Artificial Intelligence Act
The EU AI Act classifies AI systems according to the risks they pose. It prohibits systems that do certain things, such as exploiting the vulnerabilities of elderly or disabled people,[ref 32] and regulates but does not ban so-called “high-risk” systems.[ref 33] While this classification system does not map neatly onto U.S. regulatory efforts, the EU AI Act does include a category conceptually similar to the EO’s “dual-use foundation model”: the “general-purpose AI model with systemic risk.”[ref 34] The statutory definition for this category includes a given general-purpose model[ref 35] if:
a. it has high impact capabilities[ref 36] evaluated on the basis of appropriate technical tools and methodologies, including indicators and benchmarks; [or]
b. based on a decision of the Commission,[ref 37] ex officio or following a qualified alert from the scientific panel, it has capabilities or an impact equivalent to those set out in point (a) having regard to the criteria set out in Annex XIII.
Additionally, models are presumed to have “high impact capabilities” if they were trained on >1025 FLOP.[ref 38] The seven “criteria set out in Annex XIII” to be considered in evaluating model capabilities include a variety of technical inputs (such as the model’s number of parameters and the size or quality of the dataset used in training the model), the model’s performance on benchmarks and other capabilities evaluations, and other considerations such as the number of users the model has.[ref 39] When necessary, the European Commission is authorized to amend the compute threshold and “supplement benchmarks and indicators” in response to technological developments, such as “algorithmic improvements or increased hardware efficiency.”[ref 40]
The EU Act definition resembles the initial, broad definition in the EO in that they both take diverse factors like the size and quality of the dataset used to train the model, the number of parameters, and the model’s capabilities into account. However, the EU Act definition is likely much broader than either EO definition. The training compute threshold in the EU Act is sufficient, but not necessary, to classify models as systemically risky, whereas the (much higher) threshold in the EO’s placeholder definition is both necessary and sufficient. And the first EO definition includes only models that exhibit a high level of performance on tasks that pose serious risks to national security, while the EU Act includes all general-purpose models with “high impact capabilities,” which it defines as including any model trained on more than 1025 FLOP.
The EU Act definition resembles the final SB 1047 definition of “covered model” in that both definitions authorize a regulator to update their thresholds in response to changing circumstances. It also resembles SB 1047’s May 2024 definition in that both definitions incorporate a training compute threshold and a capabilities-based element.
IV. Elements of Existing Definitions
As the examples discussed above demonstrate, legal definitions of “frontier model” can consist of one or more of a number of criteria. This section discusses a few of the most promising definitional elements.
A. Technical inputs and characteristics
A definition may classify AI models according to their technical characteristics or the technical inputs used in training the model, such as training compute, parameter count, and dataset size and type. These elements can be used in either statutory or regulatory definitions.
Training compute thresholds are a particularly attractive option for policymakers,[ref 41] as evidenced by the three examples discussed above. “Training compute” refers to the computational power used to train a model, often measured in integer or floating-point operations (OP or FLOP).[ref 42] Training compute thresholds function as a useful proxy for model capabilities because capabilities tend to increase as computational resources used to train the model increase.[ref 43]
One advantage of using a compute threshold is that training compute is a straightforward metric that is quantifiable and can be readily measured, monitored, and verified.[ref 44] Because of these characteristics, determining with high certainty whether a given model exceeds a compute threshold is relatively easy. This, in turn, facilitates enforcement of and compliance with regulations that rely on a compute-based definition. Since the amount of training compute (and other technical inputs) can be estimated prior to the training run,[ref 45] developers can predict whether a model will be covered earlier in development.
One disadvantage of a compute-based definition is that compute thresholds are a proxy for model capabilities, which are in turn a proxy for risk. Definitions that make use of multiple nested layers of proxy terms in this manner are particularly prone to becoming untethered from their original purpose.[ref 46] This can be caused, for example, by the operation of Goodhart’s Law, which suggests that “when a measure becomes a target, it ceases to be a good measure.”[ref 47] Particularly problematic, especially for statutory definitions that are more difficult to update, is the possibility that a compute threshold may become underinclusive over time as improvements in algorithmic efficiency allow for the development of highly capable models trained on below-threshold levels of compute.[ref 48] This possibility is one reason why SB 1047 and the EU AI Act both supplement their compute thresholds with alternative, capabilities-based elements.
In addition to training compute, two other model characteristics correlated with capabilities are the number of model parameters[ref 49] and the size of the dataset on which the model was trained.[ref 50] Either or both of these characteristics can be used as an element of a definition. A definition can also rely on training data characteristics other than size, such as the quality or type of the data used; the placeholder definition in EO 14110, for example, contains a lower compute threshold for models “trained… using primarily biological sequence data.”[ref 51] EO 14110 requires a dual-use foundation model to contain “at least tens of billions of parameters,”[ref 52] and the “number of parameters of the model” is a criteria to be considered under the EU AI Act.[ref 53] EO 14110 specified that only models “trained on broad data” could be dual-use foundation models,[ref 54] and the EU AI Act includes “the quality or size of the data set, for example measured through tokens” as one criterion for determining whether an AI model poses systemic risks.[ref 55]
Dataset size and parameter count share many of the pros and cons of training compute. Like training compute, they are objective metrics that can be measured and verified, and they serve as proxies for model capabilities.[ref 56] Training compute is often considered the best and most reliable proxy of the three, in part because it is the most closely correlated with performance and is difficult to manipulate.[ref 57] However, partially redundant backup metrics can still be useful.[ref 58] Dataset characteristics other than size are typically less quantifiable and harder to measure but are also capable of capturing information that the quantifiable metrics cannot.
B. Capabilities
Frontier models can also be defined in terms of their capabilities. A capabilities-based definition element typically sets a threshold level of competence that a model must achieve to be considered “frontier,” either in one or more specific domains or across a broad range of domains. A capabilities-based definition can provide specific, objective criteria for measuring a model’s capabilities,[ref 59] or it can describe the capabilities required in more general terms and leave the task of evaluation to the discretion of future interpreters.[ref 60] The former approach might be better suited to a regulatory definition, especially if the criteria used will have to be updated frequently, whereas the latter approach would be more typical of a high-level statutory definition.
Basing a definition on capabilities, rather than relying on a proxy for capabilities like training compute, eliminates the risk that the chosen proxy will cease to be a good measure of capabilities over time. Therefore, a capabilities-based definition is more likely than, e.g., a compute threshold to remain robust over time in the face of improvements in algorithmic efficiency. This was the point of the May 2024 version of SB 1047’s use of a capabilities element tethered to a compute threshold (“similar or greater performance as an artificial intelligence model trained using a quantity of computing power greater than 1026 integer or floating-point operations in 2024”)—it was an attempt to capture some of the benefits of an input-based definition while also guarding against the possibility that models trained on less than 1026 FLOP may become far more capable in the future than they are in 2024.
However, capabilities are far more difficult than compute to accurately measure. Whether a model has demonstrated “high levels of performance at tasks that pose a serious risk to security” under the EO’s broad capabilities-based definition is not something that can be determined objectively and to a high degree of certainty like the size of a dataset in tokens or the total FLOP used in a training run. Model capabilities are often measured using benchmarks (standardized sets of tasks or questions),[ref 61] but creating benchmarks that accurately measure the complex and diverse capabilities of general-purpose foundation models[ref 62] is notoriously difficult.[ref 63]
Additionally, model capabilities (unlike the technical inputs discussed above) are generally not measurable until after the model has been trained.[ref 64] This makes it difficult to regulate the development of frontier models using capabilities-based definitions, although post-development, pre-release regulation is still possible.
C. Risk
Some researchers have suggested the possibility of defining frontier AI systems on the basis of the risks they pose to users or to public safety instead of or in addition to relying on a proxy metric, like capabilities, or a proxy for a proxy, such as compute.[ref 65] The principal advantage of this direct approach is that it can, in theory, allow for better-targeted regulations—for instance, by allowing a definition to exclude highly capable but demonstrably low-risk models. The principal disadvantage is that measuring risk is even more difficult than measuring capabilities.[ref 66] The science of designing rigorous safety evaluations for foundation models is still in its infancy.[ref 67]
Of the three real-world measures discussed in Section III, only EO 14110 mentions risk directly. The broad initial definition of “dual-use foundation model” includes models that exhibit “high levels of performance at tasks that pose a serious risk to security,” such as “enabling powerful offensive cyber operations through automated vulnerability discovery” or making it easier for non-experts to design chemical weapons. This is a capability threshold combined with a risk threshold; the tasks at which a dual-use foundation model must be highly capable are those that pose a “serious risk” to security, national economic security, and/or national public health or safety. As EO 14110 shows, risk-based definition elements can specify the type of risk that a frontier model must create instead of addressing the severity of the risks created.
D. Epistemic elements
One of the primary justifications for recognizing a category of “frontier models” is the likelihood that broadly capable AI models that are more advanced than previous generations of models will have capabilities and applications that are not readily predictable ex ante.[ref 68] As the word “frontier” implies, lawmakers and regulators focusing on frontier models are interested in targeting models that break new ground and push into the unknown.[ref 69] This was, at least in part, the reason for the inclusion of training compute thresholds of 1026 FLOP in EO 14110 and SB 1047—since the most capable current models were trained on 5×1025 or fewer FLOP,[ref 70] a model trained on 1026 FLOP would represent a significant step forward into uncharted territory.
While it is possible to target models that advance the state of the art by setting and adjusting capability or compute thresholds, a more direct alternative approach would be to include an epistemic element in a statutory definition of “frontier model.” An epistemic element would distinguish between “known” and “unknown” models, i.e., between well-understood models that pose only known risks and poorly understood models that may pose unfamiliar and unpredictable risks.[ref 71]
This kind of distinction between known and unknown risks has a long history in U.S. regulation.[ref 72] For instance, the Toxic Substances Control Act (TSCA) prohibits the manufacturing of any “new chemical substance” without a license.[ref 73] The EPA keeps and regularly updates a list of chemical substances which are or have been manufactured in the U.S., and any substance not included on this list is “new” by definition.[ref 74] In other words, the TSCA distinguishes between chemicals (including potentially dangerous chemicals) that are familiar to regulators and unfamiliar chemicals that pose unknown risks.
One advantage of an epistemic element is that it allows a regulator to address “unknown unknowns” separately from better-understood risks that can be evaluated and mitigated more precisely.[ref 75] Additionally, the scope of an epistemic definition, unlike that of most input- and capability-based definitions, would change over time as regulators became familiar with the capabilities of and risks posed by new models.[ref 76] Models would drop out of the “frontier” category once regulators became sufficiently familiar with their capabilities and risks.[ref 77] Like a capabilities- or risk-based definition, however, an epistemic definition might be difficult to operationalize.[ref 78] To determine whether a given model was “frontier” under an epistemic definition, it would probably be necessary to either rely on a proxy for unknown capabilities or authorize a regulator to categorize eligible models according to a specified process.[ref 79]
E. Deployment context
The context in which an AI system is deployed can serve as an element in a definition. The EU AI Act, for example, takes the number of registered end users and the number of registered EU business users a model has into account as factors to be considered in determining whether a model is a “general-purpose AI model with systemic risk.”[ref 80] Deployment context typically does not in and of itself provide enough information about the risks posed by a model to function as a stand-alone definitional element, but it can be a useful proxy for the kind of risk posed by a given model. Some models may cause harms in proportion to their number of users, and the justification for aggressively regulating these models grows stronger the more users they have. A model that will only be used by government agencies, or by the military, creates a different set of risks than a model that is made available to the general public.
V. Updating Regulatory Definitions
A recurring theme in the scholarly literature on the regulation of emerging technologies is the importance of regulatory flexibility.[ref 81] Because of the rapid pace of technological progress, legal rules designed to govern emerging technologies like AI tend to quickly become outdated and ineffective if they cannot be rapidly and frequently updated in response to changing circumstances.[ref 82] For this reason, it may be desirable to authorize an executive agency to promulgate and update a regulatory definition of “frontier model,” since regulatory definitions can typically be updated more frequently and more easily than statutory definitions under U.S. law.[ref 83]
Historically, failing to quickly update regulatory definitions in the context of emerging technologies has often led to the definitions becoming obsolete or counterproductive. For example, U.S. export controls on supercomputers in the 1990s and early 2000s defined “supercomputer” in terms of the number of millions of theoretical operations per second (MTOPS) the computer could perform.[ref 84] Rapid advances in the processing power of commercially available computers soon rendered the initial definition obsolete, however, and the Clinton administration was forced to revise the MTOPS threshold repeatedly to avoid harming the competitiveness of the American computer industry.[ref 85] Eventually, the MTOPS metric itself was rendered obsolete, leading to a period of several years in which supercomputer export controls were ineffective at best.[ref 86]
There are a number of legal considerations that may prevent an agency from quickly updating a regulatory definition and a number of measures that can be taken to streamline the process. One important aspect of the rulemaking process is the Administrative Procedure Act’s “notice and comment” requirement.[ref 87] In order to satisfy this requirement, agencies are generally obligated to publish notice of any proposed amendment to an existing regulation in the Federal Register, allow time for the public to comment on the proposal, respond to public comments, publish a final version of the new rule, and then allow at least 30–60 days before the rule goes into effect.[ref 88] From the beginning of the notice-and-comment process to the publication of a final rule, this process can take anywhere from several months to several years.[ref 89] However, an agency can waive the 30–60 day publication period or even the entire notice-and-comment requirement for “good cause” if observing the standard procedures would be “impracticable, unnecessary, or contrary to the public interest.”[ref 90] Of course, the notice-and-comment process has benefits as well as costs; public input can be substantively valuable and informative for agencies, and also increases the democratic accountability of agencies and the transparency of the rulemaking process. In certain circumstances, however, the costs of delay can outweigh the benefits. U.S. agencies have occasionally demonstrated a willingness to waive procedural rulemaking requirements in order to respond to emergency AI-related developments. The Bureau of Industry and Security (“BIS”), for example, waived the normal 30-day waiting period for an interim rule prohibiting the sale of certain advanced AI-relevant chips to China in October 2023.[ref 91]
Another way to encourage quick updating for regulatory definitions is for Congress to statutorily authorize agencies to eschew or limit the length of notice and comment, or to compel agencies to promulgate a final rule by a specified deadline.[ref 92] Because notice and comment is a statutory requirement, it can be adjusted as necessary by statute.
For regulations exceeding a certain threshold of economic significance, another substantial source of delay is OIRA review. OIRA, the Office of Information and Regulatory Affairs, is an office within the White House that oversees interagency coordination and undertakes centralized cost-benefit analysis of important regulations.[ref 93] Like notice and comment, OIRA review can have significant benefits—such as improving the quality of regulations and facilitating interagency cooperation—but it also delays the implementation of significant rules, typically by several months.[ref 94] OIRA review can be waived either by statutory mandate or by OIRA itself.[ref 95]
VI. Deference, Delegation, and Regulatory Definitions
Recent developments in U.S. administrative law may make it more difficult for Congress to effectively delegate the task of defining “frontier model” to a regulatory agency. A number of recent Supreme Court cases signal an ongoing shift in U.S. administrative law doctrine intended to limit congressional delegations of rulemaking authority.[ref 96] Whether this development is good or bad on net is a matter of perspective; libertarian-minded observers who believe that the U.S. has too many legal rules already[ref 97] and that overregulation is a bigger problem than underregulation have welcomed the change,[ref 98] while pro-regulation observers predict that it will significantly reduce the regulatory capacity of agencies in a number of important areas.[ref 99]
Regardless of where one falls on that spectrum of opinion, the relevant takeaway for efforts to define “frontier model” is that it will likely become somewhat more difficult for agencies to promulgate and update regulatory definitions without a clear statutory authorization to do so. If Congress still wishes to authorize the creation of regulatory definitions, however, it can protect agency definitions from legal challenges by clearly and explicitly authorizing agencies to exercise discretion in promulgating and updating definitions of specific terms.
A. Loper Bright and deference to agency interpretations
In a recent decision in the combined cases of Loper Bright Enterprises v. Raimondo and Relentless v. Department of Commerce, the Supreme Court repealed a longstanding legal doctrine known as Chevron deference.[ref 100] Under Chevron, federal courts were required to defer to certain agency interpretations of federal statutes when (1) the relevant part of the statute being interpreted was genuinely ambiguous and (2) the agency’s interpretation was reasonable. After Loper Bright, courts are no longer required to defer to these interpretations—instead, under a doctrine known as Skidmore deference,[ref 101] agency interpretations will prevail in court only to the extent that courts are persuaded by them.[ref 102]
Justice Elena Kagan’s dissenting opinion in Loper Bright argues that the decision will harm the regulatory capacity of agencies by reducing the ability of agency subject-matter experts to promulgate regulatory definitions of ambiguous statutory phrases in “scientific or technical” areas.[ref 103] The dissent specifically warns that, after Loper Bright, courts will “play a commanding role” in resolving questions like “[w]hat rules are going to constrain the development of A.I.?”[ref 104]
Justice Kagan’s dissent probably somewhat overstates the significance of Loper Bright to AI governance for rhetorical effect.[ref 105] The end of Chevron deference does not mean that Congress has completely lost the ability to authorize regulatory definitions; where Congress has explicitly directed an agency to define a specific statutory term, Loper Bright will not prevent the agency from doing so.[ref 106] An agency’s authority to promulgate a regulatory definition under a statute resembling EO 14110, which explicitly directs the Department of Commerce to define “dual-use foundation model,” would likely be unaffected. However, Loper Bright has created a great deal of uncertainty regarding the extent to which courts will accept agency claims that Congress has implicitly authorized the creation of regulatory definitions.[ref 107]
To better understand how this uncertainty might affect efforts to define “frontier model,” consider the following real-life example. The Energy Policy and Conservation Act (“EPCA”) includes a statutory definition of the term “small electric motor.”[ref 108] Like many statutory definitions, however, this definition is not detailed enough to resolve all disputes about whether a given product is or is not a “small electric motor” for purposes of EPCA. In 2010, the Department of Energy (“DOE”), which is authorized under EPCA to promulgate energy efficiency standards governing “small electric motors,”[ref 109] issued a regulatory definition of “small electric motor” specifying that the term referred to motors with power outputs between 0.25 and 3 horsepower.[ref 110] The National Electrical Manufacturers Association (“NEMA”), a trade association of electronics manufacturers, sued to challenge the rule, arguing that motors with between 1 and 3 horsepower were too powerful to be “small electric motors” and that the DOE was exceeding its statutory authority by attempting to regulate them.[ref 111]
In a 2011 opinion that utilized the Chevron framework, the federal court that decided NEMA’s lawsuit considered the language of EPCA’s statutory definition and concluded that EPCA was ambiguous as to whether motors with between 1 and 3 horsepower could be “small electric motors.”[ref 112] The court then found that the DOE’s regulatory definition was a reasonable interpretation of EPCA’s statutory definition, deferred to the DOE under Chevron, and upheld the challenged regulation.[ref 113]
Under Chevron, federal courts were required to assume that Congress had implicitly authorized agencies like the DOE to resolve ambiguities in a statute, as the DOE did in 2010 by promulgating its regulatory definition of “small electric motor.” After Loper Bright, courts will recognize fewer implicit delegations of definition-making authority. For instance, while EPCA requires the DOE to prescribe “testing requirements” and “energy conservation standards” for small electric motors, it does not explicitly authorize the DOE to promulgate a regulatory definition of “small electric motor.” If a rule like the one challenged by NEMA were challenged today, the DOE could still argue that Congress implicitly authorized the creation of such a rule by giving the DOE authority to prescribe standards and testing requirements—but such an argument would probably be less likely to succeed than the Chevron argument that saved the rule in 2011.
Today, a court that did not find an implicit delegation of rulemaking authority in EPCA would not defer to the DOE’s interpretation. Instead, the court would simply compare the DOE’s regulatory definition of “small electric motor” with NEMA’s proposed definition and decide which of the two was a more faithful interpretation of EPCA’s statutory definition.[ref 114] Similarly, when or if some future federal statute uses the phrase “frontier model” or any analogous term, agency attempts to operationalize the statute by enacting detailed regulatory definitions that are not explicitly authorized by the statute will be easier to challenge after Loper Bright than they would have been under Chevron.
Congress can avoid Loper Bright issues by using clear and explicit statutory language to authorize agencies to promulgate and update regulatory definitions of “frontier model” or analogous phrases. However, it is often difficult to predict in advance whether or how a statutory definition will become ambiguous over time. This is especially true in the context of emerging technologies like AI, where the rapid pace of technological development and the poorly understood nature of the technology often eventually render carefully crafted definitions obsolete.[ref 115]
Suppose, for example, that a federal statute resembling the May 2024 draft of SB 1047 was enacted. The statutory definition would include future models trained on a quantity of compute such that they “could reasonably be expected to have similar or greater performance as an artificial intelligence model trained using [>1026 FLOP] in 2024.” If the statute did not contain an explicit authorization for some agency to determine the quantity of compute that qualified in a given year, any attempt to set and enforce updated regulatory compute thresholds could be challenged in court.
The enforcing agency could argue that the statute included an implied authorization for the agency to promulgate and update the regulatory definitions at issue. This argument might succeed or fail, depending on the language of the statute, the nature of the challenged regulatory definitions, and the judicial philosophy of the deciding court. But regardless of the outcome of any individual case, challenges to impliedly authorized regulatory definitions will probably be more likely to succeed after Loper Bright than they would have been under Chevron. Perhaps more importantly, agencies will be aware that regulatory definitions will no longer receive the benefit of Chevron deference and may regulate more cautiously in order to avoid being sued.[ref 116] Moreover, even if the statute did explicitly authorize an agency to issue updated compute thresholds, such an authorization might not allow the agency to respond to future technological breakthroughs by considering some factor other than the quantity of training compute used.
In other words, a narrow congressional authorization to regulatorily define “frontier model” may prove insufficiently flexible after Loper Bright. Congress could attempt to address this possibility by instead enacting a very broad authorization.[ref 117] An overly broad definition, however, may be undesirable for reasons of democratic accountability, as it would give unelected agency officials discretionary control over which models to regulate as “frontier.” Moreover, an overly broad definition might risk running afoul of two related constitutional doctrines that limit the ability of Congress to delegate rulemaking authority to agencies—the major questions doctrine and the nondelegation doctrine.
B. The nondelegation doctrine
Under the nondelegation doctrine, which arises from the constitutional principle of separation of powers, Congress may not constitutionally delegate legislative power to executive branch agencies. In its current form, this doctrine has little relevance to efforts to define “frontier model.” Under current law, Congress can validly delegate rulemaking authority to an agency as long as the statute in which the delegation occurs includes an “intelligible principle” that provides adequate guidance for the exercise of that authority.[ref 118] In practice, this is an easy standard to satisfy—even vague and general legislative guidance, such as directing agencies to regulate in a way that “will be generally fair and equitable and will effectuate the purposes of the Act,” has been held to contain an intelligible principle.[ref 119] The Supreme Court has used the nondelegation doctrine to strike down statutes only twice, in two 1935 decisions invalidating sweeping New Deal laws.[ref 120]
However, some commentators have suggested that the Supreme Court may revisit the nondelegation doctrine in the near future,[ref 121] perhaps by discarding the “intelligible principle” test in favor of something like the standard suggested by Justice Gorsuch in his 2019 dissent in Gundy v. United States.[ref 122] In Gundy, Justice Gorsuch suggested that the nondelegation doctrine, properly understood, requires Congress to make “all the relevant policy decisions” and delegate to agencies only the task of “filling up the details” via regulation.[ref 123]
Therefore, if the Supreme Court does significantly strengthen the nondelegation doctrine, it is possible that a statute authorizing an agency to create a regulatory definition of “frontier model” would need to include meaningful guidance as to what the definition should look like. This is most likely to be the case if the regulatory definition in question is a key part of an extremely significant regulatory scheme, because “the degree of agency discretion that is acceptable varies according to the power congressionally conferred.”[ref 124] Congress generally “need not provide any direction” to agencies regarding the manner in which it defines specific and relatively unimportant technical terms,[ref 125] but must provide “substantial guidance” for extremely important and complex regulatory tasks that could significantly impact the national economy.[ref 126]
C. The major questions doctrine
Like the nondelegation doctrine, the major questions doctrine is a constitutional limitation on Congress’s ability to delegate rulemaking power to agencies. Like the nondelegation doctrine, it addresses concerns about the separation of powers and the increasingly prominent role executive branch agencies have taken on in the creation of important legal rules. Unlike the nondelegation doctrine, however, the major questions doctrine is a recent innovation. The Supreme Court acknowledged it by name for the first time in the 2022 case West Virginia v. Environmental Protection Agency,[ref 127] where it was used to strike down an EPA rule regulating power plant carbon dioxide emissions. Essentially, the major questions doctrine provides that courts will not accept an interpretation of a statute that grants an agency authority over a matter of great “economic or political significance” unless there is a “clear congressional authorization” for the claimed authority.[ref 128] Whereas the nondelegation doctrine provides a way to strike down statutes as unconstitutional, the major questions doctrine only affects the way that statutes are interpreted.
Supporters of the major questions doctrine argue that it helps to rein in excessively broad delegations of legislative power to the administrative state and serves a useful separation-of-powers function. The doctrine’s critics, however, have argued that it limits Congress’s ability to set up flexible regulatory regimes that allow agencies to respond quickly and decisively to changing circumstances.[ref 129] According to this school of thought, requiring a clear statement authorizing each economically significant agency action inhibits Congress’s ability to communicate broad discretion in handling problems that are difficult to foresee in advance.
This difficulty is particularly salient in the context of regulatory regimes for the governance of emerging technologies.[ref 130] Justice Kagan made this point in her dissent from the majority opinion in West Virginia, where she argued that the statute at issue was broadly worded because Congress had known that “without regulatory flexibility, changing circumstances and scientific developments would soon render the Clean Air Act obsolete.”[ref 131] Because advanced AI systems are likely to have a significant impact on the U.S. economy in the coming years,[ref 132] it is plausible that the task of choosing which systems should be categorized as “frontier” and subject to increased regulatory scrutiny will be an issue of great “economic and political significance.” If it is, then the major questions doctrine could be invoked to invalidate agency efforts to promulgate or amend a definition of “frontier model” to address previously unforeseen unsafe capabilities.
For example, consider a hypothetical federal statute instituting a licensing regime for frontier models that includes a definition similar to the placeholder in EO 14110 (empowering the Bureau of Industry and Security to “define, and thereafter update as needed on a regular basis, the set of technical conditions [that determine whether a model is a frontier model].”). Suppose that BIS initially defined “dual-use foundation model” under this statute using a regularly updated compute threshold, but that ten years after the statute’s enactment a new kind of AI system was developed that could be trained to exhibit cutting-edge capabilities using a relatively small quantity of training compute. If BIS attempted to amend its regulatory definition of “frontier model” to include a capabilities threshold that would cover this newly developed and economically significant category of AI system, that new regulatory definition might be challenged under the major questions doctrine. In that situation, a court with deregulatory inclinations might not view the broad congressional authorization for BIS to define “frontier model” as a sufficiently clear statement of congressional intent to allow BIS to later institute a new and expanded licensing regime based on less objective technical criteria.[ref 133]
VI. Conclusion
One of the most common mistakes that nonlawyers make when reading a statute or regulation is to assume that each word of the text carries its ordinary English meaning. This error occurs because legal rules, unlike most writing encountered in everyday life, are often written in a sort of simple code where a number of the terms in a given sentence are actually stand-ins for much longer phrases catalogued elsewhere in a “definitions” section.
This tendency to overlook the role that definitions play in legal rules has an analogue in a widespread tendency to overlook the importance of well-crafted definitions to a regulatory scheme. The object of this paper, therefore, has been to explain some of the key legal considerations relevant to the task of defining “frontier model” or any of the analogous phrases used in existing laws and regulations.
One such consideration is the role that should be played by statutory and regulatory definitions, which can be used independently or in conjunction with each other to create a definition that is both technically sound and democratically legitimate. Another is the selection and combination of potential definitional elements, including technical inputs, capabilities metrics, risk, deployment context, and familiarity, that can be used independently or in conjunction with each other to create a single statutory or regulatory definition. Legal mechanisms for facilitating rapid and frequent updating for regulations targeting emerging technologies also merit attention. Finally, the nondelegation and major questions doctrines and the recent elimination of Chevron deference may affect the scope of discretion that can be conferred for the creation and updating of regulatory definitions.
Beyond a piecemeal approach: prospects for a framework convention on AI
Abstract
Solving many of the challenges presented by artificial intelligence (AI) requires international coordination and cooperation. In response, the past years have seen multiple global initiatives to govern AI. However, very few proposals have discussed treaty models or design for AI governance and have therefore neglected the study of framework conventions–generally multilateral law-making treaties that establish a two-step regulatory process through which initially underspecified obligations and implementation mechanisms are subsequently specified via protocols. This chapter asks whether or how a Framework Convention on AI (FCAI) might serve as a regulatory tool for global AI governance, in contrast with the more traditional piecemeal approach based on individual treaties that govern isolated issues and have no subsequent regime. To answer these questions, the chapter first briefly sets out the recent context of global AI governance, and the governance gaps that remain to be filled. It then explores the elements, definition, and general role of framework conventions as an international regulatory instrument. On this basis, the chapter considers the structural trade-offs and challenges that an FCAI would face, before discussing key ways in which it could be designed to address these concerns. We argue that, while imperfect, an FCAI may be the most tractable and appropriate solution for the international governance of AI if it follows a hybrid model that combines a wide scope with specific obligations and implementation mechanisms concerning issues on which states already converge.
The future of international scientific assessments of AI’s risks
Abstract
Effective international coordination to address AI’s global impacts demands a shared, scientifically rigorous understanding of AI risks. This paper examines the challenges and opportunities in establishing international scientific consensus in this domain. It analyzes current efforts, including the UK-led International Scientific Report on the Safety of Advanced AI and emerging UN initiatives, identifying key limitations and tradeoffs. The authors propose a two-track approach: 1) a UN-led process focusing on broad AI issues and engaging member states, and 2) an independent annual report specifically focused on advanced AI risks. The paper recommends careful coordination between these efforts to leverage their respective strengths while maintaining their independence. It also evaluates potential hosts for the independent report, including the network of AI Safety Institutes, the OECD, and scientific organizations like the International Science Council. The proposed framework aims to balance scientific rigor, political legitimacy, and timely action to facilitate coordinated international action on AI risks.
Computing power and the governance of artificial intelligence
Abstract
Computing power, or “compute,” is crucial for the development and deployment of artificial intelligence (AI) capabilities. As a result, governments and companies have started to leverage compute as a means to govern AI. For example, governments are investing in domestic compute capacity, controlling the flow of compute to competing countries, and subsidizing compute access to certain sectors. However, these efforts only scratch the surface of how compute can be used to govern AI development and deployment. Relative to other key inputs to AI (data and algorithms), AI-relevant compute is a particularly effective point of intervention: it is detectable, excludable, and quantifiable, and is produced via an extremely concentrated supply chain. These characteristics, alongside the singular importance of compute for cutting-edge AI models, suggest that governing compute can contribute to achieving common policy objectives, such as ensuring the safety and beneficial use of AI. More precisely, policymakers could use compute to facilitate regulatory visibility of AI, allocate resources to promote beneficial outcomes, and enforce restrictions against irresponsible or malicious AI development and usage. However, while compute-based policies and technologies have the potential to assist in these areas, there is significant variation in their readiness for implementation. Some ideas are currently being piloted, while others are hindered by the need for fundamental research. Furthermore, naïve or poorly scoped approaches to compute governance carry significant risks in areas like privacy, economic impacts, and centralization of power. We end by suggesting guardrails to minimize these risks from compute governance.