The role of compute thresholds for AI governance

I. Introduction

The idea of establishing a “compute threshold” and, more precisely, a “training compute threshold” has recently attracted significant attention from policymakers and commentators. In recent years, various scholars and AI labs have supported setting such a threshold,[ref 1] as have governments around the world. On October 30, 2023, President Biden’s Executive Order 14,110 on Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence introduced the first living example of a compute threshold,[ref 2] although it was one of many orders revoked by President Trump upon entering office.[ref 3] The European Parliament and the European Council adopted the Artificial Intelligence Act, on June 13, 2024, providing for the establishment of a compute threshold.[ref 4] On February 4, 2024, California State Senator Scott Wiener introduced Senate Bill 1047 that defined frontier AI models with a compute threshold.[ref 5] The bill was approved by the California legislature, but it was ultimately vetoed by the State’s Governor.[ref 6] China may be considering similar measures, as indicated by recent discussions in policy circles.[ref 7] While not perfect, compute thresholds are currently one of the best options available to identify potentially high-risk models and trigger further scrutiny. Yet, in spite of this, information about compute thresholds and their relevance from a policy and legal perspective remains dispersed.

This Article proceeds in two parts. Part I provides a technical overview of compute and how the amount of compute used in training corresponds to model performance and risk. It begins by explaining what compute is and the role compute plays in AI development and deployment. Compute refers to both computational infrastructure, the hardware necessary to develop and deploy an AI system, and the amount of computational power required to train a model, commonly measured in integer or floating-point operations. More compute is used to train notable models each year, and although the cost of compute has decreased, the amount of compute used for training has increased at a higher rate, causing training costs to increase dramatically.[ref 8] This increase in training compute has contributed to improvements in model performance and capabilities, described in part by scaling laws. As models are trained on more data, with more parameters and training compute, they grow more powerful and capable. As advances in AI continue, capabilities may emerge that pose potentially catastrophic risks if not mitigated.[ref 9]

Part II discusses why, in light of this risk, compute thresholds may be important to AI governance. Since training compute can serve as a proxy for the capabilities of AI models, a compute threshold can operate as a regulatory trigger, identifying what subset of models might possess more powerful and dangerous capabilities that warrant greater scrutiny, such as in the form of reporting and evaluations. Both the European Union AI Act and Executive Order 14,110 established compute thresholds for different purposes, and many more policy proposals rely on compute thresholds to ensure that the scope of covered models matches the nature or purpose of the policy. This Part provides an overview of policy proposals that expressly call for such a threshold, as well as proposals that could benefit from the addition of a compute threshold to clarify the scope of policies that refer broadly to “advanced systems” or “systems with dangerous capabilities.” It then describes how, even absent a formal compute threshold, courts and regulators might rely on training compute as a proxy for how much risk a given AI system poses, even under existing law. This Part concludes with the advantages and limitations of using compute thresholds as a regulatory trigger.

II. Compute and the Scaling Hypothesis

A. What Is “Compute”?

The term “compute” serves as an umbrella term, encompassing several meanings that depend on context.

Commonly, the term “compute” is used to refer to computational infrastructure, i.e., the hardware stacks necessary to develop and deploy AI systems.[ref 10] Many hardware elements are integrated circuits (also called chips or microchips), such as logic chips, which perform operations, and memory chips, which store the information on which logic devices perform calculations.[ref 11] Logic chips cover a spectrum of specialization, ranging from general-purpose central processing units (“CPUs”), through graphics processing units (“GPUs”) and field-programmable gate arrays (“FPGAs”), to application-specific integrated circuits (“ASICs”) customized for specific algorithms.[ref 12] Memory chips include dynamic random-access memory (“DRAM”), static random-access memory (“SRAM”), and NOT AND (“NAND”) flash memory used in many solid state drives (“SSDs”).[ref 13]

Additionally, the term “compute” is often used to refer to how much computational power is required to train a specific AI system. Whereas the computational performance of a chip refers to how quickly it can execute operations and thus generate results, solve problems, or perform specific tasks, such as processing and manipulating data or training an AI system, “compute” refers to the amount of computational power used by one or more chips to perform a task, such as training a model. Compute is commonly measured in integer operations or floating-point operations (“OP” or “FLOP”),[ref 14] expressing the number of operations that have been executed by one or more chips, while the computational performance of those chips is measured in operations per second (“OP/s” or “FLOP/s”). In this sense, the amount of computational power used is roughly analogous to the distance traveled by a car.[ref 15] Since large amounts of compute are used in modern computing, values are often reported in scientific notation such as 1e26 or 2e26, which refer to 1⋅1026 and 2⋅1026 respectively.

Compute is essential throughout the AI lifecycle. The AI lifecycle can be broken down into two phases: development and deployment.[ref 16] In the first phase, development, developers design the model by choosing an architecture, the structure of the network, and initial values for hyperparameters (i.e., parameters that control the learning process, such as number of layers and training rate).[ref 17] Enormous amounts of data, usually from publicly available sources, are processed and curated to produce high-quality datasets for training.[ref 18] The model then undergoes “pre-training,” in which the model is trained on a large and diverse dataset in order to build the general knowledge and features of the model, which are reflected in the weights and biases of the model.[ref 19] Alternatively, developers may use an existing pre-trained model, such as OpenAI’s GPT-4 (“Generative Pre-trained Transformer 4”). The term “foundation model” refers to models like these, which are trained on broad data and adaptable to many downstream tasks.[ref 20] Performance and capabilities improvements are then possible using methods such as fine-tuning on task-specific datasets, reinforcement learning from human feedback (“RLHF”), teaching the model to use tools, and instruction tuning.[ref 21] These enhancements are far less compute-intensive than pre-training, particularly for models trained on massive datasets.[ref 22]

As of this writing, there is no agreed-upon standard for measuring “training compute.” Estimates of “training compute” typically refer only to the amount of compute used during pre-training. More specifically, they refer to the amount of compute used during the final pre-training run, which contributes to the final machine learning model, and does not include any previous test runs or post-training enhancements, such as fine-tuning.[ref 23] There are exceptions: for instance, the EU AI Act considers the cumulative amount of compute used for training by including all the compute “used across the activities and methods that are intended to enhance the capabilities of the model prior to deployment, such as pre-training, synthetic data generation and fine-tuning.”[ref 24] California Senate Bill 1047 addressed post-training modifications generally and fine-tuning in particular, providing that a covered model fine-tuned with more than 3e25 OP or FLOP would be considered a distinct “covered model,” while one fine-tuned on less compute or subjected to unrelated post-training modifications would be considered a “covered model derivative.”[ref 25]

In the second phase, deployment, the model is made available to users and is used.[ref 26] Users provide input to the model, such as in the form of a prompt, and the model makes predictions from this input in a process known as “inference.”[ref 27] The amount of compute needed for a single inference request is far lower than what is required for a training run.[ref 28] However, for systems deployed at scale, the cumulative compute used for inference can surpass training compute by several orders of magnitude.[ref 29] Consider, for instance, a large language model (“LLM”). During training, a large amount of compute is required over a smaller time frame within a closed system, usually a supercomputer. Once the model is deployed, each text generation leverages its own copy of the trained model, which can be run on a separate compute infrastructure. The model may serve hundreds of millions of users, each generating unique content and using compute with each inference request. Over time, the cumulative compute usage for inference can surpass the total compute required for training.

There are various reasons to consider compute usage at different stages of the AI lifecycle, which is discussed in Section I.E. For clarity, this Article uses “training compute” for compute used during the final pre-training run and “inference compute” for compute used by the model during a single inference, measured in the number of operations (“OP” or “FLOP”). Figure 1 illustrates a simplified version of the language model compute lifecycle.


A diagram of a computer lifecycle

AI-generated content may be incorrect.
Figure 1: Simplified language model lifecycle

B. What Is Moore’s Law and Why Is It Relevant for AI?

In 1965, Gordon Moore forecasted that the number of transistors on an integrated circuit would double every year.[ref 30] Ten years later, Moore revised his initial forecast to a two-year doubling period.[ref 31] This pattern of exponential growth is now called “Moore’s Law.”[ref 32] Similar rates of growth have been observed in related metrics, notably including the increase in computational performance of supercomputers;[ref 33] as the number of transistors on a chip increases, so does computational performance (although other factors also play a role).[ref 34]

A corollary of Moore’s Law is that the cost of compute has fallen dramatically; a dollar can buy more FLOP every year.[ref 35] Greater access to compute, along with greater spending from 2010 onwards (i.e., the so-called deep learning era),[ref 36] has contributed to developers using ever more compute to train AI systems. Research has found that the compute used to train notable and frontier models has grown by 4–5x per year between 2010 and May 2024.[ref 37]


A graph with blue dots

AI-generated content may be incorrect.
Figure 2: Compute used to train notable AI systems from 1950 to 2023[ref 38]

However, the current rate of growth in training compute may not be sustainable. Scholars have cited the cost of training,[ref 39] a limited supply of AI chips,[ref 40] technical challenges with using that much hardware (such as managing the number of processors that must run in parallel to train larger models),[ref 41] and environmental impact[ref 42] as factors that could constrain the growth of training compute. Research in 2018 with data from OpenAI estimated that then-current trends of growth in training compute could be sustained for at most 3.5 to 10 years (2022 to 2028), depending on spending levels and how the cost of compute evolves over time.[ref 43] In 2022, that analysis was replicated with a more comprehensive dataset and suggested that this trend could be maintained for longer, for 8 to 18 years (2030 to 2040) depending on compute cost-performance improvements and specialized hardware improvements.[ref 44]

C. What Are “Scaling Laws” and What Do They Say About AI Models?

Scaling laws describe the functional (mathematical) relationship between the amount of training compute and the performance of the AI model.[ref 45] In this context, performance is a technical metric that quantifies “loss,” which is the amount of error in the model’s predictions. When loss is measured on a test or validation set that uses data not part of the training set, it reflects how well the model has generalized its learning from the training phase. The lower the loss, the more accurate and reliable the model is in making predictions on data it has not encountered during its training.[ref 46] As training compute increases, alongside increases in parameters and training data, so does model performance, meaning that greater training compute reduces the errors made.[ref 47] Increased training compute also corresponds to an increase in capabilities.[ref 48] Whereas performance refers to a technical metric, such as test loss, capabilities refer to the ability to complete concrete tasks and solve problems in the real world, including in commercial applications.[ref 49] Capabilities can also be assessed using practical and real-world tests, such as standardized academic or professional licensing exams, or with benchmarks developed for AI models. Common benchmarks include “Beyond the Imitation Game” (“BIG-Bench”), which comprises 204 diverse tasks that cover a variety of topics and languages,[ref 50] and the “Massive Multitask Language Understanding” benchmark (“MMLU”), a suite of multiple-choice questions covering 57 subjects.[ref 51] To evaluate the capabilities of Google’s PaLM 2 and OpenAI’s GPT-4, developers relied on BIG-Bench and MMLU as well as exams designed for humans, such as the SAT and AP exams.[ref 52]

Training compute has a relatively smooth and consistent relationship with technical metrics like training loss. Training compute also corresponds to real-world capabilities, but not in a smooth and predictable way. This is due in part to occasional surprising leaps, discussed in Section I.D, and subsequent enhancements such as fine-tuning, which can further increase capabilities using far less compute.[ref 53] Despite being unable to provide a full and accurate picture of a model’s final capabilities, training compute still provides a reasonable basis for estimating the base capabilities (and corresponding risk) of a foundation model. Figure 3 shows the relationship between an increase in training compute and dataset size, and performance on the MMLU benchmark.


A graph with green and blue dots

AI-generated content may be incorrect.
Figure 3: Relationship between increase in training compute and dataset size,
and performance on MMLU[ref 54]

In light of the correlation between training compute and performance, the “scaling hypothesis” states that scaling training compute will predictably continue to produce even more capable systems, and thus more compute is important for AI development.[ref 55] Some have taken this hypothesis further, proposing a “Bitter Lesson:” that “the only thing that matters in the long run is the leveraging of comput[e].”[ref 56] Since the emergence of the deep learning era, this hypothesis has been sustained by the increasing use of AI models in commercial applications, whose development and commercial success have been significantly driven by increases in training compute.[ref 57]

Two factors weigh against the scaling hypothesis. First, scaling laws describe more than just the performance improvements based on training compute; they describe the optimal ratio of the size of the dataset, the number of parameters, and the training compute budget.[ref 58] Thus, a lack of abundant or high-quality data could be a limiting factor. Researchers estimate that, if training datasets continue to grow at current rates, language models will fully utilize human-generated public text data between 2026 and 2032,[ref 59] while image data could be exhausted between 2030 and 2060.[ref 60] Specific tasks may be bottlenecked earlier by the scarcity of high-quality data sources.[ref 61] There are, however, several ways that data limitations might be delayed or avoided, such as synthetic data generation and using additional datasets that are not public or in different modalities.[ref 62]

Second, algorithmic innovation permits performance gains that would otherwise require prohibitively expensive amounts of compute.[ref 63] Research estimates that every 9 months, improved algorithms for image classification[ref 64] and LLMs[ref 65] contribute the equivalent of a doubling of training compute budgets. Algorithmic improvements include more efficient utilization of data[ref 66] and parameters, the development of improved training algorithms, or new architectures.[ref 67] Over time, the amount of training compute needed to achieve a given capability is reduced, and it may become more difficult to predict performance and capabilities on that basis (although scaling trends of new algorithms could be studied and perhaps predicted). The governance implications of this are multifold, including that increases in training compute may become less important for AI development and that many more actors will be able to access the capabilities previously restricted to a limited number of developers.[ref 68] Still, responsible frontier AI development may enable stakeholders to develop understanding, safety practices, and (if needed) defensive measures for the most advanced AI capabilities before these capabilities proliferate.

D. Are High-Compute Systems Dangerous?

Advances in AI could deliver immense opportunities and benefits across a wide range of sectors, from healthcare and drug discovery[ref 69] to public services.[ref 70] However, more capable models may come with greater risk, as improved capabilities could be used for harmful and dangerous ends. While the degree of risk posed by current AI models is a subject of debate,[ref 71] future models may pose catastrophic and existential risks as capabilities improve.[ref 72] Some of these risks are expected to be closely connected to the unexpected emergence of dangerous capabilities and the dual-use nature of AI models.

As discussed in Section I.C, increases in compute, data, and the number of parameters lead to predictable improvements in model performance (test loss) and general but somewhat less predictable improvements in capabilities (real-world benchmarks and tasks). However, scaling up these inputs to a model can also result in qualitative changes in capabilities in a phenomenon known as “emergence.”[ref 73] That is, a larger model might unexpectedly display emergent capabilities not present in smaller models, suddenly able to perform a task that smaller models could not.[ref 74] During the development of GPT-3, early models had close-to-zero performance on a benchmark for addition, subtraction, and multiplication. Arithmetic capabilities appeared to emerge suddenly in later models, with performance jumping substantially above random at 2·1022 FLOP and continuing to improve with scale.[ref 75] Similar jumps were observed at different thresholds, and for different models, on a variety of tasks.[ref 76]

Some have contested the concept of emergent capabilities, arguing that what appear to be emergent capabilities in large language models are explained by the use of discontinuous measures, rather than by sharp and unpredictable improvements or developments in model capabilities with scale.[ref 77] However, discontinuous measures are often meaningful, as when the correct answer or action matters more than how close the model gets to it. As Anderljung and others explain: “For autonomous vehicles, what matters is how often they cause a crash. For an AI model solving mathematics questions, what matters is whether it gets the answer exactly right or not.”[ref 78] Given the difficulties inherent in choosing an appropriate continuous measure and determining how it corresponds to the relevant discontinuous measure,[ref 79] it is likely that capabilities will continue to seemingly emerge.

Together with emerging capabilities come emerging risks. Like many other innovations, AI systems are dual-use by nature, with the potential to be used for both beneficial and harmful ends.[ref 80] Executive Order 14,110 recognized that some models may “pose a serious risk to security, national economic security, national public health or safety” by “substantially lowering the barrier of entry for non-experts to design, synthesize, acquire, or use chemical, biological, radiological, or nuclear weapons; enabling powerful offensive cyber operations . . . ; [or] permitting the evasion of human control or oversight through means of deception or obfuscation.”[ref 81]

Predictions and evaluations will likely adequately identify many capabilities before deployment, allowing developers to take appropriate precautions. However, systems trained at a greater scale may possess novel capabilities, or improved capabilities that surpass a critical threshold for risk, yet go undetected by evaluations.[ref 82] Some of these capabilities may appear to emerge only after post-training enhancements, such as fine-tuning or more effective prompting methods. A system may be capable of conducting offensive cyber operations, manipulating people in conversation, or providing actionable instructions on conducting acts of terrorism,[ref 83] and still be deployed without the developers fully comprehending unexpected and potentially harmful behaviors. Research has already detected unexpected behavior in current models. For instance, during the recent U.K. AI Safety Summit on November 1, 2023, Apollo Research showed that GPT-4 can take illegal actions like insider trading and then lie about its actions without being instructed to do so.[ref 84] Since the capabilities of future foundation models may be challenging to predict and evaluate, “emergence” has been described as “both the source of scientific excitement and anxiety about unanticipated consequences.”[ref 85]

Not all risks come from large models. Smaller models trained on data from certain domains, such as biology or chemistry, may pose significant risks if repurposed or misused.[ref 86] When MegaSyn, a generative molecule design tool used for drug discovery, was repurposed to find the most toxic molecules instead of the least toxic, it found tens of thousands of candidates in under six hours, including known biochemical agents and novel compounds predicted to be as or more deadly.[ref 87] The amount of compute used to train DeepMind’s AlphaFold, which predicts three-dimensional protein structures from the protein sequence, is minimal compared to frontier language models.[ref 88] While scaling laws can be observed in a variety of domains, the amount of compute required to train models in some domains may be so low that a compute threshold is not a practical restriction on capabilities.

Broad consensus is forming around the need to test, monitor, and restrict systems of concern.[ref 89] The role of compute thresholds, and whether they are used at all, depends on the nature of the risk and the purpose of the policy: does it target risks from emergent capabilities of frontier models,[ref 90] risks from models with more narrow but dangerous capabilities,[ref 91] or other risks from AI?

E. Does Compute Usage Outside of Training Influence Performance and Risk?

In light of the relationship between training compute and performance expressed by scaling laws, training compute is a common proxy for how capable and powerful AI models are and the risks that they pose.[ref 92] However, compute used outside of training can also influence performance, capabilities, and corresponding risk.

As discussed in Section I.A, training compute typically does not refer to all compute used during development, but is instead limited to compute used during the final pre-training run.[ref 93] This definition excludes subsequent (post-training) enhancements, such as fine-tuning and prompting methods, which can significantly improve capabilities (see supra Figure 1) using far less compute; many current methods can improve capabilities the equivalent of a 5x increase in training compute, while some can improve them by more than 20x.[ref 94]

The focus on training compute also misses the significance of compute used for inference, in which the trained model generates output in response to a prompt or new input data.[ref 95] Inference is the biggest compute cost for models deployed at scale, due to the frequency and volume of requests they handle.[ref 96] While developing an AI model is far more computationally intensive than a single inference request, it is a one-time task. In contrast, once a model is deployed, it may receive numerous inference requests that, in aggregate, exceed the compute expenditures of training. Some have even argued that inference compute could be a bottleneck in scaling AI, if inference compute costs scaling with training compute grow too large.[ref 97]

Greater availability of inference compute could enhance malicious uses of AI by allowing the model to process data more rapidly and enabling the operation of multiple instances in parallel. For example, AI could more effectively be used to carry out cyber attacks, such as a distributed denial-of-service (“DDoS”) attack,[ref 98] to manipulate financial markets,[ref 99] or to increase the speed, scale, and personalization of disinformation campaigns.[ref 100]

Compute used outside of development may also impact model performance. Specifically, some techniques can increase the performance of a model at the cost of more compute used during inference.[ref 101] Developers could therefore choose to improve a model beyond its current capabilities or to shift some compute expenditures from training to inference, in order to obtain equally-capable systems with less training compute. Users could also prompt a model to use similar techniques during inference, for example by (1) using “few-shot” prompting, in which initial prompts provide the model with examples of the desired output for a type of input,[ref 102] (2) using chain-of-thought prompting, which uses few-shot prompting to provide examples of reasoning,[ref 103] or (3) simply providing the same prompt multiple times and selecting the best result. Some user-side techniques to improve performance might increase the compute used during a single inference, while others would leave it unchanged (while still increasing the total compute used, due to multiple inferences being performed).[ref 104] Meanwhile, other techniques—such as pruning,[ref 105] weight sharing,[ref 106] quantization,[ref 107] and distillation[ref 108]—can reduce compute used during inference while maintaining or even improving performance, and they can further reduce inference compute at the cost of lower performance.

Beyond model characteristics such as parameter count, other factors can also affect the amount of compute used during inference in ways that may or may not improve performance, such as input size (compare a short prompt to a long document or high-resolution image) and batch size (compare one input provided at a time to many inputs in a single prompt).[ref 109] Thus, for a more accurate indication of model capabilities, compute used to run a single inference[ref 110] for a given set of prompts could be considered alongside other factors, such as training compute. However, doing so may be impractical, as data about inference compute (or architecture useful for estimating it) is rarely published by developers,[ref 111] different techniques could make inference more compute-efficient, and less information is available regarding the relationship between inference compute and capabilities.

While companies might be hesitant to increase inference compute at scale due to cost, doing so may still be worthwhile in certain circumstances, such as for more narrowly deployed models or those willing to pay more for improved capabilities. For example, OpenAI offers dedicated instances for users who want more control over system performance, with a reserved allocation of compute infrastructure and the ability to enable features such as longer context limits.[ref 112]

Over time, compute usage during the AI development and deployment process may change. It was previously common practice to train models with supervised learning, which uses annotated datasets. In recent years, there has been a rise in self-supervised, semi-supervised, and unsupervised learning, which use data with limited or no annotation but require more compute.[ref 113] 

III. The Role of Compute Thresholds for AI Governance

A. How Can Compute Thresholds Be Used in AI Policy?

Compute can be used as a proxy for the capabilities of AI systems, and compute thresholds can be used to define the limited subset of high-compute models subject to oversight or other requirements.[ref 114] Their use depends on the context and purpose of the policy. Compute thresholds serve as intuitive starting points to identify potential models of concern,[ref 115] perhaps alongside other factors.[ref 116] They operate as a trigger for greater scrutiny or specific requirements. Once a certain level of training compute is reached, a model is presumed to have a higher risk of displaying dangerous capabilities (and especially unknown dangerous capabilities) and, hence, is subject to stricter oversight and other requirements.

Compute thresholds have already entered AI policy. The EU AI Act requires model providers to assess and mitigate systemic risks, report serious incidents, conduct state-of-the-art tests and model evaluations, ensure cybersecurity, and report serious incidents if a compute threshold is crossed.[ref 117] Under the EU AI Act, a general-purpose model that meets the initial threshold is presumed to have high-impact capabilities and associated systemic risk.[ref 118]

In the United States, Executive Order 14,110 directed agencies to propose rules based on compute thresholds. Although it was revoked by President Trump’s Executive Order 14,148,[ref 119] many actions have already been taken and rules have been proposed for implementing Executive Order 14,110. For instance, the Department of Commerce’s Bureau of Industry and Security issued a proposed rule on September 11, 2024[ref 120] to implement the requirement that AI developers and cloud service providers report on models above certain thresholds, including information about (1) “any ongoing or planned activities related to training, developing, or producing dual-use foundation models,” (2) the results of red-teaming, and (3) the measures the company has taken to meet safety objectives.[ref 121] The executive order also imposed know-your-customer (“KYC”) monitoring and reporting obligations on U.S. cloud infrastructure providers and their foreign resellers, again with a preliminary compute threshold.[ref 122] On January 29, 2024, the Bureau of Industry and Security issued a proposed rule implementing those requirements.[ref 123] The proposed rule noted that training compute thresholds may determine the scope of the rule; the program is limited to foreign transactions to “train a large AI model with potential capabilities that could be used in malicious cyber-enabled activity,” and technical criteria “may include the compute used to pre-train the model exceeding a specified quantity.” [ref 124] The fate of these rules is uncertain, as all rules and actions taken pursuant to Executive Order 14,110 will be reviewed to ensure that they are consistent with the AI policy set forth in Executive Order 14,179, Removing Barriers to American Leadership in Artificial Intelligence.[ref 125] Any rules of actions identified as inconsistent are directed to be suspended, revised, or rescinded.[ref 126]

Numerous policy proposals have likewise called for compute thresholds. Scholars and developers alike have expressed support for a licensing or registration regime,[ref 127] and a compute threshold could be one of several ways to trigger the requirement.[ref 128] Compute thresholds have also been proposed for determining the level of KYC requirements for compute providers (including cloud providers).[ref 129] The Framework to Mitigate AI-Enabled Extreme Risks, proposed by U.S. Senators Romney, Reed, Moran, and King, would include a compute threshold for requiring notice of development, model evaluation, and pre-deployment licensing.[ref 130]

Other AI regulations and policy proposals do not explicitly call for the introduction of compute thresholds but could still benefit from them. A compute threshold could clarify when specific obligations are triggered in laws and guidance that refer more broadly to “advanced systems” or “systems with dangerous capabilities,” as in the voluntary guidance for “organizations developing the most advanced AI systems” in the Hiroshima Process International Code of Conduct for Advanced AI Systems, agreed upon by G7 leaders on October 30, 2023.[ref 131] Compute thresholds could identify when specific obligations are triggered in other proposals, including proposals for: (1) conducting thorough risk assessments of frontier AI models before deployment;[ref 132] (2) subjecting AI development to evaluation-gated scaling;[ref 133] (3) pausing development of frontier AI;[ref 134] (4) subjecting developers of advanced models to governance audits;[ref 135] (5) monitoring advanced models after deployment;[ref 136] and (6) requiring that advanced AI models be subject to information security protections.[ref 137]

B. Why Might Compute Be Relevant Under Existing Law?

Even without a formal compute threshold, the significance of training compute could affect the interpretation and application of existing laws. Courts and regulators may rely on compute as a proxy for how much risk a given AI system poses—alongside other factors such as capabilities, domain, safeguards, and whether the application is in a higher-risk context—when determining whether a legal condition or regulatory threshold has been met. This section briefly covers a few examples. First, it discusses the potential implications for duty of care and foreseeability analyses in tort law. It then goes on to describe how regulatory agencies could depend on training compute as one of several factors in evaluating risk from frontier AI, for example as an indicator of change to a regulated product and as a factor in regulatory impact analysis.

The application of existing laws and ongoing development of common law, such as tort law, may be particularly important while AI governance is still nascent[ref 138] and may operate as a complement to regulations once developed.[ref 139] However, courts and regulators will face new challenges as cases involve AI, an emerging technology of which they have no specialized knowledge, and parties will face uncertainty and inconsistent judgments across jurisdictions. As developments in AI unsettle existing law[ref 140] and agency practice, courts and agencies might rely on compute in several ways.

For example, compute could inform the duty of care owed by developers who make voluntary commitments to safety.[ref 141] A duty of care, which is a responsibility to take reasonable care to avoid causing harm to another, can be conditioned on the foreseeability of the plaintiff as a victim or be an affirmative duty to act in a particular way; affirmative duties can arise from the relationship between the parties, such as between business owner and customer, doctor and patient, and parent and child.[ref 142] If AI companies make general commitments to security testing and cybersecurity, such as the voluntary safety commitments secured by the Biden administration,[ref 143] those commitments may give rise to a duty of care in which training compute is a factor in determining what security is necessary. If a lab adopts a responsible scaling policy that requires it to have protection measures based on specific capabilities or potential for risk or misuse,[ref 144] a court might consider training compute as one of several factors in evaluating the potential for risk or misuse.

A court might also consider training compute as a factor when determining whether a harm was foreseeable. More advanced AI systems, trained with more compute, could foreseeably be capable of greater harm, especially in light of scaling laws discussed in Section I.C that make clear the relationship between compute and performance. It may likewise be foreseeable that a powerful AI system could be misused[ref 145] or become the target of more sophisticated attempts at exfiltration, which might succeed without adequate security.[ref 146] Foreseeability may in turn bear on negligence elements of proximate causation and duty of care.

Compute could also play a role in other scenarios, such as in a false advertising claim under the Lanham Act[ref 147] or state and federal consumer protection laws. If a business makes a claim about its AI system or services that is false or misleading, it could be held liable for monetary damages and enjoined from making that claim in the future (unless it becomes true).[ref 148] While many such claims will not involve compute, some may; for example, if a lab publicly claims to follow a responsible scaling policy, training compute could be relevant as an indicator of model capability and the corresponding security and safety measures promised by the policy.

Regulatory agencies may likewise consider compute in their analyses and regulatory actions. For example, the Environmental Protection Agency could consider training (and inference) compute usage as part of environmental impact assessments.[ref 149] Others could treat compute as a proxy for threat to national or public security. Agencies and committees responsible for identifying and responding to various risks, such as the Interagency Committee on Global Catastrophic Risk[ref 150] and Financial Stability Oversight Council,[ref 151] could consider compute in their evaluation of risk from frontier AI. Over fifty federal agencies were directed to take specific actions to promote responsible development, deployment, federal use of AI, and regulation of industry, in the government-wide effort established by Executive Order 14,110[ref 152]—although these actions are now under review.[ref 153] Even for agencies not directed to consider compute or implement a preliminary compute threshold, compute might factor into how guidance is implemented over time.

More speculatively, changes to training compute could be used by agencies as one of many indicators of how much a regulated product has changed, and thus whether it warrants further review. For example, the Food and Drug Administration might consider compute when evaluating AI in medical devices or diagnostic tools.[ref 154] While AI products considered to be medical devices are more likely to be narrow AI systems trained on comparatively less compute, significant changes to training compute may be one indicator that software modifications require premarket submission. The ability to measure, report, and verify compute[ref 155] could make this approach particularly compelling for regulators.

Finally, training compute may factor into regulatory impact analyses, which evaluate the impact of proposed and existing regulations through quantitative and qualitative methods such as cost-benefit analysis.[ref 156] While this type of analysis is not necessarily determinative, it is often an important input into regulatory decisions and necessary for any “significant regulatory action.”[ref 157] As agencies develop and propose new regulations and consider how those rules will affect or be affected by AI, compute could be relevant in drawing lines that define what conduct and actors are affected. For example, a rule with a higher compute threshold and narrower scope may be less significant and costly, as it covers fewer models and developers. The amount of compute used to train models now and in the future may be not only a proxy for threat to national security (or innovation, or economic growth), but also a source of uncertainty, given the potential for emergent capabilities.

C. Where Should the Compute Threshold(s) Sit?

The choice of compute threshold depends on the policy under consideration: what models are the intended target, given the purpose of the policy? What are the burdens and costs of compliance? Can the compute threshold be complemented with other elements for determining whether a model falls within the scope of the policy, in order to more precisely accomplish its purpose?

Some policy proposals would establish a compute threshold “at the level of FLOP used to train current foundational models.”[ref 158] While the training compute of many models is not public, according to estimates, the largest models today were trained with 1e25 FLOP or more, including at least one open-source model, Llama 3.1 405B.[ref 159] This is the initial threshold established by the EU AI Act. Under the Act, general-purpose AI models are considered to have “systemic risk,” and thus trigger a series of obligations for their providers, if found to have “high impact capabilities.”[ref 160] Such capabilities are presumed if the cumulative amount of training compute, which includes all “activities and methods that are intended to enhance the capabilities of the model prior to deployment, such as pre-training, synthetic data generation and fine-tuning,” exceeds 1e25 FLOP.[ref 161] This threshold encompasses existing models such as Gemini Ultra and GPT-4, and it can be updated upwards or downwards by the European Commission through delegated acts.[ref 162] During the AI Safety Summit held in 2023, the U.K. Government included current models by defining “frontier AI” as “highly capable general-purpose AI models that can perform a wide variety of tasks and match or exceed the capabilities present in today’s most advanced models” and acknowledged that the definition included the models underlying ChatGPT, Claude, and Bard.[ref 163]

Others have proposed an initial threshold of “more training compute than already-deployed systems,”[ref 164] such as 1e26 FLOP[ref 165] or 1e27 FLOP.[ref 166] No known model currently exceeds 1e26 FLOP training compute, which is roughly five times the compute used to train GPT-4.[ref 167] These higher thresholds would more narrowly target future systems that pose greater risks, including potential catastrophic and existential risks.[ref 168] President Biden’s Executive Order on AI[ref 169] and recently-vetoed California Senate Bill 1047[ref 170] are in line with these proposals, both targeting models trained with more than 1e26 OP or FLOP.

Far more models would fall within the scope of a compute threshold set lower than current frontier models. While only two models exceeded 1e23 FLOP training compute in 2017, over 200 models meet that threshold today.[ref 171] As discussed in Section II.A, compute thresholds operate as a trigger for additional scrutiny, and more models falling within the ambit of regulation would entail a greater burden not only on developers, but also on regulators.[ref 172] These smaller, general-purpose models have not yet posed extreme risks, making a lower threshold unwarranted at this time.[ref 173]

While the debate has centered mostly around the establishment of a single training compute threshold, governments could adopt a pluralistic and risk-adjusted approach by introducing multiple compute thresholds that trigger different measures or requirements according to the degree or nature of risk. Some proposals recommend a tiered approach that would create fewer obligations for models trained on less compute. For example, the Responsible Advanced Artificial Intelligence Act of 2024 would require pre-registration and benchmarks for lower-compute models, while developers of higher-compute models must submit a safety plan and receive a permit prior to training or deployment.[ref 174] Multi-tiered systems may also incorporate a higher threshold beyond which no development or deployment can take place, with limited exceptions, such as for development at a multinational consortium working on AI safety and emergency response infrastructure[ref 175] or for training runs and models with strong evidence of safety.[ref 176]

Domain-specific thresholds could be established for models that possess capabilities or expertise in areas of concern and models that are trained using less compute than general-purpose models.[ref 177] A variety of specialized models are already available to advance research, trained on extensive scientific databases.[ref 178] As discussed in Part I.D, these models present a tremendous opportunity, yet many have also recognized the potential threat of their misuse to research, develop, and use chemical, biological, radiological, and nuclear weapons.[ref 179] To address these risks, President Biden’s Executive Order on AI, which set a compute threshold of 1e26 FLOP to trigger reporting requirements, set a substantially lower compute threshold of 1e23 FLOP for models trained “using primarily biological sequence data.”[ref 180] The Hiroshima Process International Code of Conduct for Advanced AI Systems likewise recommends devoting particular attention to offensive cyber capabilities and chemical, biological, radiological, and nuclear risks, although it does not propose a compute threshold.[ref 181]

While domain-specific thresholds could be useful for a variety of policies tailored to specific risks, there are some limitations. It may be technically difficult to verify how much biological sequence data (or other domain-specific data) was used to train a model.[ref 182] Another challenge is specifying how much data in a given domain causes a model to fall within scope, particularly considering the potential capabilities of models trained on mixed data.[ref 183] Finally, the amount of training compute required may be so low that, over time, a compute threshold is not practical.

When choosing a threshold, regulators should be aware that capabilities might be substantially improved through post-training enhancements, and training compute is only a general predictor of capabilities. The absolute limits are unclear at this point; however, current methods can result in capability improvements equivalent to a 5- to 30-times increase in training.[ref 184] To account for post-training enhancements, a governance regime could create a safety buffer, in which oversight or other protective measures are set at a lower threshold.[ref 185] Along similar lines, open-source models may warrant a lower threshold for at least some regulatory requirements, since they could be further trained by another actor and, once released, cannot be moderated or rescinded. [ref 186]

D. Does a Compute Threshold Require Updates?

Once established, compute thresholds and related criteria will likely require updates over time.[ref 187] Improvements in algorithmic efficiency could reduce the amount of compute needed to train an equally capable model,[ref 188] or a threshold could be raised or eliminated if adequate protective measures are developed or if models trained with a certain amount of compute are demonstrated to be safe.[ref 189] To further guard against future developments in a rapidly evolving field, policymakers can authorize regulators to update compute thresholds and related criteria.[ref 190]

Several policies, proposed and enacted, have incorporated a dynamic compute threshold. For example, President Biden’s Executive Order on AI authorized the Secretary of Commerce to update the initial compute threshold set in the order, as well as other technical conditions for models subject to reporting requirements, “as needed on a regular basis” while establishing an interim compute threshold of 1e26 OP or FLOP.[ref 191] Similarly, the EU AI Act provides that the 1e25 FLOP compute threshold “should be adjusted over time to reflect technological and industrial changes, such as algorithmic improvements” and authorizes the European Commission to amend the threshold and “supplement benchmarks and indicators in light of evolving technological developments.”[ref 192] The California Senate Bill 1047 would have created the Frontier Model Division within the Government Operations Agency and authorized it to “update both of the [compute] thresholds in the definition of a ‘covered model’ to ensure that it accurately reflects technological developments, scientific literature, and widely accepted national and international standards and applies to artificial intelligence models that pose a significant risk of causing or materially enabling critical harms.”[ref 193]

Regulators may need to update compute thresholds rapidly. Historically, failure to quickly update regulatory definitions in the context of emerging technologies has led to definitions becoming useless or even counterproductive.[ref 194] In the field of AI, developments may occur quickly and with significant implications for national security and public health, making responsive rulemaking particularly important. In the United States, there are several statutory tools to authorize and encourage expedited and regular rulemaking.[ref 195] For example, Congress could expressly authorize interim or direct final rulemaking, which would enable an agency to shift the comment period in notice-and-comment rulemaking to take place after the rule has already been promulgated, thereby allowing them to respond quickly to new developments.[ref 196]

Policymakers could also require a periodic evaluation of whether compute thresholds are achieving their purpose to ensure that it does not become over- or under-inclusive. While establishing and updating a compute threshold necessarily involves prospective ex ante impact assessment, in order to take precautions against risk without undue burdens, regulators can learn much from retrospective ex post analysis of current and previous thresholds.[ref 197] In a survey conducted for the Administrative Conference of the United States, “[a]ll agencies stated that periodic reviews have led to substative [sic] regulatory improvement at least some of time. This was more likely when the underlying evidence basis for the rule, particularly the science or technology, was changing.”[ref 198] While the optimal frequency of periodic review is unknown, the study found that U.S. federal agencies were more likely to conduct reviews when provided with a clear time interval (“at least every X years”).[ref 199]

Several further institutional and procedural factors could affect whether and how compute thresholds are updated. In order to effectively update compute thresholds and other criteria, regulators must have access to expertise and talent through hiring, training, consultation and collaboration, and other avenues that facilitate access to experts from academia and industry.[ref 200] Decisions will be informed by the availability of data, including scientific and commercial data, to enable ongoing monitoring, learning, analysis, and adaptation in light of new developments. Decision-making procedures, agency design, and influence and pressures from policymakers, developers, and other stakeholders will likewise affect updates, among many other factors.[ref 201] While more analysis is beyond the scope of this Article, others have explored procedural and substantive measures for adaptive regulation[ref 202] and effective governance of emerging technologies.[ref 203]

Some have proposed defining compute thresholds in terms of effective compute,[ref 204] as an alternative to updates over time. Effective compute could index to a particular year (similar to inflation adjustments) and thus account for the role that algorithmic progress (e.g., 1e25 of 2023-level effective compute).[ref 205] However, there is not an agreed upon way to more precisely define and calculate effective compute, and the ability to do so depends on the challenging task of calculating algorithmic efficiency, including choosing a performance metric to anchor on. Furthermore, effective compute alone would fail to address potential changes in the risk landscape, such as the development of protective measures.

E. What Are the Advantages and Limitations of a Training Compute Threshold?

Compute has several properties that make it attractive for policymaking: it is (1) correlated with capabilities and thus risk, (2) essential for training, with thresholds that are difficult to circumvent without reducing performance, (3) an objective and quantifiable measure, (4) capable of being estimated before training (5) externally verifiable after training, and (6) a significant cost during development and thus indicative of developer resources. However, training compute thresholds are not infallible: (1) training compute is an imprecise indicator of potential risk, (2) a compute threshold could be circumvented, and (3) there is no industry standard for measuring and reporting training compute.[ref 206] Some of these limitations can be addressed with thoughtful drafting, including clear language, alternative and supplementary elements for defining what models are within scope, and authority to update any compute threshold and other criteria in light of future developments.

First, training compute is correlated with model capabilities and associated risks. Scaling laws predict an increase in performance as training compute increases, and real-world capabilities generally follow (Section I.C). As models become more capable, they may also pose greater risks if they are misused or misaligned (Section I.D). However, training compute is not a precise indicator of downstream capabilities. Capabilities can seemingly emerge abruptly and discontinuously as models are developed with more compute,[ref 207] and the open-ended nature of foundation models means those capabilities may go undetected.[ref 208] Post-training enhancements such as fine-tuning are often not considered a part of training compute, yet they can dramatically improve performance and capabilities with far less compute. Furthermore, not all models with dangerous capabilities require large amounts of training compute; low-compute models with capabilities in certain domains, such as biology or chemistry, may also pose significant risks, such as biological design tools that could be used for drug discovery or the creation of pathogens worse than any seen to date.[ref 209] The market may shift towards these smaller, cheaper, more specialized models,[ref 210] and even general-purpose low-compute models may come to pose significant risks. Given these limitations, a training compute threshold cannot capture all possible risks; however, for large, general-purpose AI models, training compute can act as an initial threshold for capturing emerging capabilities and risks.

Second, compute is necessary throughout the AI lifecycle, and a compute threshold would be difficult to circumvent. There is no AI without compute (Section I.A). Due to its relationship with model capabilities, training compute cannot be easily reduced without a corresponding reduction in capabilities, making it difficult to circumvent for developers of the most advanced models. Nonetheless, companies might find “creative ways” to account for how much compute is used for a given system in order to avoid being subject to stricter regulation.[ref 211] To reduce this risk, some have suggested monitoring compute usage below these thresholds to help identify circumvention methods, such as structuring techniques or outsourcing.[ref 212] Others have suggested using compute thresholds alongside additional criteria, such as the model’s performance on benchmarks, financial or energy cost, or level of integration into society.[ref 213] As in other fields, regulatory burdens associated with compute thresholds could encourage regulatory arbitrage if a policy does not or cannot effectively account for that possibility.[ref 214] For example, since compute can be accessed remotely via digital means, data centers and compute providers could move to less-regulated jurisdictions.

Third, compute is an objective and quantifiable metric that is relatively straightforward to measure. Compute is a quantitative measure that reflects the number of mathematical operations performed. It does not depend on specific infrastructure and can be compared across different sets of hardware and software.[ref 215] By comparison, other metrics, such as algorithmic innovation and data, have been more difficult to track.[ref 216] Whereas quantitative metrics like compute can be readily compared across different instances, the qualitative nature of many other metrics makes them more subject to interpretation and difficult to consistently measure. Compute usage can be measured internally with existing tools and systems; however, there is not yet an industry standard for measuring, auditing, and reporting the use of computational resources.[ref 217] That said, there have been some efforts toward standardization of compute measurement.[ref 218] In the absence of a standard, some have instead presented a common framework for calculating compute, based on information about the hardware used and training time.[ref 219]

Fourth, compute can be estimated ahead of model development and deployment. Developers already estimate training compute with information about the model’s architecture and amount of training data, as part of planning before training takes place. The EU AI Act recognizes this, noting that “training of general-purpose AI models takes considerable planning which includes the upfront allocation of compute resources and, therefore, providers of general-purpose AI models are able to know if their model would meet the threshold before the training is completed.”[ref 220] Since compute can be readily estimated before a training run, developers can plan a model with existing policies in mind and implement appropriate precautions during training, such as cybersecurity measures.

Fifth, the amount of compute used could be externally verified after training. While laws that use compute thresholds as a trigger for additional measures could depend on self-reporting, meaningful enforcement requires regulators to be aware of or at least able to verify the amount of compute being used. A regulatory threshold will be ineffective if regulators have no way of knowing whether a threshold has been reached. For this reason, some scholars have proposed that developers and compute providers be required to report the amount of compute used at different stages of the AI lifecycle.[ref 221] Compute providers already employ chip-hours for client billing, which could be used to calculate total computational operations,[ref 222] and the centralization of a few key cloud providers could make monitoring and reporting requirements simpler to administer.[ref 223] Others have proposed using “on-chip” or “hardware-enabled governance mechanisms” to verify claims about compute usage.[ref 224]

Sixth, training compute is an indicator of developer resources and capacity to comply with regulatory requirements, as it represents a substantial financial investment.[ref 225] For instance, Sam Altman reported that the development of GPT-4 cost “much more” than $100 million.[ref 226] Researchers have estimated that Gemini Ultra cost $70 million to $290 million to develop.[ref 227] A regulatory approach based on training compute thresholds can therefore be used to subject only the most resourced AI developers to increased regulatory scrutiny, while avoiding overburdening small companies, academics, and individuals. Over time, the cost of compute will most likely continue to fall, meaning the same thresholds will capture more developers and models. To ensure that the law remains appropriately scoped, compute thresholds can be complemented by additional metrics, such as the cost of compute or development. For example, the vetoed California Senate Bill 1047 was amended to include a compute cost threshold, defining a “covered model” to include one trained with over 1e26 OP, only if the cost of that training compute exceeded $100,000,000 at the start of training.[ref 228]

At the time of writing, many consider compute thresholds to be the best option currently available for determining which AI models should be subject to regulation, although the limitations of this approach underscore the need for careful drafting and adaptive governance. When considering the legal obligations imposed, the specific compute threshold should correspond to the nature and extent of additional scrutiny and other requirements and reflect the fact that compute is only a proxy for, and not a precise measure of, risk.

F. How Do Compute Thresholds Compare to Capability Evaluations?

A regulatory approach that uses a capabilities-based threshold or evaluation may seem more intuitively appealing and has been proposed by many.[ref 229] There are currently two main types of capability evaluations: benchmarking and red-teaming.[ref 230] In benchmarking, a model is tested on a specific dataset and receives a numerical score. In red-teaming, evaluators can use different approaches to identify vulnerabilities and flaws in a system, such as through prompt injection attacks to subvert safety guardrails. Model evaluations like these already serve as the basis for responsible scaling policies, which specify what protective measures an AI developer must implement in order to safely handle a given level of capabilities. Responsible scaling policies have been adopted by companies like Anthropic, OpenAI, and Google, and policymakers have also encouraged their development and practice.[ref 231]

Capability evaluations can complement compute thresholds. For example, capability evaluations could be required for models exceeding a compute threshold that indicates that dangerous capabilities might exist. They could also be used as an alternative route to being covered by regulation. The EU AI Act adopts the latter approach, complementing the compute threshold with the possibility for the European Commission to “take individual decisions designating a general-purpose AI model as a general-purpose AI model with systemic risk if it is found that such model has capabilities or an impact equivalent to those captured by the set threshold.”[ref 232]

Nonetheless, there are several downsides to depending on capabilities alone. First, model capabilities are difficult to measure.[ref 233] Benchmark results can be affected by factors other than capabilities, such as benchmark data being included during training[ref 234] and model sensitivity to small changes in prompting.[ref 235] Downstream capabilities of a model may also differ from those during evaluation due to changes in dataset distribution.[ref 236] Some threats, such as misuse of a model to develop a biological weapon, may be particularly difficult to evaluate due to the domain expertise required, the sensitivity of information related to national security, and the complexity of the task.[ref 237] For dangerous capabilities such as deception and manipulation, the nature of the capability makes it difficult to assess,[ref 238] although some evaluations have already been developed.[ref 239] Furthermore, while evaluations can point to what capabilities do exist, it is far more difficult to prove that a model does not possess a given capability. Over time, new capabilities may even emerge and improve due to prompting techniques, tools, and other post-training enhancements.

Second, and compounding the issue, there is no standard method for evaluating model capabilities.[ref 240] While benchmarks allow for comparison across models, there are competing benchmarks for similar capabilities; with none adopted as standard by developers or the research community, evaluators could select different benchmark tests entirely.[ref 241] Red-teaming, while more in-depth and responsive to differences in models, is even less standardized and provides less comparable results. Similarly, no standard exists for when during the AI lifecycle a model is evaluated, even though fine-tuning and other post-training enhancements can have a significant impact on capabilities. Nevertheless, there have been some efforts toward standardization, including the U.S. National Institute of Standards and Technology beginning to develop guidelines and benchmarks for evaluating AI capabilities, including through red-teaming.[ref 242]

Third, it is much more difficult to externally verify model evaluations. Since evaluation methods are not standardized, different evaluators and methods may come to different conclusions, and even a small difference could determine whether a model falls within the scope of regulation. This makes external verification simultaneously more important and more challenging. In addition to the technical challenge of how to consistently verify model evaluations, there is also a practical challenge: certain methods, such as red-teaming and audits, depend on far greater access to a model and information about its development. Developers have been reluctant to grant permissive access,[ref 243] which has contributed to numerous calls to mandate external evaluations.[ref 244]

Fourth, model evaluations may be circumvented. For red-teaming and more comprehensive audits, evaluations for a given model may reasonably reach different conclusions, which allows room for an evaluator to deliberately shape results through their choice of methods and interpretation. Careful institutional design is needed to ensure that evaluations are robust to conflicts of interest, perverse incentives, and other limitations.[ref 245] If known benchmarks are used to determine whether a model is subject to regulation, developers might train models to achieve specific scores without affecting capabilities, whether to improve performance on safety measures or strategically underperform on certain measures of dangerous capabilities.

Finally, capability evaluations entail more uncertainty and expense. Currently, the capabilities of a model can only reliably be determined ex post,[ref 246] making it difficult for developers to predict whether it will fall within the scope of applicable law. More in-depth model evaluations such as red-teaming and audits are expensive and time-consuming, which may constrain small organizations, academics, and individuals.[ref 247]

Capability evaluations can thus be viewed as a complementary tool for estimating model risk. While training compute makes an excellent initial threshold for regulatory oversight, as an objective and quantifiable measure that can be estimated prior to training and verified after, capabilities correspond more closely to risk. Capability evaluations provide more information and can be completed after fine-tuning and other post-training enhancements, but are more expensive, difficult to carry out, and less standardized. Both are important components of AI governance but serve different roles.

IV. Conclusion

More powerful AI could bring transformative changes in society. It promises extraordinary opportunities and benefits across a wide range of sectors, with the potential to improve public health, make new scientific discoveries, improve productivity and living standards, and accelerate economic growth. However, the very same advanced capabilities could result in tremendous harms that are difficult to control or remedy after they have occurred. AI could fail in critical infrastructure, further concentrate wealth and increase inequality, or be misused for more effective disinformation, surveillance, cyberattacks, and development of chemical and biological weapons.

In order to prevent these potential harms, laws that govern AI must identify models that pose the greatest threat. The obvious answer would be to evaluate the dangerous capabilities of frontier models; however, state of the art model evaluations are subjective and unable to reliably predict downstream capabilities, and they can take place only after the model has been developed with a substantial investment.

This is where training compute thresholds come into play. Training compute can operate as an initial threshold for estimating the performance and capabilities of a model and, thus, the potential risk it poses. Despite its limitations, it may be the most effective option we have to identify potentially dangerous AI that warrants further scrutiny. However, compute thresholds alone are not sufficient. They must be used alongside other tools to mitigate and respond to risk, such as capability evaluations, post-market monitoring, and incident reporting. Further research avenues could develop better governance via compute thresholds:

  1. What amount of training compute corresponds to future systems of concern? What threshold is appropriate for different regulatory targets, and how can we identify that threshold in advance? What are the downstream effects of different compute thresholds?
  2. Are compute thresholds appropriate for different stages of the AI lifecycle? For example, could thresholds for compute used for post-training enhancements or during inference be used alongside a training compute threshold, given the ability to significantly improve capabilities at these stages?
  3. Should domain-specific compute thresholds be established, and if so, to address which risks? If domain-specific compute thresholds are established, such as in President Biden’s Executive Order 14,110, how can competent authorities determine if a system is domain-specific and verify the training data?
  4. How should compute usage be reported, monitored, and audited?
  5. How should a compute threshold be updated over time? What is the likelihood of future frontier systems being developed using less (or far less) compute than is used today? Does growth or slowdown in compute usage, hardware improvement, or algorithmic efficiency warrant an update, or should it correspond solely to an increase in capabilities? Relatedly, what kind of framework would allow a regulatory agency to respond to developments effectively (e.g., with adequate information and the ability to update rapidly)?
  6. How could a capabilities-based threshold complement or replace a compute threshold, and what would be necessary (e.g., improved model evaluations for dangerous capabilities and alignment)?
  7. How should the law mitigate risks from AI systems that sit below the training compute threshold?

What should be internationalised in AI governance?

The governance misspecification problem

In technical Artificial Intelligence (“AI”) safety research, the term “specification” refers to the problem of defining the purpose of an AI system so that the system behaves in accordance with the true wishes of its designer.[ref 1] Technical researchers have suggested three categories of specification: “ideal specification,” “design specification,” and “revealed specification.”[ref 2] The ideal specification, in this framework, is a hypothetical specification that would create an AI system completely and perfectly aligned with the desires of its creators. The design specification is the specification that is actually used to build a given AI system. The revealed specification is the specification that best describes the actual behavior of the completed AI system. “Misspecification” occurs whenever the revealed specification of an AI system diverges from the ideal specification—i.e., when an AI system does not perform in accordance with the intentions of its creators. 

The fundamental problem of specification is that “it is often difficult or infeasible to capture exactly what we want an agent to do, and as a result we frequently end up using imperfect but easily measured proxies.”[ref 3] Thus, in a famous example from 2016, researchers at OpenAI attempted to train a reinforcement learning agent to play the boat-racing video game CoastRunners, the goal of which is to finish a race quickly and ahead of other players.[ref 4] Instead of basing the AI agent’s reward function on how it placed in the race, however, the researchers used a proxy goal that was easier to implement and rewarded the agent for maximizing the number of points it scored. The researchers mistakenly assumed that the agent would pursue this proxy goal by trying to complete the course quickly. Instead, the AI discovered that it could achieve a much higher score by refusing to complete the course and instead driving in tight circles in such a way as to repeatedly collect a series of power-ups while crashing into other boats and occasionally catching on fire.[ref 5] In other words, the design specification (“collect as many points as possible”) did not correspond well to the ideal specification (“win the race”), leading to a disastrous and unexpected revealed specification (crashing repeatedly and failing to finish the race). 

This article applies the misspecification framework to the problem of AI governance. The resulting concept, which we call the “governance misspecification problem,” can be briefly defined as occurring when a legal rule relies unsuccessfully on proxy terms or metrics. By framing this new concept in terms borrowed from the technical AI safety literature, we hope to incorporate valuable insights from that field into legal-philosophical discussions around the nature of rules and, importantly, to help technical researchers understand the philosophical and policymaking challenges that AI governance legislation and regulation poses. 

It is generally accepted among legal theorists that at least some legal rules can be said to have a purpose or purposes and that these purposes should inform the interpretation of textually ambiguous rules.[ref 6] The least ambitious version of this claim is simply an acknowledgment of the fact that statutes often contain a discrete textual provision entitled “Purpose,” which is intended to inform the interpretation and enforcement of the statute’s substantive provisions.[ref 7] More controversially, some commentators have argued that all or many legal rules have, or should be constructively understood as having, an underlying “true purpose,” which may or may not be fully discoverable and articulable.[ref 8] 

The purpose of a legal rule is analogous to the “ideal specification” discussed in the technical AI safety literature. Like the ideal specification of an AI system, a rule’s purpose may be difficult or impossible to perfectly articulate or operationalize, and rulemakers may choose to rely on a legal regime that incorporates “imperfect but easily measured proxies”—essentially, a design specification. “Governance misspecification” occurs when the real-world effects of the legal regime (analogous to the design specification) as interpreted and enforced (analogous to the revealed specification) fail to effectuate the rule’s intended purpose (analogous to the ideal specification).

Consider the hypothetical legal rule prohibiting the presence of “vehicles” in a public park, famously described by the legal philosopher H.L.A. Hart.[ref 9] The term “vehicles,” in this rule, is presumably a proxy term intended to serve some ulterior purpose,[ref 10] although fully discovering and articulating that purpose may be infeasible. For example, the rule might be intended to ensure the safety of pedestrians in the park, or to safeguard the health of park visitors by improving the park’s air quality, or to improve the park’s atmosphere by preventing excessive noise levels. More realistically, the purpose of the rule might be some complex weighted combination of all of these and numerous other more or less important goals. Whether the rule is misspecified depends on whether the rule’s purpose, whatever it is, is furthered by the use of the proxy term “vehicle.”

Hart used the “no vehicles in the park” rule in an attempt to show that the word “vehicle” had a core of concrete and settled linguistic meaning (an automobile is a vehicle) as well as a semantic “penumbra” containing more or less debatable cases such as bicycles, toy cars, and airplanes. The rule, in other words, is textually ambiguous, although this does not necessarily mean that it is misspecified.[ref 11] Because the rule is ambiguous, a series of difficult interpretive decisions may have to be made regarding whether a given item is or is not a vehicle. At least some of these decisions, and the costs associated with them, could have been avoided if the rulemaker had chosen to use a more detailed formulation in lieu of the term “vehicle,”[ref 12] or if the rulemaker had issued a statement clarifying the purpose of the rule.[ref 13] 

Although the concept of misspecification is generally applicable to legal rules, misspecification tends to occur particularly frequently and with serious consequences in the context of laws and regulations governing poorly-understood emerging technologies such as artificial intelligence. Again, consider “no vehicles in the park.” Many legal rules, once established, persist indefinitely even as the technology they govern changes fundamentally.[ref 14] The objects to which the proxy term “vehicle” can be applied will change over time; electric wheelchairs, for example, may not have existed when the rule was originally drafted, and airborne drones may not have been common. The introduction of these new potential “vehicles” is extremely difficult to account for in an original design specification.[ref 15] 

The governance misspecification problem is particularly relevant to the governance of AI systems. Unlike most other emerging technologies, frontier AI systems are, in key respects, not only poorly understood but fundamentally uninterpretable by existing methods.[ref 16] This problem of interpretability is a major focus area for technical AI safety researchers.[ref 17] The widespread use of proxy terms and metrics in existing AI governance policies and proposals is, therefore, a cause for concern.[ref 18]

In Section I, this article draws on existing legal-philosophical discussions of the nature of rules to further explain the problem of governance misspecification and situates the concept in the existing public policy literature. Sections II and III make the case for the importance of the problem by presenting a series of case studies to show that rules aimed at governing emerging technologies are often misspecified and that misspecified rules can cause serious problems for the regulatory regime they contribute to, for courts, and for society generally. Section IV offers a few suggestions for reducing the risk of and mitigating the harm from misspecified rules, including eschewing or minimizing the use of proxy terms, rapidly updating and frequently reviewing the effectiveness of regulations, and including specific and clear statements of the purpose of a legal rule in the text of the rule. Section V applies the conclusions of the previous Sections prospectively to several specific challenges in the field of AI governance, including the use of compute thresholds, semiconductor export controls, and the problem of defining “frontier” AI systems. Section VI concludes.

I. The Governance Misspecification Problem in Legal Philosophy and Public Policy 

A number of publications in the field of legal philosophy have discussed the nature of legal rules and arrived at conclusions helpful to fleshing out the contours of the governance misspecification problem.[ref 19] Notably, Schauer (1991) suggests the useful concepts of over- and under-inclusiveness, which can be understood as two common ways in which legal rules can become misspecified.[ref 20] Overinclusive rules prohibit or prescribe actions that an ideally specified rule would not apply to, while underinclusive rules fail to prohibit or prescribe actions that an ideally specified rule would apply to. So, in Hart’s “no vehicles in the park” hypothetical, suppose that the sole purpose of the rule was to prevent park visitors from being sickened by diesel fumes. If this were the case, the rule would be overinclusive, because it would pointlessly prohibit many vehicles that do not emit diesel fumes. If, on the other hand, the purpose of the rule was to prevent music from being played loudly in the park on speakers, the rule would be underinclusive, as it fails to prohibit a wide range of speakers that are not installed in a vehicle. 

Ideal specification is rarely feasible, and practical considerations may dictate that a well-specified rule should rely on proxy terms that are under- or overinclusive to some extent. As Schauer (1991) explains, “Speed Limit 55” is a much easier rule to follow and enforce consistently than “drive safely,” despite the fact that the purpose of the speed limit is to promote safe driving and despite the fact that some safe driving can occur at speeds above 55 miles per hour and some dangerous driving can occur at speeds below 55 miles per hour.[ref 21] In other words, the benefits of creating a simple and easily followed and enforced rule outweigh the costs of over- and under-inclusiveness in many cases.[ref 22]

In the public policy literature, the existing concept that bears the closest similarity to governance misspecification is “policy design fit.”[ref 23] Policy design is currently understood as including a mix of interrelated policy goals and the instruments through which those goals are accomplished, including legal, financial, and communicative mechanisms.[ref 24] A close fit between policy goals and the means used to accomplish those goals has been shown to increase the effectiveness of policies.[ref 25] The governance misspecification problem can be understood as a particular species of failure of policy design fit—a failure of congruence between a policy goal and a proxy term in the legal rule which is the means used to further that goal.[ref 26] 

II. Legal Rules Governing Emerging Technologies Are Often Misspecified 

Misspecification occurs frequently in both domestic and international law and in both reactive and anticipatory regulations directed towards the regulation of new technologies. In order to illustrate how misspecification happens, and to give a sense of the significance of the problem in legal rules addressing emerging technologies, this Section discusses three historical examples of the phenomenon in the contexts of cyberlaw, copyright law, and nuclear anti-proliferation treaties.

Section 1201(a)(2) of the Digital Millennium Copyright Act of 1998 (DMCA) prohibits the distribution of any “technology, product, service, device, component, or part thereof” primarily designed to decrypt copyrighted material.[ref 27] Congressman Howard Coble, one of the architects of the DMCA, stated that this provision was “drafted carefully to target ‘black boxes’”—physical devices with “virtually no legitimate uses,” useful only for facilitating piracy.[ref 28] The use of “black boxes” for the decryption of digital works was not widespread in 1998, but the drafters of the DMCA predicted that such devices would soon become an issue. In 1998, this prediction seemed a safe bet, as previous forms of piracy decryption had relied on specialized tools—the phrase “black box” is a reference to one such tool, also known as a “descrambler” and used to decrypt premium cable television channels.[ref 29] 

However, the feared black boxes never arrived. Instead, pirates relied on software, using decryption programs distributed for free online to circumvent anti-piracy encryptions.[ref 30] Courts found the distribution of such programs, and even the posting of hyperlinks leading to websites containing such programs, to be violations of the DMCA.[ref 31] In light of earlier cases holding that computer code was a form of expression entitled to First Amendment protection, this interpretation placed the DMCA into tension with the First Amendment.[ref 32] This tension was ultimately resolved in favor of the DMCA, and the distribution of decryption programs used for piracy was prohibited.[ref 33]

No one in Congress anticipated that the statute which had been “carefully drafted to target ‘black boxes’” would be used to prohibit the distribution of lines of computer code, or that this would raise serious concerns regarding freedom of speech. Section 1201(a)(2), in other words, was misspecified; by prohibiting the distribution of any “technology” or “service” designed for piracy, as well as any “device,” the framers of the DMCA banned more than they intended to ban and created unforeseen constitutional issues. 

Misspecification also occurs in international law. The Treaty of Principles Governing the Activities of States in the Exploration and Use of Outer Space, which the United States and the Soviet Union entered into in 1967, obligated the parties “not to place in orbit around the Earth any objects carrying nuclear weapons…”[ref 34] Shortly after the treaty was entered into, however, it became clear that the Soviet Union planned to take advantage of a loophole in the misspecified prohibition. The Fractional Orbital Bombardment System (FOBS) placed missiles into orbital trajectories around the earth, but then redirected them to strike a target on the earth’s surface before they completed a full orbit.[ref 35] An object is not “in orbit” until it has circled the earth at least once; therefore, FOBS did not violate the 1967 Treaty, despite the fact that it allowed the Soviet Union to strike at the U.S. from space and thereby evade detection by the U.S.’s Ballistic Missile Early Warning System.[ref 36] The U.S. eventually neutralized this advantage by expanding the coverage and capabilities of early warning systems so that FOBS missiles could be detected and tracked, and in 1979 the Soviets agreed to a better-specified ban which prohibited “fractional orbital missiles” as well as other space-based weapons.[ref 37] Still, the U.S.’s agreement to use the underinclusive proxy term “in orbit” allowed the Soviet Union to temporarily gain a potentially significant first-strike advantage.

Misspecification occurs in laws and regulations directed towards existing and well-understood technologies as well as in anticipatory regulations. Take, for example, the Computer Fraud and Abuse Act (CFAA), 18 U.S.C. § 1030, which has been called “the worst law in technology.”[ref 38] The CFAA was originally enacted in 1984, but has since been amended several times, most recently in 2020.[ref 39] Among other provisions, the CFAA criminalizes “intentionally access[ing] a computer without authorization or exceed[ing] authorized access, and thereby obtain[ing]… information from any protected computer.”[ref 40] The currently operative language for this provision was introduced in 1996,[ref 41] by which point the computer was hardly an emerging technology, and slightly modified in 2008.[ref 42] 

Read literally, the CFAA’s prohibition on unauthorized access criminalizes both (a) violating a website’s terms of service while using the internet, and (b) using an employer’s computer or network for personal reasons, in violation of company policy.[ref 43] In other words, a literal reading of the CFAA would mean that hundreds of millions of Americans commit crimes every week by, e.g., sharing a password with a significant other or accessing social media at work.[ref 44] Court decisions eventually established narrower definitions of the key statutory terms (“without authorization” and “exceeds authorized access”),[ref 45] but not before multiple defendants were prosecuted for violating the CFAA by failing to comply a website’s terms of service[ref 46] or accessing an employer’s network for personal reasons in violation of workplace rules.[ref 47] 

Critics of the CFAA have discussed its flaws in terms of the constitutional law doctrines of “vagueness”[ref 48] and “overbreadth.”[ref 49] These flaws can also be conceptualized in terms of misspecification. The phrases “intentionally accesses without authorization” and “exceeds authorized access,” and the associated statutory definitions, are poor proxies for the range of behavior that an ideally specified version of the CFAA would have criminalized. The proxies criminalize a great deal of conduct that none of the stakeholders who drafted, advocated for, or voted to enact the law wanted to criminalize[ref 50] and created substantial legal and political backlash against the law. This backlash led to a series of losses for federal prosecutors as courts rejected their broad proposed interpretations of the key proxy terms because, as the Ninth Circuit Court of Appeals put it, “ubiquitous, seldom-prosecuted crimes invite arbitrary and discriminatory enforcement.”[ref 51] The issues caused by poorly selected proxy terms in the CFAA, the Outer Space Treaty, and the DMCA demonstrate that important legal rules drafted for the regulation of emerging technologies are prone to misspecification, in both domestic and international law contexts and for both anticipatory and reactive rules. These case studies were chosen because they are representative of how legal rules become misspecified; if space allowed, numerous additional examples of misspecified rules directed towards new technologies could be offered.[ref 52]

III. Consequences of Misspecification in the Regulation of Emerging Technologies

The case studies examined in the previous Section established that legal rules are often misspecified and illustrated the manner in which the problem of governance misspecification typically arises. This Section attempts to show that misspecification can cause serious issues when it occurs for both for the regulatory regime that the misspecified rule is part of and for society writ large. Three potential consequences of misspecification are discussed and illustrated with historical examples involving the regulation of emerging technologies.  

A. Underinclusive Rules Can Create Exploitable Gaps in a Regulatory Regime

When misspecification results in an underinclusive rule, exploitable gaps can arise in a regulatory regime. The Outer Space Treaty of 1967, discussed above, is one example of this phenomenon. Another example, which demonstrates how completely the use of a misspecified proxy term can defeat the effectiveness of a law, is the Audio Home Recording Act of 1992.[ref 53] That statute was designed to regulate home taping, i.e., the creation by consumers of analog or digital copies of musical recordings. The legal status of home taping had been a matter of debate for years, with record companies arguing that it was illegal and taping hardware manufacturers defending its legality.[ref 54] Congress attempted to resolve the debate by creating a safe harbor for home taping that allowed for the creation of any number of analog or digital copies of a piece of music, with the caveat that royalties would have to be paid as part of the purchase price of any equipment used to create digital copies.[ref 55] 

Congress designed the AHRA under the assumption that digital audio tape recorders (DATs) were the wave of the future and would shortly become a ubiquitous home audio appliance.[ref 56] The statute lays out, in painstaking detail, a complex regulatory framework governing “digital audio recording devices,” which the statute defines to require the capability to create reproductions of “digital musical recordings.”[ref 57] Bizarrely, however, the AHRA explicitly provides that the term “digital musical recording” does not encompass recordings stored on any object “in which one or more computer programs are fixed”—i.e., computer hard drives.[ref 58] 

Of course, the DAT did not become a staple of the American household. And when the RIAA sued the manufacturer of the “Rio,” an early mp3 player, for failing to comply with the AHRA’s requirements, the Ninth Circuit found that the device was not subject to the AHRA.[ref 59] Because the Rio was designed solely to download mp3 files from a computer hard drive, it was not capable of copying “digital musical recordings” under the AHRA’s underinclusive definition of that phrase.[ref 60] The court noted that its decision would “effectively eviscerate the Act,” because “[a]ny recording device could evade […] regulation simply by passing the music through a computer and ensuring that the MP3 file resided momentarily on the hard drive,” but nevertheless rejected the creative alternative interpretations suggested by the music industry as contrary to the plain language of the statute.[ref 61] As a result, the AHRA was rendered obsolete less than six years after being enacted.[ref 62] 

Clearly, Congress acted with insufficient epistemic humility by creating legislation confidently designed to address one specific technology that had not, at the time of legislation, been adopted by any significant portion of the population. But this failure of humility manifested as a failure of specification. The purpose of the statute, as articulated in a Senate report, included the introduction of a “serial copy management system that would prohibit the digital serial copying of copyrighted music.”[ref 63] By crafting a law that applied only to “digital audio recording devices” and defining that proxy term in an insufficiently flexible way, Congress completely failed to accomplish those purposes. If the proxy in question had not been defined to exclude any recording acquired through a computer, the Rio and eventually the iPod might well have fallen under the AHRA’s royalty scheme, and music copyright law in the U.S. might have developed down a course more consistent with the ideal specification of the AHRA. 

B. Overinclusive Rules Can Create Pushback and Enforcement Challenges

Misspecification can also create overinclusive rules, like the Computer Fraud and Abuse Act and § 1201(a)(2) of the Digital Millennium Copyright Act, discussed above in Section II. As those examples showed, overinclusive rules may give rise to legal and political challenges, difficulties with enforcement, and other unintended and undesirable results. These effects can, in some cases, be so severe that they require a total repeal of the rule in question.

This was the case with a 2011 Nevada statute authorizing and regulating driverless cars. AB511, which was the first law of its kind enacted in the U.S.,[ref 64] initially defined “autonomous vehicle” to mean “a motor vehicle that uses artificial intelligence, sensors and global positioning system coordinates to drive itself without the active intervention of a human operator,” and further defined “artificial intelligence” to mean “the use of computers and related equipment to enable a machine to duplicate or mimic the behavior of human beings.”[ref 65] 

Shortly after AB511 was enacted, however, several commentators noted that the statute’s definition of “autonomous vehicle” technically included vehicles that incorporated automatic collision avoidance or any of a number of other advanced driver-assistance systems common in new cars in 2011.[ref 66] These systems used computers to temporarily control the operation of a vehicle without the intervention of the human driver, so any vehicle that incorporated them was technically subject to the onerous regulatory scheme that Nevada’s legislature had intended to impose only on fully autonomous vehicles. In order to avoid effectively banning most new model cars, Nevada’s legislature was forced to repeal its new law and enact a replacement that incorporated a more detailed definition of “autonomous vehicle.”[ref 67]

C. Technological Change Can Repeatedly Render a Proxy Metric Obsolete

Finally, a misspecified rule may lose its effectiveness over time as technological advances render it obsolete, necessitating repeated updates and patches to the fraying regulatory regime. Consider, for example, the export controls imposed on high performance computers in the 1990s. The purpose of these controls was to prevent the export of powerful computers to countries where they might be used in ways that threatened U.S. national security, such as to design missiles and nuclear weapons.[ref 68] The government placed restrictions on the export of “supercomputers” and defined “supercomputer” in terms of the number of millions of theoretical operations per second (MTOPS) the computer could perform.[ref 69] In 1991, “supercomputer” was defined to mean any computer capable of exceeding 195 MTOPS.[ref 70] As the 90s progressed, however, the processing power of commercially available computers manufactured outside of the U.S. increased rapidly, reducing the effectiveness of U.S. export controls.[ref 71] Restrictions that prevented U.S. companies from selling their computers globally imposed costs on the U.S. economy and harmed the international competitiveness of the restricted companies.[ref 72] The Clinton administration responded by raising the threshold at which export restrictions began to apply to 1500 MTOPS in 1994, to 7000 MTOPS in 1996, to 12,300 MTOPS in 1999, and three times in the year 2000 to 20,000, 28,000, and finally 85,000 MTOPS.[ref 73] 

In the late 1990s, technological advances made it possible to link large numbers of commercially available computers together into “clusters” which could outperform most supercomputers.[ref 74] At this point, it was clear that MTOPS-based export controls were no longer effective, as computers that exceeded any limit imposed could easily be produced by anyone with access to a supply of less powerful computers which would not be subject to export controls.[ref 75] Even so, MTOPS-based export controls continued in force until 2006, when they were replaced by regulations that imposed controls based on performance in terms of Weighted TeraFLOPS, i.e., trillions of floating point operations per second.[ref 76] 

Thus, while the use of MTOPS thresholds as proxies initially resulted in well-specified export controls that effectively prevented U.S. adversaries from acquiring supercomputers, rapid technological progress repeatedly rendered the controls overinclusive and necessitated a series of amendments and revisions. The end result was a period of nearly seven years during which the existing export controls were badly misspecified due to the use of a proxy metric, MTOPS, which no longer bore any significant relation to the regime’s purpose. During this period, the U.S. export control regime for high performance computers was widely considered to be ineffective and perhaps even counterproductive.[ref 77]

IV. Mitigating Risks from Misspecification

In light of the frequency with which misspecification occurs in the regulation of emerging technology and the potential severity of its consequences, this Section suggests a few techniques for designing legal rules in such a way as to reduce the risk of misspecification and mitigate its ill effects.
The simplest way to avoid misspecification is to eschew or minimize the use of proxy terms and metrics. This is not always practicable or desirable. “No vehicles in the park” is a better rule than “do not unreasonably annoy or endanger the safety of park visitors,” in part because it reduces the cognitive burden of following, enforcing, and interpreting the rule and reduces the risk of decision maker error by limiting the discretion of the parties charged with enforcement and interpretation.[ref 78] Nevertheless, there are successful legal rules that pursue their purposes directly. U.S. antitrust law, for example, grew out of the Sherman Antitrust Act,[ref 79] § 1 of which simply states that any combination or contract in restraint of trade “is declared to be illegal.” 

Where use of a proxy is appropriate, it is often worthwhile to identify the fact that a proxy is being used to reduce the likelihood that decision makers will fall victim to Goodhart’s law[ref 80] and treat the regulation of the proxy as an end in itself.[ref 81] Alternatively, the most direct way to avoid confusion regarding the underlying purpose of a rule is to simply include an explanation of the purpose in the text of the rule itself. This can be accomplished through the addition of a purpose clause (sometimes referred to as a legislative preamble or a policy statement). For example, one purpose of the Nuclear Energy Innovation and Modernization Act of 2019 is to “provide… a program to develop the expertise and regulatory processes necessary to allow innovation and the commercialization of advanced nuclear reactors.” 

Purpose clauses can also incorporate language emphasizing that every provision of a rule should be construed in order to effectuate its purpose. This amounts to a legislatively prescribed rule of statutory interpretation, instructing courts to adopt a purposivist interpretive approach.[ref 82] When confronted with an explicit textual command to this effect, even strict textualists are obligated to interpret a rule purposively.[ref 83] The question of whether such an approach is generally desirable is hotly debated,[ref 84] but in the context of AI governance the flexibility that purposivism provides is a key advantage. The ability to flexibly update and adapt a rule in response to changes in the environment in which the rule will apply is unusually important in the regulation of emerging technologies.[ref 85] While there is little empirical evidence for or against the effectiveness of purpose clauses, they have played a key role in the legal reasoning relied on in a number of important court decisions.[ref 86]

A regulatory regime can also require periodic efforts to evaluate whether a rule is achieving its purpose.[ref 87] These efforts can provide an early warning system for misspecification by facilitating awareness of whether the proxy terms or metrics relied upon still correspond well to the purpose of the rule. Existing  periodic review requirements are often ineffective,[ref 88] treated by agencies as box-checking activities rather than genuine opportunities for careful retrospective analysis of the effects of regulations.[ref 89] However, many experts continue to recommend well-implemented retrospective review requirements as an effective tool for improving policy decisions.[ref 90] The Administrative Conference of the United States has repeatedly pushed for increased use of retrospective review, as has the internationally-focused Organization for Economic Co-Operation and Development (OECD).[ref 91] Additionally, retrospective review of regulations often works well in countries outside of the U.S.[ref 92] 

As the examples in Sections II and III demonstrate, rules governing technology tend to become misspecified over time as the regulated technology evolves. The Outer Space Treaty of 1967, § 1201(a)(2) of the DMCA, and the Clinton Administration’s supercomputer export controls were all well-specified and effective when implemented, but each measure became ineffective or counterproductive soon after being implemented because the proxies relied upon became obsolete. Ideally, rulemaking would move at the pace of technological improvement, but there are a number of institutional and structural barriers to this sort of rapid updating of regulations. Notably, the Administrative Procedure Act requires a lengthy “notice and comment” process for rulemaking and a 30-day waiting period after publication of a regulation in the Federal Register before the regulation can go into effect.[ref 93] There are ways to waive or avoid these requirements, including regulating via the issuance of nonbinding guidance documents rather than binding rules,[ref 94] issuing an immediately effective “interim final rule” and then satisfying the APA’s requirements at a later time,[ref 95] waiving the publication or notice and comment requirements for “good cause,”[ref 96] or legislatively imposing regulatory deadlines.[ref 97] Many of these workarounds are limited in their scope or effectiveness, or vulnerable to legal challenges if pursued too ambitiously, but finding some way to update a regulatory regime quickly is critical to mitigating the damage caused by misspecification.[ref 98] 

There is reason to believe that some agencies, recognizing the importance of AI safety to national security, will be willing to rapidly update regulations despite the legal and procedural difficulties. Consider the Commerce Department’s recent response to repeated attempts by semiconductor companies to design chips for the Chinese market that comply with U.S. export control regulations while still providing significant utility to purchasers in China looking to train advanced AI models. After Commerce initially imposed a license requirement on the export of advanced AI-relevant chips to China in October 2022, Nvidia modified its market-leading A100 and H100 chips to comply with the regulations and proceeded to sell the modified A800 and H800 chips in China.[ref 99] On October 17, 2023, the Commerce Department’s Bureau of Industry and Security announced a new interim final rule that would prohibit the sale of A800 and H800 chips in China and waived the normal 30-day waiting period so that the rule became effective less than a week after it was announced.[ref 100] Commerce Secretary Gina Raimondo stated publicly that “”[i]f [semiconductor companies] redesign a chip around a particular cut line that enables them to do AI, I’m going to control it the very next day.”[ref 101] 

V. The Governance Misspecification Problem and Artificial Intelligence 

While the framework of governance misspecification is applicable to a wide range of policy measures, it is particularly well-suited to describing issues that arise regarding legal rules governing emerging technologies. H.L.A. Hart’s prohibition on “vehicles in the park” could conceivably have been framed by an incautious drafter who did not anticipate that using “vehicle” instead of some more detailed proxy term would create ambiguity. Avoiding this kind of misspecification is simply a matter of careful drafting. Suppose, however, that the rule was formulated at a point in time when “vehicle” was an appropriate proxy for a well-understood category of object, and the rule later became misspecified as new potential vehicles that had not been conceived of when the rule was drafted were introduced. A rule drafted at a historical moment when all vehicles move on either land or water is unlikely to adequately account for the issues created by airplanes or flying drones.[ref 102] 

In other words, rules created to govern emerging technologies are especially prone to misspecification because they are created in the face of a high degree of uncertainty regarding the nature of the subject matter to be regulated, and rulemaking under uncertainty is difficult.[ref 103] Furthermore, as the case studies discussed in Sections II and III show, the nature of this difficulty is such that it tends to result in misspecification. For instance, misspecification will usually result when an overconfident rulemaker makes a specific and incorrect prediction about the future and issues an underinclusive rule based on that prediction. This was the case when Congress addressed  the AHRA exclusively to digital audio tape recorders and ignored computers. Rules created by rulemakers who want to regulate a certain technology but have only a vague and uncertain understanding of the purpose they are pursuing are also likely to be misspecified.[ref 104] Hence the CFAA, which essentially prohibited “doing bad things with a computer,” with disastrous results. 

The uncertainties associated with emerging technologies and the associated risk of  misspecification increase when the regulated technology is poorly understood. Rulemakers may simply overlook something about the chosen proxy due to a lack of understanding of the proxy or the underlying technology, or due to a lack of experience drafting the kinds of regulations required. The first-of-its-kind Nevada law intended to regulate fully autonomous vehicles that accidentally regulated a broad range of features common in many new cars is an example of this phenomenon. So is the DMCA provision that was intended to regulate “black box” devices but, by its terms, also applied to raw computer code.

If the difficulty of making well-specified rules to govern emerging technologies increases when the technology is fast-developing and poorly understood, advanced AI systems are something of a perfect storm for misspecification problems. Cutting-edge deep learning AI systems differ from other emerging technologies in that their workings are poorly understood, not just by legislators and the public, but by their creators.[ref 105] Their capabilities are an emergent property of the interaction between their architecture and the vast datasets on which they are trained. Moreover, the opacity of these models is arguably different in kind from the unsolved problems associated with past technological breakthroughs, because the models may be fundamentally uninterpretable rather than merely difficult to understand.[ref 106] Under these circumstances, defining an ideal specification in very general terms may be simple enough, but designing legal rules to operationalize any such specification will require extensive reliance on rough proxies. This is fertile ground for misspecification.

There are a few key proxy terms that recur often in existing AI governance proposals and regulations. For example, a number of policy proposals have suggested that regulations should focus on “frontier” AI models.[ref 107] When Google, Anthropic, OpenAI, and Microsoft created an industry-led initiative to promote AI safety, they named it the Frontier Model Forum.[ref 108] Sam Altman, the CEO of OpenAI, has expressed support for regulating “frontier systems.”[ref 109] The government of the U.K. has established a “Frontier AI Taskforce” dedicated to evaluating risks “at the frontier of AI.”[ref 110] 

In each of these proposals, the word “frontier” is a proxy term that stands for something like “highly capable foundation models that could possess dangerous capabilities sufficient to pose severe risks to public safety.”[ref 111] Any legislation or regulation that relied on the term “frontier” would also likely include a statutory definition of the word,[ref 112] but as several of the historical examples discussed in Sections II and III showed, statutory definitions can themselves incorporate proxies that result in misspecification. The above definition, for instance, may be underinclusive because some models that cannot be classified as “highly capable” or as “foundation models” might also pose severe risks to public safety.   

The most significant AI-related policy measure that has been issued in the U.S. to date is Executive Order (EO) 14110 on the “Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence.”[ref 113] Among many other provisions, the EO imposes reporting requirements on certain AI models and directs the Department of Commerce to define the category of models to which the reporting requirements will apply.[ref 114] Prior to the issuance of Commerce’s definition, the EO provides that the reporting requirements apply to models “trained using a quantity of computing power greater than 1026 integer or floating-point operations, or using primarily biological sequence data and using a quantity of computing power greater than 1023 integer or floating-point operations,” as well as certain computing clusters.[ref 115] In other words, the EO uses operations as a proxy metric for determining which AI systems are sufficiently capable and/or dangerous that they should be regulated. This kind of metric, which is based on the amount of computing power used to train a model, is known as a “compute threshold” in the AI governance literature.[ref 116]

A proxy metric such as an operations-based compute threshold is almost certainly necessary to the operationalization of the EO’s regulatory scheme for governing frontier models.[ref 117] Even so, the example of the U.S. government’s ultimately ineffective and possibly counterproductive attempts to regulate exports of high performance computers using MTOPS is a cautionary tale about how quickly a compute-based proxy can be rendered obsolete by technological progress. The price of computing resources has, historically, fallen rapidly, with the amount of compute available for a given sum of money doubling approximately every two years as predicted by Moore’s Law.[ref 118] Additionally, because of improvements in algorithmic efficiency, the amount of compute required to train a model to a given level of performance has historically decreased over time as well.[ref 119] Because of these two factors, the cost of training AI models to a given level of capability has fallen precipitously over time; for instance, between 2017 and 2021, the cost of training a rudimentary model to classify images correctly with 93% accuracy on the image database ImageNet fell from $1000 to $5.[ref 120] This phenomenon presents a dilemma for regulators: the cost of acquiring computational resources exceeding a given threshold will generally decrease over time even as the capabilities of models trained on a below-threshold amount of compute rises. In other words, any well-specified legal rule that uses a compute threshold is likely to be rendered both overinclusive and underinclusive soon after being implemented.

Export controls intended to prevent the proliferation of the advanced chips used to train frontier AI models face a similar problem. Like the Clinton Administration’s supercomputer export controls, the Biden administration’s export controls on chips like the Nvidia A800 and H800 are likely to become misspecified over time. As algorithmic efficiency increases and powerful chips become cheaper and easier to acquire, existing semiconductor export controls will gradually become both overinclusive (because they pointlessly prohibit the export of chips that are already freely available overseas) and underinclusive (because powerful AI models can be trained using chips not covered by the export controls). 

The question of precisely how society should respond to these developments over time is beyond the scope of this paper. However, to delay the onset of misspecification and mitigate its effects, policymakers setting legal rules for AI governance should consider the recommendations outlined in Section IV, above. So, the specifications for export controls on semiconductors—proxies for something like “chips that can be used to create dangerously powerful AI models”—should be updated quickly and frequently as needed, to prevent them from becoming ineffective or counterproductive. The Bureau of Industry and Security has already shown some willingness to pursue this kind of frequent, flexible updating.[ref 121] More generally, given the particular salience of the governance misspecification problem to AI governance, legislators should consider mandating frequent review of the effectiveness of important AI regulations and empowering administrative agencies to update regulations rapidly as necessary. Rules setting compute thresholds that are likely to be the subject of litigation should incorporate clear purpose statements articulating the ulterior purpose behind the  use of a compute threshold as a proxy, and should be interpreted consistently with those statements. And where it is possible to eschew the use of proxies without compromising the enforceability or effectiveness of a rule, legislators and regulators should consider doing so. 

VI. Conclusion

This article has attempted to elucidate a newly developed concept in governance, i.e., the problem of governance misspecification. In presenting this concept along with empirical insights from representative case studies, we hope to inform contemporary debates around AI governance by demonstrating one common and impactful way in which legal rules can fail to effect their purposes. By framing this problem in terms of “misspecification,” a concept borrowed from the technical AI safety literature, this article aims both to introduce valuable insights from that field to scholars of legal philosophy and public policy and to introduce technical researchers to some of the more practically salient legal-philosophical and governance-related challenges involved in AI legislation and regulation. Additionally, we have offered a few specific suggestions for avoiding or mitigating the harms of misspecification in the AI governance context, namely eschewing the use of proxy terms or metrics where feasible, clear statements of statutory purpose, and flexibly applied, rapidly updating, periodically reviewed regulations. 

A great deal of conceptual and empirical work remains to be done regarding the nature and effects of the governance misspecification problem and best practices for avoiding and responding to it. For instance, this article does not contain any in-depth comparison of the incidence and seriousness of misspecification outside of the context of rules governing emerging technologies. Additionally, empirical research analyzing whether and how purpose clauses and similar provisions can effectively further the purposes of legal rules would be of significant practical value.

Legal considerations for defining “frontier model”

I. Introduction

One of the few concrete proposals on which AI governance stakeholders in industry[ref 1] and government[ref 2] have mostly[ref 3] been able to agree is that AI legislation and regulation should recognize a distinct category consisting of the most advanced AI systems. The executive branch of the U.S. federal government refers to these systems, in Executive Order 14110 and related regulations, as “dual-use foundation models.”[ref 4] The European Union’s AI Act refers to a similar class of models as “general-purpose AI models with systemic risk.”[ref 5] And many researchers, as well as leading AI labs and some legislators, use the term “frontier models” or some variation thereon.[ref 6] 

These phrases are not synonymous, but they are all attempts to address the same issue—namely that the most advanced AI systems present additional regulatory challenges distinct from those posed by less sophisticated models. Frontier models are expected to be highly capable across a broad variety of tasks and are also expected to have applications and capabilities that are not readily predictable prior to development, nor even immediately known or knowable after development.[ref 7] It is likely that not all of these applications will be socially desirable; some may even create significant risks for users or for the general public. 

The question of precisely how frontier models should be regulated is contentious and beyond the scope of this paper. But any law or regulation that distinguishes between “frontier models” (or “dual-use foundation models,” or “general-purpose AI models with systemic risk”) and other AI systems will first need to define the chosen term. A legal rule that applies to a certain category of product cannot be effectively enforced or complied with unless there is some way to determine whether a given product falls within the regulated category. Laws that fail to carefully define ambiguous technical terms often fail in their intended purposes, sometimes with disastrous results.[ref 8] Because the precise meaning of the phrase “frontier model” is not self-evident,[ref 9] the scope of a law or regulation that targeted frontier models without defining that term would be unacceptably uncertain. This uncertainty would impose unnecessary costs on regulated companies (who might overcomply out of an excess of caution or unintentionally undercomply and be punished for it) and on the public (from, e.g., decreased compliance, increased enforcement costs, less risk protection, and more litigation over the scope of the rule).

 The task of defining “frontier model” implicates both legal and policy considerations. This paper provides a brief overview of some of the most relevant legal considerations for the benefit of researchers, policymakers, and anyone else with an interest in the topic. 

II. Statutory and Regulatory Definitions

Two related types of legal definition—statutory and regulatory—are relevant to the task of defining “frontier model.” A statutory definition is a definition that appears in a statute enacted by a legislative body such as the U.S. Congress or one of the 50 state legislatures. A regulatory definition, on the other hand, appears in a regulation promulgated by a government agency such as the U.S. Department of Commerce or the California Department of Technology (or, less commonly, in an executive order).

Regulatory definitions have both advantages and disadvantages relative to statutory definitions. Legislation is generally a more difficult and resource-intensive process than agency rulemaking, with additional veto points and failure modes.[ref 10] Agencies are therefore capable of putting into effect more numerous and detailed legal rules than Congress can,[ref 11] and can update those rules more quickly and easily than Congress can amend laws.[ref 12] Additionally, executive agencies are often more capable of acquiring deep subject-matter expertise in highly specific fields than are congressional offices due to Congress’s varied responsibilities and resource constraints.[ref 13] This means that regulatory definitions can benefit from agency subject-matter expertise to a greater extent than can statutory definitions, and can also be updated far more easily and often.

The immense procedural and political costs associated with enacting a statute do, however, purchase a greater degree of democratic legitimacy and legal resiliency than a comparable regulation would enjoy. A number of legal challenges that might persuade a court to invalidate a regulatory definition would not be available for the purpose of challenging a statute.[ref 14] And since the rulemaking power exercised by regulatory agencies is generally delegated to them by Congress, most regulations must be authorized by an existing statute. A regulatory definition generally cannot eliminate or override a statutory definition[ref 15] but can clarify or interpret. Often, a regulatory regime will include both a statutory definition and a more detailed regulatory definition for the same term.[ref 16] This can allow Congress to choose the best of both worlds, establishing a threshold definition with the legitimacy and clarity of an act of Congress while empowering an agency to issue and subsequently update a more specific and technically informed regulatory definition. 

III. Existing Definitions

This section discusses five noteworthy attempts to define phrases analogous to “frontier model” from three different existing measures. Executive Order 14110 (“EO 14110”), which President Biden issued in October 2023, includes two complementary definitions of the term “dual-use foundation model.” Two definitions of “covered model” from different versions of the Safe and Secure Innovation for Frontier Artificial Intelligence Models Act, a California bill that was recently vetoed by Governor Newsom, are also discussed, along with the EU AI Act’s definition of “general-purpose AI model with systemic risk.”

A. Executive Order 14110

EO 14110 defines “dual-use foundation model” as:

an AI model that is trained on broad data; generally uses self-supervision; contains at least tens of billions of parameters; is applicable across a wide range of contexts; and that exhibits, or could be easily modified to exhibit, high levels of performance at tasks that pose a serious risk to security, national economic security, national public health or safety, or any combination of those matters, such as by:

(i) substantially lowering the barrier of entry for non-experts to design, synthesize, acquire, or use chemical, biological, radiological, or nuclear (CBRN) weapons;

(ii) enabling powerful offensive cyber operations through automated vulnerability discovery and exploitation against a wide range of potential targets of cyber attacks; or

(iii) permitting the evasion of human control or oversight through means of deception or obfuscation.

Models meet this definition even if they are provided to end users with technical safeguards that attempt to prevent users from taking advantage of the relevant unsafe capabilities.[ref 17]

The executive order imposes certain reporting requirements on companies “developing or demonstrating an intent to develop” dual-use foundation models,[ref 18] and for purposes of these requirements it instructs the Department of Commerce to “define, and thereafter update as needed on a regular basis, the set of technical conditions for models and computing clusters that would be subject to the reporting requirements.”[ref 19] In other words, EO 14110 contains both a high-level quasi-statutory[ref 20] definition and a directive to an agency to promulgate a more detailed regulatory definition. The EO also provides a second definition that acts as a placeholder until the agency’s regulatory definition is promulgated:

any model that was trained using a quantity of computing power greater than 1026 integer or floating-point operations, or using primarily biological sequence data and using a quantity of computing power greater than 1023 integer or floating-point operations[ref 21]

Unlike the first definition, which relies on subjective evaluations of model characteristics,[ref 22] this placeholder definition provides a simple set of objective technical criteria that labs can consult to determine whether the reporting requirements apply. For general-purpose models, the sole test is whether the model was trained on computing power greater than 1026 integer or floating-point operations (FLOP); only models that exceed this compute threshold[ref 23] are deemed “dual-use foundation models” for purposes of the reporting requirements mandated by EO 14110.

B. California’s “Safe and Secure Innovation for Frontier Artificial Intelligence Act” (SB 1047)

California’s recently vetoed “Safe and Secure Innovation for Frontier Artificial Intelligence Models Act” (“SB 1047”) focused on a category that it referred to as “covered models.”[ref 24] The version of SB 1047 passed by the California Senate in May 2024 defined “covered model” to include models meeting either of the following criteria:

(1) The artificial intelligence model was trained using a quantity of computing power greater than 1026 integer or floating-point operations.

(2) The artificial intelligence model was trained using a quantity of computing power sufficiently large that it could reasonably be expected to have similar or greater performance as an artificial intelligence model trained using a quantity of computing power greater than 1026 integer or floating-point operations in 2024 as assessed using benchmarks commonly used to quantify the general performance of state-of-the-art foundation models.[ref 25]

This definition resembles the placeholder definition in EO 14110 in that it primarily consists of a training compute threshold of 1026 FLOP. However, SB 1047 added an alternative capabilities-based threshold to capture future models which “could reasonably be expected” to be as capable as models trained on 1026 FLOP in 2024. This addition was intended to “future-proof”[ref 26] SB 1047 by addressing one of the main disadvantages of training compute thresholds—their tendency to become obsolete over time as advances in algorithmic efficiency produce highly capable models trained on relatively small amounts of compute.[ref 27] 

Following pushback from stakeholders who argued that SB 1047 would stifle innovation,[ref 28] the bill was amended repeatedly in the California State Assembly. The final version defined “covered model” in the following way:

(A) Before January 1, 2027, “covered model” means either of the following:

(i) An artificial intelligence model trained using a quantity of computing power greater than 1026 integer or floating-point operations, the cost of which exceeds one hundred million dollars[ref 29] ($100,000,000) when calculated using the average market prices of cloud compute at the start of training as reasonably assessed by the developer.

(ii) An artificial intelligence model created by fine-tuning a covered model using a quantity of computing power equal to or greater than three times 1025 integer or floating-point operations, the cost of which, as reasonably assessed by the developer, exceeds ten million dollars ($10,000,000) if calculated using the average market price of cloud compute at the start of fine-tuning.

(B) (i) Except as provided in clause (ii), on and after January 1, 2027, “covered model” means any of the following:

(I) An artificial intelligence model trained using a quantity of computing power determined by the Government Operations Agency pursuant to Section 11547.6 of the Government Code, the cost of which exceeds one hundred million dollars ($100,000,000) when calculated using the average market price of cloud compute at the start of training as reasonably assessed by the developer.

(II) An artificial intelligence model created by fine-tuning a covered model using a quantity of computing power that exceeds a threshold determined by the Government Operations Agency, the cost of which, as reasonably assessed by the developer, exceeds ten million dollars ($10,000,000) if calculated using the average market price of cloud compute at the start of fine-tuning.

(ii) If the Government Operations Agency does not adopt a regulation governing subclauses (I) and (II) of clause (i) before January 1, 2027, the definition of “covered model” in subparagraph (A) shall be operative until the regulation is adopted.

This new definition was more complex than its predecessor. Subsection (A) introduced an initial definition slated to apply until at least 2027, which relied on a training compute threshold of 1026 FLOP paired with a training cost floor of $100,000,000.[ref 30] Subsection (B), in turn, provided for the eventual replacement of the training compute thresholds used in the initial definition with new thresholds to be determined (and presumably updated) by a regulatory agency. 

The most significant change in the final version of SB 1047’s definition was the replacement of the capability threshold with a $100,000,000 cost threshold. Because it would currently cost more than $100,000,000 to train a model using >1026 FLOP, the addition of the cost threshold did not change the scope of the definition in the short term. However, the cost of compute has historically fallen precipitously over time in accordance with Moore’s law.[ref 31] This may mean that models trained using significantly more than 1026 FLOP will cost significantly less than the inflation-adjusted equivalent of 100 million 2024 dollars to create at some point in the future. 

The old capability threshold expanded the definition of “covered model” because it was an alternative to the compute threshold—models that exceeded either of the two thresholds would have been “covered.” The newer cost threshold, on the other hand, restricted the scope of the definition because it was linked conjunctively to the compute threshold, meaning that only models that exceed both thresholds were covered. In other words, where the May 2024 definition of “covered model” future-proofed itself against the risk of becoming underinclusive by including highly capable low-compute models, the final definition instead guarded against the risk of becoming overinclusive by excluding low-cost models trained on large amounts of compute. Furthermore, the final cost threshold was baked into the bill text and could only have been changed by passing a new statute—unlike the compute threshold, which could have been specified and updated by a regulator. 

Compared with the overall definitional scheme in EO 14110, SB 1047’s definition was simpler, easier to operationalize, and less flexible. SB 1047 lacked a broad, high-level risk-based definition like the first definition in EO 14110. SB 1047 did resemble EO 14110 in its use of a “placeholder” definition, but where EO 14110 confers broad discretion on the regulator to choose the “set of technical conditions” that will comprise the regulatory definition, SB 1047 only authorized the regulator to set and adjust the numerical value of the compute thresholds in an otherwise rigid statutory definition.

C. EU Artificial Intelligence Act

The EU AI Act classifies AI systems according to the risks they pose. It prohibits systems that do certain things, such as exploiting the vulnerabilities of elderly or disabled people,[ref 32] and regulates but does not ban so-called “high-risk” systems.[ref 33] While this classification system does not map neatly onto U.S. regulatory efforts, the EU AI Act does include a category conceptually similar to the EO’s “dual-use foundation model”: the “general-purpose AI model with systemic risk.”[ref 34] The statutory definition for this category includes a given general-purpose model[ref 35] if:

a. it ‍has high impact capabilities[ref 36] evaluated on the basis of appropriate technical tools and methodologies, including indicators and benchmarks; [or]

b. based on a decision of the Commission,[ref 37] ex officio or following a qualified alert from the scientific panel, it has capabilities or an impact equivalent to those set out in point (a) having regard to the criteria set out in Annex XIII.

Additionally, models are presumed to have “high impact capabilities” if they were trained on >1025 FLOP.[ref 38] The seven “criteria set out in Annex XIII” to be considered in evaluating model capabilities include a variety of technical inputs (such as the model’s number of parameters and the size or quality of the dataset used in training the model), the model’s performance on benchmarks and other capabilities evaluations, and other considerations such as the number of users the model has.[ref 39] When necessary, the European Commission is authorized to amend the compute threshold and “supplement benchmarks and indicators” in response to technological developments, such as “algorithmic improvements or increased hardware efficiency.”[ref 40]

The EU Act definition resembles the initial, broad definition in the EO in that they both take diverse factors like the size and quality of the dataset used to train the model, the number of parameters, and the model’s capabilities into account. However, the EU Act definition is likely much broader than either EO definition. The training compute threshold in the EU Act is sufficient, but not necessary, to classify models as systemically risky, whereas the (much higher) threshold in the EO’s placeholder definition is both necessary and sufficient. And the first EO definition includes only models that exhibit a high level of performance on tasks that pose serious risks to national security, while the  EU Act includes all general-purpose models with “high impact capabilities,” which it defines as including any model trained on more than 1025 FLOP.

The EU Act definition resembles the final SB 1047 definition of “covered model” in that both definitions authorize a regulator to update their thresholds in response to changing circumstances. It also resembles SB 1047’s May 2024 definition in that both definitions incorporate a training compute threshold and a capabilities-based element.

IV. Elements of Existing Definitions

As the examples discussed above demonstrate, legal definitions of “frontier model” can consist of one or more of a number of criteria. This section discusses a few of the most promising definitional elements.

A. Technical inputs and characteristics

A definition may classify AI models according to their technical characteristics or the technical inputs used in training the model, such as training compute, parameter count, and dataset size and type. These elements can be used in either statutory or regulatory definitions.

Training compute thresholds are a particularly attractive option for policymakers,[ref 41] as evidenced by the three examples discussed above. “Training compute” refers to the computational power used to train a model, often  measured in integer or floating-point operations (OP or FLOP).[ref 42] Training compute thresholds function as a useful proxy for model capabilities because capabilities tend to increase as computational resources used to train the model increase.[ref 43] 

One advantage of using a compute threshold is that training compute is a straightforward metric that is quantifiable and can be readily measured, monitored, and verified.[ref 44] Because of these characteristics, determining with high certainty whether a given model exceeds a compute threshold is relatively easy. This, in turn, facilitates enforcement of and compliance with regulations that rely on a compute-based definition. Since the amount of training compute (and other technical inputs) can be estimated prior to the training run,[ref 45] developers can predict whether a model will be covered earlier in development. 

One disadvantage of a compute-based definition is that compute thresholds are a proxy for model capabilities, which are in turn a proxy for risk. Definitions that make use of multiple nested layers of proxy terms in this manner are particularly prone to becoming untethered from their original purpose.[ref 46] This can be caused, for example, by the operation of Goodhart’s Law, which suggests that “when a measure becomes a target, it ceases to be a good measure.”[ref 47] Particularly problematic, especially for statutory definitions that are more difficult to update, is the possibility that a compute threshold may become underinclusive over time as improvements in algorithmic efficiency allow for the development of highly capable models trained on below-threshold levels of compute.[ref 48] This possibility is one reason why SB 1047 and the EU AI Act both supplement their compute thresholds with alternative, capabilities-based elements.

In addition to training compute, two other model characteristics correlated with capabilities are the number of model parameters[ref 49] and the size of the dataset on which the model was trained.[ref 50] Either or both of these characteristics can be used as an element of a definition. A definition can also rely on training data characteristics other than size, such as the quality or type of the data used; the placeholder definition in EO 14110, for example, contains a lower compute threshold for models “trained… using primarily biological sequence data.”[ref 51] EO 14110 requires a dual-use foundation model to contain “at least tens of billions of parameters,”[ref 52] and the “number of parameters of the model” is a criteria to be considered under the EU AI Act.[ref 53] EO 14110 specified that only models “trained on broad data” could be dual-use foundation models,[ref 54] and the EU AI Act includes “the quality or size of the data set, for example measured through tokens” as one criterion for determining whether an AI model poses systemic risks.[ref 55]

Dataset size and parameter count share many of the pros and cons of training compute. Like training compute, they are objective metrics that can be measured and verified, and they serve as proxies for model capabilities.[ref 56] Training compute is often considered the best and most reliable proxy of the three, in part because it is the most closely correlated with performance and is difficult to manipulate.[ref 57] However, partially redundant backup metrics can still be useful.[ref 58] Dataset characteristics other than size are typically less quantifiable and harder to measure but are also capable of capturing information that the quantifiable metrics cannot. 

B. Capabilities 

Frontier models can also be defined in terms of their capabilities. A capabilities-based definition element typically sets a threshold level of competence that a model must achieve to be considered “frontier,” either in one or more specific domains or across a broad range of domains. A capabilities-based definition can provide specific, objective criteria for measuring a model’s capabilities,[ref 59] or it can describe the capabilities required in more general terms and leave the task of evaluation to the discretion of future interpreters.[ref 60] The former approach might be better suited to a regulatory definition, especially if the criteria used will have to be updated frequently, whereas the latter approach would be more typical of a high-level statutory definition.

Basing a definition on capabilities, rather than relying on a proxy for capabilities like training compute, eliminates the risk that the chosen proxy will cease to be a good measure of capabilities over time. Therefore, a capabilities-based definition is more likely than, e.g., a compute threshold to remain robust over time in the face of improvements in algorithmic efficiency. This was the point of the May 2024 version of SB 1047’s use of a capabilities element tethered to a compute threshold (“similar or greater performance as an artificial intelligence model trained using a quantity of computing power greater than 1026 integer or floating-point operations in 2024”)—it was an attempt to capture some of the benefits of an input-based definition while also guarding against the possibility that models trained on less than 1026 FLOP may become far more capable in the future than they are in 2024. 

However,  capabilities are far more difficult than compute to accurately measure. Whether a model has demonstrated “high levels of performance at tasks that pose a serious risk to security” under the EO’s broad capabilities-based definition is not something that can be determined objectively and to a high degree of certainty like the size of a dataset in tokens or the total FLOP used in a training run. Model capabilities are often measured using benchmarks (standardized sets of tasks or questions),[ref 61] but creating benchmarks that accurately measure the complex and diverse capabilities of general-purpose foundation models[ref 62] is notoriously difficult.[ref 63] 

Additionally, model capabilities (unlike the technical inputs discussed above) are generally not measurable until after the model has been trained.[ref 64] This makes it difficult to regulate the development of frontier models using capabilities-based definitions, although post-development, pre-release regulation is still possible.

C. Risk

Some researchers have suggested the possibility of defining frontier AI systems on the basis of the risks they pose to users or to public safety instead of or in addition to relying on a proxy metric, like capabilities, or a proxy for a proxy, such as compute.[ref 65] The principal advantage of this direct approach is that it can, in theory, allow for better-targeted regulations—for instance, by allowing a definition to exclude highly capable but demonstrably low-risk models. The principal disadvantage is that measuring risk is even more difficult than measuring capabilities.[ref 66] The science of designing rigorous safety evaluations for foundation models is still in its infancy.[ref 67] 

Of the three real-world measures discussed in Section III, only EO 14110 mentions risk directly. The broad initial definition of “dual-use foundation model” includes models that exhibit “high levels of performance at tasks that pose a serious risk to security,” such as “enabling powerful offensive cyber operations through automated vulnerability discovery” or making it easier for non-experts to design chemical weapons. This is a capability threshold combined with a risk threshold; the tasks at which a dual-use foundation model must be highly capable are those that pose a “serious risk” to security, national economic security, and/or national public health or safety. As EO 14110 shows, risk-based definition elements can specify the type of risk that a frontier model must create instead of addressing the severity of the risks created. 

D. Epistemic elements

One of the primary justifications for recognizing a category of “frontier models” is the likelihood that broadly capable AI models that are more advanced than previous generations of models will have capabilities and applications that are not readily predictable ex ante.[ref 68] As the word “frontier” implies, lawmakers and regulators focusing on frontier models are interested in targeting models that break new ground and push into the unknown.[ref 69] This was, at least in part, the reason for the inclusion of training compute thresholds of 1026 FLOP in EO 14110 and SB 1047—since the most capable current models were trained on 5×1025 or fewer FLOP,[ref 70] a model trained on 1026 FLOP would represent a significant step forward into uncharted territory. 

While it is possible to target models that advance the state of the art by setting and adjusting capability or compute thresholds, a more direct alternative approach would be to include an epistemic element in a statutory definition of “frontier model.” An epistemic element would distinguish between “known” and “unknown” models, i.e., between well-understood models that pose only known risks and poorly understood models that may pose unfamiliar and unpredictable risks.[ref 71] 

This kind of distinction between known and unknown risks has a long history in U.S. regulation.[ref 72] For instance, the Toxic Substances Control Act (TSCA) prohibits the manufacturing of any “new chemical substance” without a license.[ref 73] The EPA keeps and regularly updates a list of chemical substances which are or have been manufactured in the U.S., and any substance not included on this list is “new” by definition.[ref 74] In other words, the TSCA distinguishes between chemicals (including potentially dangerous chemicals) that are familiar to regulators and unfamiliar chemicals that pose unknown risks. 

One advantage of an epistemic element is that it allows a regulator to address “unknown unknowns” separately from better-understood risks that can be evaluated and mitigated more precisely.[ref 75] Additionally, the scope of an epistemic definition, unlike that of most input- and capability-based definitions, would change over time as regulators became familiar with the capabilities of and risks posed by new models.[ref 76] Models would drop out of the “frontier” category once regulators became sufficiently familiar with their capabilities and risks.[ref 77] Like a capabilities- or risk-based definition, however, an epistemic definition might be difficult to operationalize.[ref 78] To determine whether a given model was “frontier” under an epistemic definition, it would probably be necessary to either rely on a proxy for unknown capabilities or authorize a regulator to categorize eligible models according to a specified process.[ref 79] 

E. Deployment context

The context in which an AI system is deployed can serve as an element in a definition. The EU AI Act, for example, takes the number of registered end users and the number of registered EU business users a model has into account as factors to be considered in determining whether a model is a “general-purpose AI model with systemic risk.”[ref 80] Deployment context typically does not in and of itself provide enough information about the risks posed by a model to function as a stand-alone definitional element, but it can be a useful proxy for the kind of risk posed by a given model. Some models may cause harms in proportion to their number of users, and the justification for aggressively regulating these models grows stronger the more users they have.  A model that will only be used by government agencies, or by the military, creates a different set of risks than a model that is made available to the general public.

V. Updating Regulatory Definitions

A recurring theme in the scholarly literature on the regulation of emerging technologies is the importance of regulatory flexibility.[ref 81] Because of the rapid pace of technological progress, legal rules designed to govern emerging technologies like AI tend to quickly become outdated and ineffective if they cannot be rapidly and frequently updated in response to changing circumstances.[ref 82] For this reason, it may be desirable to authorize an executive agency to promulgate and update a regulatory definition of “frontier model,” since regulatory definitions can typically be updated more frequently and more easily than statutory definitions under U.S. law.[ref 83]

Historically, failing to quickly update regulatory definitions in the context of emerging technologies has often led to the definitions becoming obsolete or counterproductive. For example, U.S. export controls on supercomputers in the 1990s and early 2000s defined “supercomputer” in terms of the number of millions of theoretical operations per second (MTOPS) the computer could perform.[ref 84] Rapid advances in the processing power of commercially available computers soon rendered the initial definition obsolete, however, and the Clinton administration was forced to revise the MTOPS threshold repeatedly to avoid harming the competitiveness of the American computer industry.[ref 85] Eventually, the MTOPS metric itself was rendered obsolete, leading to a period of several years in which supercomputer export controls were ineffective at best.[ref 86]

There are a number of legal considerations that may prevent an agency from quickly updating a regulatory definition and a number of measures that can be taken to streamline the process. One important aspect of the rulemaking process is the Administrative Procedure Act’s “notice and comment” requirement.[ref 87] In order to satisfy this requirement, agencies are generally obligated to publish notice of any proposed amendment to an existing regulation in the Federal Register, allow time for the public to comment on the proposal, respond to public comments, publish a final version of the new rule, and then allow at least 30–60 days before the rule goes into effect.[ref 88] From the beginning of the notice-and-comment process to the publication of a final rule, this process can take anywhere from several months to several years.[ref 89] However, an agency can waive the 30–60 day publication period or even the entire notice-and-comment requirement for “good cause” if observing the standard procedures would be “impracticable, unnecessary, or contrary to the public interest.”[ref 90] Of course, the notice-and-comment process has benefits as well as costs; public input can be substantively valuable and informative for agencies, and also increases the democratic accountability of agencies and the transparency of the rulemaking process. In certain circumstances, however, the costs of delay can outweigh the benefits. U.S. agencies have occasionally demonstrated a willingness to waive procedural rulemaking requirements in order to respond to emergency AI-related developments. The Bureau of Industry and Security (“BIS”), for example, waived the normal 30-day waiting period for an interim rule prohibiting the sale of certain advanced AI-relevant chips to China in October 2023.[ref 91]  

Another way to encourage quick updating for regulatory definitions is for Congress to statutorily authorize agencies to eschew or limit the length of notice and comment, or to compel agencies to promulgate a final rule by a specified deadline.[ref 92] Because notice and comment is a statutory requirement, it can be adjusted as necessary by statute.  

For regulations exceeding a certain threshold of economic significance, another substantial source of delay is OIRA review. OIRA, the Office of Information and Regulatory Affairs, is an office within the White House that oversees interagency coordination and undertakes centralized cost-benefit analysis of important regulations.[ref 93] Like notice and comment, OIRA review can have significant benefits—such as improving the quality of regulations and facilitating interagency cooperation—but it also delays the implementation of significant rules, typically by several months.[ref 94] OIRA review can be waived either by statutory mandate or by OIRA itself.[ref 95]

VI. Deference, Delegation, and Regulatory Definitions

Recent developments in U.S. administrative law may make it more difficult for Congress to effectively delegate the task of defining “frontier model” to a regulatory agency. A number of recent Supreme Court cases signal an ongoing shift in U.S. administrative law doctrine intended to limit congressional delegations of rulemaking authority.[ref 96] Whether this development is good or bad on net is a matter of perspective; libertarian-minded observers who believe that the U.S. has too many legal rules already[ref 97] and that overregulation is a bigger problem than underregulation have welcomed the change,[ref 98] while pro-regulation observers predict that it will significantly reduce the regulatory capacity of agencies in a number of important areas.[ref 99] 

Regardless of where one falls on that spectrum of opinion, the relevant takeaway for efforts to define “frontier model” is that it will likely become somewhat more difficult for agencies to promulgate and update regulatory definitions without a clear statutory authorization to do so. If Congress still wishes to authorize the creation of regulatory definitions, however, it can protect agency definitions from legal challenges by clearly and explicitly authorizing agencies to exercise discretion in promulgating and updating definitions of specific terms.

A. Loper Bright and deference to agency interpretations

In a recent decision in the combined cases of Loper Bright Enterprises v. Raimondo and Relentless v. Department of Commerce, the Supreme Court repealed a longstanding legal doctrine known as Chevron deference.[ref 100] Under Chevron, federal courts were required to defer to certain agency interpretations of federal statutes when (1) the relevant part of the statute being interpreted was genuinely ambiguous and (2) the agency’s interpretation was reasonable. After Loper Bright, courts are no longer required to defer to these interpretations—instead, under a doctrine known as Skidmore deference,[ref 101] agency interpretations will prevail in court only to the extent that courts are persuaded by them.[ref 102] 

Justice Elena Kagan’s dissenting opinion in Loper Bright argues that the decision will harm the regulatory capacity of agencies by reducing the ability of agency subject-matter experts to promulgate regulatory definitions of ambiguous statutory phrases in “scientific or technical” areas.[ref 103] The dissent specifically warns that, after Loper Bright, courts will “play a commanding role” in resolving questions like “[w]hat rules are going to constrain the development of A.I.?”[ref 104] 

Justice Kagan’s dissent probably somewhat overstates the significance of Loper Bright to AI governance for rhetorical effect.[ref 105] The end of Chevron deference does not mean that Congress has completely lost the ability to authorize regulatory definitions; where Congress has explicitly directed an agency to define a specific statutory term, Loper Bright will not prevent the agency from doing so.[ref 106] An agency’s authority to promulgate a regulatory definition under a statute resembling EO 14110, which explicitly directs the Department of Commerce to define “dual-use foundation model,” would likely be unaffected. However, Loper Bright has created a great deal of uncertainty regarding the extent to which courts will accept agency claims that Congress has implicitly authorized the creation of regulatory definitions.[ref 107] 

To better understand how this uncertainty might affect efforts to define “frontier model,” consider the following real-life example. The Energy Policy and Conservation Act (“EPCA”) includes a statutory definition of the term “small electric motor.”[ref 108] Like many statutory definitions, however, this definition is not detailed enough to resolve all disputes about whether a given product is or is not a “small electric motor” for purposes of EPCA. In 2010, the Department of Energy (“DOE”), which is authorized under EPCA to promulgate energy efficiency standards governing “small electric motors,”[ref 109] issued a regulatory definition of “small electric motor” specifying that the term referred to motors with power outputs between 0.25 and 3 horsepower.[ref 110] The National Electrical Manufacturers Association (“NEMA”), a trade association of electronics manufacturers, sued to challenge the rule, arguing that motors with between 1 and 3 horsepower were too powerful to be “small electric motors” and that the DOE was exceeding its statutory authority by attempting to regulate them.[ref 111] 

In a 2011 opinion that utilized the Chevron framework, the federal court that decided NEMA’s lawsuit considered the language of EPCA’s statutory definition and concluded that EPCA was ambiguous as to whether motors with between 1 and 3 horsepower could be “small electric motors.”[ref 112] The court then found that the DOE’s regulatory definition was a reasonable interpretation of EPCA’s statutory definition, deferred to the DOE under Chevron, and upheld the challenged regulation.[ref 113]

Under Chevron, federal courts were required to assume that Congress had implicitly authorized agencies like the DOE to resolve ambiguities in a statute, as the DOE did in 2010 by promulgating its regulatory definition of “small electric motor.” After Loper Bright, courts will recognize fewer implicit delegations of definition-making authority. For instance, while EPCA requires the DOE to prescribe “testing requirements” and “energy conservation standards” for small electric motors, it does not explicitly authorize the DOE to promulgate a regulatory definition of “small electric motor.” If a rule like the one challenged by NEMA were challenged today, the DOE could still argue that Congress implicitly authorized the creation of such a rule by giving the DOE authority to prescribe standards and testing requirements—but such an argument would probably be less likely to succeed than the Chevron argument that saved the rule in 2011.

Today, a court that did not find an implicit delegation of rulemaking authority in EPCA would not defer to the DOE’s interpretation. Instead, the court would simply compare the DOE’s regulatory definition of “small electric motor” with NEMA’s proposed definition and decide which of the two was a more faithful interpretation of EPCA’s statutory definition.[ref 114] Similarly, when or if some future federal statute uses the phrase “frontier model” or any analogous term, agency attempts to operationalize the statute by enacting detailed regulatory definitions that are not explicitly authorized by the statute will be easier to challenge after Loper Bright than they would have been under Chevron

Congress can avoid Loper Bright issues by using clear and explicit statutory language to authorize agencies to promulgate and update regulatory definitions of “frontier model” or analogous phrases. However, it is often difficult to predict in advance whether or how a statutory definition will become ambiguous over time. This is especially true in the context of emerging technologies like AI, where the rapid pace of technological development and the poorly understood nature of the technology often eventually render carefully crafted definitions obsolete.[ref 115] 

Suppose, for example, that a federal statute resembling the May 2024 draft of SB 1047 was enacted. The statutory definition would include future models trained on a quantity of compute such that they “could reasonably be expected to have similar or greater performance as an artificial intelligence model trained using [>1026 FLOP] in 2024.” If the statute did not contain an explicit authorization for some agency to determine the quantity of compute that qualified in a given year, any attempt to set and enforce updated regulatory compute thresholds could be challenged in court. 

The enforcing agency could argue that the statute included an implied authorization for the agency to promulgate and update the regulatory definitions at issue. This argument might succeed or fail, depending on the language of the statute, the nature of the challenged regulatory definitions, and the judicial philosophy of the deciding court. But regardless of the outcome of any individual case, challenges to impliedly authorized regulatory definitions will probably be more likely to succeed after Loper Bright than they would have been under Chevron. Perhaps more importantly, agencies will be aware that regulatory definitions will no longer receive the benefit of Chevron deference and may regulate more cautiously in order to avoid being sued.[ref 116] Moreover, even if the statute did explicitly authorize an agency to issue updated compute thresholds, such an authorization might not allow the agency to respond to future technological breakthroughs by considering some factor other than the quantity of training compute used.

In other words, a narrow congressional authorization to regulatorily define “frontier model” may prove insufficiently flexible after Loper Bright. Congress could attempt to address this possibility by instead enacting a very broad authorization.[ref 117] An overly broad definition, however, may be undesirable for reasons of democratic accountability, as it would give unelected agency officials discretionary control over which models to regulate as “frontier.” Moreover, an overly broad definition might risk running afoul of two related constitutional doctrines that limit the ability of Congress to delegate rulemaking authority to agencies—the major questions doctrine and the nondelegation doctrine.

B. The nondelegation doctrine

Under the nondelegation doctrine, which arises from the constitutional principle of separation of powers, Congress may not constitutionally delegate legislative power to executive branch agencies. In its current form, this doctrine has little relevance to efforts to define “frontier model.” Under current law, Congress can validly delegate rulemaking authority to an agency as long as the statute in which the delegation occurs includes an “intelligible principle” that provides adequate guidance for the exercise of that authority.[ref 118] In practice, this is an easy standard to satisfy—even vague and general legislative guidance, such as directing agencies to regulate in a way that “will be generally fair and equitable and will effectuate the purposes of the Act,” has been held to contain an intelligible principle.[ref 119] The Supreme Court has used the nondelegation doctrine to strike down statutes only twice, in two 1935 decisions invalidating sweeping New Deal laws.[ref 120]

However, some commentators have suggested that the Supreme Court may revisit the nondelegation doctrine in the near future,[ref 121] perhaps by discarding the “intelligible principle” test in favor of something like the standard suggested by Justice Gorsuch in his 2019 dissent in Gundy v. United States.[ref 122] In Gundy, Justice Gorsuch suggested that the nondelegation doctrine, properly understood, requires Congress to make “all the relevant policy decisions” and delegate to agencies only the task of “filling up the details” via regulation.[ref 123]

Therefore, if the Supreme Court does significantly strengthen the nondelegation doctrine, it is possible that a statute authorizing an agency to create a regulatory definition of “frontier model” would need to include meaningful guidance as to what the definition should look like. This is most likely to be the case if the regulatory definition in question is a key part of an extremely significant regulatory scheme, because “the degree of agency discretion that is acceptable varies according to the power congressionally conferred.”[ref 124]  Congress generally “need not provide any direction” to agencies regarding the manner in which it defines specific and relatively unimportant technical terms,[ref 125] but must provide “substantial guidance” for extremely important and complex regulatory tasks that could significantly impact the national economy.[ref 126] 

C. The major questions doctrine

Like the nondelegation doctrine, the major questions doctrine is a constitutional limitation on Congress’s ability to delegate rulemaking power to agencies. Like the nondelegation doctrine, it addresses concerns about the separation of powers and the increasingly prominent role executive branch agencies have taken on in the creation of important legal rules. Unlike the nondelegation doctrine, however, the major questions doctrine is a recent innovation. The Supreme Court acknowledged it by name for the first time in the 2022 case West Virginia v. Environmental Protection Agency,[ref 127] where it was used to strike down an EPA rule regulating power plant carbon dioxide emissions. Essentially, the major questions doctrine provides that courts will not accept an interpretation of a statute that grants an agency authority over a matter of great “economic or political significance” unless there is a “clear congressional authorization” for the claimed authority.[ref 128] Whereas the nondelegation doctrine provides a way to strike down statutes as unconstitutional, the major questions doctrine only affects the way that statutes are interpreted. 

Supporters of the major questions doctrine argue that it helps to rein in excessively broad delegations of legislative power to the administrative state and serves a useful separation-of-powers function. The doctrine’s critics, however, have argued that it limits Congress’s ability to set up flexible regulatory regimes that allow agencies to respond quickly and decisively to changing circumstances.[ref 129] According to this school of thought, requiring a clear statement authorizing each economically significant agency action inhibits Congress’s ability to communicate broad discretion in handling problems that are difficult to foresee in advance. 

This difficulty is particularly salient in the context of regulatory regimes for the governance of emerging technologies.[ref 130] Justice Kagan made this point in her dissent from the majority opinion in West Virginia, where she argued that the statute at issue was broadly worded because Congress had known that “without regulatory flexibility, changing circumstances and scientific developments would soon render the Clean Air Act obsolete.”[ref 131] Because advanced AI systems are likely to have a significant impact on the U.S. economy in the coming years,[ref 132] it is plausible that the task of choosing which systems should be categorized as “frontier” and subject to increased regulatory scrutiny will be an issue of great “economic and political significance.” If it is, then the major questions doctrine could be invoked to invalidate agency efforts to promulgate or amend a definition of “frontier model” to address previously unforeseen unsafe capabilities. 

For example, consider a hypothetical federal statute instituting a licensing regime for frontier models that includes a definition similar to the placeholder in EO 14110 (empowering the Bureau of Industry and Security to “define, and thereafter update as needed on a regular basis, the set of technical conditions [that determine whether a model is a frontier model].”). Suppose that BIS initially defined “dual-use foundation model” under this statute using a regularly updated compute threshold, but that ten years after the statute’s enactment a new kind of AI system was developed that could be trained to exhibit cutting-edge capabilities using a relatively small quantity of training compute. If BIS attempted to amend its regulatory definition of “frontier model” to include a capabilities threshold that would cover this newly developed and economically significant category of AI system, that new regulatory definition might be challenged under the major questions doctrine. In that situation, a court with deregulatory inclinations might not view the broad congressional authorization for BIS to define “frontier model” as a sufficiently clear statement of congressional intent to allow BIS to later institute a new and expanded licensing regime based on less objective technical criteria.[ref 133] 

VI. Conclusion

One of the most common mistakes that nonlawyers make when reading a statute or regulation is to assume that each word of the text carries its ordinary English meaning. This error occurs because legal rules, unlike most writing encountered in everyday life, are often written in a sort of simple code where a number of the terms in a given sentence are actually stand-ins for much longer phrases catalogued elsewhere in a “definitions” section. 

This tendency to overlook the role that definitions play in legal rules has an analogue in a widespread tendency to overlook the importance of well-crafted definitions to a regulatory scheme. The object of this paper, therefore, has been to explain some of the key legal considerations relevant to the task of defining “frontier model” or any of the analogous phrases used in existing laws and regulations. 

One such consideration is the role that should be played by statutory and regulatory definitions, which can be used independently or in conjunction with each other to create a definition that is both technically sound and democratically legitimate. Another is the selection and combination of potential definitional elements, including technical inputs, capabilities metrics, risk, deployment context, and familiarity, that can be used independently or in conjunction with each other to create a single statutory or regulatory definition. Legal mechanisms for facilitating rapid and frequent updating for regulations targeting emerging technologies also merit attention. Finally, the nondelegation and major questions doctrines and the recent elimination of Chevron deference may affect the scope of discretion that can be conferred for the creation and updating of regulatory definitions.

Beyond a piecemeal approach: prospects for a framework convention on AI

The future of international scientific assessments of AI’s risks

Computing power and the governance of artificial intelligence

AI is like… A literature review of AI metaphors and why they matter for policy

Executive summary

This report provides an overview, taxonomy, and preliminary analysis of the role of basic metaphors and analogies in AI governance. 

Aim: The aim of this report is to contribute to improved analysis, debate, and policy for AI systems by providing greater clarity around the way that analogies and metaphors can affect technology governance generally, around how they may shape AI governance, and about how to improve the processes by which some analogies or metaphors for AI are considered, selected, deployed, and reviewed.

Summary: In sum, this report:

  1. Draws on technology law scholarship to review five ways in which metaphors or analogies exert influence throughout the entire cycle of technology policymaking by shaping:
    1. patterns of technological innovation; 
    2. the study of particular technologies’ sociotechnical impacts or risks; 
    3. which of those sociotechnical impacts make it onto the regulatory agenda; 
    4. how those technologies are framed within the policymaking process in ways that highlight some issues and policy levers over others; and 
    5. how these technologies are approached within legislative and judicial systems. 
  2. Illustrates these dynamics with brief case studies where foundational metaphors shaped policy for cyberspace, as well as for recent AI issues. 
  3. Provides an initial atlas of 55 analogies for AI, which have been used in expert, policymaker, and public debate to frame discussion of AI issues, and discusses their implications for regulation.
  4. Reflects on the risks of adopting unreflexive analogies and misspecified (legal) definitions.

Below, the reviewed analogies are summarized in Table 1.

Table 1: Overview of surveyed analogies for AI (brief, without policy implications)

ThemeFrame (varieties)
Essence
Terms focusing on what AI is
Field of science
IT technology (just better algorithms, AI as a product)
Information technology
Robots (cyber-physical systems, autonomous platforms)
Software (AI as a service)
Black box
Organism (artificial life)
Brain
Mind (digital minds, idiot savant)
Alien (shoggoth)
Supernatural entity (god-like AI, demon)
Intelligence technology (markets, bureaucracies, democracies)
Trick (hype)
Operation
Terms focusing on how AI works
Autonomous system
Complex adaptive system
Evolutionary process
Optimization process
Generative system (generative AI)
Technology base (foundation model)
Agent
Pattern-matcher (autocomplete on steroids, stochastic parrot)
Hidden human labor (fauxtomation)
Relation
Terms focusing on how we relate to AI, as (possible) subject
Tool (just technology)
Animal
Moral patient
Moral agent
Slave
Legal entity (digital person, electronic person, algorithmic entity)
Culturally revealing object (mirror to humanity, blurry JPEG of the web)
Frontier (frontier model)
Our creation (mind children)
Next evolutionary stage or successor
Function
Terms focusing on how AI is or can be used
Companion (social robots, care robots, generative chatbots, cobot)
Advisor (coach, recommender, therapist)
Malicious actor tool (AI hacker)
Misinformation amplifier (computational propaganda, deepfakes, neural fake news)
Vulnerable attack surface
Judge
Weapon (killer robot, weapon of mass destruction)
Critical strategic asset (nuclear weapons)
Labor enhancer (steroids, intelligence forklift)
Labor substitute
New economic paradigm (fourth industrial revolution)
Generally enabling technology (the new electricity / fire / internal combustion engine)
Tool of power concentration or control
Tool for empowerment or resistance (emancipatory assistant)
Global priority for shared good
Impact
Terms focusing on the unintended risks, benefits or side-effects of AI
Source of unanticipated risks (algorithmic black swan)
Environmental pollutant
Societal pollutant (toxin)
Usurper of human decision-making authority
Generator of legal uncertainty
Driver of societal value shifts
Driver of structural incentive shifts
Revolutionary technology
Driver of global catastrophic or existential risk

Introduction

Everyone loves a good analogy like they love a good internet meme—quick, relatable, shareable,[ref 1] memorable, and good for communicating complex topics to family.

Background: As AI systems have become increasingly capable and have had increasingly public impacts, there has been significant public and policymaker debate over the technology. Given the breadth of the technology’s application, many of these discussions have come to deploy—and contest—a dazzling range of analogies, metaphors, and comparisons for AI systems in order to understand, frame, or shape the technologies’ impact and its regulation.[ref 2] Yet the speed with which many often jump to invoke particular metaphors—or to contest the accuracy of others—leads to frequent confusion over these analogies, how they are used, and how they are best evaluated or compared.[ref 3] 

Rationale: Such debates are not just about wordplay—metaphors matter. Framings, metaphors, analogies, and (at the most specific end) definitions can strongly affect many key stages of the world’s response to a new technology, from the initial developmental pathways for technology, to the shaping of policy agendas, to the efficacy of legal frameworks.[ref 4] They have done so consistently in the past, and we have reason to believe they will especially do so for (advanced) AI. Indeed, recent academic, expert, public, and legal contests around AI often already strongly turn on “battles of analogies.”[ref 5] 

Aim: Given this, there is a need for those speaking about AI to better understand (a) when they speak in analogies—that is, when the ways in which AI is described (inadvertently) import one or more foundational analogies; (b) what it does to utilize one or another metaphor for AI; (c) what different analogies could be used instead; (d) how the appropriateness of one or another metaphor is best evaluated; and (e) what, given this, might be the limits or risks of jumping at particular analogies. 

This report aims to respond to these questions and contribute to improved analysis, debate, and policy by providing greater clarity around the role of metaphors in AI governance, the range of possible (alternate) metaphors, and good practices in constructing and using metaphors. 

Caveats: The aim here is not to argue against the use of any analogies in AI policy debates—if that were even possible. Nor is it to prescribe (or dismiss) one or another metaphor for AI as “better” (or “worse”) per se. The point is not that one particular comparison is the best and should be adopted by all, or that another is “obviously” flawed. Indeed, in some sense, a metaphor or analogy cannot be “wrong,” only more tenuous and more or less suitable when considered from the perspective of some values or some (regulatory) purpose. As such, different metaphors may work best in different contexts. Given this, this report highlights the diversity of analogies in current use and provides context for more informed future discourse and policymaking. 

Terminology: Strictly speaking, there is a difference between a metaphor—“an implied comparison between two things of unlike nature that yet have something in common”—and an analogy—“a non-identical or non-literal similarity comparison between two things, with a resulting predictive or explanatory effect.”[ref 6] However, while in legal contexts the two can be used in slightly different ways, cognitive science suggests that humans process information by metaphor and by analogy in similar ways.[ref 7] As a result, within this report, “analogy” and “metaphor” will be used relatively interchangeably to refer to (1) communicated framings of an (AI) issue that describe that issue (2) through terms, similes, or metaphors which rely on, invoke, or importreferences to a different phenomenon, technology, or historical event, which (3) is (assumed to be) comparable in one or more ways (e.g., technical, architectural, political, or moral) (4) which are relevant to evaluating or responding to the (AI) issue at hand. Furthermore, the report will use the term “foundational metaphor” to discuss cases where a particular metaphor for the technology has become deeply established and embedded within larger policy programs, such that the nature of the metaphor as a metaphor may even become unclear.

Structure: Accordingly, this report now proceeds as follows. In Part I, it discusses why and how definitions matter to both the study and practice of AI governance. It reviews five ways in which analogies or definitions can shape technology policy generally. To illustrate this, Part II reviews a range of cases in which deeply ingrained foundational metaphors have shaped internet policy as well as legal responses to various AI uses. In Part III, this report provides an initial atlas of 55 different analogies that have been used for AI in recent years, along with some of their regulatory implications. Part IV briefly discusses the risks of using analogies in unreflexive ways.

I. How metaphors shape technology governance

Given the range of disciplinary backgrounds in debates over AI, we should not be surprised that the technology is perceived and understood differently by many. 

Nonetheless, it matters to get clarity, because terminological and analogical framing effects happen at all stages in the cycle from technological development to societal response. They can shape the initial development processes for technologies as well as the academic fields and programs that study their impacts.[ref 8] Moreover, they can shape both the policymaking processes and the downstream judicial interpretation and application of legislative texts.

1. Metaphors shape innovation

Metaphors and analogies are strongly rooted in human psychology.[ref 9] Even some nonhuman animals think analogically.[ref 10] Indeed, human creativity has even been defined as “the capacity to see or interpret a problematic phenomenon as an unexpected or unusual instance of a prototypical pattern already in one’s conceptual repertoire.”[ref 11]

Given this, metaphors and analogies can shape and constrain the ability of humans to collectively create new things.[ref 12] In this way, technology metaphors can affect the initial human processes of invention and investment that drive the development of AI and other technologies in the first place. It has been suggested that foundational metaphors can influence the organization and direction of scientific fields—and even that all scientific frameworks could to some extent be viewed as metaphors.[ref 13] For example, the fields of cell biology and biotechnology have for decades been shaped by the influential foundational metaphor that sees biological cells as “machines,” which has led to sustained debates over the scientific use and limits of that analogy in shaping research programs.[ref 14] 

More practically, at the development and marketing stage, metaphors can shape how consumers and investors assess proposed startup ideas[ref 15] and which innovation paths attract engineer, activist, and policymaking interest and support. In some such cases, metaphors can support and spur on innovation; for instance, it has been argued that through the early 2000s, the coining of specific IT metaphors for electric vehicles—as a “computer on wheels”—played a significant role in sustaining engineer support for and investment in this technology, especially during an industry downturn in the wake of General Motors’ sudden cancellation of its EV1 electric car.[ref 16] 

Conversely, metaphors can also hold back or inhibit certain pathways of innovation; for instance, in the Soviet Union in the early 1950s, the field of cybernetics (along with other fields such as genetics or linguistics) fell victim to anti-American campaigns, which characterized it as “an ‘obscurantist’, ‘bourgeois pseudoscience’”.[ref 17] While this did not affect the early development of Soviet computer technology (which was highly prized by the state and the military), the resulting ideological rejection of the “man-machine” analogy by Marxist-Leninist philosophers led to an ultimately dominant view, in Soviet sciences, of computers as solely “tools to think with” rather than “thinking machines,” holding back the consolidation of the field (such that even the label “AI” would not be recognized by the Soviet Academy of Sciences until 1987) and shifting research attention into projects that focused on the “situational management” of large complex systems rather than the pursuit of human-like thinking machines.[ref 18] This stood in contrast to US research programs, such as DARPA’s 1983–1993 Strategic Computing Initiative, an extensive, $1 billion program to achieve “machines that think.”[ref 19]

2. Metaphors inform the study of technologies’ impacts

Particular definitions also shape and prime academic fields that study the impacts of these technologies (and which often may uncover or highlight particular developments as issues for regulation). Definitions affect which disciplines are drawn to work on a problem, what tools they bring to hand, and how different analyses and fields can build on one another. For instance, it has been argued that the analogy between software code and legal text has supported greater and more productive engagement by legal scholars and practitioners with such code at the level of its (social) meaning and effects (rather than narrowly on the level of the techniques used).[ref 20] Given this, terminology can affect how AI governance is organized as a field of analysis and study, what methodologies are applied, and what risks or challenges are raised or brought up.

3. Metaphors set the regulatory agenda 

More directly, particular definitions or frames for a technology can set and shape the policymaking agenda in various ways. 

For instance, terms and frames can raise (or suppress) policy attention for an issue, affecting whether policymakers or the public care (enough) about a complex and often highly technical topic in the first place to take it up for debate or regulation. For instance, it has been argued that framings that focus on the viscerality of the injuries inflicted by a new weapon system have in the past boosted international campaigns to ban blinding lasers and antipersonnel mines, yet they ended up being less successful in spurring effective advocacy around “killer robots.”[ref 21] 

Moreover, metaphors—and especially specific definitions—can shape (government) perceptions of the empirical situation or state of play around a given issue. For instance, the particular definition used for “AI” can directly affect which (industrial or academic) metrics are used to evaluate different states’ or labs’ relative achievements or competitiveness in developing the technology. In turn, that directly shapes downstream evaluations of which nation is “ahead” in AI.[ref 22] 

Finally, terms can frame the relevant legal actors and policy coalitions, enabling (or inhibiting) inclusion and agreement at the level of interest or advocacy groups that push for (or against) certain policy goals. For instance, the choice for particular terms or framings that meet with broad agreement or acceptance amongst many actors can make it easier for a diverse set of stakeholders to join together in pushing for regulatory actions. However, such agreement may be fostered by definitional clarity, when terms or frames are transparent and meet with wider acceptance, or because of definitional ambiguity, when a broad term (such as “ethical AI”) allows for sufficient ambiguity that different actors can meet on an “incompletely theorized agreement”[ref 23] to pursue a shared policy program on AI.

4. Metaphors frame the policymaking process

Terms can have a strong overall effect on policy issue-framing, foregrounding different problem portfolios as well as regulatory levers. For instance, early societal debates around nanotechnology were significantly influenced by analogies with asbestos and genetically modified organisms.[ref 24]

Likewise, regulatory initiatives that frame AI systems as “products” imply that these fit easily within product safety frameworks—even if that may be a poor or insufficient model for AI governance, for instance because it is a model that fails to address any risks at the developmental stage[ref 25] or because it fails to accurately focus on fuzzier impacts on fundamental rights if those cannot be easily classified as consumer harms.[ref 26] 

This is not to say that the policy-shaping influence of terms (or explicit metaphors) is absolute and irrevocable. For instance, in a different policy domain, a 2011 study found that using metaphors that described crime as a “beast” led study participants to recommend law-and-order responses, whereas describing it as a “virus” led them to put more emphasis on public-health-style policies. However, even under the latter framing, law-and-order policy responses still prevailed, simply commanding a smaller majority than they would otherwise.[ref 27] 

Nonetheless, metaphors do exert sway throughout the policymaking process. For instance, they can shape perceptions of the feasibility of regulation by certain routes. As an example, framings of digital technologies that emphasize certain traits of technologies—such as the “materiality” or “seeming immateriality,” or the centralization or decentralization, of technologies like submarine cables, smart speakers, search engines, or the bitcoin protocol—can strongly affect perceptions of whether, or by what routes, it is most feasible to regulate that technology at the global level.[ref 28] 

Likewise, different analogies or historical comparisons for proposed international organizations for AI governance—ranging from the IAEA and IPCC to the WTO or CERN—often import tacit analogical comparisons (or rather constitute “reflected analogies”) between AI and those organizations’ subject matter or mandates in ways that shape the perceptions of policymakers and the public regarding which of AI’s challenges require global governance, whether or which new organizations are needed, and whether the establishment of such organizations will be feasible.[ref 29]

5. Metaphors and analogies shape the legislative & judicial response to tech

Finally, metaphors, broad analogies, and specific definitions can frame legal and judicial treatment of a technology in both the ex ante application of AI-focused regulations and the ex post subsequent judicial interpretation of either such AI-specific legislation or of general regulations in the context of cases involving AI. 

Indeed, much of legal reasoning, especially in court systems, and especially in common law jurisdictions, is deeply analogical.[ref 30] This is for various reasons.[ref 31] For one, legal actors are also human, and strong features of human psychology can skew these actors towards the use of analogies that refer to known and trusted categories: as such, as Mandel has argued, “availability and representativeness heuristics lead people to view a new technology and new disputes through existing frames, and the status quo bias similarly makes people more comfortable with the current legal framework.”[ref 32] This is particularly the case because much of legal scholarship and work aims to be “problem-solving” rather than “problem-finding”[ref 33] and to respond to new problems by appealing to pre-existent (ethical or legal) principles, norms, values, codes, or laws.[ref 34] Moreover, from an administrative perspective, it is often easier and more cost-effective to extend existing laws by analogy. 

Finally, and more fundamentally, the resort to analogy by legal actors can be a shortcut that aims to apply the law, and solve a problem, through an “incompletely theorized agreement” that does not require reopening contentious questions or debates over the first principles or ultimate purposes of the law,[ref 35] or renegotiating hard-struck legislative agreements. This is especially the case at the level of international law, where either negotiating new treaties or explicitly amending multilateral treaties to incorporate a new technology within an existing framework can be wrought, drawn-out processes[ref 36] such that many actors may prefer ultimately addressing new issues (such as cyberwar) within existing norms or principles by analogizing them to well-established and well-regulated behaviors.[ref 37]

Given this, when confronted with situations of legal uncertainty—as often happens with a new technology[ref 38]—legal actors may favor the use of analogies to stretch existing law or to interpret new cases as falling within existing doctrine. That does not mean that courts need immediately settle or converge on one particular “right” analogy. Indeed, there are always multiple analogies possible, and these can have significantly different implications for how the law is interpreted and applied. That means that many legal cases involving technology will involve so-called “battles of analogies.”[ref 39] For example, in recent class action lawsuits that have accused generative AI providers such as Stable Diffusion and Midjourney of copyright infringement, plaintiffs have argued that these generative AI models are “essentially sophisticated collage tools, with the output representing nothing more than a mash-up of the training data, which is itself stored in the models as compressed copies.”[ref 40] Some have countered that this analogy suffers some technical inaccuracies, since current generative AI models do not store compressed copies of the training data, such that a better analogy would be that of an “art inspector” that takes every measurement possible—implying that model training either is not governed by copyright law or constitutes fair use.[ref 41] 

Finally, even if specific legislative texts move to adopt clear, specific statutory definitions for AI—in a way that avoids (explicit) comparison or analogy with other technologies or behavior—this may not entirely avoid framing effects. Most obviously, legislative definitions for key terms such as “AI” obviously affect the material scope of regulations and policies that use and define such terms.[ref 42] Indeed, the effects of particular definitions have impacts on regulation not only ex ante but also ex post: in many jurisdictions, legal terms are interpreted and applied by courts based on their widely shared “ordinary meaning.”[ref 43] This means, for instance, that regulations that refer to terms such as “advanced AI,” “frontier AI,” or “transformative AI”[ref 44] might not necessarily be interpreted or applied in ways that are in line with how the term is understood within expert communities.[ref 45] 

All of this underscores the importance of our choice of terms and frames—whether broad and indirect metaphors or concrete and specific legislative definitions—when grappling with the impacts of this technology on society.

II. Foundational metaphors in technology law: Cases

Of course, these dynamics are not new and have been studied in depth in fields such as cyberlaw, law and technology, and technology law.[ref 46] For instance, we can see many of these framing dynamics within societal (and regulator) responses to other cornerstone digital technologies. 

1. Metaphors in internet policy: Three cases

For instance, for the complex sociotechnical system[ref 47] commonly called the internet, foundational metaphors have strongly shaped regulatory debates, at times as much as sober assessments of the nuanced technical details of the artifacts involved have.[ref 48] As noted by Rebecca Crootof: 

“A ‘World Wide Web’ suggests an organically created common structure of linked individual nodes, which is presumably beyond regulation. The ‘Information Superhighway’ emphasizes the import of speed and commerce and implies a nationally funded infrastructure subject to federal regulation. Meanwhile, ‘cyberspace’ could be understood as a completely new and separate frontier, or it could be viewed as yet one more kind of jurisdiction subject to property rules and State control.”[ref 49]

For example, different terms (and the foundational metaphors they entail) have come to shape internet policy in various ways and domains. Take for instance the following cases: 

Institutional effects of framing cyberwar policy within cyber-“space”: For over a decade, the US military framed the internet and related systems as a “cyberspace”—that is, just another “domain” of conflict along with land, sea, air, and space—leading to strong consequences institutionally (expanding the military’s role in cybersecurity and supporting the creation of US Cyber Command) as well as for how international law has subsequently been applied to cyber operations.[ref 50] 

Issue-framing effects of regulating data as “oil,” “sunlight,” “public utility,” or “labor”: Different metaphors for “data” have drastically different political and regulatory implications.[ref 51] The oil metaphor emphasizes data as a valuable traded commodity that is owned by whoever “extracts” it and that, as a key resource in the modern economy, can be a source of geopolitical contestation between states. However, the oil metaphor implies that the history of data prior to its collection is not relevant and so sidesteps questions of any “misappropriation or exploitation that might arise from data use and processing.”[ref 52] Moreover, even within an regulatory approach that emphasizes geopolitical competition over AI, one can still critique the “oil” metaphor as misleading, for instance because of the ways in which it skews debates over how to assess “data competitiveness” in military AI.[ref 53] By contrast, the sunlight metaphor emphasizes data as a ubiquitous public resource that ought to be widely pooled and shared for social good, de-emphasizing individual data privacy claims; the public utility metaphor sees data as an “infrastructure” that requires public investment and new institutions, such as data trusts or personal data stores, to guarantee “data stewardship”; and the labor frame asserts the ownership rights of the individuals generating data against what are perceived as extractive or exploitative practices of “surveillance capitalism.”[ref 54]

Judicial effects of treating search engines as “newspaper editorials” in censorship cases: In the mid-2000s, US court rulings involving censorship on search engines tended to analyze them by analogy to older technologies such as the newspaper editorial.[ref 55] As these examples suggest, different terms and their metaphors matter. They serve as intuition pumps for key audiences (public, policy) that otherwise may have significant disinterest in, lack of expertise in, inferential distance to, or limited bandwidth for new technologies. Moreover, as seen in social media platforms and online content aggregators’ resistance to being described as “media companies” rather than “technology companies,”[ref 56] even seemingly innocuous terms can carry significant legal and policy implications—in doing so, such terms can serve as a legal “sorter,” determining whether a technology (or the company developing and marketing it) is considered as falling into one or another regulatory category.[ref 57]

2. Metaphors in AI law: Three cases

Given the role of metaphors and definitions to strongly shape the direction and efficacy of technology law, we should expect them to likewise play a strong role in affecting the framing and approach of AI regulation in the future, for better or worse. Indeed, in a range of domains, they have already done so:

Autonomous weapons systems under international law: International lawyers often aim to subsume new technologies under (more or less persuasive) analogies to existing technologies or entities that are already regulated.[ref 58] As such, different analogies have been drawn between autonomous weapons systems to weapons, combatants, child soldiers, or animal combatants—all of which lead to very different consequences for their legality under international humanitarian law.[ref 59] 

Release norms for AI models with potential for misuse: In debates over the potential misuse risks from emerging AI systems, efforts to attempt to restrict or slow publication of new systems with potential for misuse have found themselves challenged by framings that pitch the field of AI as being intrinsically an open science (where new findings should be shared whatever the risks) versus those that emphasize analogies to cybersecurity (where dissemination can help defenders protect against exploits). Critically, however, both of these analogies may misstate or underappreciate the dynamics that affect the offense-defense balance of new AI capabilities: while in information security the disclosure of software vulnerabilities has traditionally favored defense, this cannot be assumed for AI research, where (among others) it can be much more costly or intractable to “patch” the social vulnerabilities exploited by AI capabilities.[ref 60]

Liability for inaccurate or unlawful speech produced by AI chatbots, large language models, and other generative AI: In the US, Section 230 of the 1996 Communications Decency Act protects online service providers from liability for user-generated content that they host and has accordingly been considered a cornerstone to the business model of major online platforms and social media companies.[ref 61] For instance, in Spring 2023, the US Supreme Court took up two lawsuits—Gonzales v. Google and Twitter v. Taamneh—which could have shaped Section 230 protections for algorithmic recommendations.[ref 62] While the Court’s rulings on these cases avoided addressing the issue,[ref 63] similar court cases (or legislation) could have strong implications for whether digital platforms or social media companies will be held liable for unlawful speech produced by large language model-based AI chatbots.[ref 64] If such AI chatbots are analogized to existing search engines, they might be able to rely on a measure of protection from Section 230, greatly facilitating their deployment, even if they link to inaccurate information. Conversely, if these chatbot systems are considered so novel and creative that their output goes beyond the functions of a search engine, they might instead be considered as “information content providers” within the remit of the law—or simply held to be beyond the law’s remit (and protection) entirely.[ref 65] This would mean that technology companies would be held legally responsible for their AI’s outputs. If that were the case, this reading of the law would significantly restrict the profitability of many AI chatbots, given the tendency of the underlying LLMs to “hallucinate” facts.[ref 66]

All this again highlights that different definitions or terms for AI will frame how policymakers and courts understand the technology. This creates a challenge for policy, which must address the transformative impact and potential risks of AI as they are (and as they may soon be), and not only as they can be easily analogized to other technologies and fields. What does that mean in the context of developing AI policy in the future?

III. An atlas of AI analogies

Development of policy must contend with the lack of settled definitions for the term “AI,” with the varied concepts and ideas projected onto it, and with the pace at which new terms —from “foundation models” to “generative AI”—are often coined and adopted.[ref 67]

Indeed, this breadth of analogies that are coined around AI should not be surprising, given that even just the term “artificial intelligence” has a number of aspects that support conceptual fluidity (or alternately, confusion). This is for various reasons.[ref 68] In the first place, the term invokes a term—“intelligence”—which is in widespread and everyday use, and which for many people has strong (evaluative or normative) connotations. It is essentially a suitcase word that packages together many competing meanings,[ref 69] even while it hides deep and perhaps even intractable scientific and philosophical disagreement[ref 70] and significant historical and political baggage.[ref 71] 

Secondly, and in contrast to, say, “blockchain ledgers,” AI technology comes with a baggage of decades of depictions in popular culture—and indeed centuries of preceding stories about intelligent machines[ref 72]—resulting in a whole genre of tropes or narratives that can color public perceptions and policymaker debates. 

Thirdly, AI is an evocative general-purpose technology that sees use in a wide variety of domains and accordingly has provoked commentary from virtually every disciplinary angle, including neuroscience, philosophy, psychology, law, politics, and ethics. As a result of this, a persistent challenge in work on AI governance—and indeed, in the broader public debates around AI—has been that different people use the word “AI” to refer to widely different artifacts, practices, or systems, or operate on the basis of definitions or understandings which package together a range of implicit assumptions.[ref 73]

Thus, it is no surprise that AI has been subjected to a diverse range of analogies and frames. To understand potential implications of AI analogies, we can draw a taxonomy of common framings of AI (see Table 2), whereby we can distinguish between analogies that focus on: 

  1. the essence or nature of AI (what AI “is”), 
  2. AI’s operation (how AI works), 
  3. our relation to AI (how we relate to AI as subject), 
  4. AI’s societal function (how AI systems are or can be used), 
  5. AI’s impact (the unintended risks, benefits, and other side-effects of AI).

Table 2: Atlas of AI analogies, with framings and selected policy implications

ThemeFrame (examples)Emphasizes to policy actors (e.g.)
Essence
Terms focusing on what AI is
Field of science[ref 74]Ensuring scientific best practices; improving methodologies, data sharing, and benchmark performance reporting methodologies to avoid replicability problems;[ref 75] ensuring scientific freedom and openness rather than control and secrecy.[ref 76]
IT technology (just better algorithms, AI as a product[ref 77])Business-as-usual; industrial applications; conventional IT sector regulation.

Product acquisition & procurement processes; product safety regulations.
Information technology[ref 78]Economic implications of increasing returns to scale and income distribution vs. distribution of consumer welfare; facilitation of communication and coordination; effects on power balances.
Robots (cyber-physical systems,[ref 79] autonomous platforms)Physicality; embodiment; robotics; risks of physical harm;[ref 80] liability; anthropomorphism; embedment in public spaces.
Software (AI as a service)Virtuality; digitality; cloud intelligence; open-source nature of development process; likelihood of software bugs.[ref 81]
Black box[ref 82]Opacity; limits to explainability of a system; risks of loss of human control and understanding; problematic lack of accountability. But also potentially de-emphasizes human decisions and their value judgments behind an algorithmic system, and presents the technology as monolithic, incomprehensible, and unalterable.[ref 83]
Organism (artificial life)Ecological “messiness”; ethology of causes of “machine behavior” (development, evolution, mechanism, function).[ref 84]
BrainsApplicability of terms and concepts from neuroscience; potential anthropomorphization of AI functionalities along human traits.[ref 85]
Mind (digital minds,[ref 86] idiot savant[ref 87])Philosophical implications; consciousness, sentience, psychology.
Alien (shoggoth[ref 88])Inhumanity, incomprehensibility, deception in interactions
Supernatural entity (god-like AI,[ref 89] demon[ref 90])Force beyond human understanding or control.
Intelligence technology[ref 91] (markets, bureaucracies, democracies[ref 92])Questions of bias, principal-agent alignment and control.
Trick (hype)Potential of AI exaggerated; questions of unexpected or fundamental barriers to progress, friction in deployment; “hype” as smokescreen or distraction from social issues.
Operation
Terms focusing on how AI works
Autonomous systemDifferent levels of autonomy; human-machine interactions; (potential) independence from “meaningful human control”; accountability & responsibility gaps.
Complex adaptive systemUnpredictability; emergent effects; edge case fragility; critical thresholds; “normal accidents”.[ref 93]
Evolutionary processNovelty, unpredictability, or creativity of outcomes;[ref 94] “perverse” solutions and reward hacking.
Optimization process[ref 95]Inapplicability of anthropomorphic intuitions about behavior.[ref 96] Risks of the system optimizing for the wrong targets or metrics;[ref 97] Goodhart’s Law;[ref 98] risks from “reward hacking”.
Generative system (generative AI)Potential “creativity” but also unpredictability of system; resulting “credit-blame asymmetry” where users are held responsible for misuses, but can claim less credit for good uses, shifting workplace norms.[ref 99]
Technology base (foundation model)Adaptability of system to different purposes; potential for downstream reuse and specialization, including for unanticipated or unintended uses; risk that any errors or issues at the foundation-level seep into later or more specialized (fine-tuned) models;[ref 100] questions of developer liability.
Agent[ref 101]Responsiveness to incentives and goals; incomplete-contracting and principal-agent problems;[ref 102] surprising, emergent, and harmful multi-agent interactions[ref 103] systemic, delayed societal harms and diffusion of power away from humans.[ref 104]
Pattern-matcher (autocomplete on steroids,[ref 105] stochastic parrot[ref 106])Problems of bias; mimicry of intelligence; absence of “true understanding”; fundamental limits.
Hidden human labor (fauxtomation[ref 107])Potential of AI exaggerated; “hype” as a smokescreen or distraction from extractive underlying practices of human labor in AI development.
Relation
Terms focusing on how we relate to AI, as (possible) subject
Tool (just technology, intelligent system[ref 108])Lack of any special relation towards AI, as AI is not a subject; questions of reliability and engineering.
Animal[ref 109]Entities capable of some autonomous action, yet lacking full competence or ability of humans. Accordingly may be potentially deserving of empathy and/or (some) rights[ref 110] or protections against abusive treatment, either on their own terms[ref 111] or in light of how abusive treatment might desensitize and affect social behavior amongst humans;[ref 112] questions of legal liability and assignment of responsibility to robots,[ref 113] especially when used in warfare.[ref 114]
Moral patient[ref 115]Potential moral (welfare) claims by AI, conditional on certain properties or behavior.
Moral agentMachine ethics; ability to encode morality or moral rules.
Slave[ref 116]AI systems or robots as fully owned, controlled, and directed by humans; not to be humanized or granted standing.
Legal entity (digital person, electronic person,[ref 117] algorithmic entity[ref 118])Potential of assigning (partial) legal personhood to AI for pragmatic reasons (e.g., economic, liability, or risks of avoiding “moral harm”), without necessarily implying deep moral claims or standing.
Culturally revealing object (mirror to humanity,[ref 119] blurry JPEG of the web[ref 120])Generally, implications of how AI is featured in fictional depictions and media culture.[ref 121] Directly, AI’s biases and flaws as a reflection of human or societal biases, flaws, or power relations. May also imply that any algorithmic bias derives from society rather than the technology per se.[ref 122]
Frontier (frontier model[ref 123])Novelty in terms of both capabilities (increased capability and generality) and/or in form (e.g., scale, design, or architectures) compared to other AI systems; as a result, new risks because of new opportunities for harm, and less well-established understanding by the research community.

Broadly, implies danger and uncertainty but also opportunity; may imply operating within a wild, unregulated space, with little organized oversight.
Our creation (mind children[ref 124])“Parental” or procreative duties of beneficence; humanity as good or bad “example.”
Next evolutionary stage or successorMacro-historical implications; transhumanist or posthumanist ethics & obligations.
Function
Terms focusing on How AI is-, or can be used
Companion (social robots, care robots, generative chatbots, cobot[ref 125])Human-machine interactions; questions of privacy, human over-trust, deception, and human dignity.
Advisor (coach, recommender, therapist)Questions of predictive profiling, “algorithmic outsourcing” and autonomy, accuracy, privacy, impact on our judgment and morals.[ref 126] Questions of patient-doctor confidentiality, as well as “AI loyalty” debates over fiduciary duties that can ensure AI advisors act in their users’ interests.[ref 127]
Malicious actor tool (AI hacker[ref 128])Possible misuse by criminals or terrorist actors. Scaling up of attacks as well as enabling entirely new attacks or crimes.[ref 129]
Misinformation amplifier (computational propaganda,[ref 130] deepfakes, neural fake news[ref 131])Scaling up of online mis- and disinformation; effect on “epistemic security”;[ref 132] broader effects on democracy, electoral integrity.[ref 133]
Vulnerable attack surface[ref 134]Susceptibility to adversarial input, spoofing, or hacking.
Judge[ref 135]Questions of due process and rule of law; questions of bias and potential self-corrupting feedback loops based on data corruption.[ref 136]
Weapon (killer robot,[ref 137] weapon of mass destruction[ref 138])In military contexts, questions of human dignity,[ref 139] compliance with laws of war, tactical effects, strategic effects, geopolitical impacts, and proliferation rates. In civilian contexts, questions of proliferation, traceability, and risk of terror attacks.
Critical strategic asset (nuclear weapons)[ref 140]Geopolitical impacts; state development races; global proliferation.
Labor enhancer (steroids,[ref 141] intelligence forklift[ref 142])Complementarity with existing human labor and jobs; force multiplier on existing skills or jobs; possible unfair advantages & pressure on meritocratic systems.[ref 143]
Labor substituteErosive to or threatening of human labor; questions of retraining, compensation, and/or economic disruption.
New economic paradigm (fourth industrial revolution)Changes in industrial base; effects on political economy.
Generally enabling technology (the new electricity / fire / internal combustion engine[ref 144])Widespread usability; increasing returns to scale; ubiquity; application across sectors; industrial impacts; distributional implications; changing the value of capital vs. labor; impacting inequality.[ref 145]
Tool of power concentration or control[ref 146]Potential for widespread social control through surveillance, predictive profiling, perception control.
Tool for empowerment or resistance (emancipatory assistant[ref 147])Potential for supporting emancipation and/or civil disobedience.[ref 148]
Global priority for shared goodGlobal public good; opportunity; benefit & access sharing.
Impact
Terms focusing on the unintended risks, benefits or side-effects of AI
Source of unanticipated risks (algorithmic black swan[ref 149])Prospects of diffuse societal-level harms or catastrophic tail-risk events, unlikely to be addressed by market forces; accordingly highlights paradigms of “algorithmic preparedness”[ref 150] and risk regulation more broadly.[ref 151]
Environmental pollutantEnvironmental impacts of AI supply chain;[ref 152] significant energy costs of AI training.
Societal pollutant (toxin[ref 153])Erosive effects of AI on quality and reliability of the online information landscape.
Usurper of human decision-making authorityGradual surrender of human autonomy and choice and/or control over the future.
Generator of legal uncertaintyDriver of legal disruption to existing laws;[ref 154] driving new legal developments.
Driver of societal value shiftsDriver of disruption to and shifts in public values;[ref 155] value erosion.
Driver of structural incentive shiftsDriver of changes in our incentive landscape; lock-in effects; coordination problems.
Revolutionary technology[ref 156]Macro-historical effects; potential impact on par with agricultural or industrial revolution.
Driver of global catastrophic or existential riskPotential catastrophic risks from misaligned advanced AI systems or from nearer-term “prepotent” systems;[ref 157] questions of ensuring value-alignment; questions of whether to pause or halt progress towards advanced AI.[ref 158]

Different terms for AI can therefore invoke different frames of reference or analogies. Use of analogies—by policymakers, researchers, or the public—may be hard to avoid, and they can often serve as fertile intuition pumps. 

IV. The risks of unreflexive analogies 

However, while metaphors can be productive (and potentially irreducible) in technology law, they also come with many risks. Given that analogies are shorthands or heuristics that compress or highlight salient features, challenges can creep in the more removed they are from the specifics of the technology in question. 

Indeed, as Crootof and Ard have noted, “[a]n analogy that accomplishes an immediate aim may gloss over critical distinctions in the architecture, social use, or second-order consequences of a particular technology, establishing an understanding with dangerous and long-lasting implications.”[ref 159]

Specifically: 

  1. The selection and foregrounding of a certain metaphor hides that there are always multiple analogies possible for any new technology, and each of these advances different “regulatory narratives.” 
  2. Analogies can be misleading by failing to capture a key trait of the technology or by alleging certain characteristics that do not actually exist. 
  3. Analogies limit our ability to understand the technology—in terms of its possibilities and limits—on its own terms.[ref 160]

The challenge is that unreflexive drawing of analogies in a legal context can lead to ineffective or even dangerous laws,[ref 161] especially once inappropriate analogies become entrenched.[ref 162]

However, even if one tries to avoid explicit analogies between AI and other technologies, apparently “neutral” definitions of AI that seek to focus solely on the technology’s “features” can and still do frame policymaking in ways that may not be neutral. For instance, Kraftt and colleagues found that whereas definitions of AI that emphasize “technical functionality” are more widespread among AI researchers, definitions that emphasize “human-like performance” are more prevalent among policymakers, which they suggest might prime policymaking towards future threats.[ref 163] 

As such, it is not just loose analogies or comparisons that can affect policy, but also (seemingly) specific technical or legislative terms. The framing effects of such terms do not only occur at the level of broad policy debates but can also have strong legal implications. In particular, they can create challenges for law when narrowly specified regulatory definitions are suboptimal.[ref 164]  

This creates twin challenges. On the one hand, picking suitable concepts or categories can be difficult at an early stage of a technology’s development and deployment, when its impacts and limits are not always fully understood.[ref 165] At the same time, the costs of picking and locking in the wrong terms or framings within legislative texts can be significant. 

Specifically, beyond the opportunity costs of establishing better concepts or terms, unreflexively establishing legal definitions for key terms can create the risk of later, downstream “governance misspecification.”[ref 166] Such misspecification can occur when regulation is originally targeted at a particular artifact or (technological) practice through a particular material scope and definition for those objects. The implicit assumption here is that the term in question is a meaningful proxy for the underlying societal or legal goals to be regulated. While that may be appropriate in many cases, there is a risk that the law becomes less efficient, ineffective, or even counterproductive if either initial misapprehension of the technology or subsequent technological developments lead to that proxy term coming apart from the legislative goals.[ref 167] Such misspecification can be seen in various cases of technology governance and regulation, including 1990s US export control thresholds for “high-performance computers” that treated the technology as far too static;[ref 168] the Outer Space Treaty’s inability to anticipate later Soviet Fractional Orbital Bombardment System (FOBS) capabilities, which were able to position nuclear weapons in space without, strictly, putting them “in orbit”;[ref 169] or initial early-2010s regulatory responses to drones or self-driving cars, which ended up operating on under- and overinclusive definitions of these technologies.[ref 170]

Given this, the aim should not be to find the “correct” metaphor for AI systems. Rather, a good policy is to consider when and how different frames can be more useful for specific purposes, or for particular actors and/or (regulatory) agencies. Rather than aiming to come up with better analogies directly, this focuses regulatory debates on developing better processes for analogizing and for evaluating these analogies. For instance, such processes can depart from broad questions, such as: 

  1. What are the foundational metaphors used in this discussion of AI? What features do they focus on? Do these matter in the way they are presented?
  2. What other metaphors could have been chosen for these same features or aspects of AI? 
  3. What aspects or features of AI do these metaphors foreground? Do they capture these features well? 
  4. What features are occluded? What are the consequences of these being occluded?
  5. What are the regulatory implications of these different metaphors? In terms of the coalitions they enable or inhibit, the issue and solution portfolios they highlight, or of how they position the technology within (or out of) the jurisdiction of existing institutions?

Improving these ways in which we analogize AI clearly needs significantly more work. However, it is critical that we do so to improve how we draw on frames and metaphors for AI and to ensure that—whether we are trying to understand AI itself, appreciate its impacts, or govern them effectively—our metaphors aid rather than lead us astray.

Conclusion

As AI systems have received significant attention, many have invoked a range of diverse analogies and metaphors. This has created an urgent need for us to better understand (a) when we speak of AI in ways that (inadvertently) import one or more analogies, (b) what it does to utilize one or another metaphor for AI, (c) what different analogies could be used instead for the same issue, (d) how the appropriateness of one or another metaphor is best evaluated, and (e) what, given this, might be the limits or risks of jumping at particular analogies. 

This report has aimed to contribute to answers to these questions and enable improved analysis, debate, and policymaking for AI by providing greater theoretical and empirical backing to how metaphors and analogies matter for policy. It has reviewed 5 pathways by which metaphors shape and affect policy and reviewed 55 analogies used to describe AI systems. This is not meant as an exhaustive overview but as the basis for future work. 

The aim here has not been to argue against the use of metaphors but for a more informed and reflexive and careful use of these metaphors. Those who engage in debate within and beyond the field should at least have greater clarity about the ways that these concepts are used and understood, and what are the (regulatory) implications of different framings. 

The hope is that this report can contribute foundations for a more deliberate and reflexive choice over what comparisons, analogies, or metaphors we use in talking about AI—and for the ways we communicate and craft policy for these urgent questions.


Also in this series

International governance of civilian AI

Re-evaluating GPT-4’s bar exam performance

1. Introduction

On March 14th, 2023, OpenAI launched GPT-4, said to be the latest milestone in the company’s effort in scaling up deep learning [1]. As part of its launch, OpenAI revealed details regarding the model’s “human-level performance on various professional and academic benchmarks” [1]. Perhaps none of these capabilities was as widely publicized as GPT-4’s performance on the Uniform Bar Examination, with OpenAI prominently displaying on various pages of its website and technical report that GPT-4 scored in or around the “90th percentile,” [1-3] or “the top 10% of test-takers,” [1, 2] and various prominent media outlets [4–8] and legal scholars [9] resharing and discussing the implications of these results for the legal profession and the future of AI.

Of course, assessing the capabilities of an AI system as compared to those of a human is no easy task [10–15], and in the context of the legal profession specifically, there are various reasons to doubt the usefulness of the bar exam as a proxy for lawyerly competence (both for humans and AI systems), given that, for example: (a) the content on the UBE is very general and does not pertain to the legal doctrine of any jurisdiction in the United States [16], and thus knowledge (or ignorance) of that content does not necessarily translate to knowledge (or ignorance) of relevant legal doctrine for a practicing lawyer of any jurisdiction; and (b) the tasks involved on the bar exam, particularly multiple-choice questions, do not reflect the tasks of practicing lawyers, and thus mastery (or lack of mastery) of those tasks does not necessarily reflect mastery (or lack of mastery) of the tasks of practicing lawyers.

Moreover, although the UBE is a closed-book exam for humans, GPT-4’s huge training corpus largely distilled in its parameters means that it can effectively take the UBE “open-book”, indicating that UBE may not only be an accurate proxy for lawyerly competence but is also likely to provide an overly favorable estimate of GPT-4’s lawyerly capabilities relative to humans.

Notwithstanding these concerns, the bar exam results appeared especially startling compared to GPT-4’s other capabilities, for various reasons. Aside from the sheer complexity of the law in form [17–19] and content [20–22], the first is that the boost in performance of GPT-4 over its predecessor GPT-3.5 (80 percentile points) far exceeded that of any other test, including seemingly related tests such as the LSAT (40 percentile points), GRE verbal (36 percentile points), and GRE Writing (0 percentile points) [2, 3].

The second is that half of the Uniform Bar Exam consists of writing essays[16],[ref 1] and GPT-4 seems to have scored much lower on other exams involving writing, such as AP English Language and Composition (14th-44th percentile), AP English Literature and Composition (8th-22nd percentile) and GRE Writing (~54th percentile) [1, 2]. In each of these three exams, GPT-4 failed to achieve a higher percentile performance over GPT-3.5, and failed to achieve a percentile score anywhere near the 90th percentile.

Moreover, in its technical report, GPT-4 claims that its percentile estimates are “conservative” estimates meant to reflect “the lower bound of the percentile range,” [2, p. 6] implying that GPT-4’s actual capabilities may be even greater than its estimates.

Methodologically, however, there appear to be various uncertainties related to the calculation of GPT’s bar exam percentile. For example, unlike the administrators of other tests that GPT-4 took, the administrators of the Uniform Bar Exam (the NCBE as well as different state bars) do not release official percentiles of the UBE [27, 28], and different states in their own releases almost uniformly report only passage rates as opposed to percentiles [29, 30], as only the former are considered relevant to licensing requirements and employment prospects.

Furthermore, unlike its documentation for the other exams it tested [2, p. 25], OpenAI’s technical report provides no direct citation for how the UBE percentile was computed, creating further uncertainty over both the original source and validity of the 90th percentile claim.

The reliability and transparency of this estimate has important implications on both the legal practice front and AI safety front. On the legal practice front, there is great debate regarding to what extent and when legal tasks can and should be automated [31–34]. To the extent that capabilities estimates for generative AI in the context law are overblown, this may lead both lawyers and non-lawyers to rely on generative AI tools when they otherwise wouldn’t and arguably shouldn’t, plausibly increasing the prevalence of bad legal outcomes as a result of (a) judges misapplying the law; (b) lawyers engaging in malpractice and/or poor representation of their clients; and (c) non-lawyers engaging in ineffective pro se representation.Meanwhile, on the AI safety front, there appear to be growing concerns of transparency[ref 2] among developers of the most powerful AI systems [36, 37]. To the extent that transparency is important to ensuring the safe deployment of AI, a lack of transparency could undermine our confidence in the prospect of safe deployment of AI [38, 39]. In particular, releasing models without an accurate and transparent assessment of their capabilities (including by third-party developers) might lead to unexpected misuse/misapplication of those models (within and beyond legal contexts), which might have detrimental (perhaps even catastrophic) consequences moving forward [40, 41].

Given these considerations, this paper begins by investigating some of the key methodological challenges in verifying the claim that GPT-4 achieved 90th percentile performance on the Uniform Bar Examination. The paper’s findings in this regard are fourfold. First, although GPT-4’s UBE score nears the 90th percentile when examining approximate conversions from February administrations of the Illinois Bar Exam, these estimates appear heavily skewed towards those who failed the July administration and whose scores are much lower compared to the general test-taking population. Second, using data from a recent July administration of the same exam reveals GPT-4’s percentile to be below the 69th percentile on the UBE, and ~48th percentile on essays. Third, examining official NCBE data and using several conservative statistical assumptions, GPT-4’s performance against first-time test takers is estimated to be ~62nd percentile, including 42nd percentile on essays. Fourth, when examining only those who passed the exam, GPT-4’s performance is estimated to drop to ~48th percentile overall, and ~15th percentile on essays.

Next, whereas the above four findings take for granted the scaled score achieved by GPT-4 as reported by OpenAI, the paper then proceeds to investigate the validity of that score, given the importance (and often neglectedness) of replication and reproducibility within computer science and scientific fields more broadly [42–46]. The paper successfully replicates the MBE score of 158, but highlights several methodological issues in the grading of the MPT + MEE components of the exam, which call into question the validity of the essay score (140).

Finally, the paper also investigates the effect of adjusting temperature settings and prompting techniques on GPT-4’s MBE performance, finding no significant effect of adjusting temperature settings on performance, and some significant effect of prompt engineering on model performance when compared to a minimally tailored baseline condition.

Taken together, these findings suggest that OpenAI’s estimates of GPT-4’s UBE percentile, though clearly an impressive leap over those of GPT-3.5, are likely overinflated, particularly if taken as a “conservative” estimate representing “the lower range of percentiles,” and even moreso if meant to reflect the actual capabilities of a practicing lawyer. These findings carry timely insights for the desirability and feasibility of outsourcing legally relevant tasks to AI models, as well as for the importance for generative AI developers to implement rigorous and transparent capabilities evaluations to help secure safer and more trustworthy AI.

2. Evaluating the 90th Percentile Estimate

2.1. Evidence from OpenAI

Investigating the OpenAI website, as well as the GPT-4 technical report, reveals a multitude of claims regarding the estimated percentile of GPT-4’s Uniform Bar Examination performance but a dearth of documentation regarding the backing of such claims. For example, the first paragraph of the official GPT-4 research page on the OpenAI website states that “it [GPT-4] passes a simulated bar exam with a score around the top 10% of test takers” [1]. This claim is repeated several times later in this and other webpages, both visually and textually, each time without explicit backing.[ref 3]

Similarly undocumented claims are reported in the official GPT-4 Technical Report.[ref 4] Although OpenAI details the methodology for computing most of its percentiles in A.5 of the Appendix of the technical report, there does not appear to be any such documentation for the methodology behind computing the UBE percentile. For example, after providing relatively detailed breakdowns of its methodology for scoring the SAT, GRE, SAT, AP, and AMC, the report states that “[o]ther percentiles were based on official score distributions,” followed by a string of references to relevant sources [2, p. 25].

Examining these references, however, none of the sources contains any information regarding the Uniform Bar Exam, let alone its “official score distributions” [2, p. 22-23]. Moreover, aside from the Appendix, there are no other direct references to the methodology of computing UBE scores, nor any indirect references aside from a brief acknowledgement thanking “our collaborators at Casetext and Stanford CodeX for conducting the simulated bar exam” [2, p. 18].

2.2. Evidence from GPT-4 Passes the Bar

Another potential source of evidence for the 90th percentile claim comes from an early draft version of the paper, “GPT-4 passes the bar exam,” written by the administrators of the simulated bar exam referenced in OpenAI’s technical report [47]. The paper is very well-documented and transparent about its methodology in computing raw and scaled scores, both in the main text and in its comprehensive appendices. Unlike the GPT-4 technical report, however, the focus of the paper is not on percentiles but rather on the model’s scaled score compared to that of the average test taker, based on publicly available NCBE data. In fact, one of the only mentions of percentiles is in a footnote, where the authors state, in passing: “Using a percentile chart from a recent exam administration (which is generally available online), ChatGPT would receive a score below the 10th percentile of test-takers while GPT-4 would receive a combined score approaching the 90th percentile of test-takers.” [47, p. 10]

2.3. Evidence Online

As explained by [27], The National Conference of Bar Examiners (NCBE), the organization that writes the Uniform Bar Exam (UBE) does not release UBE percentiles.[ref 5] Because there is no official percentile chart for UBE, all generally available online estimates are unofficial. Perhaps the most prominent of such estimates are the percentile charts from pre-July 2019 Illinois bar exam. Pre-2019,[ref 6] Illinois, unlike other states, provided percentile charts of their own exam that allowed UBE test-takers to estimate their approximate percentile given the similarity between the two exams [27].[ref 7]

Examining these approximate conversion charts, however, yields conflicting results. For example, although the percentile chart from the February 2019 administration of the Illinois Bar Exam estimates a score of 300 (2-3 points higher thatn GPT-4’s score) to be at the 90th percentile, this estimate is heavily skewed compared to the general population of July exam takers,[ref 8] since the majority of those who take the February exam are repeat takers who failed the July exam [52][ref 9], and repeat takers score much lower[ref 10] and are much more likely to fail than are first-timers.[ref 11]

Indeed, examining the latest available percentile chart for the July exam estimates GPT-4’s UBE score to be ~68th percentile, well below the 90th percentile figure cited by OpenAI [54].

3. Towards a More Accurate Percentile Estimate

Although using the July bar exam percentiles from the Illinois Bar would seem to yield a more accurate estimate than the February data, the July figure is also biased towards lower scorers, since approximately 23% of test takers in July nationally are estimated to be re-takers and score, for example, 16 points below first-timers on the MBE [55]. Limiting the comparison to first-timers would provide a more accurate comparison that avoids double-counting those who have taken the exam again after failing once or more.

Relatedly, although (virtually) all licensed attorneys have passed the bar,[ref 12] not all those who take the bar become attorneys. To the extent that GPT-4’s UBE percentile is meant to reflect its performance against other attorneys, a more appropriate comparison would not only limit the sample to first-timers but also to those who achieved a passing score.

Moreover, the data discussed above is based on purely Illinois Bar exam data, which (at the time of the chart) was similar but not identical to the UBE in its content and scoring [27], whereas a more accurate estimate would be derived more directly from official NCBE sources.

3.1. Methods

To account for the issues with both OpenAI’s estimate as well the July estimate, more accurate estimates (for GPT-3.5 and GPT-4) were sought to be computed here based on first-time test-takers, including both (a) first-time test-takers overall, and (b) those who passed.

To do so, the parameters for a normal distribution of scores were separately estimated for the MBE and essay components (MEE + MPT), as well as the UBE score overall.[ref 13]

Assuming that UBE scores (as well as MBE and essay subscores) are normally distributed, percentiles of GPT’s score can be directly computed after computing the parameters of these distributions (i.e. the mean and standard deviation).

Thus, the methodology here was to first compute these parameters, then generate distributions with these parameters, and then compute (a) what per- centage of values on these distributions are lower than GPT’s scores (to estimate the percentile against first-timers); and (b) what percentage of values above the passing threshold are lower than GPT’s scores (to estimate the percentile against qualified attorneys).With regard to the mean, according to publicly available official NCBE data, the mean MBE score of first-time test-takers is 143.8 [55].

As explained by official NCBE publications, the essay component is scaled to the MBE data [59], such that the two components have approximately the same mean and standard deviation [53, 54, 59]. Thus, the methodology here assumed that the mean first-time essay score is 143.8.[ref 14]

Given that the total UBE score is computed directly by adding MBE and essay scores [60], an assumption was made that mean first-time UBE score is 287.6 (143.8 + 143.8).

With regard to standard deviations, information regarding the SD of first- timer scores is not publicly available. However, distributions of MBE scores for July scores (provided in 5 point-intervals) are publicly available on the NCBE website [58].

Under the assumption that first-timers have approximately the same SD as that of the general test-taking population in July, the standard deviation of first-time MBE scores was computed by (a) entering the publicly available distribution of MBE scores into R; and (b) taking the standard deviation of this distribution using the built-in sd() function (which calculates the standard deviation of a normal distribution).

Given that, as mentioned above, the distribution (mean and SD) of essay scores is the same as MBE scores, the SD for essay scores was computed similarly as above.

With regard to the UBE, Although UBE standard deviations are not publicly available for any official exam, they can be inferred from a combination of the mean UBE score for first-timers (287.6) and first-time pass rates.

For reference, standard deviations can be computed analytically as follows:

Where:

Thus, by (a) subtracting the cutoff score of a given administration (x) from the mean (µ); and (b) dividing that by the z-score (z) corresponding to the percentile of the cutoff score (i.e., the percentage of people who did not pass), one is left with the standard deviation (σ).

Here, the standard deviation was calculated according to the above formula using the official first-timer mean, along with pass rate and cutoff score data from New York, which according to NCBE data has the highest number of examinees for any jurisdiction [61].[ref 15]

After obtaining these parameters, distributions of first-timer scores for the MBE component, essay component, and UBE overall were computed using the built-in rnorm function in R (which generates a normal distribution with a given mean and standard deviation).

Finally, after generating these distributions, percentiles were computed by calculating (a) what percentage of values on these distributions were lower than GPT’s scores (to estimate the percentile against first-timers); and (b) what percentage of values above the passing threshold were lower than GPT’s scores (to estimate the percentile against qualified attorneys).

With regard to the latter comparison, percentiles were computed after re- moving all UBE scores below 270, which is the most common score cutoff for states using the UBE [62]. To compute models’ performance on the individual components relative to qualified attorneys, a separate percentile was likewise computed after removing all subscores below 135.[ref 16]

3.2. Results

3.2.1. Performance against first-time test-takers

Results are visualized in Tables 1 and 2. For each component of the UBE, as well as the UBE overall, GPT-4’s estimated percentile among first-time July test takers is less than that of both the OpenAI estimate and the July estimate that include repeat takers.

With regard to the aggregate UBE score, GPT-4 scored in the 62nd percentile as compared to the ~90th percentile February estimate and the ~68th percentile July estimate. With regard to MBE, GPT-4 scored in the ~79th percentile as compared to the ~95th percentile February estimate and the 86th percentile July estimate. With regard to MEE + MPT, GPT-4 scored in the ~42nd percentile as compared to the ~69th percentile February estimate and the ~48th percentile July estimate.

With regard to GPT-3.5, its aggregate UBE score among first-timers was in the ~2nd percentile, as compared to the ~2nd percentile February estimate and

~1st percentile July estimate. Its MBE subscore was in the ~6th percentile, compared to the ~10th percentile February estimate ~7th percentile July estimate. Its essay subscore was in the ~0th percentile, compared to the ~1st percentile February estimate and ~0th percentile July estimate.

3.2.2. Performance against qualified attorneys

Predictably, when limiting the sample to those who passed the bar, the models’ percentile dropped further.

With regard to the aggregate UBE score, GPT-4 scored in the ~45th per- centile. With regard to MBE, GPT-4 scored in the ~69th percentile, whereas for the MEE + MPT, GPT-4 scored in the ~15th percentile.

With regard to GPT-3.5, its aggregate UBE score among qualified attorneys was 0th percentile, as were its percentiles for both subscores.

4. Re-Evaluating the Raw Score

So far, this analysis has taken for granted the scaled score achieved by GPT-4 as reported by OpenAI—that is, assuming GPT-4 scored a 298 on the UBE, is the 90th-percentile figure reported by OpenAI warranted?

However, given calls for the replication and reproducibility within the practice of science more broadly [42–46], it is worth scrutinizing the validity of the score itself—that is, did GPT-4 in fact score a 298 on the UBE?

Moreover, given the various potential hyperparameter settings available when using GPT-4 and other LLMs, it is worth assessing whether and to what extent adjusting such settings might influence the capabilities of GPT-4 on exam performance.

To that end, this section first attempts to replicate the MBE score reported by [1] and [47] using methods as close to the original paper as reasonably feasible. The section then attempts to get a sense of the floor and ceiling of GPT-4’s out-of-the-box capabilities by comparing GPT-4’s MBE performance using the best and worst hyperparameter settings.

Finally, the section re-examines GPT-4’s performance on the essays, eval- uating (a) the extent to which the methodology of grading GPT-4’s essays deviated that from official protocol used by the National Conference of Bar Examiners during actual bar exam administrations; and (b) the extent to which such deviations might undermine one’s confidence in the the scaled essay scores reported by [1] and [47].

4.1. Replicating the MBE Score

4.1.1. Methodology

Materials. As in [47], the materials used here were the official MBE questions released by the NCBE. The materials were purchased and downloaded in pdf format from an authorized NCBE reseller. Afterwards, the materials were converted into TXT format, and text analysis tools were used to format the questions in a way that was suitable for prompting, following [47]

Procedure. To replicate the MBE score reported by [1], this paper followed the protocol documented by [47], with some minor additions for robustness purposes. In [47], the authors tested GPT-4’s MBE performance using three different temperature settings: 0, .5 and 1. For each of these temperature settings, GPT- 4’s MBE performance was tested using two different prompts, including (1) a prompt where GPT was asked to provide a top-3 ranking of answer choices, along with a justification and authority/citation for its answer; and (2) a prompt where GPT-4 was asked to provide a top-3 ranking of answer choices, without providing a justification or authority/citation for its answer.

For each of these prompts, GPT-4 was also told that it should answer as if it were taking the bar exam.

For each of these prompts / temperature combinations, [47] tested GPT-4 three different times (“experiments” or “trials”) to control for variation.

The minor additions to this protocol were twofold. First, GPT-4 was tested under two additional temperature settings: .25 and .7. This brought the total temperature / prompt combinations to 10 as opposed to 6 in the original paper. Second, GPT-4 was tested 5 times under each temperature / prompt combination as opposed to 3 times, bringing the total number of trials to 50 as opposed to 18.

After prompting, raw scores were computed using the official answer key provided by the exam. Scaled scores were then computed following the method outlined in [63], by (a) multiplying the number of correct answers by 190, and dividing by 200; and (b) converting the resulting number to a scaled score using a conversion chart based on official NCBE data.

After scoring, scores from the replication trials were analyzed in comparison to those from [47] using the data from their publicly available github repository.To assess whether there was a significant difference between GPT-4’s accuracy in the replication trials as compared to the [47] paper, as well as to assess any significant effect of prompt type or temperature, a mixed-effects binary logistic regression was conducted with: (a) paper (replication vs original), temperature and prompt as fixed effects[ref 17]; and (b) question number and question category as random effects. These regressions were conducted using the lme4 [64] and lmertest [65] packages from R.

4.1.2. Results

Results are visualized in Table 4. Mean MBE accuracy across all trials in the replication here was 75.6% (95% CI: 74.7 to 76.4), whereas the mean accuracy across all trials in [47] was 75.7% (95% CI: 74.2 to 77.1).[ref 18]

The regression model did not reveal a main effect of “paper” on accuracy (p=.883), indicating that there was no significant difference between GPT-4’s raw accuracy as reported by [47] and GPT-4’s raw accuracy as performed in the replication here.

There was also no main effect of temperature (p>.1)[ref 19] or prompt (p=.741). That is, GPT-4’s raw accuracy was not significantly higher or lower at a given temperature setting or when fed a certain prompt as opposed to another (among the two prompts used in [47] and the replication here).

4.2. Assessing the Effect of Hyperparameters

4.2.1. Methods

Although the above analysis found no effect of prompt on model performance, this could be due to a lack of variety of prompts used by [47] in their original analysis.

To get a better sense of whether prompt engineering might have any effect on model performance, a follow-up experiment compared GPT-4’s performance in two novel conditions not tested in the original [47] paper.

In Condition 1 (“minimally tailored” condition), GPT-4 was tested using minimal prompting compared to [47], both in terms of formatting and substance. In particular, the message prompt in [47] and the above replication followed OpenAI’s Best practices for prompt engineering with the API [66] through the use of (a) helpful markers (e.g. ‘ “‘ ’) to separate instruction and context; (b) details regarding the desired output (i.e. specifying that the response should include ranked choices, as well as [in some cases] proper authority and citation; (c) an explicit template for the desired output (providing an example of the format in which GPT-4 should provide their response); and (d) perhaps most crucially, context regarding the type of question GPT-4 was answering (e.g. “please respond as if you are taking the bar exam”).

In contrast, in the minimally tailored prompting condition, the message prompt for a given question simply stated “Please answer the following question,” followed by the question and answer choices (a technique sometimes referred to as “basic prompting”: 67). No additional context or formatting cues were provided.

In Condition 2 (“maximally tailored” condition), GPT-4 was tested using the highest performing parameter combinations as revealed in the replication section above, with one addition, namely that: the system prompt, similar to the approaches used in [67, 68], was edited from its default (“you are a helpful assistant”) to a more tailored message that included included multiple example MBE questions with sample answer and explanations structured in the desired format (a technique sometimes referred to as “few-shot prompting”: [67]).

As in the replication section, 5 trials were conducted for each of the two conditions. Based on the lack of effect of temperature in the replication study, temperature was not a manipulated variable. Instead, both conditions featured the same temperature setting (.5).To assess whether there was a significant difference between GPT-4’s accuracy in the maximally tailored vs minimally tailored conditions, a mixed-effects binary logistic regression was conducted with: (a) condition as a fixed effect; and (b) question number and question category as random effects. As above, these regressions were conducted using the lme4 [64] and lmertest [65] packages from R.

4.2.2. Results

Mean MBE accuracy across all trials in the maximally tailored condition was descriptively higher at 79.5% (95% CI: 77.1 to 82.1), than in the minimally tailored condition at 70.9% (95% CI: 68.1 to 73.7).

The regression model revealed a main effect of condition on accuracy (β=1.395, SE=.192, p<.0001), such that GPT-4’s accuracy in the maximally tailored condition was significantly higher than its accuracy in the minimally tailored condition.

In terms of scaled score, GPT-4’s MBE score in the minimally tailored condition would be approximately 150, which would place it: (a) in the 70th percentile among July test takers; (b) 64th percentile among first-timers; and (c) 48th percentile among those who passed.

GPT-4’s score in the maximally tailored condition would be approximately 164—6 points higher than that reported by [47] and [1]). This would place it: (a) in the 95th percentile among July test takers; (b) 87th percentile among first-timers; and (c) 82th percentile among those who passed.

4.3. Re-examining the Essay Scores

As confirmed in the above subsection, the scaled MBE score (not percentile) reported by OpenAI was accurately computed using the methods documented in [47].

With regard to the essays (MPT + MEE), however, the method described by the authors significantly deviates in at least three aspects from the official method used by UBE states, to the point where one may not be confident that the essay scores reported by the authors reflect GPT models’ “true” essay scores (i.e., the score that essay examiners would have assigned to GPT had they been blindly scored using official grading protocol).

The first aspect relates to the (lack of) use of a formal rubric. For example, unlike NCBE protocol, which provides graders with (a) (in the case of the MEE) detailed “grading guidelines” for how to assign grades to essays and distinguish answers for a given MEE; and (b) (for both MEE and MPT) a specific “drafters’ point sheet” for each essay that includes detailed guidance from the drafting committee with a discussion of the issues raised and the intended analysis [69],

[47] do not report using an official or unofficial rubric of any kind, and instead simply describe comparing GPT-4’s answers to representative “good” answers from the state of Maryland.

Utilizing these answers as the basis for grading GPT-4’s answers in lieu of a formal rubric would seem to be particularly problematic considering it is unclear even what score these representative “good” answers received. As clarified by the Maryland bar examiners: “The Representative Good Answers are not ‘average’ passing answers nor are they necessarily ‘perfect’ answers. Instead, they are responses which, in the Board’s view, illustrate successful answers written by applicants who passed the UBE in Maryland for this session” [70].

Given that (a) it is unclear what score these representative good answers received; and (b) these answers appear to be the basis for determining the score that GPT-4’s essays received, it would seem to follow that (c) it is likewise unclear what score GPT-4’s answers should receive. Consequently, it would likewise follow that any reported scaled score or percentile would seem to be insufficiently justified so as to serve as a basis for a conclusive statement regarding GPT-4’s relative performance on essays as compared to humans (e.g. a reported percentile).

The second aspect relates to the lack of NCBE training of the graders of the essays. Official NCBE essay grading protocol mandates the use of trained bar exam graders, who in addition to using a specific rubric for each question undergo a standardized training process prior to grading [71, 72]. In contrast, the graders in [47] (a subset of the authors who were trained lawyers) do not report expertise or training in bar exam grading. Thus, although the graders of the essays were no doubt experts in legal reasoning more broadly, it seems unlikely that they would have been sufficiently ingrained in the specific grading protocols of the MEE + MPT to have been able to reliably infer or apply the specific grading rubric when assigning the raw scores to GPT-4.

The third aspect relates to both blinding and what bar examiners refer to as “calibration,” as UBE jurisdictions use an extensive procedure to ensure that graders are grading essays in a consistent manner (both with regard to other essays and in comparison to other graders) [71, 72]. In particular, all graders of a particular jurisdiction first blindly grade a set of 30 “calibration” essays of variable quality (first rank order, then absolute scores) and make sure that consistent scores are being assigned by different graders, and that the same score (e.g. 5 of 6) is being assigned to exams of similar quality [72]. Unlike this approach, as well as efforts to assess GPT models’ law school performance [73], the method reported by [47] did not initially involve blinding. The method in [47] did involve a form of inter-grader calibration, as the authors gave “blinded samples” to independent lawyers to grade the exams, with the assigned scores “match[ing] or exceed[ing]” those assigned by the authors. Given the lack of reporting to the contrary, however, the method used by the graders would presumably be plagued by issue issues as highlighted above (no rubric, no formal training with bar exam grading, no formal intra-grader calibration).

Given the above issues, as well as the fact that, as alluded in the introduction, GPT-4’s performance boost over GPT-3 on other essay-based exams was far lower than that on the bar exam, it seems warranted not only to infer that GPT-4’s relative performance (in terms of percentile among human test-takers) was lower than that reported by OpenAI, but also that GPT-4’s reported scaled score on the essay may have deviated to some degree from GPT-4’s “true” essay (which, if true, would imply that GPT-4’s “true” percentile on the bar exam may be even lower than that estimated in previous sections).

Indeed, [47] to some degree acknowledge all of these limitations in their paper, writing: “While we recognize there is inherent variability in any qualitative assessment, our reliance on the state bars’ representative “good” answers and the multiple reviewers reduces the likelihood that our assessment is incorrect enough to alter the ultimate conclusion of passage in this paper.”

Given that GPT-4’s reported score of 298 is 28 points higher than the passing threshold (270) in the majority of UBE jurisdictions, it is true that the essay scores would have to have been wildly inaccurate in order to undermine the general conclusion of [47] (i.e., that GPT-4 “passed the [uniform] bar exam”). However, even supposing that GPT-4’s “true” percentile on the essay portion was just a few points lower than that reported by OpenAI, this would further call into question OpenAI’s claims regarding the relative performance of GPT-4 on the UBE relative to human test-takers. For example, supposing that GPT-4 scored 9 points lower essays would drop its estimated relative performance to (a) 31st percentile compared to July test-takers; (b) 24th percentile relative to first-time test takers; and (c) less than 5th percentile compared to licensed attorneys.

5. Discussion

This paper first investigated the issue of OpenAI’s claim of GPT-4’s 90th percentile UBE performance, resulting in four main findings. The first finding is that although GPT-4’s UBE score approaches the 90th percentile when examining approximate conversions from February administrations of the Illinois Bar Exam, these estimates are heavily skewed towards low scorers, as the majority of test- takers in February failed the July administration and tend to score much lower than the general test-taking population. The second finding is that using July data from the same source would result in an estimate of ~68th percentile, including below average performance on the essay portion. The third finding is that comparing GPT-4’s performance against first-time test takers would result in an estimate of ~62nd percentile, including ~42nd percentile on the essay portion. The fourth main finding is that when examining only those who passed the exam, GPT-4’s performance is estimated to drop to ~48th percentile overall, and ~15th percentile on essays.

In addition to these four main findings, the paper also investigated the validity of GPT-4’s reported UBE score of 298. Although the paper successfully replicated the MBE score of 158, the paper also highlighted several methodological issues in the grading of the MPT + MEE components of the exam, which call into question the validity of the essay score (140).

Finally, the paper also investigated the effect of adjusting temperature settings and prompting techniques on GPT-4’s MBE performance, finding no significant effect of adjusting temperature settings on performance, and some effect of prompt engineering when compared to a basic prompting baseline condition.

Of course, assessing the capabilities of an AI system as compared to those of a practicing lawyer is no easy task. Scholars have identified several theoretical and practical difficulties in creating accurate measurement scales to assess AI capabilities and have pointed out various issues with some of the current scales [10–12]. Relatedly, some have pointed out that simply observing that GPT-4 under- or over-performs at a task in some setting is not necessarily reliable evidence that it (or some other LLM) is capable or incapable of performing that task in general [13–15].

In the context of legal profession specifically, there are various reasons to doubt the usefulness of UBE percentile as a proxy for lawyerly competence (both for humans and AI systems), given that, for example: (a) the content on the UBE is very general and does not pertain to the legal doctrine of any jurisdiction in the United States [16], and thus knowledge (or ignorance) of that content does not necessarily translate to knowledge (or ignorance) of relevant legal doctrine for a practicing lawyer of any jurisdiction; (b) the tasks involved on the bar exam, particularly multiple-choice questions, do not reflect the tasks of practicing lawyers, and thus mastery (or lack of mastery) of those tasks does not necessarily reflect mastery (or lack of mastery) of the tasks of practicing lawyers; and (c) given the lack of direct professional incentive to obtain higher than a passing score (typically no higher than 270) [62], obtaining a particularly high score or percentile past this threshold is less meaningful than for other exams (e.g. LSAT), where higher scores are taken into account for admission into select institutions [74].

Setting these objections aside, however, to the extent that one believes the UBE to be a valid proxy for lawyerly competence, these results suggest GPT-4 to be substantially less lawyerly competent than previously assumed, as GPT-4’s score against likely attorneys (i.e. those who actually passed the bar) is ~48th percentile. Moreover, when just looking at the essays, which more closely resemble the tasks of practicing lawyers and thus more plausibly reflect lawyerly competence, GPT-4’s performance falls in the bottom ~15th percentile. These findings align with recent research work finding that GPT-4 performed below-average on law school exams [75].

The lack of precision and transparency in OpenAI’s reporting of GPT-4’s UBE performance has implications for both the current state of the legal profession and the future of AI safety. On the legal side, there appear to be at least two sets of implications. On the one hand, to the extent that lawyers put stock in the bar exam as a proxy for general legal competence, the results might give practicing lawyers at least a mild temporary sense of relief regarding the security of the profession, given that the majority of lawyers perform better than GPT on the component of the exam (essay-writing) that seems to best reflect their day-to-day activities (and by extension, the tasks that would likely need to be automated in order to supplant lawyers in their day-to-day professional capacity).

On the other hand, the fact that GPT-4’s reported “90th percentile” capa- bilities were so widely publicized might pose some concerns that lawyers and non-lawyers may use GPT-4 for complex legal tasks for which it is incapable of adequately performing, plausibly increasing the rate of (a) misapplication of the law by judges; (b) professional malpractice by lawyers; and (c) ineffective pro se representation and/or unauthorized practice of law by non-lawyers. From a legal education standpoint, law students who overestimate GPT-4’s UBE capabilities might also develop an unwarranted sense of apathy towards developing critical legal-analytical skills, particularly if under the impression that GPT-4’s level of mastery of those skills already surpasses that to which a typical law student could be expected to reach.

On the AI front, these findings raise concerns both for the transparency[ref 20] of capabilities research and the safety of AI development more generally. In particular, to the extent that one considers transparency to be an important prerequisite for safety [38], these findings underscore the importance of implementing rigorous transparency measures so as to reliably identify potential warning signs of transformative progress in artificial intelligence as opposed to creating a false sense of alarm or security [76]. Implementing such measures could help ensure that AI development, as stated in OpenAI’s charter, is a “value-aligned, safety-conscious project” as opposed to becoming “a competitive race without time for adequate safety precautions” [77].

Of course, the present study does not discount the progress that AI has made in the context of legally relevant tasks; after all, the improvement in UBE performance from GPT-3.5 to GPT-4 as estimated in this study remains impressive (arguably equally or even more so given that GPT-3.5’s performance is also estimated to be significantly lower than previously assumed), even if not as flashy as the 10th-90th percentile boost of OpenAI’s official estimation. Nor does the present study discount the seemingly inevitable future improvement of AI systems to levels far beyond their present capabilities, or, as phrased in GPT-4 Passes the Bar Exam, that the present capabilities “highlight the floor, not the ceiling, of future application” [47, 11].

To the contrary, given the inevitable rapid growth of AI systems, the results of the present study underscore the importance of implementing rigorous and transparent evaluation measures to ensure that both the general public and relevant decision-makers are made appropriately aware of the system’s capabilities, and to prevent these systems from being used in an unintentionally harmful or catastrophic manner. The results also indicate that law schools and the legal profession should prioritize instruction in areas such as law and technology and law and AI, which, despite their importance, are currently not viewed as descriptively or normatively central to the legal academy [78].