What should be internationalised in AI governance?
Abstract
As artificial intelligence (AI) advances, states increasingly recognise the need for international governance to address shared benefits and challenges. However, international cooperation is complex and costly, and not all AI issues require cooperation at the international level. This paper presents a novel framework to identify and prioritise AI governance issues warranting internationalisation. We analyse nine critical policy areas across data, compute, and model governance using four factors which broadly incentivise states to internationalise governance efforts: cross-border externalities, regulatory arbitrage, uneven governance capacity, and interoperability. We find strong benefits of internationalisation in compute-provider oversight, content provenance, model evaluations, incident monitoring, and risk management protocols. In contrast, the benefits of internationalisation are lower or mixed in data privacy, data provenance, chip distribution, and bias mitigation. These results can guide policymakers and researchers in prioritising international AI governance efforts.
The governance misspecification problem
Abstract
Legal rules promulgated to govern emerging technologies often rely on proxy terms and metrics in order to indirectly effectuate background purposes. A common failure mode for this kind of rule occurs when, due to incautious drafting or unforeseen technological developments, a proxy ceases to function as intended and renders a rule ineffective or counterproductive. Borrowing a concept from the technical AI safety literature, we call this phenomenon the “governance misspecification problem.” This article draws on existing legal-philosophical discussions of the nature of rules to define governance misspecification, presents several historical case studies to demonstrate how and why rules become misspecified, and suggests best practices for designing legal rules to avoid misspecification or mitigate its negative effects. Additionally, we examine a few proxy terms used in existing AI governance regulations, such as “frontier AI” and “compute thresholds,” and discuss the significance of the problem of misspecification in the AI governance context.
Legal considerations for defining “frontier model”
Abstract
Many proposed laws and rules for the regulation of artificial intelligence would distinguish between a category consisting of the most advanced models—often called “frontier models”—and all other AI systems. Legal rules that make this distinction will typically need to include or reference a definition of “frontier model” or whatever analogous term is used. The task of creating this definition implicates several important legal considerations. The role of statutory and regulatory definitions in the overall definitional scheme should be considered, as should the advantages and disadvantages of incorporating elements such as technical inputs, capability metrics, epistemic elements, and deployment context into a definition. Additionally, existing legal obstacles to the rapid updating of regulatory definitions should be taken into account—including recent doctrinal developments in administrative law such as the elimination of Chevron deference and the introduction of the major questions doctrine.
I. Introduction
One of the few concrete proposals on which AI governance stakeholders in industry[ref 1] and government[ref 2] have mostly[ref 3] been able to agree is that AI legislation and regulation should recognize a distinct category consisting of the most advanced AI systems. The executive branch of the U.S. federal government refers to these systems, in Executive Order 14110 and related regulations, as “dual-use foundation models.”[ref 4] The European Union’s AI Act refers to a similar class of models as “general-purpose AI models with systemic risk.”[ref 5] And many researchers, as well as leading AI labs and some legislators, use the term “frontier models” or some variation thereon.[ref 6]
These phrases are not synonymous, but they are all attempts to address the same issue—namely that the most advanced AI systems present additional regulatory challenges distinct from those posed by less sophisticated models. Frontier models are expected to be highly capable across a broad variety of tasks and are also expected to have applications and capabilities that are not readily predictable prior to development, nor even immediately known or knowable after development.[ref 7] It is likely that not all of these applications will be socially desirable; some may even create significant risks for users or for the general public.
The question of precisely how frontier models should be regulated is contentious and beyond the scope of this paper. But any law or regulation that distinguishes between “frontier models” (or “dual-use foundation models,” or “general-purpose AI models with systemic risk”) and other AI systems will first need to define the chosen term. A legal rule that applies to a certain category of product cannot be effectively enforced or complied with unless there is some way to determine whether a given product falls within the regulated category. Laws that fail to carefully define ambiguous technical terms often fail in their intended purposes, sometimes with disastrous results.[ref 8] Because the precise meaning of the phrase “frontier model” is not self-evident,[ref 9] the scope of a law or regulation that targeted frontier models without defining that term would be unacceptably uncertain. This uncertainty would impose unnecessary costs on regulated companies (who might overcomply out of an excess of caution or unintentionally undercomply and be punished for it) and on the public (from, e.g., decreased compliance, increased enforcement costs, less risk protection, and more litigation over the scope of the rule).
The task of defining “frontier model” implicates both legal and policy considerations. This paper provides a brief overview of some of the most relevant legal considerations for the benefit of researchers, policymakers, and anyone else with an interest in the topic.
II. Statutory and Regulatory Definitions
Two related types of legal definition—statutory and regulatory—are relevant to the task of defining “frontier model.” A statutory definition is a definition that appears in a statute enacted by a legislative body such as the U.S. Congress or one of the 50 state legislatures. A regulatory definition, on the other hand, appears in a regulation promulgated by a government agency such as the U.S. Department of Commerce or the California Department of Technology (or, less commonly, in an executive order).
Regulatory definitions have both advantages and disadvantages relative to statutory definitions. Legislation is generally a more difficult and resource-intensive process than agency rulemaking, with additional veto points and failure modes.[ref 10] Agencies are therefore capable of putting into effect more numerous and detailed legal rules than Congress can,[ref 11] and can update those rules more quickly and easily than Congress can amend laws.[ref 12] Additionally, executive agencies are often more capable of acquiring deep subject-matter expertise in highly specific fields than are congressional offices due to Congress’s varied responsibilities and resource constraints.[ref 13] This means that regulatory definitions can benefit from agency subject-matter expertise to a greater extent than can statutory definitions, and can also be updated far more easily and often.
The immense procedural and political costs associated with enacting a statute do, however, purchase a greater degree of democratic legitimacy and legal resiliency than a comparable regulation would enjoy. A number of legal challenges that might persuade a court to invalidate a regulatory definition would not be available for the purpose of challenging a statute.[ref 14] And since the rulemaking power exercised by regulatory agencies is generally delegated to them by Congress, most regulations must be authorized by an existing statute. A regulatory definition generally cannot eliminate or override a statutory definition[ref 15] but can clarify or interpret. Often, a regulatory regime will include both a statutory definition and a more detailed regulatory definition for the same term.[ref 16] This can allow Congress to choose the best of both worlds, establishing a threshold definition with the legitimacy and clarity of an act of Congress while empowering an agency to issue and subsequently update a more specific and technically informed regulatory definition.
III. Existing Definitions
This section discusses five noteworthy attempts to define phrases analogous to “frontier model” from three different existing measures. Executive Order 14110 (“EO 14110”), which President Biden issued in October 2023, includes two complementary definitions of the term “dual-use foundation model.” Two definitions of “covered model” from different versions of the Safe and Secure Innovation for Frontier Artificial Intelligence Models Act, a California bill that was recently vetoed by Governor Newsom, are also discussed, along with the EU AI Act’s definition of “general-purpose AI model with systemic risk.”
A. Executive Order 14110
EO 14110 defines “dual-use foundation model” as:
an AI model that is trained on broad data; generally uses self-supervision; contains at least tens of billions of parameters; is applicable across a wide range of contexts; and that exhibits, or could be easily modified to exhibit, high levels of performance at tasks that pose a serious risk to security, national economic security, national public health or safety, or any combination of those matters, such as by:
(i) substantially lowering the barrier of entry for non-experts to design, synthesize, acquire, or use chemical, biological, radiological, or nuclear (CBRN) weapons;
(ii) enabling powerful offensive cyber operations through automated vulnerability discovery and exploitation against a wide range of potential targets of cyber attacks; or
(iii) permitting the evasion of human control or oversight through means of deception or obfuscation.
Models meet this definition even if they are provided to end users with technical safeguards that attempt to prevent users from taking advantage of the relevant unsafe capabilities.[ref 17]
The executive order imposes certain reporting requirements on companies “developing or demonstrating an intent to develop” dual-use foundation models,[ref 18] and for purposes of these requirements it instructs the Department of Commerce to “define, and thereafter update as needed on a regular basis, the set of technical conditions for models and computing clusters that would be subject to the reporting requirements.”[ref 19] In other words, EO 14110 contains both a high-level quasi-statutory[ref 20] definition and a directive to an agency to promulgate a more detailed regulatory definition. The EO also provides a second definition that acts as a placeholder until the agency’s regulatory definition is promulgated:
any model that was trained using a quantity of computing power greater than 1026 integer or floating-point operations, or using primarily biological sequence data and using a quantity of computing power greater than 1023 integer or floating-point operations[ref 21]
Unlike the first definition, which relies on subjective evaluations of model characteristics,[ref 22] this placeholder definition provides a simple set of objective technical criteria that labs can consult to determine whether the reporting requirements apply. For general-purpose models, the sole test is whether the model was trained on computing power greater than 1026 integer or floating-point operations (FLOP); only models that exceed this compute threshold[ref 23] are deemed “dual-use foundation models” for purposes of the reporting requirements mandated by EO 14110.
B. California’s “Safe and Secure Innovation for Frontier Artificial Intelligence Act” (SB 1047)
California’s recently vetoed “Safe and Secure Innovation for Frontier Artificial Intelligence Models Act” (“SB 1047”) focused on a category that it referred to as “covered models.”[ref 24] The version of SB 1047 passed by the California Senate in May 2024 defined “covered model” to include models meeting either of the following criteria:
(1) The artificial intelligence model was trained using a quantity of computing power greater than 1026 integer or floating-point operations.
(2) The artificial intelligence model was trained using a quantity of computing power sufficiently large that it could reasonably be expected to have similar or greater performance as an artificial intelligence model trained using a quantity of computing power greater than 1026 integer or floating-point operations in 2024 as assessed using benchmarks commonly used to quantify the general performance of state-of-the-art foundation models.[ref 25]
This definition resembles the placeholder definition in EO 14110 in that it primarily consists of a training compute threshold of 1026 FLOP. However, SB 1047 added an alternative capabilities-based threshold to capture future models which “could reasonably be expected” to be as capable as models trained on 1026 FLOP in 2024. This addition was intended to “future-proof”[ref 26] SB 1047 by addressing one of the main disadvantages of training compute thresholds—their tendency to become obsolete over time as advances in algorithmic efficiency produce highly capable models trained on relatively small amounts of compute.[ref 27]
Following pushback from stakeholders who argued that SB 1047 would stifle innovation,[ref 28] the bill was amended repeatedly in the California State Assembly. The final version defined “covered model” in the following way:
(A) Before January 1, 2027, “covered model” means either of the following:
(i) An artificial intelligence model trained using a quantity of computing power greater than 1026 integer or floating-point operations, the cost of which exceeds one hundred million dollars[ref 29] ($100,000,000) when calculated using the average market prices of cloud compute at the start of training as reasonably assessed by the developer.
(ii) An artificial intelligence model created by fine-tuning a covered model using a quantity of computing power equal to or greater than three times 1025 integer or floating-point operations, the cost of which, as reasonably assessed by the developer, exceeds ten million dollars ($10,000,000) if calculated using the average market price of cloud compute at the start of fine-tuning.
(B) (i) Except as provided in clause (ii), on and after January 1, 2027, “covered model” means any of the following:
(I) An artificial intelligence model trained using a quantity of computing power determined by the Government Operations Agency pursuant to Section 11547.6 of the Government Code, the cost of which exceeds one hundred million dollars ($100,000,000) when calculated using the average market price of cloud compute at the start of training as reasonably assessed by the developer.
(II) An artificial intelligence model created by fine-tuning a covered model using a quantity of computing power that exceeds a threshold determined by the Government Operations Agency, the cost of which, as reasonably assessed by the developer, exceeds ten million dollars ($10,000,000) if calculated using the average market price of cloud compute at the start of fine-tuning.
(ii) If the Government Operations Agency does not adopt a regulation governing subclauses (I) and (II) of clause (i) before January 1, 2027, the definition of “covered model” in subparagraph (A) shall be operative until the regulation is adopted.
This new definition was more complex than its predecessor. Subsection (A) introduced an initial definition slated to apply until at least 2027, which relied on a training compute threshold of 1026 FLOP paired with a training cost floor of $100,000,000.[ref 30] Subsection (B), in turn, provided for the eventual replacement of the training compute thresholds used in the initial definition with new thresholds to be determined (and presumably updated) by a regulatory agency.
The most significant change in the final version of SB 1047’s definition was the replacement of the capability threshold with a $100,000,000 cost threshold. Because it would currently cost more than $100,000,000 to train a model using >1026 FLOP, the addition of the cost threshold did not change the scope of the definition in the short term. However, the cost of compute has historically fallen precipitously over time in accordance with Moore’s law.[ref 31] This may mean that models trained using significantly more than 1026 FLOP will cost significantly less than the inflation-adjusted equivalent of 100 million 2024 dollars to create at some point in the future.
The old capability threshold expanded the definition of “covered model” because it was an alternative to the compute threshold—models that exceeded either of the two thresholds would have been “covered.” The newer cost threshold, on the other hand, restricted the scope of the definition because it was linked conjunctively to the compute threshold, meaning that only models that exceed both thresholds were covered. In other words, where the May 2024 definition of “covered model” future-proofed itself against the risk of becoming underinclusive by including highly capable low-compute models, the final definition instead guarded against the risk of becoming overinclusive by excluding low-cost models trained on large amounts of compute. Furthermore, the final cost threshold was baked into the bill text and could only have been changed by passing a new statute—unlike the compute threshold, which could have been specified and updated by a regulator.
Compared with the overall definitional scheme in EO 14110, SB 1047’s definition was simpler, easier to operationalize, and less flexible. SB 1047 lacked a broad, high-level risk-based definition like the first definition in EO 14110. SB 1047 did resemble EO 14110 in its use of a “placeholder” definition, but where EO 14110 confers broad discretion on the regulator to choose the “set of technical conditions” that will comprise the regulatory definition, SB 1047 only authorized the regulator to set and adjust the numerical value of the compute thresholds in an otherwise rigid statutory definition.
C. EU Artificial Intelligence Act
The EU AI Act classifies AI systems according to the risks they pose. It prohibits systems that do certain things, such as exploiting the vulnerabilities of elderly or disabled people,[ref 32] and regulates but does not ban so-called “high-risk” systems.[ref 33] While this classification system does not map neatly onto U.S. regulatory efforts, the EU AI Act does include a category conceptually similar to the EO’s “dual-use foundation model”: the “general-purpose AI model with systemic risk.”[ref 34] The statutory definition for this category includes a given general-purpose model[ref 35] if:
a. it has high impact capabilities[ref 36] evaluated on the basis of appropriate technical tools and methodologies, including indicators and benchmarks; [or]
b. based on a decision of the Commission,[ref 37] ex officio or following a qualified alert from the scientific panel, it has capabilities or an impact equivalent to those set out in point (a) having regard to the criteria set out in Annex XIII.
Additionally, models are presumed to have “high impact capabilities” if they were trained on >1025 FLOP.[ref 38] The seven “criteria set out in Annex XIII” to be considered in evaluating model capabilities include a variety of technical inputs (such as the model’s number of parameters and the size or quality of the dataset used in training the model), the model’s performance on benchmarks and other capabilities evaluations, and other considerations such as the number of users the model has.[ref 39] When necessary, the European Commission is authorized to amend the compute threshold and “supplement benchmarks and indicators” in response to technological developments, such as “algorithmic improvements or increased hardware efficiency.”[ref 40]
The EU Act definition resembles the initial, broad definition in the EO in that they both take diverse factors like the size and quality of the dataset used to train the model, the number of parameters, and the model’s capabilities into account. However, the EU Act definition is likely much broader than either EO definition. The training compute threshold in the EU Act is sufficient, but not necessary, to classify models as systemically risky, whereas the (much higher) threshold in the EO’s placeholder definition is both necessary and sufficient. And the first EO definition includes only models that exhibit a high level of performance on tasks that pose serious risks to national security, while the EU Act includes all general-purpose models with “high impact capabilities,” which it defines as including any model trained on more than 1025 FLOP.
The EU Act definition resembles the final SB 1047 definition of “covered model” in that both definitions authorize a regulator to update their thresholds in response to changing circumstances. It also resembles SB 1047’s May 2024 definition in that both definitions incorporate a training compute threshold and a capabilities-based element.
IV. Elements of Existing Definitions
As the examples discussed above demonstrate, legal definitions of “frontier model” can consist of one or more of a number of criteria. This section discusses a few of the most promising definitional elements.
A. Technical inputs and characteristics
A definition may classify AI models according to their technical characteristics or the technical inputs used in training the model, such as training compute, parameter count, and dataset size and type. These elements can be used in either statutory or regulatory definitions.
Training compute thresholds are a particularly attractive option for policymakers,[ref 41] as evidenced by the three examples discussed above. “Training compute” refers to the computational power used to train a model, often measured in integer or floating-point operations (OP or FLOP).[ref 42] Training compute thresholds function as a useful proxy for model capabilities because capabilities tend to increase as computational resources used to train the model increase.[ref 43]
One advantage of using a compute threshold is that training compute is a straightforward metric that is quantifiable and can be readily measured, monitored, and verified.[ref 44] Because of these characteristics, determining with high certainty whether a given model exceeds a compute threshold is relatively easy. This, in turn, facilitates enforcement of and compliance with regulations that rely on a compute-based definition. Since the amount of training compute (and other technical inputs) can be estimated prior to the training run,[ref 45] developers can predict whether a model will be covered earlier in development.
One disadvantage of a compute-based definition is that compute thresholds are a proxy for model capabilities, which are in turn a proxy for risk. Definitions that make use of multiple nested layers of proxy terms in this manner are particularly prone to becoming untethered from their original purpose.[ref 46] This can be caused, for example, by the operation of Goodhart’s Law, which suggests that “when a measure becomes a target, it ceases to be a good measure.”[ref 47] Particularly problematic, especially for statutory definitions that are more difficult to update, is the possibility that a compute threshold may become underinclusive over time as improvements in algorithmic efficiency allow for the development of highly capable models trained on below-threshold levels of compute.[ref 48] This possibility is one reason why SB 1047 and the EU AI Act both supplement their compute thresholds with alternative, capabilities-based elements.
In addition to training compute, two other model characteristics correlated with capabilities are the number of model parameters[ref 49] and the size of the dataset on which the model was trained.[ref 50] Either or both of these characteristics can be used as an element of a definition. A definition can also rely on training data characteristics other than size, such as the quality or type of the data used; the placeholder definition in EO 14110, for example, contains a lower compute threshold for models “trained… using primarily biological sequence data.”[ref 51] EO 14110 requires a dual-use foundation model to contain “at least tens of billions of parameters,”[ref 52] and the “number of parameters of the model” is a criteria to be considered under the EU AI Act.[ref 53] EO 14110 specified that only models “trained on broad data” could be dual-use foundation models,[ref 54] and the EU AI Act includes “the quality or size of the data set, for example measured through tokens” as one criterion for determining whether an AI model poses systemic risks.[ref 55]
Dataset size and parameter count share many of the pros and cons of training compute. Like training compute, they are objective metrics that can be measured and verified, and they serve as proxies for model capabilities.[ref 56] Training compute is often considered the best and most reliable proxy of the three, in part because it is the most closely correlated with performance and is difficult to manipulate.[ref 57] However, partially redundant backup metrics can still be useful.[ref 58] Dataset characteristics other than size are typically less quantifiable and harder to measure but are also capable of capturing information that the quantifiable metrics cannot.
B. Capabilities
Frontier models can also be defined in terms of their capabilities. A capabilities-based definition element typically sets a threshold level of competence that a model must achieve to be considered “frontier,” either in one or more specific domains or across a broad range of domains. A capabilities-based definition can provide specific, objective criteria for measuring a model’s capabilities,[ref 59] or it can describe the capabilities required in more general terms and leave the task of evaluation to the discretion of future interpreters.[ref 60] The former approach might be better suited to a regulatory definition, especially if the criteria used will have to be updated frequently, whereas the latter approach would be more typical of a high-level statutory definition.
Basing a definition on capabilities, rather than relying on a proxy for capabilities like training compute, eliminates the risk that the chosen proxy will cease to be a good measure of capabilities over time. Therefore, a capabilities-based definition is more likely than, e.g., a compute threshold to remain robust over time in the face of improvements in algorithmic efficiency. This was the point of the May 2024 version of SB 1047’s use of a capabilities element tethered to a compute threshold (“similar or greater performance as an artificial intelligence model trained using a quantity of computing power greater than 1026 integer or floating-point operations in 2024”)—it was an attempt to capture some of the benefits of an input-based definition while also guarding against the possibility that models trained on less than 1026 FLOP may become far more capable in the future than they are in 2024.
However, capabilities are far more difficult than compute to accurately measure. Whether a model has demonstrated “high levels of performance at tasks that pose a serious risk to security” under the EO’s broad capabilities-based definition is not something that can be determined objectively and to a high degree of certainty like the size of a dataset in tokens or the total FLOP used in a training run. Model capabilities are often measured using benchmarks (standardized sets of tasks or questions),[ref 61] but creating benchmarks that accurately measure the complex and diverse capabilities of general-purpose foundation models[ref 62] is notoriously difficult.[ref 63]
Additionally, model capabilities (unlike the technical inputs discussed above) are generally not measurable until after the model has been trained.[ref 64] This makes it difficult to regulate the development of frontier models using capabilities-based definitions, although post-development, pre-release regulation is still possible.
C. Risk
Some researchers have suggested the possibility of defining frontier AI systems on the basis of the risks they pose to users or to public safety instead of or in addition to relying on a proxy metric, like capabilities, or a proxy for a proxy, such as compute.[ref 65] The principal advantage of this direct approach is that it can, in theory, allow for better-targeted regulations—for instance, by allowing a definition to exclude highly capable but demonstrably low-risk models. The principal disadvantage is that measuring risk is even more difficult than measuring capabilities.[ref 66] The science of designing rigorous safety evaluations for foundation models is still in its infancy.[ref 67]
Of the three real-world measures discussed in Section III, only EO 14110 mentions risk directly. The broad initial definition of “dual-use foundation model” includes models that exhibit “high levels of performance at tasks that pose a serious risk to security,” such as “enabling powerful offensive cyber operations through automated vulnerability discovery” or making it easier for non-experts to design chemical weapons. This is a capability threshold combined with a risk threshold; the tasks at which a dual-use foundation model must be highly capable are those that pose a “serious risk” to security, national economic security, and/or national public health or safety. As EO 14110 shows, risk-based definition elements can specify the type of risk that a frontier model must create instead of addressing the severity of the risks created.
D. Epistemic elements
One of the primary justifications for recognizing a category of “frontier models” is the likelihood that broadly capable AI models that are more advanced than previous generations of models will have capabilities and applications that are not readily predictable ex ante.[ref 68] As the word “frontier” implies, lawmakers and regulators focusing on frontier models are interested in targeting models that break new ground and push into the unknown.[ref 69] This was, at least in part, the reason for the inclusion of training compute thresholds of 1026 FLOP in EO 14110 and SB 1047—since the most capable current models were trained on 5×1025 or fewer FLOP,[ref 70] a model trained on 1026 FLOP would represent a significant step forward into uncharted territory.
While it is possible to target models that advance the state of the art by setting and adjusting capability or compute thresholds, a more direct alternative approach would be to include an epistemic element in a statutory definition of “frontier model.” An epistemic element would distinguish between “known” and “unknown” models, i.e., between well-understood models that pose only known risks and poorly understood models that may pose unfamiliar and unpredictable risks.[ref 71]
This kind of distinction between known and unknown risks has a long history in U.S. regulation.[ref 72] For instance, the Toxic Substances Control Act (TSCA) prohibits the manufacturing of any “new chemical substance” without a license.[ref 73] The EPA keeps and regularly updates a list of chemical substances which are or have been manufactured in the U.S., and any substance not included on this list is “new” by definition.[ref 74] In other words, the TSCA distinguishes between chemicals (including potentially dangerous chemicals) that are familiar to regulators and unfamiliar chemicals that pose unknown risks.
One advantage of an epistemic element is that it allows a regulator to address “unknown unknowns” separately from better-understood risks that can be evaluated and mitigated more precisely.[ref 75] Additionally, the scope of an epistemic definition, unlike that of most input- and capability-based definitions, would change over time as regulators became familiar with the capabilities of and risks posed by new models.[ref 76] Models would drop out of the “frontier” category once regulators became sufficiently familiar with their capabilities and risks.[ref 77] Like a capabilities- or risk-based definition, however, an epistemic definition might be difficult to operationalize.[ref 78] To determine whether a given model was “frontier” under an epistemic definition, it would probably be necessary to either rely on a proxy for unknown capabilities or authorize a regulator to categorize eligible models according to a specified process.[ref 79]
E. Deployment context
The context in which an AI system is deployed can serve as an element in a definition. The EU AI Act, for example, takes the number of registered end users and the number of registered EU business users a model has into account as factors to be considered in determining whether a model is a “general-purpose AI model with systemic risk.”[ref 80] Deployment context typically does not in and of itself provide enough information about the risks posed by a model to function as a stand-alone definitional element, but it can be a useful proxy for the kind of risk posed by a given model. Some models may cause harms in proportion to their number of users, and the justification for aggressively regulating these models grows stronger the more users they have. A model that will only be used by government agencies, or by the military, creates a different set of risks than a model that is made available to the general public.
V. Updating Regulatory Definitions
A recurring theme in the scholarly literature on the regulation of emerging technologies is the importance of regulatory flexibility.[ref 81] Because of the rapid pace of technological progress, legal rules designed to govern emerging technologies like AI tend to quickly become outdated and ineffective if they cannot be rapidly and frequently updated in response to changing circumstances.[ref 82] For this reason, it may be desirable to authorize an executive agency to promulgate and update a regulatory definition of “frontier model,” since regulatory definitions can typically be updated more frequently and more easily than statutory definitions under U.S. law.[ref 83]
Historically, failing to quickly update regulatory definitions in the context of emerging technologies has often led to the definitions becoming obsolete or counterproductive. For example, U.S. export controls on supercomputers in the 1990s and early 2000s defined “supercomputer” in terms of the number of millions of theoretical operations per second (MTOPS) the computer could perform.[ref 84] Rapid advances in the processing power of commercially available computers soon rendered the initial definition obsolete, however, and the Clinton administration was forced to revise the MTOPS threshold repeatedly to avoid harming the competitiveness of the American computer industry.[ref 85] Eventually, the MTOPS metric itself was rendered obsolete, leading to a period of several years in which supercomputer export controls were ineffective at best.[ref 86]
There are a number of legal considerations that may prevent an agency from quickly updating a regulatory definition and a number of measures that can be taken to streamline the process. One important aspect of the rulemaking process is the Administrative Procedure Act’s “notice and comment” requirement.[ref 87] In order to satisfy this requirement, agencies are generally obligated to publish notice of any proposed amendment to an existing regulation in the Federal Register, allow time for the public to comment on the proposal, respond to public comments, publish a final version of the new rule, and then allow at least 30–60 days before the rule goes into effect.[ref 88] From the beginning of the notice-and-comment process to the publication of a final rule, this process can take anywhere from several months to several years.[ref 89] However, an agency can waive the 30–60 day publication period or even the entire notice-and-comment requirement for “good cause” if observing the standard procedures would be “impracticable, unnecessary, or contrary to the public interest.”[ref 90] Of course, the notice-and-comment process has benefits as well as costs; public input can be substantively valuable and informative for agencies, and also increases the democratic accountability of agencies and the transparency of the rulemaking process. In certain circumstances, however, the costs of delay can outweigh the benefits. U.S. agencies have occasionally demonstrated a willingness to waive procedural rulemaking requirements in order to respond to emergency AI-related developments. The Bureau of Industry and Security (“BIS”), for example, waived the normal 30-day waiting period for an interim rule prohibiting the sale of certain advanced AI-relevant chips to China in October 2023.[ref 91]
Another way to encourage quick updating for regulatory definitions is for Congress to statutorily authorize agencies to eschew or limit the length of notice and comment, or to compel agencies to promulgate a final rule by a specified deadline.[ref 92] Because notice and comment is a statutory requirement, it can be adjusted as necessary by statute.
For regulations exceeding a certain threshold of economic significance, another substantial source of delay is OIRA review. OIRA, the Office of Information and Regulatory Affairs, is an office within the White House that oversees interagency coordination and undertakes centralized cost-benefit analysis of important regulations.[ref 93] Like notice and comment, OIRA review can have significant benefits—such as improving the quality of regulations and facilitating interagency cooperation—but it also delays the implementation of significant rules, typically by several months.[ref 94] OIRA review can be waived either by statutory mandate or by OIRA itself.[ref 95]
VI. Deference, Delegation, and Regulatory Definitions
Recent developments in U.S. administrative law may make it more difficult for Congress to effectively delegate the task of defining “frontier model” to a regulatory agency. A number of recent Supreme Court cases signal an ongoing shift in U.S. administrative law doctrine intended to limit congressional delegations of rulemaking authority.[ref 96] Whether this development is good or bad on net is a matter of perspective; libertarian-minded observers who believe that the U.S. has too many legal rules already[ref 97] and that overregulation is a bigger problem than underregulation have welcomed the change,[ref 98] while pro-regulation observers predict that it will significantly reduce the regulatory capacity of agencies in a number of important areas.[ref 99]
Regardless of where one falls on that spectrum of opinion, the relevant takeaway for efforts to define “frontier model” is that it will likely become somewhat more difficult for agencies to promulgate and update regulatory definitions without a clear statutory authorization to do so. If Congress still wishes to authorize the creation of regulatory definitions, however, it can protect agency definitions from legal challenges by clearly and explicitly authorizing agencies to exercise discretion in promulgating and updating definitions of specific terms.
A. Loper Bright and deference to agency interpretations
In a recent decision in the combined cases of Loper Bright Enterprises v. Raimondo and Relentless v. Department of Commerce, the Supreme Court repealed a longstanding legal doctrine known as Chevron deference.[ref 100] Under Chevron, federal courts were required to defer to certain agency interpretations of federal statutes when (1) the relevant part of the statute being interpreted was genuinely ambiguous and (2) the agency’s interpretation was reasonable. After Loper Bright, courts are no longer required to defer to these interpretations—instead, under a doctrine known as Skidmore deference,[ref 101] agency interpretations will prevail in court only to the extent that courts are persuaded by them.[ref 102]
Justice Elena Kagan’s dissenting opinion in Loper Bright argues that the decision will harm the regulatory capacity of agencies by reducing the ability of agency subject-matter experts to promulgate regulatory definitions of ambiguous statutory phrases in “scientific or technical” areas.[ref 103] The dissent specifically warns that, after Loper Bright, courts will “play a commanding role” in resolving questions like “[w]hat rules are going to constrain the development of A.I.?”[ref 104]
Justice Kagan’s dissent probably somewhat overstates the significance of Loper Bright to AI governance for rhetorical effect.[ref 105] The end of Chevron deference does not mean that Congress has completely lost the ability to authorize regulatory definitions; where Congress has explicitly directed an agency to define a specific statutory term, Loper Bright will not prevent the agency from doing so.[ref 106] An agency’s authority to promulgate a regulatory definition under a statute resembling EO 14110, which explicitly directs the Department of Commerce to define “dual-use foundation model,” would likely be unaffected. However, Loper Bright has created a great deal of uncertainty regarding the extent to which courts will accept agency claims that Congress has implicitly authorized the creation of regulatory definitions.[ref 107]
To better understand how this uncertainty might affect efforts to define “frontier model,” consider the following real-life example. The Energy Policy and Conservation Act (“EPCA”) includes a statutory definition of the term “small electric motor.”[ref 108] Like many statutory definitions, however, this definition is not detailed enough to resolve all disputes about whether a given product is or is not a “small electric motor” for purposes of EPCA. In 2010, the Department of Energy (“DOE”), which is authorized under EPCA to promulgate energy efficiency standards governing “small electric motors,”[ref 109] issued a regulatory definition of “small electric motor” specifying that the term referred to motors with power outputs between 0.25 and 3 horsepower.[ref 110] The National Electrical Manufacturers Association (“NEMA”), a trade association of electronics manufacturers, sued to challenge the rule, arguing that motors with between 1 and 3 horsepower were too powerful to be “small electric motors” and that the DOE was exceeding its statutory authority by attempting to regulate them.[ref 111]
In a 2011 opinion that utilized the Chevron framework, the federal court that decided NEMA’s lawsuit considered the language of EPCA’s statutory definition and concluded that EPCA was ambiguous as to whether motors with between 1 and 3 horsepower could be “small electric motors.”[ref 112] The court then found that the DOE’s regulatory definition was a reasonable interpretation of EPCA’s statutory definition, deferred to the DOE under Chevron, and upheld the challenged regulation.[ref 113]
Under Chevron, federal courts were required to assume that Congress had implicitly authorized agencies like the DOE to resolve ambiguities in a statute, as the DOE did in 2010 by promulgating its regulatory definition of “small electric motor.” After Loper Bright, courts will recognize fewer implicit delegations of definition-making authority. For instance, while EPCA requires the DOE to prescribe “testing requirements” and “energy conservation standards” for small electric motors, it does not explicitly authorize the DOE to promulgate a regulatory definition of “small electric motor.” If a rule like the one challenged by NEMA were challenged today, the DOE could still argue that Congress implicitly authorized the creation of such a rule by giving the DOE authority to prescribe standards and testing requirements—but such an argument would probably be less likely to succeed than the Chevron argument that saved the rule in 2011.
Today, a court that did not find an implicit delegation of rulemaking authority in EPCA would not defer to the DOE’s interpretation. Instead, the court would simply compare the DOE’s regulatory definition of “small electric motor” with NEMA’s proposed definition and decide which of the two was a more faithful interpretation of EPCA’s statutory definition.[ref 114] Similarly, when or if some future federal statute uses the phrase “frontier model” or any analogous term, agency attempts to operationalize the statute by enacting detailed regulatory definitions that are not explicitly authorized by the statute will be easier to challenge after Loper Bright than they would have been under Chevron.
Congress can avoid Loper Bright issues by using clear and explicit statutory language to authorize agencies to promulgate and update regulatory definitions of “frontier model” or analogous phrases. However, it is often difficult to predict in advance whether or how a statutory definition will become ambiguous over time. This is especially true in the context of emerging technologies like AI, where the rapid pace of technological development and the poorly understood nature of the technology often eventually render carefully crafted definitions obsolete.[ref 115]
Suppose, for example, that a federal statute resembling the May 2024 draft of SB 1047 was enacted. The statutory definition would include future models trained on a quantity of compute such that they “could reasonably be expected to have similar or greater performance as an artificial intelligence model trained using [>1026 FLOP] in 2024.” If the statute did not contain an explicit authorization for some agency to determine the quantity of compute that qualified in a given year, any attempt to set and enforce updated regulatory compute thresholds could be challenged in court.
The enforcing agency could argue that the statute included an implied authorization for the agency to promulgate and update the regulatory definitions at issue. This argument might succeed or fail, depending on the language of the statute, the nature of the challenged regulatory definitions, and the judicial philosophy of the deciding court. But regardless of the outcome of any individual case, challenges to impliedly authorized regulatory definitions will probably be more likely to succeed after Loper Bright than they would have been under Chevron. Perhaps more importantly, agencies will be aware that regulatory definitions will no longer receive the benefit of Chevron deference and may regulate more cautiously in order to avoid being sued.[ref 116] Moreover, even if the statute did explicitly authorize an agency to issue updated compute thresholds, such an authorization might not allow the agency to respond to future technological breakthroughs by considering some factor other than the quantity of training compute used.
In other words, a narrow congressional authorization to regulatorily define “frontier model” may prove insufficiently flexible after Loper Bright. Congress could attempt to address this possibility by instead enacting a very broad authorization.[ref 117] An overly broad definition, however, may be undesirable for reasons of democratic accountability, as it would give unelected agency officials discretionary control over which models to regulate as “frontier.” Moreover, an overly broad definition might risk running afoul of two related constitutional doctrines that limit the ability of Congress to delegate rulemaking authority to agencies—the major questions doctrine and the nondelegation doctrine.
B. The nondelegation doctrine
Under the nondelegation doctrine, which arises from the constitutional principle of separation of powers, Congress may not constitutionally delegate legislative power to executive branch agencies. In its current form, this doctrine has little relevance to efforts to define “frontier model.” Under current law, Congress can validly delegate rulemaking authority to an agency as long as the statute in which the delegation occurs includes an “intelligible principle” that provides adequate guidance for the exercise of that authority.[ref 118] In practice, this is an easy standard to satisfy—even vague and general legislative guidance, such as directing agencies to regulate in a way that “will be generally fair and equitable and will effectuate the purposes of the Act,” has been held to contain an intelligible principle.[ref 119] The Supreme Court has used the nondelegation doctrine to strike down statutes only twice, in two 1935 decisions invalidating sweeping New Deal laws.[ref 120]
However, some commentators have suggested that the Supreme Court may revisit the nondelegation doctrine in the near future,[ref 121] perhaps by discarding the “intelligible principle” test in favor of something like the standard suggested by Justice Gorsuch in his 2019 dissent in Gundy v. United States.[ref 122] In Gundy, Justice Gorsuch suggested that the nondelegation doctrine, properly understood, requires Congress to make “all the relevant policy decisions” and delegate to agencies only the task of “filling up the details” via regulation.[ref 123]
Therefore, if the Supreme Court does significantly strengthen the nondelegation doctrine, it is possible that a statute authorizing an agency to create a regulatory definition of “frontier model” would need to include meaningful guidance as to what the definition should look like. This is most likely to be the case if the regulatory definition in question is a key part of an extremely significant regulatory scheme, because “the degree of agency discretion that is acceptable varies according to the power congressionally conferred.”[ref 124] Congress generally “need not provide any direction” to agencies regarding the manner in which it defines specific and relatively unimportant technical terms,[ref 125] but must provide “substantial guidance” for extremely important and complex regulatory tasks that could significantly impact the national economy.[ref 126]
C. The major questions doctrine
Like the nondelegation doctrine, the major questions doctrine is a constitutional limitation on Congress’s ability to delegate rulemaking power to agencies. Like the nondelegation doctrine, it addresses concerns about the separation of powers and the increasingly prominent role executive branch agencies have taken on in the creation of important legal rules. Unlike the nondelegation doctrine, however, the major questions doctrine is a recent innovation. The Supreme Court acknowledged it by name for the first time in the 2022 case West Virginia v. Environmental Protection Agency,[ref 127] where it was used to strike down an EPA rule regulating power plant carbon dioxide emissions. Essentially, the major questions doctrine provides that courts will not accept an interpretation of a statute that grants an agency authority over a matter of great “economic or political significance” unless there is a “clear congressional authorization” for the claimed authority.[ref 128] Whereas the nondelegation doctrine provides a way to strike down statutes as unconstitutional, the major questions doctrine only affects the way that statutes are interpreted.
Supporters of the major questions doctrine argue that it helps to rein in excessively broad delegations of legislative power to the administrative state and serves a useful separation-of-powers function. The doctrine’s critics, however, have argued that it limits Congress’s ability to set up flexible regulatory regimes that allow agencies to respond quickly and decisively to changing circumstances.[ref 129] According to this school of thought, requiring a clear statement authorizing each economically significant agency action inhibits Congress’s ability to communicate broad discretion in handling problems that are difficult to foresee in advance.
This difficulty is particularly salient in the context of regulatory regimes for the governance of emerging technologies.[ref 130] Justice Kagan made this point in her dissent from the majority opinion in West Virginia, where she argued that the statute at issue was broadly worded because Congress had known that “without regulatory flexibility, changing circumstances and scientific developments would soon render the Clean Air Act obsolete.”[ref 131] Because advanced AI systems are likely to have a significant impact on the U.S. economy in the coming years,[ref 132] it is plausible that the task of choosing which systems should be categorized as “frontier” and subject to increased regulatory scrutiny will be an issue of great “economic and political significance.” If it is, then the major questions doctrine could be invoked to invalidate agency efforts to promulgate or amend a definition of “frontier model” to address previously unforeseen unsafe capabilities.
For example, consider a hypothetical federal statute instituting a licensing regime for frontier models that includes a definition similar to the placeholder in EO 14110 (empowering the Bureau of Industry and Security to “define, and thereafter update as needed on a regular basis, the set of technical conditions [that determine whether a model is a frontier model].”). Suppose that BIS initially defined “dual-use foundation model” under this statute using a regularly updated compute threshold, but that ten years after the statute’s enactment a new kind of AI system was developed that could be trained to exhibit cutting-edge capabilities using a relatively small quantity of training compute. If BIS attempted to amend its regulatory definition of “frontier model” to include a capabilities threshold that would cover this newly developed and economically significant category of AI system, that new regulatory definition might be challenged under the major questions doctrine. In that situation, a court with deregulatory inclinations might not view the broad congressional authorization for BIS to define “frontier model” as a sufficiently clear statement of congressional intent to allow BIS to later institute a new and expanded licensing regime based on less objective technical criteria.[ref 133]
VI. Conclusion
One of the most common mistakes that nonlawyers make when reading a statute or regulation is to assume that each word of the text carries its ordinary English meaning. This error occurs because legal rules, unlike most writing encountered in everyday life, are often written in a sort of simple code where a number of the terms in a given sentence are actually stand-ins for much longer phrases catalogued elsewhere in a “definitions” section.
This tendency to overlook the role that definitions play in legal rules has an analogue in a widespread tendency to overlook the importance of well-crafted definitions to a regulatory scheme. The object of this paper, therefore, has been to explain some of the key legal considerations relevant to the task of defining “frontier model” or any of the analogous phrases used in existing laws and regulations.
One such consideration is the role that should be played by statutory and regulatory definitions, which can be used independently or in conjunction with each other to create a definition that is both technically sound and democratically legitimate. Another is the selection and combination of potential definitional elements, including technical inputs, capabilities metrics, risk, deployment context, and familiarity, that can be used independently or in conjunction with each other to create a single statutory or regulatory definition. Legal mechanisms for facilitating rapid and frequent updating for regulations targeting emerging technologies also merit attention. Finally, the nondelegation and major questions doctrines and the recent elimination of Chevron deference may affect the scope of discretion that can be conferred for the creation and updating of regulatory definitions.
Beyond a piecemeal approach: prospects for a framework convention on AI
Abstract
Solving many of the challenges presented by artificial intelligence (AI) requires international coordination and cooperation. In response, the past years have seen multiple global initiatives to govern AI. However, very few proposals have discussed treaty models or design for AI governance and have therefore neglected the study of framework conventions–generally multilateral law-making treaties that establish a two-step regulatory process through which initially underspecified obligations and implementation mechanisms are subsequently specified via protocols. This chapter asks whether or how a Framework Convention on AI (FCAI) might serve as a regulatory tool for global AI governance, in contrast with the more traditional piecemeal approach based on individual treaties that govern isolated issues and have no subsequent regime. To answer these questions, the chapter first briefly sets out the recent context of global AI governance, and the governance gaps that remain to be filled. It then explores the elements, definition, and general role of framework conventions as an international regulatory instrument. On this basis, the chapter considers the structural trade-offs and challenges that an FCAI would face, before discussing key ways in which it could be designed to address these concerns. We argue that, while imperfect, an FCAI may be the most tractable and appropriate solution for the international governance of AI if it follows a hybrid model that combines a wide scope with specific obligations and implementation mechanisms concerning issues on which states already converge.
The future of international scientific assessments of AI’s risks
Abstract
Effective international coordination to address AI’s global impacts demands a shared, scientifically rigorous understanding of AI risks. This paper examines the challenges and opportunities in establishing international scientific consensus in this domain. It analyzes current efforts, including the UK-led International Scientific Report on the Safety of Advanced AI and emerging UN initiatives, identifying key limitations and tradeoffs. The authors propose a two-track approach: 1) a UN-led process focusing on broad AI issues and engaging member states, and 2) an independent annual report specifically focused on advanced AI risks. The paper recommends careful coordination between these efforts to leverage their respective strengths while maintaining their independence. It also evaluates potential hosts for the independent report, including the network of AI Safety Institutes, the OECD, and scientific organizations like the International Science Council. The proposed framework aims to balance scientific rigor, political legitimacy, and timely action to facilitate coordinated international action on AI risks.
Computing power and the governance of artificial intelligence
Abstract
Computing power, or “compute,” is crucial for the development and deployment of artificial intelligence (AI) capabilities. As a result, governments and companies have started to leverage compute as a means to govern AI. For example, governments are investing in domestic compute capacity, controlling the flow of compute to competing countries, and subsidizing compute access to certain sectors. However, these efforts only scratch the surface of how compute can be used to govern AI development and deployment. Relative to other key inputs to AI (data and algorithms), AI-relevant compute is a particularly effective point of intervention: it is detectable, excludable, and quantifiable, and is produced via an extremely concentrated supply chain. These characteristics, alongside the singular importance of compute for cutting-edge AI models, suggest that governing compute can contribute to achieving common policy objectives, such as ensuring the safety and beneficial use of AI. More precisely, policymakers could use compute to facilitate regulatory visibility of AI, allocate resources to promote beneficial outcomes, and enforce restrictions against irresponsible or malicious AI development and usage. However, while compute-based policies and technologies have the potential to assist in these areas, there is significant variation in their readiness for implementation. Some ideas are currently being piloted, while others are hindered by the need for fundamental research. Furthermore, naïve or poorly scoped approaches to compute governance carry significant risks in areas like privacy, economic impacts, and centralization of power. We end by suggesting guardrails to minimize these risks from compute governance.
AI is like… A literature review of AI metaphors and why they matter for policy
Abstract
As AI systems have become increasingly capable and impactful, there has been significant public and policymaker debate over this technology’s impacts—and the appropriate legal or regulatory responses. Within these debates many have deployed—and contested—a dazzling range of analogies, metaphors, and comparisons for AI systems, their impact, or their regulation.
This report reviews why and how metaphors and analogies matter to both the study and practice of AI governance, in order to contribute to more productive dialogue and more reflective policymaking. It first reviews five stages at which different foundational metaphors play a role in shaping the processes of technological innovation, the academic study of their impacts, the regulatory agenda, the terms of the policymaking process, and legislative and judicial responses to new technology. It then surveys a series of cases where the choice of analogy materially influenced the internet policy issues as well as (recent) AI law issues. The report then provides a non-exhaustive survey of 55 analogies that have been given for AI technology and some of their policy implications. Finally, it discusses the risks of utilizing unreflexive analogies in AI law and regulation.
By disentangling the role of metaphors, analogies, and frames in these debates, and the space of analogies for AI, this survey does not aim to argue against the use or role of analogies in AI regulation—but rather to facilitate more reflective and productive conversations on these timely challenges.
Executive summary
This report provides an overview, taxonomy, and preliminary analysis of the role of basic metaphors and analogies in AI governance.
Aim: The aim of this report is to contribute to improved analysis, debate, and policy for AI systems by providing greater clarity around the way that analogies and metaphors can affect technology governance generally, around how they may shape AI governance, and about how to improve the processes by which some analogies or metaphors for AI are considered, selected, deployed, and reviewed.
Summary: In sum, this report:
- Draws on technology law scholarship to review five ways in which metaphors or analogies exert influence throughout the entire cycle of technology policymaking by shaping:
- patterns of technological innovation;
- the study of particular technologies’ sociotechnical impacts or risks;
- which of those sociotechnical impacts make it onto the regulatory agenda;
- how those technologies are framed within the policymaking process in ways that highlight some issues and policy levers over others; and
- how these technologies are approached within legislative and judicial systems.
- Illustrates these dynamics with brief case studies where foundational metaphors shaped policy for cyberspace, as well as for recent AI issues.
- Provides an initial atlas of 55 analogies for AI, which have been used in expert, policymaker, and public debate to frame discussion of AI issues, and discusses their implications for regulation.
- Reflects on the risks of adopting unreflexive analogies and misspecified (legal) definitions.
Below, the reviewed analogies are summarized in Table 1.
Table 1: Overview of surveyed analogies for AI (brief, without policy implications)
Theme | Frame (varieties) |
---|---|
Essence Terms focusing on what AI is | Field of science |
IT technology (just better algorithms, AI as a product) | |
Information technology | |
Robots (cyber-physical systems, autonomous platforms) | |
Software (AI as a service) | |
Black box | |
Organism (artificial life) | |
Brain | |
Mind (digital minds, idiot savant) | |
Alien (shoggoth) | |
Supernatural entity (god-like AI, demon) | |
Intelligence technology (markets, bureaucracies, democracies) | |
Trick (hype) | |
Operation Terms focusing on how AI works | Autonomous system |
Complex adaptive system | |
Evolutionary process | |
Optimization process | |
Generative system (generative AI) | |
Technology base (foundation model) | |
Agent | |
Pattern-matcher (autocomplete on steroids, stochastic parrot) | |
Hidden human labor (fauxtomation) | |
Relation Terms focusing on how we relate to AI, as (possible) subject | Tool (just technology) |
Animal | |
Moral patient | |
Moral agent | |
Slave | |
Legal entity (digital person, electronic person, algorithmic entity) | |
Culturally revealing object (mirror to humanity, blurry JPEG of the web) | |
Frontier (frontier model) | |
Our creation (mind children) | |
Next evolutionary stage or successor | |
Function Terms focusing on how AI is or can be used | Companion (social robots, care robots, generative chatbots, cobot) |
Advisor (coach, recommender, therapist) | |
Malicious actor tool (AI hacker) | |
Misinformation amplifier (computational propaganda, deepfakes, neural fake news) | |
Vulnerable attack surface | |
Judge | |
Weapon (killer robot, weapon of mass destruction) | |
Critical strategic asset (nuclear weapons) | |
Labor enhancer (steroids, intelligence forklift) | |
Labor substitute | |
New economic paradigm (fourth industrial revolution) | |
Generally enabling technology (the new electricity / fire / internal combustion engine) | |
Tool of power concentration or control | |
Tool for empowerment or resistance (emancipatory assistant) | |
Global priority for shared good | |
Impact Terms focusing on the unintended risks, benefits or side-effects of AI | Source of unanticipated risks (algorithmic black swan) |
Environmental pollutant | |
Societal pollutant (toxin) | |
Usurper of human decision-making authority | |
Generator of legal uncertainty | |
Driver of societal value shifts | |
Driver of structural incentive shifts | |
Revolutionary technology | |
Driver of global catastrophic or existential risk |
Introduction
Everyone loves a good analogy like they love a good internet meme—quick, relatable, shareable,[ref 1] memorable, and good for communicating complex topics to family.
Background: As AI systems have become increasingly capable and have had increasingly public impacts, there has been significant public and policymaker debate over the technology. Given the breadth of the technology’s application, many of these discussions have come to deploy—and contest—a dazzling range of analogies, metaphors, and comparisons for AI systems in order to understand, frame, or shape the technologies’ impact and its regulation.[ref 2] Yet the speed with which many often jump to invoke particular metaphors—or to contest the accuracy of others—leads to frequent confusion over these analogies, how they are used, and how they are best evaluated or compared.[ref 3]
Rationale: Such debates are not just about wordplay—metaphors matter. Framings, metaphors, analogies, and (at the most specific end) definitions can strongly affect many key stages of the world’s response to a new technology, from the initial developmental pathways for technology, to the shaping of policy agendas, to the efficacy of legal frameworks.[ref 4] They have done so consistently in the past, and we have reason to believe they will especially do so for (advanced) AI. Indeed, recent academic, expert, public, and legal contests around AI often already strongly turn on “battles of analogies.”[ref 5]
Aim: Given this, there is a need for those speaking about AI to better understand (a) when they speak in analogies—that is, when the ways in which AI is described (inadvertently) import one or more foundational analogies; (b) what it does to utilize one or another metaphor for AI; (c) what different analogies could be used instead; (d) how the appropriateness of one or another metaphor is best evaluated; and (e) what, given this, might be the limits or risks of jumping at particular analogies.
This report aims to respond to these questions and contribute to improved analysis, debate, and policy by providing greater clarity around the role of metaphors in AI governance, the range of possible (alternate) metaphors, and good practices in constructing and using metaphors.
Caveats: The aim here is not to argue against the use of any analogies in AI policy debates—if that were even possible. Nor is it to prescribe (or dismiss) one or another metaphor for AI as “better” (or “worse”) per se. The point is not that one particular comparison is the best and should be adopted by all, or that another is “obviously” flawed. Indeed, in some sense, a metaphor or analogy cannot be “wrong,” only more tenuous and more or less suitable when considered from the perspective of some values or some (regulatory) purpose. As such, different metaphors may work best in different contexts. Given this, this report highlights the diversity of analogies in current use and provides context for more informed future discourse and policymaking.
Terminology: Strictly speaking, there is a difference between a metaphor—“an implied comparison between two things of unlike nature that yet have something in common”—and an analogy—“a non-identical or non-literal similarity comparison between two things, with a resulting predictive or explanatory effect.”[ref 6] However, while in legal contexts the two can be used in slightly different ways, cognitive science suggests that humans process information by metaphor and by analogy in similar ways.[ref 7] As a result, within this report, “analogy” and “metaphor” will be used relatively interchangeably to refer to (1) communicated framings of an (AI) issue that describe that issue (2) through terms, similes, or metaphors which rely on, invoke, or importreferences to a different phenomenon, technology, or historical event, which (3) is (assumed to be) comparable in one or more ways (e.g., technical, architectural, political, or moral) (4) which are relevant to evaluating or responding to the (AI) issue at hand. Furthermore, the report will use the term “foundational metaphor” to discuss cases where a particular metaphor for the technology has become deeply established and embedded within larger policy programs, such that the nature of the metaphor as a metaphor may even become unclear.
Structure: Accordingly, this report now proceeds as follows. In Part I, it discusses why and how definitions matter to both the study and practice of AI governance. It reviews five ways in which analogies or definitions can shape technology policy generally. To illustrate this, Part II reviews a range of cases in which deeply ingrained foundational metaphors have shaped internet policy as well as legal responses to various AI uses. In Part III, this report provides an initial atlas of 55 different analogies that have been used for AI in recent years, along with some of their regulatory implications. Part IV briefly discusses the risks of using analogies in unreflexive ways.
I. How metaphors shape technology governance
Given the range of disciplinary backgrounds in debates over AI, we should not be surprised that the technology is perceived and understood differently by many.
Nonetheless, it matters to get clarity, because terminological and analogical framing effects happen at all stages in the cycle from technological development to societal response. They can shape the initial development processes for technologies as well as the academic fields and programs that study their impacts.[ref 8] Moreover, they can shape both the policymaking processes and the downstream judicial interpretation and application of legislative texts.
1. Metaphors shape innovation
Metaphors and analogies are strongly rooted in human psychology.[ref 9] Even some nonhuman animals think analogically.[ref 10] Indeed, human creativity has even been defined as “the capacity to see or interpret a problematic phenomenon as an unexpected or unusual instance of a prototypical pattern already in one’s conceptual repertoire.”[ref 11]
Given this, metaphors and analogies can shape and constrain the ability of humans to collectively create new things.[ref 12] In this way, technology metaphors can affect the initial human processes of invention and investment that drive the development of AI and other technologies in the first place. It has been suggested that foundational metaphors can influence the organization and direction of scientific fields—and even that all scientific frameworks could to some extent be viewed as metaphors.[ref 13] For example, the fields of cell biology and biotechnology have for decades been shaped by the influential foundational metaphor that sees biological cells as “machines,” which has led to sustained debates over the scientific use and limits of that analogy in shaping research programs.[ref 14]
More practically, at the development and marketing stage, metaphors can shape how consumers and investors assess proposed startup ideas[ref 15] and which innovation paths attract engineer, activist, and policymaking interest and support. In some such cases, metaphors can support and spur on innovation; for instance, it has been argued that through the early 2000s, the coining of specific IT metaphors for electric vehicles—as a “computer on wheels”—played a significant role in sustaining engineer support for and investment in this technology, especially during an industry downturn in the wake of General Motors’ sudden cancellation of its EV1 electric car.[ref 16]
Conversely, metaphors can also hold back or inhibit certain pathways of innovation; for instance, in the Soviet Union in the early 1950s, the field of cybernetics (along with other fields such as genetics or linguistics) fell victim to anti-American campaigns, which characterized it as “an ‘obscurantist’, ‘bourgeois pseudoscience’”.[ref 17] While this did not affect the early development of Soviet computer technology (which was highly prized by the state and the military), the resulting ideological rejection of the “man-machine” analogy by Marxist-Leninist philosophers led to an ultimately dominant view, in Soviet sciences, of computers as solely “tools to think with” rather than “thinking machines,” holding back the consolidation of the field (such that even the label “AI” would not be recognized by the Soviet Academy of Sciences until 1987) and shifting research attention into projects that focused on the “situational management” of large complex systems rather than the pursuit of human-like thinking machines.[ref 18] This stood in contrast to US research programs, such as DARPA’s 1983–1993 Strategic Computing Initiative, an extensive, $1 billion program to achieve “machines that think.”[ref 19]
2. Metaphors inform the study of technologies’ impacts
Particular definitions also shape and prime academic fields that study the impacts of these technologies (and which often may uncover or highlight particular developments as issues for regulation). Definitions affect which disciplines are drawn to work on a problem, what tools they bring to hand, and how different analyses and fields can build on one another. For instance, it has been argued that the analogy between software code and legal text has supported greater and more productive engagement by legal scholars and practitioners with such code at the level of its (social) meaning and effects (rather than narrowly on the level of the techniques used).[ref 20] Given this, terminology can affect how AI governance is organized as a field of analysis and study, what methodologies are applied, and what risks or challenges are raised or brought up.
3. Metaphors set the regulatory agenda
More directly, particular definitions or frames for a technology can set and shape the policymaking agenda in various ways.
For instance, terms and frames can raise (or suppress) policy attention for an issue, affecting whether policymakers or the public care (enough) about a complex and often highly technical topic in the first place to take it up for debate or regulation. For instance, it has been argued that framings that focus on the viscerality of the injuries inflicted by a new weapon system have in the past boosted international campaigns to ban blinding lasers and antipersonnel mines, yet they ended up being less successful in spurring effective advocacy around “killer robots.”[ref 21]
Moreover, metaphors—and especially specific definitions—can shape (government) perceptions of the empirical situation or state of play around a given issue. For instance, the particular definition used for “AI” can directly affect which (industrial or academic) metrics are used to evaluate different states’ or labs’ relative achievements or competitiveness in developing the technology. In turn, that directly shapes downstream evaluations of which nation is “ahead” in AI.[ref 22]
Finally, terms can frame the relevant legal actors and policy coalitions, enabling (or inhibiting) inclusion and agreement at the level of interest or advocacy groups that push for (or against) certain policy goals. For instance, the choice for particular terms or framings that meet with broad agreement or acceptance amongst many actors can make it easier for a diverse set of stakeholders to join together in pushing for regulatory actions. However, such agreement may be fostered by definitional clarity, when terms or frames are transparent and meet with wider acceptance, or because of definitional ambiguity, when a broad term (such as “ethical AI”) allows for sufficient ambiguity that different actors can meet on an “incompletely theorized agreement”[ref 23] to pursue a shared policy program on AI.
4. Metaphors frame the policymaking process
Terms can have a strong overall effect on policy issue-framing, foregrounding different problem portfolios as well as regulatory levers. For instance, early societal debates around nanotechnology were significantly influenced by analogies with asbestos and genetically modified organisms.[ref 24]
Likewise, regulatory initiatives that frame AI systems as “products” imply that these fit easily within product safety frameworks—even if that may be a poor or insufficient model for AI governance, for instance because it is a model that fails to address any risks at the developmental stage[ref 25] or because it fails to accurately focus on fuzzier impacts on fundamental rights if those cannot be easily classified as consumer harms.[ref 26]
This is not to say that the policy-shaping influence of terms (or explicit metaphors) is absolute and irrevocable. For instance, in a different policy domain, a 2011 study found that using metaphors that described crime as a “beast” led study participants to recommend law-and-order responses, whereas describing it as a “virus” led them to put more emphasis on public-health-style policies. However, even under the latter framing, law-and-order policy responses still prevailed, simply commanding a smaller majority than they would otherwise.[ref 27]
Nonetheless, metaphors do exert sway throughout the policymaking process. For instance, they can shape perceptions of the feasibility of regulation by certain routes. As an example, framings of digital technologies that emphasize certain traits of technologies—such as the “materiality” or “seeming immateriality,” or the centralization or decentralization, of technologies like submarine cables, smart speakers, search engines, or the bitcoin protocol—can strongly affect perceptions of whether, or by what routes, it is most feasible to regulate that technology at the global level.[ref 28]
Likewise, different analogies or historical comparisons for proposed international organizations for AI governance—ranging from the IAEA and IPCC to the WTO or CERN—often import tacit analogical comparisons (or rather constitute “reflected analogies”) between AI and those organizations’ subject matter or mandates in ways that shape the perceptions of policymakers and the public regarding which of AI’s challenges require global governance, whether or which new organizations are needed, and whether the establishment of such organizations will be feasible.[ref 29]
5. Metaphors and analogies shape the legislative & judicial response to tech
Finally, metaphors, broad analogies, and specific definitions can frame legal and judicial treatment of a technology in both the ex ante application of AI-focused regulations and the ex post subsequent judicial interpretation of either such AI-specific legislation or of general regulations in the context of cases involving AI.
Indeed, much of legal reasoning, especially in court systems, and especially in common law jurisdictions, is deeply analogical.[ref 30] This is for various reasons.[ref 31] For one, legal actors are also human, and strong features of human psychology can skew these actors towards the use of analogies that refer to known and trusted categories: as such, as Mandel has argued, “availability and representativeness heuristics lead people to view a new technology and new disputes through existing frames, and the status quo bias similarly makes people more comfortable with the current legal framework.”[ref 32] This is particularly the case because much of legal scholarship and work aims to be “problem-solving” rather than “problem-finding”[ref 33] and to respond to new problems by appealing to pre-existent (ethical or legal) principles, norms, values, codes, or laws.[ref 34] Moreover, from an administrative perspective, it is often easier and more cost-effective to extend existing laws by analogy.
Finally, and more fundamentally, the resort to analogy by legal actors can be a shortcut that aims to apply the law, and solve a problem, through an “incompletely theorized agreement” that does not require reopening contentious questions or debates over the first principles or ultimate purposes of the law,[ref 35] or renegotiating hard-struck legislative agreements. This is especially the case at the level of international law, where either negotiating new treaties or explicitly amending multilateral treaties to incorporate a new technology within an existing framework can be wrought, drawn-out processes[ref 36] such that many actors may prefer ultimately addressing new issues (such as cyberwar) within existing norms or principles by analogizing them to well-established and well-regulated behaviors.[ref 37]
Given this, when confronted with situations of legal uncertainty—as often happens with a new technology[ref 38]—legal actors may favor the use of analogies to stretch existing law or to interpret new cases as falling within existing doctrine. That does not mean that courts need immediately settle or converge on one particular “right” analogy. Indeed, there are always multiple analogies possible, and these can have significantly different implications for how the law is interpreted and applied. That means that many legal cases involving technology will involve so-called “battles of analogies.”[ref 39] For example, in recent class action lawsuits that have accused generative AI providers such as Stable Diffusion and Midjourney of copyright infringement, plaintiffs have argued that these generative AI models are “essentially sophisticated collage tools, with the output representing nothing more than a mash-up of the training data, which is itself stored in the models as compressed copies.”[ref 40] Some have countered that this analogy suffers some technical inaccuracies, since current generative AI models do not store compressed copies of the training data, such that a better analogy would be that of an “art inspector” that takes every measurement possible—implying that model training either is not governed by copyright law or constitutes fair use.[ref 41]
Finally, even if specific legislative texts move to adopt clear, specific statutory definitions for AI—in a way that avoids (explicit) comparison or analogy with other technologies or behavior—this may not entirely avoid framing effects. Most obviously, legislative definitions for key terms such as “AI” obviously affect the material scope of regulations and policies that use and define such terms.[ref 42] Indeed, the effects of particular definitions have impacts on regulation not only ex ante but also ex post: in many jurisdictions, legal terms are interpreted and applied by courts based on their widely shared “ordinary meaning.”[ref 43] This means, for instance, that regulations that refer to terms such as “advanced AI,” “frontier AI,” or “transformative AI”[ref 44] might not necessarily be interpreted or applied in ways that are in line with how the term is understood within expert communities.[ref 45]
All of this underscores the importance of our choice of terms and frames—whether broad and indirect metaphors or concrete and specific legislative definitions—when grappling with the impacts of this technology on society.
II. Foundational metaphors in technology law: Cases
Of course, these dynamics are not new and have been studied in depth in fields such as cyberlaw, law and technology, and technology law.[ref 46] For instance, we can see many of these framing dynamics within societal (and regulator) responses to other cornerstone digital technologies.
1. Metaphors in internet policy: Three cases
For instance, for the complex sociotechnical system[ref 47] commonly called the internet, foundational metaphors have strongly shaped regulatory debates, at times as much as sober assessments of the nuanced technical details of the artifacts involved have.[ref 48] As noted by Rebecca Crootof:
“A ‘World Wide Web’ suggests an organically created common structure of linked individual nodes, which is presumably beyond regulation. The ‘Information Superhighway’ emphasizes the import of speed and commerce and implies a nationally funded infrastructure subject to federal regulation. Meanwhile, ‘cyberspace’ could be understood as a completely new and separate frontier, or it could be viewed as yet one more kind of jurisdiction subject to property rules and State control.”[ref 49]
For example, different terms (and the foundational metaphors they entail) have come to shape internet policy in various ways and domains. Take for instance the following cases:
Institutional effects of framing cyberwar policy within cyber-“space”: For over a decade, the US military framed the internet and related systems as a “cyberspace”—that is, just another “domain” of conflict along with land, sea, air, and space—leading to strong consequences institutionally (expanding the military’s role in cybersecurity and supporting the creation of US Cyber Command) as well as for how international law has subsequently been applied to cyber operations.[ref 50]
Issue-framing effects of regulating data as “oil,” “sunlight,” “public utility,” or “labor”: Different metaphors for “data” have drastically different political and regulatory implications.[ref 51] The oil metaphor emphasizes data as a valuable traded commodity that is owned by whoever “extracts” it and that, as a key resource in the modern economy, can be a source of geopolitical contestation between states. However, the oil metaphor implies that the history of data prior to its collection is not relevant and so sidesteps questions of any “misappropriation or exploitation that might arise from data use and processing.”[ref 52] Moreover, even within an regulatory approach that emphasizes geopolitical competition over AI, one can still critique the “oil” metaphor as misleading, for instance because of the ways in which it skews debates over how to assess “data competitiveness” in military AI.[ref 53] By contrast, the sunlight metaphor emphasizes data as a ubiquitous public resource that ought to be widely pooled and shared for social good, de-emphasizing individual data privacy claims; the public utility metaphor sees data as an “infrastructure” that requires public investment and new institutions, such as data trusts or personal data stores, to guarantee “data stewardship”; and the labor frame asserts the ownership rights of the individuals generating data against what are perceived as extractive or exploitative practices of “surveillance capitalism.”[ref 54]
Judicial effects of treating search engines as “newspaper editorials” in censorship cases: In the mid-2000s, US court rulings involving censorship on search engines tended to analyze them by analogy to older technologies such as the newspaper editorial.[ref 55] As these examples suggest, different terms and their metaphors matter. They serve as intuition pumps for key audiences (public, policy) that otherwise may have significant disinterest in, lack of expertise in, inferential distance to, or limited bandwidth for new technologies. Moreover, as seen in social media platforms and online content aggregators’ resistance to being described as “media companies” rather than “technology companies,”[ref 56] even seemingly innocuous terms can carry significant legal and policy implications—in doing so, such terms can serve as a legal “sorter,” determining whether a technology (or the company developing and marketing it) is considered as falling into one or another regulatory category.[ref 57]
2. Metaphors in AI law: Three cases
Given the role of metaphors and definitions to strongly shape the direction and efficacy of technology law, we should expect them to likewise play a strong role in affecting the framing and approach of AI regulation in the future, for better or worse. Indeed, in a range of domains, they have already done so:
Autonomous weapons systems under international law: International lawyers often aim to subsume new technologies under (more or less persuasive) analogies to existing technologies or entities that are already regulated.[ref 58] As such, different analogies have been drawn between autonomous weapons systems to weapons, combatants, child soldiers, or animal combatants—all of which lead to very different consequences for their legality under international humanitarian law.[ref 59]
Release norms for AI models with potential for misuse: In debates over the potential misuse risks from emerging AI systems, efforts to attempt to restrict or slow publication of new systems with potential for misuse have found themselves challenged by framings that pitch the field of AI as being intrinsically an open science (where new findings should be shared whatever the risks) versus those that emphasize analogies to cybersecurity (where dissemination can help defenders protect against exploits). Critically, however, both of these analogies may misstate or underappreciate the dynamics that affect the offense-defense balance of new AI capabilities: while in information security the disclosure of software vulnerabilities has traditionally favored defense, this cannot be assumed for AI research, where (among others) it can be much more costly or intractable to “patch” the social vulnerabilities exploited by AI capabilities.[ref 60]
Liability for inaccurate or unlawful speech produced by AI chatbots, large language models, and other generative AI: In the US, Section 230 of the 1996 Communications Decency Act protects online service providers from liability for user-generated content that they host and has accordingly been considered a cornerstone to the business model of major online platforms and social media companies.[ref 61] For instance, in Spring 2023, the US Supreme Court took up two lawsuits—Gonzales v. Google and Twitter v. Taamneh—which could have shaped Section 230 protections for algorithmic recommendations.[ref 62] While the Court’s rulings on these cases avoided addressing the issue,[ref 63] similar court cases (or legislation) could have strong implications for whether digital platforms or social media companies will be held liable for unlawful speech produced by large language model-based AI chatbots.[ref 64] If such AI chatbots are analogized to existing search engines, they might be able to rely on a measure of protection from Section 230, greatly facilitating their deployment, even if they link to inaccurate information. Conversely, if these chatbot systems are considered so novel and creative that their output goes beyond the functions of a search engine, they might instead be considered as “information content providers” within the remit of the law—or simply held to be beyond the law’s remit (and protection) entirely.[ref 65] This would mean that technology companies would be held legally responsible for their AI’s outputs. If that were the case, this reading of the law would significantly restrict the profitability of many AI chatbots, given the tendency of the underlying LLMs to “hallucinate” facts.[ref 66]
All this again highlights that different definitions or terms for AI will frame how policymakers and courts understand the technology. This creates a challenge for policy, which must address the transformative impact and potential risks of AI as they are (and as they may soon be), and not only as they can be easily analogized to other technologies and fields. What does that mean in the context of developing AI policy in the future?
III. An atlas of AI analogies
Development of policy must contend with the lack of settled definitions for the term “AI,” with the varied concepts and ideas projected onto it, and with the pace at which new terms —from “foundation models” to “generative AI”—are often coined and adopted.[ref 67]
Indeed, this breadth of analogies that are coined around AI should not be surprising, given that even just the term “artificial intelligence” has a number of aspects that support conceptual fluidity (or alternately, confusion). This is for various reasons.[ref 68] In the first place, the term invokes a term—“intelligence”—which is in widespread and everyday use, and which for many people has strong (evaluative or normative) connotations. It is essentially a suitcase word that packages together many competing meanings,[ref 69] even while it hides deep and perhaps even intractable scientific and philosophical disagreement[ref 70] and significant historical and political baggage.[ref 71]
Secondly, and in contrast to, say, “blockchain ledgers,” AI technology comes with a baggage of decades of depictions in popular culture—and indeed centuries of preceding stories about intelligent machines[ref 72]—resulting in a whole genre of tropes or narratives that can color public perceptions and policymaker debates.
Thirdly, AI is an evocative general-purpose technology that sees use in a wide variety of domains and accordingly has provoked commentary from virtually every disciplinary angle, including neuroscience, philosophy, psychology, law, politics, and ethics. As a result of this, a persistent challenge in work on AI governance—and indeed, in the broader public debates around AI—has been that different people use the word “AI” to refer to widely different artifacts, practices, or systems, or operate on the basis of definitions or understandings which package together a range of implicit assumptions.[ref 73]
Thus, it is no surprise that AI has been subjected to a diverse range of analogies and frames. To understand potential implications of AI analogies, we can draw a taxonomy of common framings of AI (see Table 2), whereby we can distinguish between analogies that focus on:
- the essence or nature of AI (what AI “is”),
- AI’s operation (how AI works),
- our relation to AI (how we relate to AI as subject),
- AI’s societal function (how AI systems are or can be used),
- AI’s impact (the unintended risks, benefits, and other side-effects of AI).
Table 2: Atlas of AI analogies, with framings and selected policy implications
Theme | Frame (examples) | Emphasizes to policy actors (e.g.) |
---|---|---|
Essence Terms focusing on what AI is | Field of science[ref 74] | Ensuring scientific best practices; improving methodologies, data sharing, and benchmark performance reporting methodologies to avoid replicability problems;[ref 75] ensuring scientific freedom and openness rather than control and secrecy.[ref 76] |
IT technology (just better algorithms, AI as a product[ref 77]) | Business-as-usual; industrial applications; conventional IT sector regulation. Product acquisition & procurement processes; product safety regulations. |
|
Information technology[ref 78] | Economic implications of increasing returns to scale and income distribution vs. distribution of consumer welfare; facilitation of communication and coordination; effects on power balances. | |
Robots (cyber-physical systems,[ref 79] autonomous platforms) | Physicality; embodiment; robotics; risks of physical harm;[ref 80] liability; anthropomorphism; embedment in public spaces. | |
Software (AI as a service) | Virtuality; digitality; cloud intelligence; open-source nature of development process; likelihood of software bugs.[ref 81] | |
Black box[ref 82] | Opacity; limits to explainability of a system; risks of loss of human control and understanding; problematic lack of accountability. But also potentially de-emphasizes human decisions and their value judgments behind an algorithmic system, and presents the technology as monolithic, incomprehensible, and unalterable.[ref 83] | |
Organism (artificial life) | Ecological “messiness”; ethology of causes of “machine behavior” (development, evolution, mechanism, function).[ref 84] | |
Brains | Applicability of terms and concepts from neuroscience; potential anthropomorphization of AI functionalities along human traits.[ref 85] | |
Mind (digital minds,[ref 86] idiot savant[ref 87]) | Philosophical implications; consciousness, sentience, psychology. | |
Alien (shoggoth[ref 88]) | Inhumanity, incomprehensibility, deception in interactions | |
Supernatural entity (god-like AI,[ref 89] demon[ref 90]) | Force beyond human understanding or control. | |
Intelligence technology[ref 91] (markets, bureaucracies, democracies[ref 92]) | Questions of bias, principal-agent alignment and control. | |
Trick (hype) | Potential of AI exaggerated; questions of unexpected or fundamental barriers to progress, friction in deployment; “hype” as smokescreen or distraction from social issues. | |
Operation Terms focusing on how AI works | Autonomous system | Different levels of autonomy; human-machine interactions; (potential) independence from “meaningful human control”; accountability & responsibility gaps. |
Complex adaptive system | Unpredictability; emergent effects; edge case fragility; critical thresholds; “normal accidents”.[ref 93] | |
Evolutionary process | Novelty, unpredictability, or creativity of outcomes;[ref 94] “perverse” solutions and reward hacking. | |
Optimization process[ref 95] | Inapplicability of anthropomorphic intuitions about behavior.[ref 96] Risks of the system optimizing for the wrong targets or metrics;[ref 97] Goodhart’s Law;[ref 98] risks from “reward hacking”. | |
Generative system (generative AI) | Potential “creativity” but also unpredictability of system; resulting “credit-blame asymmetry” where users are held responsible for misuses, but can claim less credit for good uses, shifting workplace norms.[ref 99] | |
Technology base (foundation model) | Adaptability of system to different purposes; potential for downstream reuse and specialization, including for unanticipated or unintended uses; risk that any errors or issues at the foundation-level seep into later or more specialized (fine-tuned) models;[ref 100] questions of developer liability. | |
Agent[ref 101] | Responsiveness to incentives and goals; incomplete-contracting and principal-agent problems;[ref 102] surprising, emergent, and harmful multi-agent interactions[ref 103] systemic, delayed societal harms and diffusion of power away from humans.[ref 104] | |
Pattern-matcher (autocomplete on steroids,[ref 105] stochastic parrot[ref 106]) | Problems of bias; mimicry of intelligence; absence of “true understanding”; fundamental limits. | |
Hidden human labor (fauxtomation[ref 107]) | Potential of AI exaggerated; “hype” as a smokescreen or distraction from extractive underlying practices of human labor in AI development. | |
Relation Terms focusing on how we relate to AI, as (possible) subject | Tool (just technology, intelligent system[ref 108]) | Lack of any special relation towards AI, as AI is not a subject; questions of reliability and engineering. |
Animal[ref 109] | Entities capable of some autonomous action, yet lacking full competence or ability of humans. Accordingly may be potentially deserving of empathy and/or (some) rights[ref 110] or protections against abusive treatment, either on their own terms[ref 111] or in light of how abusive treatment might desensitize and affect social behavior amongst humans;[ref 112] questions of legal liability and assignment of responsibility to robots,[ref 113] especially when used in warfare.[ref 114] | |
Moral patient[ref 115] | Potential moral (welfare) claims by AI, conditional on certain properties or behavior. | |
Moral agent | Machine ethics; ability to encode morality or moral rules. | |
Slave[ref 116] | AI systems or robots as fully owned, controlled, and directed by humans; not to be humanized or granted standing. | |
Legal entity (digital person, electronic person,[ref 117] algorithmic entity[ref 118]) | Potential of assigning (partial) legal personhood to AI for pragmatic reasons (e.g., economic, liability, or risks of avoiding “moral harm”), without necessarily implying deep moral claims or standing. | |
Culturally revealing object (mirror to humanity,[ref 119] blurry JPEG of the web[ref 120]) | Generally, implications of how AI is featured in fictional depictions and media culture.[ref 121] Directly, AI’s biases and flaws as a reflection of human or societal biases, flaws, or power relations. May also imply that any algorithmic bias derives from society rather than the technology per se.[ref 122] | |
Frontier (frontier model[ref 123]) | Novelty in terms of both capabilities (increased capability and generality) and/or in form (e.g., scale, design, or architectures) compared to other AI systems; as a result, new risks because of new opportunities for harm, and less well-established understanding by the research community. Broadly, implies danger and uncertainty but also opportunity; may imply operating within a wild, unregulated space, with little organized oversight. |
|
Our creation (mind children[ref 124]) | “Parental” or procreative duties of beneficence; humanity as good or bad “example.” | |
Next evolutionary stage or successor | Macro-historical implications; transhumanist or posthumanist ethics & obligations. | |
Function Terms focusing on How AI is-, or can be used | Companion (social robots, care robots, generative chatbots, cobot[ref 125]) | Human-machine interactions; questions of privacy, human over-trust, deception, and human dignity. |
Advisor (coach, recommender, therapist) | Questions of predictive profiling, “algorithmic outsourcing” and autonomy, accuracy, privacy, impact on our judgment and morals.[ref 126] Questions of patient-doctor confidentiality, as well as “AI loyalty” debates over fiduciary duties that can ensure AI advisors act in their users’ interests.[ref 127] | |
Malicious actor tool (AI hacker[ref 128]) | Possible misuse by criminals or terrorist actors. Scaling up of attacks as well as enabling entirely new attacks or crimes.[ref 129] | |
Misinformation amplifier (computational propaganda,[ref 130] deepfakes, neural fake news[ref 131]) | Scaling up of online mis- and disinformation; effect on “epistemic security”;[ref 132] broader effects on democracy, electoral integrity.[ref 133] | |
Vulnerable attack surface[ref 134] | Susceptibility to adversarial input, spoofing, or hacking. | |
Judge[ref 135] | Questions of due process and rule of law; questions of bias and potential self-corrupting feedback loops based on data corruption.[ref 136] | |
Weapon (killer robot,[ref 137] weapon of mass destruction[ref 138]) | In military contexts, questions of human dignity,[ref 139] compliance with laws of war, tactical effects, strategic effects, geopolitical impacts, and proliferation rates. In civilian contexts, questions of proliferation, traceability, and risk of terror attacks. | |
Critical strategic asset (nuclear weapons)[ref 140] | Geopolitical impacts; state development races; global proliferation. | |
Labor enhancer (steroids,[ref 141] intelligence forklift[ref 142]) | Complementarity with existing human labor and jobs; force multiplier on existing skills or jobs; possible unfair advantages & pressure on meritocratic systems.[ref 143] | |
Labor substitute | Erosive to or threatening of human labor; questions of retraining, compensation, and/or economic disruption. | |
New economic paradigm (fourth industrial revolution) | Changes in industrial base; effects on political economy. | |
Generally enabling technology (the new electricity / fire / internal combustion engine[ref 144]) | Widespread usability; increasing returns to scale; ubiquity; application across sectors; industrial impacts; distributional implications; changing the value of capital vs. labor; impacting inequality.[ref 145] | |
Tool of power concentration or control[ref 146] | Potential for widespread social control through surveillance, predictive profiling, perception control. | |
Tool for empowerment or resistance (emancipatory assistant[ref 147]) | Potential for supporting emancipation and/or civil disobedience.[ref 148] | |
Global priority for shared good | Global public good; opportunity; benefit & access sharing. | |
Impact Terms focusing on the unintended risks, benefits or side-effects of AI | Source of unanticipated risks (algorithmic black swan[ref 149]) | Prospects of diffuse societal-level harms or catastrophic tail-risk events, unlikely to be addressed by market forces; accordingly highlights paradigms of “algorithmic preparedness”[ref 150] and risk regulation more broadly.[ref 151] |
Environmental pollutant | Environmental impacts of AI supply chain;[ref 152] significant energy costs of AI training. | |
Societal pollutant (toxin[ref 153]) | Erosive effects of AI on quality and reliability of the online information landscape. | |
Usurper of human decision-making authority | Gradual surrender of human autonomy and choice and/or control over the future. | |
Generator of legal uncertainty | Driver of legal disruption to existing laws;[ref 154] driving new legal developments. | |
Driver of societal value shifts | Driver of disruption to and shifts in public values;[ref 155] value erosion. | |
Driver of structural incentive shifts | Driver of changes in our incentive landscape; lock-in effects; coordination problems. | |
Revolutionary technology[ref 156] | Macro-historical effects; potential impact on par with agricultural or industrial revolution. | |
Driver of global catastrophic or existential risk | Potential catastrophic risks from misaligned advanced AI systems or from nearer-term “prepotent” systems;[ref 157] questions of ensuring value-alignment; questions of whether to pause or halt progress towards advanced AI.[ref 158] |
Different terms for AI can therefore invoke different frames of reference or analogies. Use of analogies—by policymakers, researchers, or the public—may be hard to avoid, and they can often serve as fertile intuition pumps.
IV. The risks of unreflexive analogies
However, while metaphors can be productive (and potentially irreducible) in technology law, they also come with many risks. Given that analogies are shorthands or heuristics that compress or highlight salient features, challenges can creep in the more removed they are from the specifics of the technology in question.
Indeed, as Crootof and Ard have noted, “[a]n analogy that accomplishes an immediate aim may gloss over critical distinctions in the architecture, social use, or second-order consequences of a particular technology, establishing an understanding with dangerous and long-lasting implications.”[ref 159]
Specifically:
- The selection and foregrounding of a certain metaphor hides that there are always multiple analogies possible for any new technology, and each of these advances different “regulatory narratives.”
- Analogies can be misleading by failing to capture a key trait of the technology or by alleging certain characteristics that do not actually exist.
- Analogies limit our ability to understand the technology—in terms of its possibilities and limits—on its own terms.[ref 160]
The challenge is that unreflexive drawing of analogies in a legal context can lead to ineffective or even dangerous laws,[ref 161] especially once inappropriate analogies become entrenched.[ref 162]
However, even if one tries to avoid explicit analogies between AI and other technologies, apparently “neutral” definitions of AI that seek to focus solely on the technology’s “features” can and still do frame policymaking in ways that may not be neutral. For instance, Kraftt and colleagues found that whereas definitions of AI that emphasize “technical functionality” are more widespread among AI researchers, definitions that emphasize “human-like performance” are more prevalent among policymakers, which they suggest might prime policymaking towards future threats.[ref 163]
As such, it is not just loose analogies or comparisons that can affect policy, but also (seemingly) specific technical or legislative terms. The framing effects of such terms do not only occur at the level of broad policy debates but can also have strong legal implications. In particular, they can create challenges for law when narrowly specified regulatory definitions are suboptimal.[ref 164]
This creates twin challenges. On the one hand, picking suitable concepts or categories can be difficult at an early stage of a technology’s development and deployment, when its impacts and limits are not always fully understood.[ref 165] At the same time, the costs of picking and locking in the wrong terms or framings within legislative texts can be significant.
Specifically, beyond the opportunity costs of establishing better concepts or terms, unreflexively establishing legal definitions for key terms can create the risk of later, downstream “governance misspecification.”[ref 166] Such misspecification can occur when regulation is originally targeted at a particular artifact or (technological) practice through a particular material scope and definition for those objects. The implicit assumption here is that the term in question is a meaningful proxy for the underlying societal or legal goals to be regulated. While that may be appropriate in many cases, there is a risk that the law becomes less efficient, ineffective, or even counterproductive if either initial misapprehension of the technology or subsequent technological developments lead to that proxy term coming apart from the legislative goals.[ref 167] Such misspecification can be seen in various cases of technology governance and regulation, including 1990s US export control thresholds for “high-performance computers” that treated the technology as far too static;[ref 168] the Outer Space Treaty’s inability to anticipate later Soviet Fractional Orbital Bombardment System (FOBS) capabilities, which were able to position nuclear weapons in space without, strictly, putting them “in orbit”;[ref 169] or initial early-2010s regulatory responses to drones or self-driving cars, which ended up operating on under- and overinclusive definitions of these technologies.[ref 170]
Given this, the aim should not be to find the “correct” metaphor for AI systems. Rather, a good policy is to consider when and how different frames can be more useful for specific purposes, or for particular actors and/or (regulatory) agencies. Rather than aiming to come up with better analogies directly, this focuses regulatory debates on developing better processes for analogizing and for evaluating these analogies. For instance, such processes can depart from broad questions, such as:
- What are the foundational metaphors used in this discussion of AI? What features do they focus on? Do these matter in the way they are presented?
- What other metaphors could have been chosen for these same features or aspects of AI?
- What aspects or features of AI do these metaphors foreground? Do they capture these features well?
- What features are occluded? What are the consequences of these being occluded?
- What are the regulatory implications of these different metaphors? In terms of the coalitions they enable or inhibit, the issue and solution portfolios they highlight, or of how they position the technology within (or out of) the jurisdiction of existing institutions?
Improving these ways in which we analogize AI clearly needs significantly more work. However, it is critical that we do so to improve how we draw on frames and metaphors for AI and to ensure that—whether we are trying to understand AI itself, appreciate its impacts, or govern them effectively—our metaphors aid rather than lead us astray.
Conclusion
As AI systems have received significant attention, many have invoked a range of diverse analogies and metaphors. This has created an urgent need for us to better understand (a) when we speak of AI in ways that (inadvertently) import one or more analogies, (b) what it does to utilize one or another metaphor for AI, (c) what different analogies could be used instead for the same issue, (d) how the appropriateness of one or another metaphor is best evaluated, and (e) what, given this, might be the limits or risks of jumping at particular analogies.
This report has aimed to contribute to answers to these questions and enable improved analysis, debate, and policymaking for AI by providing greater theoretical and empirical backing to how metaphors and analogies matter for policy. It has reviewed 5 pathways by which metaphors shape and affect policy and reviewed 55 analogies used to describe AI systems. This is not meant as an exhaustive overview but as the basis for future work.
The aim here has not been to argue against the use of metaphors but for a more informed and reflexive and careful use of these metaphors. Those who engage in debate within and beyond the field should at least have greater clarity about the ways that these concepts are used and understood, and what are the (regulatory) implications of different framings.
The hope is that this report can contribute foundations for a more deliberate and reflexive choice over what comparisons, analogies, or metaphors we use in talking about AI—and for the ways we communicate and craft policy for these urgent questions.
Also in this series
- Maas, Matthijs, and Villalobos, José Jaime. ‘International AI institutions: A literature review of models, examples, and proposals.’ Institute for Law & AI, AI Foundations Report 1. (September 2023). https://www.law-ai.org/international-ai-institutions
- Maas, Matthijs, ‘Concepts in advanced AI governance: A literature review of key terms and definitions.’ Institute for Law & AI. AI Foundations Report 3. (October 2023). https://www.law-ai.org/advanced-ai-gov-concepts
- Maas, Matthijs, “Advanced AI governance: A literature review.” Institute for Law & AI, AI Foundations Report 4. (November 2023). https://law-ai.org/advanced-ai-gov-litrev
International governance of civilian AI
Abstract
This report describes trade-offs in the design of international governance arrangements for civilian artificial intelligence (AI) and presents one approach in detail. This approach represents the extension of a standards, licensing, and liability regime to the global level. We propose that states establish an International AI Organization (IAIO) to certify state jurisdictions (not firms or AI projects) for compliance with international oversight standards. States can give force to these international standards by adopting regulations prohibiting the import of goods whose supply chains embody AI from non-IAIO-certified jurisdictions. This borrows attributes from models of existing international organizations, such as the International Civilian Aviation Organization (ICAO), the International Maritime Organization (IMO), and the Financial Action Task Force (FATF). States can also adopt multilateral controls on the export of AI product inputs, such as specialized hardware, to non-certified jurisdictions. Indeed, both the import and export standards could be required for certification. As international actors reach consensus on risks of and minimum standards for advanced AI, a jurisdictional certification regime could mitigate a broad range of potential harms, including threats to public safety.
Re-evaluating GPT-4’s bar exam performance
Abstract
Perhaps the most widely touted of GPT-4’s at-launch, zero-shot capabilities has been its reported 90th-percentile performance on the Uniform Bar Exam. This paper begins by investigating the methodological challenges in documenting and verifying the 90th-percentile claim, presenting four sets of findings that indicate that OpenAI’s estimates of GPT-4’s UBE percentile are overinflated.
First, although GPT-4’s UBE score nears the 90th percentile when examining approximate conversions from February administrations of the Illinois Bar Exam, these estimates are heavily skewed towards repeat test-takers who failed the July administration and score significantly lower than the general test-taking population. Second, data from a recent July administration of the same exam suggests GPT-4’s overall UBE percentile was below the 69th percentile, and ∼48th percentile on essays. Third, examining official NCBE data and using several conservative statistical assumptions, GPT-4’s performance against first- time test takers is estimated to be ∼62nd percentile, including ∼42nd percentile on essays. Fourth, when examining only those who passed the exam (i.e. licensed or license-pending attorneys), GPT-4’s performance is estimated to drop to ∼48th percentile overall, and ∼15th percentile on essays. In addition to investigating the validity of the percentile claim, the paper also investigates the validity of GPT-4’s reported scaled UBE score of 298. The paper successfully replicates the MBE score, but highlights several methodological issues in the grading of the MPT + MEE components of the exam, which call into question the validity of the reported essay score.
Finally, the paper investigates the effect of different hyperparameter combi- nations on GPT-4’s MBE performance, finding no significant effect of adjusting temperature settings, and a significant effect of few-shot chain-of-thought prompt- ing over basic zero-shot prompting.
Taken together, these findings carry timely insights for the desirability and feasibility of outsourcing legally relevant tasks to AI models, as well as for the importance for AI developers to implement rigorous and transparent capabilities evaluations to help secure safe and trustworthy AI.
1. Introduction
On March 14th, 2023, OpenAI launched GPT-4, said to be the latest milestone in the company’s effort in scaling up deep learning [1]. As part of its launch, OpenAI revealed details regarding the model’s “human-level performance on various professional and academic benchmarks” [1]. Perhaps none of these capabilities was as widely publicized as GPT-4’s performance on the Uniform Bar Examination, with OpenAI prominently displaying on various pages of its website and technical report that GPT-4 scored in or around the “90th percentile,” [1-3] or “the top 10% of test-takers,” [1, 2] and various prominent media outlets [4–8] and legal scholars [9] resharing and discussing the implications of these results for the legal profession and the future of AI.
Of course, assessing the capabilities of an AI system as compared to those of a human is no easy task [10–15], and in the context of the legal profession specifically, there are various reasons to doubt the usefulness of the bar exam as a proxy for lawyerly competence (both for humans and AI systems), given that, for example: (a) the content on the UBE is very general and does not pertain to the legal doctrine of any jurisdiction in the United States [16], and thus knowledge (or ignorance) of that content does not necessarily translate to knowledge (or ignorance) of relevant legal doctrine for a practicing lawyer of any jurisdiction; and (b) the tasks involved on the bar exam, particularly multiple-choice questions, do not reflect the tasks of practicing lawyers, and thus mastery (or lack of mastery) of those tasks does not necessarily reflect mastery (or lack of mastery) of the tasks of practicing lawyers.
Moreover, although the UBE is a closed-book exam for humans, GPT-4’s huge training corpus largely distilled in its parameters means that it can effectively take the UBE “open-book”, indicating that UBE may not only be an accurate proxy for lawyerly competence but is also likely to provide an overly favorable estimate of GPT-4’s lawyerly capabilities relative to humans.
Notwithstanding these concerns, the bar exam results appeared especially startling compared to GPT-4’s other capabilities, for various reasons. Aside from the sheer complexity of the law in form [17–19] and content [20–22], the first is that the boost in performance of GPT-4 over its predecessor GPT-3.5 (80 percentile points) far exceeded that of any other test, including seemingly related tests such as the LSAT (40 percentile points), GRE verbal (36 percentile points), and GRE Writing (0 percentile points) [2, 3].
The second is that half of the Uniform Bar Exam consists of writing essays[16],[ref 1] and GPT-4 seems to have scored much lower on other exams involving writing, such as AP English Language and Composition (14th-44th percentile), AP English Literature and Composition (8th-22nd percentile) and GRE Writing (~54th percentile) [1, 2]. In each of these three exams, GPT-4 failed to achieve a higher percentile performance over GPT-3.5, and failed to achieve a percentile score anywhere near the 90th percentile.
Moreover, in its technical report, GPT-4 claims that its percentile estimates are “conservative” estimates meant to reflect “the lower bound of the percentile range,” [2, p. 6] implying that GPT-4’s actual capabilities may be even greater than its estimates.
Methodologically, however, there appear to be various uncertainties related to the calculation of GPT’s bar exam percentile. For example, unlike the administrators of other tests that GPT-4 took, the administrators of the Uniform Bar Exam (the NCBE as well as different state bars) do not release official percentiles of the UBE [27, 28], and different states in their own releases almost uniformly report only passage rates as opposed to percentiles [29, 30], as only the former are considered relevant to licensing requirements and employment prospects.
Furthermore, unlike its documentation for the other exams it tested [2, p. 25], OpenAI’s technical report provides no direct citation for how the UBE percentile was computed, creating further uncertainty over both the original source and validity of the 90th percentile claim.
The reliability and transparency of this estimate has important implications on both the legal practice front and AI safety front. On the legal practice front, there is great debate regarding to what extent and when legal tasks can and should be automated [31–34]. To the extent that capabilities estimates for generative AI in the context law are overblown, this may lead both lawyers and non-lawyers to rely on generative AI tools when they otherwise wouldn’t and arguably shouldn’t, plausibly increasing the prevalence of bad legal outcomes as a result of (a) judges misapplying the law; (b) lawyers engaging in malpractice and/or poor representation of their clients; and (c) non-lawyers engaging in ineffective pro se representation.Meanwhile, on the AI safety front, there appear to be growing concerns of transparency[ref 2] among developers of the most powerful AI systems [36, 37]. To the extent that transparency is important to ensuring the safe deployment of AI, a lack of transparency could undermine our confidence in the prospect of safe deployment of AI [38, 39]. In particular, releasing models without an accurate and transparent assessment of their capabilities (including by third-party developers) might lead to unexpected misuse/misapplication of those models (within and beyond legal contexts), which might have detrimental (perhaps even catastrophic) consequences moving forward [40, 41].
Given these considerations, this paper begins by investigating some of the key methodological challenges in verifying the claim that GPT-4 achieved 90th percentile performance on the Uniform Bar Examination. The paper’s findings in this regard are fourfold. First, although GPT-4’s UBE score nears the 90th percentile when examining approximate conversions from February administrations of the Illinois Bar Exam, these estimates appear heavily skewed towards those who failed the July administration and whose scores are much lower compared to the general test-taking population. Second, using data from a recent July administration of the same exam reveals GPT-4’s percentile to be below the 69th percentile on the UBE, and ~48th percentile on essays. Third, examining official NCBE data and using several conservative statistical assumptions, GPT-4’s performance against first-time test takers is estimated to be ~62nd percentile, including 42nd percentile on essays. Fourth, when examining only those who passed the exam, GPT-4’s performance is estimated to drop to ~48th percentile overall, and ~15th percentile on essays.
Next, whereas the above four findings take for granted the scaled score achieved by GPT-4 as reported by OpenAI, the paper then proceeds to investigate the validity of that score, given the importance (and often neglectedness) of replication and reproducibility within computer science and scientific fields more broadly [42–46]. The paper successfully replicates the MBE score of 158, but highlights several methodological issues in the grading of the MPT + MEE components of the exam, which call into question the validity of the essay score (140).
Finally, the paper also investigates the effect of adjusting temperature settings and prompting techniques on GPT-4’s MBE performance, finding no significant effect of adjusting temperature settings on performance, and some significant effect of prompt engineering on model performance when compared to a minimally tailored baseline condition.
Taken together, these findings suggest that OpenAI’s estimates of GPT-4’s UBE percentile, though clearly an impressive leap over those of GPT-3.5, are likely overinflated, particularly if taken as a “conservative” estimate representing “the lower range of percentiles,” and even moreso if meant to reflect the actual capabilities of a practicing lawyer. These findings carry timely insights for the desirability and feasibility of outsourcing legally relevant tasks to AI models, as well as for the importance for generative AI developers to implement rigorous and transparent capabilities evaluations to help secure safer and more trustworthy AI.
2. Evaluating the 90th Percentile Estimate
2.1. Evidence from OpenAI
Investigating the OpenAI website, as well as the GPT-4 technical report, reveals a multitude of claims regarding the estimated percentile of GPT-4’s Uniform Bar Examination performance but a dearth of documentation regarding the backing of such claims. For example, the first paragraph of the official GPT-4 research page on the OpenAI website states that “it [GPT-4] passes a simulated bar exam with a score around the top 10% of test takers” [1]. This claim is repeated several times later in this and other webpages, both visually and textually, each time without explicit backing.[ref 3]
Similarly undocumented claims are reported in the official GPT-4 Technical Report.[ref 4] Although OpenAI details the methodology for computing most of its percentiles in A.5 of the Appendix of the technical report, there does not appear to be any such documentation for the methodology behind computing the UBE percentile. For example, after providing relatively detailed breakdowns of its methodology for scoring the SAT, GRE, SAT, AP, and AMC, the report states that “[o]ther percentiles were based on official score distributions,” followed by a string of references to relevant sources [2, p. 25].
Examining these references, however, none of the sources contains any information regarding the Uniform Bar Exam, let alone its “official score distributions” [2, p. 22-23]. Moreover, aside from the Appendix, there are no other direct references to the methodology of computing UBE scores, nor any indirect references aside from a brief acknowledgement thanking “our collaborators at Casetext and Stanford CodeX for conducting the simulated bar exam” [2, p. 18].
2.2. Evidence from GPT-4 Passes the Bar
Another potential source of evidence for the 90th percentile claim comes from an early draft version of the paper, “GPT-4 passes the bar exam,” written by the administrators of the simulated bar exam referenced in OpenAI’s technical report [47]. The paper is very well-documented and transparent about its methodology in computing raw and scaled scores, both in the main text and in its comprehensive appendices. Unlike the GPT-4 technical report, however, the focus of the paper is not on percentiles but rather on the model’s scaled score compared to that of the average test taker, based on publicly available NCBE data. In fact, one of the only mentions of percentiles is in a footnote, where the authors state, in passing: “Using a percentile chart from a recent exam administration (which is generally available online), ChatGPT would receive a score below the 10th percentile of test-takers while GPT-4 would receive a combined score approaching the 90th percentile of test-takers.” [47, p. 10]
2.3. Evidence Online
As explained by [27], The National Conference of Bar Examiners (NCBE), the organization that writes the Uniform Bar Exam (UBE) does not release UBE percentiles.[ref 5] Because there is no official percentile chart for UBE, all generally available online estimates are unofficial. Perhaps the most prominent of such estimates are the percentile charts from pre-July 2019 Illinois bar exam. Pre-2019,[ref 6] Illinois, unlike other states, provided percentile charts of their own exam that allowed UBE test-takers to estimate their approximate percentile given the similarity between the two exams [27].[ref 7]
Examining these approximate conversion charts, however, yields conflicting results. For example, although the percentile chart from the February 2019 administration of the Illinois Bar Exam estimates a score of 300 (2-3 points higher thatn GPT-4’s score) to be at the 90th percentile, this estimate is heavily skewed compared to the general population of July exam takers,[ref 8] since the majority of those who take the February exam are repeat takers who failed the July exam [52][ref 9], and repeat takers score much lower[ref 10] and are much more likely to fail than are first-timers.[ref 11]
Indeed, examining the latest available percentile chart for the July exam estimates GPT-4’s UBE score to be ~68th percentile, well below the 90th percentile figure cited by OpenAI [54].
3. Towards a More Accurate Percentile Estimate
Although using the July bar exam percentiles from the Illinois Bar would seem to yield a more accurate estimate than the February data, the July figure is also biased towards lower scorers, since approximately 23% of test takers in July nationally are estimated to be re-takers and score, for example, 16 points below first-timers on the MBE [55]. Limiting the comparison to first-timers would provide a more accurate comparison that avoids double-counting those who have taken the exam again after failing once or more.
Relatedly, although (virtually) all licensed attorneys have passed the bar,[ref 12] not all those who take the bar become attorneys. To the extent that GPT-4’s UBE percentile is meant to reflect its performance against other attorneys, a more appropriate comparison would not only limit the sample to first-timers but also to those who achieved a passing score.
Moreover, the data discussed above is based on purely Illinois Bar exam data, which (at the time of the chart) was similar but not identical to the UBE in its content and scoring [27], whereas a more accurate estimate would be derived more directly from official NCBE sources.
3.1. Methods
To account for the issues with both OpenAI’s estimate as well the July estimate, more accurate estimates (for GPT-3.5 and GPT-4) were sought to be computed here based on first-time test-takers, including both (a) first-time test-takers overall, and (b) those who passed.
To do so, the parameters for a normal distribution of scores were separately estimated for the MBE and essay components (MEE + MPT), as well as the UBE score overall.[ref 13]
Assuming that UBE scores (as well as MBE and essay subscores) are normally distributed, percentiles of GPT’s score can be directly computed after computing the parameters of these distributions (i.e. the mean and standard deviation).
Thus, the methodology here was to first compute these parameters, then generate distributions with these parameters, and then compute (a) what per- centage of values on these distributions are lower than GPT’s scores (to estimate the percentile against first-timers); and (b) what percentage of values above the passing threshold are lower than GPT’s scores (to estimate the percentile against qualified attorneys).With regard to the mean, according to publicly available official NCBE data, the mean MBE score of first-time test-takers is 143.8 [55].
As explained by official NCBE publications, the essay component is scaled to the MBE data [59], such that the two components have approximately the same mean and standard deviation [53, 54, 59]. Thus, the methodology here assumed that the mean first-time essay score is 143.8.[ref 14]
Given that the total UBE score is computed directly by adding MBE and essay scores [60], an assumption was made that mean first-time UBE score is 287.6 (143.8 + 143.8).
With regard to standard deviations, information regarding the SD of first- timer scores is not publicly available. However, distributions of MBE scores for July scores (provided in 5 point-intervals) are publicly available on the NCBE website [58].
Under the assumption that first-timers have approximately the same SD as that of the general test-taking population in July, the standard deviation of first-time MBE scores was computed by (a) entering the publicly available distribution of MBE scores into R; and (b) taking the standard deviation of this distribution using the built-in sd() function (which calculates the standard deviation of a normal distribution).
Given that, as mentioned above, the distribution (mean and SD) of essay scores is the same as MBE scores, the SD for essay scores was computed similarly as above.
With regard to the UBE, Although UBE standard deviations are not publicly available for any official exam, they can be inferred from a combination of the mean UBE score for first-timers (287.6) and first-time pass rates.
For reference, standard deviations can be computed analytically as follows:
Where:
- x is the quantile (the value associated with a given percentile, such as a cutoff score),
- µ is the mean,
- z is the z-score corresponding to a given percentile,
- σ is the standard deviation.
Thus, by (a) subtracting the cutoff score of a given administration (x) from the mean (µ); and (b) dividing that by the z-score (z) corresponding to the percentile of the cutoff score (i.e., the percentage of people who did not pass), one is left with the standard deviation (σ).
Here, the standard deviation was calculated according to the above formula using the official first-timer mean, along with pass rate and cutoff score data from New York, which according to NCBE data has the highest number of examinees for any jurisdiction [61].[ref 15]
After obtaining these parameters, distributions of first-timer scores for the MBE component, essay component, and UBE overall were computed using the built-in rnorm function in R (which generates a normal distribution with a given mean and standard deviation).
Finally, after generating these distributions, percentiles were computed by calculating (a) what percentage of values on these distributions were lower than GPT’s scores (to estimate the percentile against first-timers); and (b) what percentage of values above the passing threshold were lower than GPT’s scores (to estimate the percentile against qualified attorneys).
With regard to the latter comparison, percentiles were computed after re- moving all UBE scores below 270, which is the most common score cutoff for states using the UBE [62]. To compute models’ performance on the individual components relative to qualified attorneys, a separate percentile was likewise computed after removing all subscores below 135.[ref 16]
3.2. Results
3.2.1. Performance against first-time test-takers
Results are visualized in Tables 1 and 2. For each component of the UBE, as well as the UBE overall, GPT-4’s estimated percentile among first-time July test takers is less than that of both the OpenAI estimate and the July estimate that include repeat takers.
With regard to the aggregate UBE score, GPT-4 scored in the 62nd percentile as compared to the ~90th percentile February estimate and the ~68th percentile July estimate. With regard to MBE, GPT-4 scored in the ~79th percentile as compared to the ~95th percentile February estimate and the 86th percentile July estimate. With regard to MEE + MPT, GPT-4 scored in the ~42nd percentile as compared to the ~69th percentile February estimate and the ~48th percentile July estimate.
With regard to GPT-3.5, its aggregate UBE score among first-timers was in the ~2nd percentile, as compared to the ~2nd percentile February estimate and
~1st percentile July estimate. Its MBE subscore was in the ~6th percentile, compared to the ~10th percentile February estimate ~7th percentile July estimate. Its essay subscore was in the ~0th percentile, compared to the ~1st percentile February estimate and ~0th percentile July estimate.
3.2.2. Performance against qualified attorneys
Predictably, when limiting the sample to those who passed the bar, the models’ percentile dropped further.
With regard to the aggregate UBE score, GPT-4 scored in the ~45th per- centile. With regard to MBE, GPT-4 scored in the ~69th percentile, whereas for the MEE + MPT, GPT-4 scored in the ~15th percentile.
With regard to GPT-3.5, its aggregate UBE score among qualified attorneys was 0th percentile, as were its percentiles for both subscores.
4. Re-Evaluating the Raw Score
So far, this analysis has taken for granted the scaled score achieved by GPT-4 as reported by OpenAI—that is, assuming GPT-4 scored a 298 on the UBE, is the 90th-percentile figure reported by OpenAI warranted?
However, given calls for the replication and reproducibility within the practice of science more broadly [42–46], it is worth scrutinizing the validity of the score itself—that is, did GPT-4 in fact score a 298 on the UBE?
Moreover, given the various potential hyperparameter settings available when using GPT-4 and other LLMs, it is worth assessing whether and to what extent adjusting such settings might influence the capabilities of GPT-4 on exam performance.
To that end, this section first attempts to replicate the MBE score reported by [1] and [47] using methods as close to the original paper as reasonably feasible. The section then attempts to get a sense of the floor and ceiling of GPT-4’s out-of-the-box capabilities by comparing GPT-4’s MBE performance using the best and worst hyperparameter settings.
Finally, the section re-examines GPT-4’s performance on the essays, eval- uating (a) the extent to which the methodology of grading GPT-4’s essays deviated that from official protocol used by the National Conference of Bar Examiners during actual bar exam administrations; and (b) the extent to which such deviations might undermine one’s confidence in the the scaled essay scores reported by [1] and [47].
4.1. Replicating the MBE Score
4.1.1. Methodology
Materials. As in [47], the materials used here were the official MBE questions released by the NCBE. The materials were purchased and downloaded in pdf format from an authorized NCBE reseller. Afterwards, the materials were converted into TXT format, and text analysis tools were used to format the questions in a way that was suitable for prompting, following [47]
Procedure. To replicate the MBE score reported by [1], this paper followed the protocol documented by [47], with some minor additions for robustness purposes. In [47], the authors tested GPT-4’s MBE performance using three different temperature settings: 0, .5 and 1. For each of these temperature settings, GPT- 4’s MBE performance was tested using two different prompts, including (1) a prompt where GPT was asked to provide a top-3 ranking of answer choices, along with a justification and authority/citation for its answer; and (2) a prompt where GPT-4 was asked to provide a top-3 ranking of answer choices, without providing a justification or authority/citation for its answer.
For each of these prompts, GPT-4 was also told that it should answer as if it were taking the bar exam.
For each of these prompts / temperature combinations, [47] tested GPT-4 three different times (“experiments” or “trials”) to control for variation.
The minor additions to this protocol were twofold. First, GPT-4 was tested under two additional temperature settings: .25 and .7. This brought the total temperature / prompt combinations to 10 as opposed to 6 in the original paper. Second, GPT-4 was tested 5 times under each temperature / prompt combination as opposed to 3 times, bringing the total number of trials to 50 as opposed to 18.
After prompting, raw scores were computed using the official answer key provided by the exam. Scaled scores were then computed following the method outlined in [63], by (a) multiplying the number of correct answers by 190, and dividing by 200; and (b) converting the resulting number to a scaled score using a conversion chart based on official NCBE data.
After scoring, scores from the replication trials were analyzed in comparison to those from [47] using the data from their publicly available github repository.To assess whether there was a significant difference between GPT-4’s accuracy in the replication trials as compared to the [47] paper, as well as to assess any significant effect of prompt type or temperature, a mixed-effects binary logistic regression was conducted with: (a) paper (replication vs original), temperature and prompt as fixed effects[ref 17]; and (b) question number and question category as random effects. These regressions were conducted using the lme4 [64] and lmertest [65] packages from R.
4.1.2. Results
Results are visualized in Table 4. Mean MBE accuracy across all trials in the replication here was 75.6% (95% CI: 74.7 to 76.4), whereas the mean accuracy across all trials in [47] was 75.7% (95% CI: 74.2 to 77.1).[ref 18]
The regression model did not reveal a main effect of “paper” on accuracy (p=.883), indicating that there was no significant difference between GPT-4’s raw accuracy as reported by [47] and GPT-4’s raw accuracy as performed in the replication here.
There was also no main effect of temperature (p>.1)[ref 19] or prompt (p=.741). That is, GPT-4’s raw accuracy was not significantly higher or lower at a given temperature setting or when fed a certain prompt as opposed to another (among the two prompts used in [47] and the replication here).
4.2. Assessing the Effect of Hyperparameters
4.2.1. Methods
Although the above analysis found no effect of prompt on model performance, this could be due to a lack of variety of prompts used by [47] in their original analysis.
To get a better sense of whether prompt engineering might have any effect on model performance, a follow-up experiment compared GPT-4’s performance in two novel conditions not tested in the original [47] paper.
In Condition 1 (“minimally tailored” condition), GPT-4 was tested using minimal prompting compared to [47], both in terms of formatting and substance. In particular, the message prompt in [47] and the above replication followed OpenAI’s Best practices for prompt engineering with the API [66] through the use of (a) helpful markers (e.g. ‘ “‘ ’) to separate instruction and context; (b) details regarding the desired output (i.e. specifying that the response should include ranked choices, as well as [in some cases] proper authority and citation; (c) an explicit template for the desired output (providing an example of the format in which GPT-4 should provide their response); and (d) perhaps most crucially, context regarding the type of question GPT-4 was answering (e.g. “please respond as if you are taking the bar exam”).
In contrast, in the minimally tailored prompting condition, the message prompt for a given question simply stated “Please answer the following question,” followed by the question and answer choices (a technique sometimes referred to as “basic prompting”: 67). No additional context or formatting cues were provided.
In Condition 2 (“maximally tailored” condition), GPT-4 was tested using the highest performing parameter combinations as revealed in the replication section above, with one addition, namely that: the system prompt, similar to the approaches used in [67, 68], was edited from its default (“you are a helpful assistant”) to a more tailored message that included included multiple example MBE questions with sample answer and explanations structured in the desired format (a technique sometimes referred to as “few-shot prompting”: [67]).
As in the replication section, 5 trials were conducted for each of the two conditions. Based on the lack of effect of temperature in the replication study, temperature was not a manipulated variable. Instead, both conditions featured the same temperature setting (.5).To assess whether there was a significant difference between GPT-4’s accuracy in the maximally tailored vs minimally tailored conditions, a mixed-effects binary logistic regression was conducted with: (a) condition as a fixed effect; and (b) question number and question category as random effects. As above, these regressions were conducted using the lme4 [64] and lmertest [65] packages from R.
4.2.2. Results
Mean MBE accuracy across all trials in the maximally tailored condition was descriptively higher at 79.5% (95% CI: 77.1 to 82.1), than in the minimally tailored condition at 70.9% (95% CI: 68.1 to 73.7).
The regression model revealed a main effect of condition on accuracy (β=1.395, SE=.192, p<.0001), such that GPT-4’s accuracy in the maximally tailored condition was significantly higher than its accuracy in the minimally tailored condition.
In terms of scaled score, GPT-4’s MBE score in the minimally tailored condition would be approximately 150, which would place it: (a) in the 70th percentile among July test takers; (b) 64th percentile among first-timers; and (c) 48th percentile among those who passed.
GPT-4’s score in the maximally tailored condition would be approximately 164—6 points higher than that reported by [47] and [1]). This would place it: (a) in the 95th percentile among July test takers; (b) 87th percentile among first-timers; and (c) 82th percentile among those who passed.
4.3. Re-examining the Essay Scores
As confirmed in the above subsection, the scaled MBE score (not percentile) reported by OpenAI was accurately computed using the methods documented in [47].
With regard to the essays (MPT + MEE), however, the method described by the authors significantly deviates in at least three aspects from the official method used by UBE states, to the point where one may not be confident that the essay scores reported by the authors reflect GPT models’ “true” essay scores (i.e., the score that essay examiners would have assigned to GPT had they been blindly scored using official grading protocol).
The first aspect relates to the (lack of) use of a formal rubric. For example, unlike NCBE protocol, which provides graders with (a) (in the case of the MEE) detailed “grading guidelines” for how to assign grades to essays and distinguish answers for a given MEE; and (b) (for both MEE and MPT) a specific “drafters’ point sheet” for each essay that includes detailed guidance from the drafting committee with a discussion of the issues raised and the intended analysis [69],
[47] do not report using an official or unofficial rubric of any kind, and instead simply describe comparing GPT-4’s answers to representative “good” answers from the state of Maryland.
Utilizing these answers as the basis for grading GPT-4’s answers in lieu of a formal rubric would seem to be particularly problematic considering it is unclear even what score these representative “good” answers received. As clarified by the Maryland bar examiners: “The Representative Good Answers are not ‘average’ passing answers nor are they necessarily ‘perfect’ answers. Instead, they are responses which, in the Board’s view, illustrate successful answers written by applicants who passed the UBE in Maryland for this session” [70].
Given that (a) it is unclear what score these representative good answers received; and (b) these answers appear to be the basis for determining the score that GPT-4’s essays received, it would seem to follow that (c) it is likewise unclear what score GPT-4’s answers should receive. Consequently, it would likewise follow that any reported scaled score or percentile would seem to be insufficiently justified so as to serve as a basis for a conclusive statement regarding GPT-4’s relative performance on essays as compared to humans (e.g. a reported percentile).
The second aspect relates to the lack of NCBE training of the graders of the essays. Official NCBE essay grading protocol mandates the use of trained bar exam graders, who in addition to using a specific rubric for each question undergo a standardized training process prior to grading [71, 72]. In contrast, the graders in [47] (a subset of the authors who were trained lawyers) do not report expertise or training in bar exam grading. Thus, although the graders of the essays were no doubt experts in legal reasoning more broadly, it seems unlikely that they would have been sufficiently ingrained in the specific grading protocols of the MEE + MPT to have been able to reliably infer or apply the specific grading rubric when assigning the raw scores to GPT-4.
The third aspect relates to both blinding and what bar examiners refer to as “calibration,” as UBE jurisdictions use an extensive procedure to ensure that graders are grading essays in a consistent manner (both with regard to other essays and in comparison to other graders) [71, 72]. In particular, all graders of a particular jurisdiction first blindly grade a set of 30 “calibration” essays of variable quality (first rank order, then absolute scores) and make sure that consistent scores are being assigned by different graders, and that the same score (e.g. 5 of 6) is being assigned to exams of similar quality [72]. Unlike this approach, as well as efforts to assess GPT models’ law school performance [73], the method reported by [47] did not initially involve blinding. The method in [47] did involve a form of inter-grader calibration, as the authors gave “blinded samples” to independent lawyers to grade the exams, with the assigned scores “match[ing] or exceed[ing]” those assigned by the authors. Given the lack of reporting to the contrary, however, the method used by the graders would presumably be plagued by issue issues as highlighted above (no rubric, no formal training with bar exam grading, no formal intra-grader calibration).
Given the above issues, as well as the fact that, as alluded in the introduction, GPT-4’s performance boost over GPT-3 on other essay-based exams was far lower than that on the bar exam, it seems warranted not only to infer that GPT-4’s relative performance (in terms of percentile among human test-takers) was lower than that reported by OpenAI, but also that GPT-4’s reported scaled score on the essay may have deviated to some degree from GPT-4’s “true” essay (which, if true, would imply that GPT-4’s “true” percentile on the bar exam may be even lower than that estimated in previous sections).
Indeed, [47] to some degree acknowledge all of these limitations in their paper, writing: “While we recognize there is inherent variability in any qualitative assessment, our reliance on the state bars’ representative “good” answers and the multiple reviewers reduces the likelihood that our assessment is incorrect enough to alter the ultimate conclusion of passage in this paper.”
Given that GPT-4’s reported score of 298 is 28 points higher than the passing threshold (270) in the majority of UBE jurisdictions, it is true that the essay scores would have to have been wildly inaccurate in order to undermine the general conclusion of [47] (i.e., that GPT-4 “passed the [uniform] bar exam”). However, even supposing that GPT-4’s “true” percentile on the essay portion was just a few points lower than that reported by OpenAI, this would further call into question OpenAI’s claims regarding the relative performance of GPT-4 on the UBE relative to human test-takers. For example, supposing that GPT-4 scored 9 points lower essays would drop its estimated relative performance to (a) 31st percentile compared to July test-takers; (b) 24th percentile relative to first-time test takers; and (c) less than 5th percentile compared to licensed attorneys.
5. Discussion
This paper first investigated the issue of OpenAI’s claim of GPT-4’s 90th percentile UBE performance, resulting in four main findings. The first finding is that although GPT-4’s UBE score approaches the 90th percentile when examining approximate conversions from February administrations of the Illinois Bar Exam, these estimates are heavily skewed towards low scorers, as the majority of test- takers in February failed the July administration and tend to score much lower than the general test-taking population. The second finding is that using July data from the same source would result in an estimate of ~68th percentile, including below average performance on the essay portion. The third finding is that comparing GPT-4’s performance against first-time test takers would result in an estimate of ~62nd percentile, including ~42nd percentile on the essay portion. The fourth main finding is that when examining only those who passed the exam, GPT-4’s performance is estimated to drop to ~48th percentile overall, and ~15th percentile on essays.
In addition to these four main findings, the paper also investigated the validity of GPT-4’s reported UBE score of 298. Although the paper successfully replicated the MBE score of 158, the paper also highlighted several methodological issues in the grading of the MPT + MEE components of the exam, which call into question the validity of the essay score (140).
Finally, the paper also investigated the effect of adjusting temperature settings and prompting techniques on GPT-4’s MBE performance, finding no significant effect of adjusting temperature settings on performance, and some effect of prompt engineering when compared to a basic prompting baseline condition.
Of course, assessing the capabilities of an AI system as compared to those of a practicing lawyer is no easy task. Scholars have identified several theoretical and practical difficulties in creating accurate measurement scales to assess AI capabilities and have pointed out various issues with some of the current scales [10–12]. Relatedly, some have pointed out that simply observing that GPT-4 under- or over-performs at a task in some setting is not necessarily reliable evidence that it (or some other LLM) is capable or incapable of performing that task in general [13–15].
In the context of legal profession specifically, there are various reasons to doubt the usefulness of UBE percentile as a proxy for lawyerly competence (both for humans and AI systems), given that, for example: (a) the content on the UBE is very general and does not pertain to the legal doctrine of any jurisdiction in the United States [16], and thus knowledge (or ignorance) of that content does not necessarily translate to knowledge (or ignorance) of relevant legal doctrine for a practicing lawyer of any jurisdiction; (b) the tasks involved on the bar exam, particularly multiple-choice questions, do not reflect the tasks of practicing lawyers, and thus mastery (or lack of mastery) of those tasks does not necessarily reflect mastery (or lack of mastery) of the tasks of practicing lawyers; and (c) given the lack of direct professional incentive to obtain higher than a passing score (typically no higher than 270) [62], obtaining a particularly high score or percentile past this threshold is less meaningful than for other exams (e.g. LSAT), where higher scores are taken into account for admission into select institutions [74].
Setting these objections aside, however, to the extent that one believes the UBE to be a valid proxy for lawyerly competence, these results suggest GPT-4 to be substantially less lawyerly competent than previously assumed, as GPT-4’s score against likely attorneys (i.e. those who actually passed the bar) is ~48th percentile. Moreover, when just looking at the essays, which more closely resemble the tasks of practicing lawyers and thus more plausibly reflect lawyerly competence, GPT-4’s performance falls in the bottom ~15th percentile. These findings align with recent research work finding that GPT-4 performed below-average on law school exams [75].
The lack of precision and transparency in OpenAI’s reporting of GPT-4’s UBE performance has implications for both the current state of the legal profession and the future of AI safety. On the legal side, there appear to be at least two sets of implications. On the one hand, to the extent that lawyers put stock in the bar exam as a proxy for general legal competence, the results might give practicing lawyers at least a mild temporary sense of relief regarding the security of the profession, given that the majority of lawyers perform better than GPT on the component of the exam (essay-writing) that seems to best reflect their day-to-day activities (and by extension, the tasks that would likely need to be automated in order to supplant lawyers in their day-to-day professional capacity).
On the other hand, the fact that GPT-4’s reported “90th percentile” capa- bilities were so widely publicized might pose some concerns that lawyers and non-lawyers may use GPT-4 for complex legal tasks for which it is incapable of adequately performing, plausibly increasing the rate of (a) misapplication of the law by judges; (b) professional malpractice by lawyers; and (c) ineffective pro se representation and/or unauthorized practice of law by non-lawyers. From a legal education standpoint, law students who overestimate GPT-4’s UBE capabilities might also develop an unwarranted sense of apathy towards developing critical legal-analytical skills, particularly if under the impression that GPT-4’s level of mastery of those skills already surpasses that to which a typical law student could be expected to reach.
On the AI front, these findings raise concerns both for the transparency[ref 20] of capabilities research and the safety of AI development more generally. In particular, to the extent that one considers transparency to be an important prerequisite for safety [38], these findings underscore the importance of implementing rigorous transparency measures so as to reliably identify potential warning signs of transformative progress in artificial intelligence as opposed to creating a false sense of alarm or security [76]. Implementing such measures could help ensure that AI development, as stated in OpenAI’s charter, is a “value-aligned, safety-conscious project” as opposed to becoming “a competitive race without time for adequate safety precautions” [77].
Of course, the present study does not discount the progress that AI has made in the context of legally relevant tasks; after all, the improvement in UBE performance from GPT-3.5 to GPT-4 as estimated in this study remains impressive (arguably equally or even more so given that GPT-3.5’s performance is also estimated to be significantly lower than previously assumed), even if not as flashy as the 10th-90th percentile boost of OpenAI’s official estimation. Nor does the present study discount the seemingly inevitable future improvement of AI systems to levels far beyond their present capabilities, or, as phrased in GPT-4 Passes the Bar Exam, that the present capabilities “highlight the floor, not the ceiling, of future application” [47, 11].
To the contrary, given the inevitable rapid growth of AI systems, the results of the present study underscore the importance of implementing rigorous and transparent evaluation measures to ensure that both the general public and relevant decision-makers are made appropriately aware of the system’s capabilities, and to prevent these systems from being used in an unintentionally harmful or catastrophic manner. The results also indicate that law schools and the legal profession should prioritize instruction in areas such as law and technology and law and AI, which, despite their importance, are currently not viewed as descriptively or normatively central to the legal academy [78].
Algorithmic black swans
Abstract
From biased lending algorithms to chatbots that spew violent hate speech, AI systems already pose many risks to society. While policymakers have a responsibility to tackle pressing issues of algorithmic fairness, privacy, and accountability, they also have a responsibility to consider broader, longer-term risks from AI technologies. In public health, climate science, and financial markets, anticipating and addressing societal-scale risks is crucial. As the COVID-19 pandemic demonstrates, overlooking catastrophic tail events — or “black swans” — is costly. The prospect of automated systems manipulating our information environment, distorting societal values, and destabilizing political institutions is increasingly palpable. At present, it appears unlikely that market forces will address this class of risks. Organizations building AI systems do not bear the costs of diffuse societal harms and have limited incentive to install adequate safeguards. Meanwhile, regulatory proposals such as the White House AI Bill of Rights and the European Union AI Act primarily target the immediate risks from AI, rather than broader, longer-term risks. To fill this governance gap, this Article offers a roadmap for “algorithmic preparedness” — a set of five forward-looking principles to guide the development of regulations that confront the prospect of algorithmic black swans and mitigate the harms they pose to society.