What might the end of Chevron deference mean for AI governance?

In January of this year, the Supreme Court heard oral argument in two cases—Relentless, Inc. v. Department of Commerce and Loper Bright Enterprises, Inc. v. Raimondo—that will decide the fate of a longstanding legal doctrine known as “Chevron deference.” During the argument, Justice Elena Kagan spoke at some length about her concern that eliminating Chevron deference would impact the U.S. federal government’s ability to “capture the opportunities, but also meet the challenges” presented by advances in Artificial Intelligence (AI) technology.

Eliminating Chevron deference would dramatically impact the ability of federal agencies to regulate in a number of important areas, from health care to immigration to environmental protection. But Justice Kagan chose to focus on AI for a reason. In addition to being a hot topic in government at the moment—more than 80 items of AI-related legislation have been proposed in the current Session of the U.S. Congress—AI governance could prove to be an area where the end of Chevron deference will be particularly impactful.

The Supreme Court will issue a decision in Relentless and Loper Bright at some point before the end of June 2024. Most commentators expect the Court’s conservative majority to eliminate (or at least to significantly weaken) Chevron deference, notwithstanding the objections of Justice Kagan and the other two members of the Court’s liberal minority. But despite the potential significance of this change, relatively little has been written about what it means for the future of AI governance. Accordingly, this blog post offers a brief overview of what Chevron deference is and what its elimination might mean for AI governance efforts.

What is Chevron deference?

Chevron U.S.A., Inc. v. Natural Resources Defense Council, Inc. is a 1984 Supreme Court case  in which the Court laid out a framework for evaluating agency regulations interpreting federal statutes (i.e., laws). Under Chevron, federal courts defer to agency interpretations when: (1) the relevant part of the statute being interpreted is genuinely ambiguous, and (2) the agency’s interpretation is reasonable. 

As an example of how this deference works in practice, consider the case National Electrical Manufacturers Association v. Department of Energy. There, a trade association of electronics manufacturers (NEMA) challenged a Department of Energy (DOE) regulation that imposed energy conservation standards on electric induction motors with power outputs between 0.25 and 3 horsepower. The DOE claimed that this regulation was authorized by a statute that empowered the DOE to create energy conservation standards for “small electric motors.” NEMA argued that motors with between 1 and 3 horsepower were too powerful to be “small electric motors” and that the DOE was therefore exceeding its statutory authority by attempting to regulate them. A federal court considered the language of the statute and concluded that the statute was ambiguous as to whether 1-3 horsepower motors could be “small electric motors.” The court also found that the DOE’s interpretation of the statute was reasonable. Therefore, the court deferred to the DOE’s interpretation under Chevron and the challenged regulation was upheld.

What effect would overturning Chevron have on AI governance efforts?

Consider the electric motor case discussed above. In a world without Chevron deference, the question considered by the court would have been “does the best interpretation of the statute allow DOE to regulate 1-3 horsepower motors?” rather than “is the DOE’s interpretation of this statute reasonable?” Under the new standard, lawsuits like NEMA’s would probably be more likely to succeed than they have been in recent decades under Chevron.

Eliminating Chevron would essentially take some amount of interpretive authority away from federal agencies and transfer it to federal courts. This would make it easier for litigants to successfully challenge agency actions, and could also have a chilling effect on agencies’ willingness to adopt potentially controversial interpretations. Simply put, no Chevron means fewer and less aggressive regulations. To libertarian-minded observers like Justice Neil Gorsuch, who has been strongly critical of the modern administrative state, this would be a welcome change—less regulation would mean smaller government, increased economic growth, and more individual freedom.1 Those who favor a laissez-faire approach to AI governance, therefore, should welcome the end of Chevron. Many commentators, however, have suggested that a robust federal regulatory response is necessary to safely develop advanced AI systems without creating unacceptable risks. Those who subscribe to this view would probably share Justice Kagan’s concern that degrading the federal government’s regulatory capacity will seriously impede AI governance efforts.

Furthermore, AI governance may be more susceptible to the potential negative effects of Chevron repeal than other areas of regulation. Under current law, the degree of deference accorded to agency interpretations “is particularly great where … the issues involve a high level of technical expertise in an area of rapidly changing technological and competitive circumstances.”2 This is because the regulation of emerging technologies is an area where two of the most important policy justifications for Chevron deference are at their most salient. Agencies, according to Chevron’s proponents, are (a) better than judges at marshaling deep subject matter expertise and hands-on experience, and (b) better than Congress at responding quickly and flexibly to changed circumstances. These considerations are particularly important for AI governance because AI is, in some ways, particularly poorly understood and unusually prone to manifesting unexpected capabilities and behaving in unexpected ways even in comparison to other emerging technologies.

Overturning Chevron would also make it more difficult for agencies to regulate AI under existing authorities by issuing new rules based on old statutes. The Federal Trade Commission, for example, does not necessarily need additional authorization to issue regulations intended to protect consumers from harms such as deceptive advertising using AI. It already has some authority to issue such regulations under § 5 of the FTC Act, which authorizes the FTC to issue regulations aimed at preventing “unfair or deceptive acts or practices in or affecting commerce.” But disputes will inevitably arise, as they often have in the past, over the exact meaning of statutory language like “unfair or deceptive acts or practices” and “in or affecting commerce.” This is especially likely to happen when old statutes (the “unfair or deceptive acts or practices” language in the FTC Act dates from 1938) are leveraged to regulate technologies that could not possibly have been foreseen when the statutes were drafted. Statutes that predate the technologies to which they are applied will necessarily be full of gaps and ambiguities, and in the past Chevron deference has allowed agencies to regulate more or less effectively by filling in those gaps. If Chevron is overturned, challenges to this kind of regulation will be more likely to succeed. For instance, the anticipated legal challenge to the Biden administration’s use of the Defense Production Act to authorize reporting requirements for AI labs developing dual-use foundation models might possibly be more likely to succeed if Chevron is overturned.3

If Chevron is overturned, agency interpretations will still be entitled to a weaker form of deference known as Skidmore deference, after the 1944 Supreme Court case Skidmore v. Swift & Co. Skidmore requires courts give respectful consideration to an agency’s interpretation, taking into account the agency’s expertise and knowledge of the policy context surrounding the statute. But Skidmore deference is not really deference at all; agency interpretations under Skidmore influence a court’s decision only to the extent that they are persuasive. In other words, replacing Chevron with Skidmore would require courts only to consider the agency’s interpretation along with other arguments and authorities raised by the parties to a lawsuit in the course of choosing the best interpretation of a statute. 

How can legislators respond to the elimination of Chevron?

Chevron deference was not originally created by Congress—rather, it was created by the Supreme Court in 1984. This means that Congress could probably4 codify Chevron into law, if the political will to do so existed. However, past attempts to codify Chevron have mostly failed, and the difficulty of enacting controversial new legislation in the current era of partisan gridlock makes codifying Chevron an unlikely prospect in the short term. 

However, codifying Chevron as a universal principle of judicial interpretation is not the only option. Congress can alternatively codify Chevron on a narrower basis, by including, in individual laws for which Chevron deference would be particularly useful,  provisions directing courts to defer to specified agencies’ reasonable interpretations of specified statutory provisions. This approach could address Justice Kagan’s concerns about the desirability of flexible rulemaking in highly technical and rapidly evolving regulatory areas while also making concessions to conservative concerns about the constitutional legitimacy of the modern administrative state. 

While codifying Chevron could be controversial, there are also some uncontroversial steps that legislators can take to shore up new legislation against post-Chevron legal challenges. Conservative and liberal jurists agree that statutes can legitimately confer discretion on agencies to choose between different available policy options. So, returning to the small electric motor example discussed above, a statute that explicitly granted the DOE broad discretion to define “small electric motor” in accordance with the DOE’s policy judgment about what motors should be regulated would effectively confer discretion. The same would be true for, e.g., a law authorizing the Department of Commerce to exercise discretion in defining the phrase “frontier model.”5 A reviewing court would then ask whether the challenged agency interpretation fell within the agency’s discretion, rather than asking whether the interpretation was the best interpretation possible.

Conclusion

If the Supreme Court eliminates Chevron deference in the coming months, that decision will have profound implications for the regulatory capacity of executive-branch agencies generally and for AI governance specifically. However, there are concrete steps that can be taken to mitigate the impact of Chevron repeal on AI governance policy.  Governance researchers and policymakers should not underestimate the potential significance of the end of Chevron and should take it into consideration while proposing legislative and regulatory strategies for AI governance.

Computing power and the governance of artificial intelligence

AI Insight Forum – privacy and liability

Summary

On November 8, our Head of Strategy, Mackenzie Arnold, spoke before the US Senate’s bipartisan AI Insight Forum on Privacy and Liability, convened by Senate Majority Leader Chuck Schumer. We presented our perspective on how Congress can meet the unique challenges that AI presents to liability law.1

In our statement, we note that:

We then make several recommendations for how Congress could respond to these challenges:


Dear Senate Majority Leader Schumer, Senators Rounds, Heinrich, and Young, and distinguished members of the U.S. Senate, thank you for the opportunity to speak with you about this important issue. Liability is a critical tool for addressing risks posed by AI systems today and in the future. In some respects, existing law will function well, compensating victims, correcting market inefficiencies, and driving safety innovation. However, artificial intelligence also presents unusual challenges to liability law that may lead to inconsistency and uncertainty, penalize the wrong actors, and leave victims uncompensated. Courts, limited to the specific cases and facts at hand, may be slow to respond. It is in this context that Congress has an opportunity to act. 

Problem 1: Existing law will under-deter malicious and criminal misuse of AI. 

Many have noted the potential for AI systems to increase the risk of various hostile threats, ranging from biological and chemical weapons to attacks on critical infrastructure like energy, elections, and water systems. AI’s unique contribution to these risks goes beyond simply identifying dangerous chemicals and pathogens; advanced systems may help plan, design, and execute complex research tasks or help criminals operate on a vastly greater scale. With this in mind, President Biden’s recent Executive Order has called upon federal agencies to evaluate and respond to systems that may “substantially lower[] the barrier of entry for non-experts to design, synthesize, acquire, or use chemical, biological, radiological, or nuclear (CBRN) weapons.” While large-scale malicious threats have yet to materialize, many AI systems are inherently dual-use by nature. If AI is capable of tremendous innovation, it may also be capable of tremendous, real-world harms. In many cases, the benefits of these systems will outweigh the risks, but the law can take steps to minimize misuse while preserving benefits. 

Existing criminal, civil, and tort law will penalize malevolent actors for the harms they cause; however, liability is insufficient to deter those who know they are breaking the law. AI developers and some deployers will have the most control over whether powerful AI systems fall into the wrong hands, yet they may escape liability (or believe and act as if they will). Unfortunately, existing law may treat malevolent actors’ intentional bad acts or alterations to models as intervening causes that sever the causal chain and preclude liability, and the law leaves unclear what obligations companies have to secure their models. Victims will go uncompensated if their only source of recourse is small, hostile actors with limited funds. Reform is needed to make clear that those with the greatest ability to protect and compensate victims will be responsible for preventing malicious harms. 

Recommendations

(1.1) Hold AI developers and some deployers strictly liable for attacks on critical infrastructure and harms that result from biological, chemical, radiological, or nuclear weapons.

The law has long recognized that certain harms are so egregious that those who create them should internalize their cost by default. Harms caused by biological, chemical, radiological, and nuclear weapons fit these criteria, as do harms caused by attacks on critical infrastructure. Congress has addressed similar harms before, for example, creating strict liability for releasing hazardous chemicals into the environment. 

(1.2) Consider (a) holding developers strictly liable for harms caused by malicious use of exfiltrated systems and open-sourced weights or (b) creating a duty to ensure the security of model weights.

Access to model weights increases malicious actors’ ability to enhance dangerous capabilities and remove critical safeguards. And once model weights are out, companies cannot regain control or restrict malicious use. Despite this, existing information security norms are insufficient, as evidenced by the leak of Meta’s LLaMA model just one week after it was announced and significant efforts by China to steal intellectual property from key US tech companies. Congress should create strong incentives to secure and protect model weights. 

Getting this balance right will be difficult. Open-sourcing is a major source of innovation, and even the most scrupulous information security practices will sometimes fail. Moreover, penalizing exfiltration without restricting the open-sourcing of weights may create perverse incentives to open-source weights in order to avoid liability—what has been published openly can’t be stolen. To address these tradeoffs, Congress could pair strict liability with the ability to apply for safe harbor or limit liability to only the largest developers, who have the resources to secure the most powerful systems, while excluding smaller and more decentralized open-source platforms. At the very least, Congress should create obligations for leading developers to maintain adequate security practices and empower a qualified agency to update these duties over time. Congress could also support open-source development through secure, subsidized platforms like NAIRR or investigate
other alternatives to safe access.

(1.3) Create duties to (a) identify and test for model capabilities that could be misused and (b) design and implement safeguards that consistently prevent misuse and cannot be easily removed. 

Leading AI developers are best positioned to secure their models and identify dangerous misuse capabilities before they cause harm. The latter requires evaluation and red-teaming before deployment, as acknowledged in President Biden’s Recent Executive Order, and continued testing and updates after deployment. Congress should codify clear minimum standards for identifying capabilities and preventing misuse and should grant a qualified agency authority to update these duties over time. 

Problem 2: Existing law will under-compensate harms from models with unexpected capabilities and failure modes. 

A core characteristic of modern AI systems is their tendency to display rapid capability jumps and unexpected emergent behaviors. While many of these advances have been benign, when unexpected capabilities cause harm, courts may treat them as unforeseeable and decline to impose liability. Other failures may occur when AI systems are integrated into new contexts, such as healthcare, employment, and agriculture, where integration presents both great upside and novel risks. Developers of frontier systems and deployers introducing AI into novel contexts will be best positioned to develop containment methods and detect and correct harms that emerge.

Recommendations

(2.1) Adjust the timing of obligations to account for redressability. 

To balance innovation and risk, liability law can create obligations at different stages of the product development cycle. For harms that are difficult to control or remedy after they have occurred, like harms that upset complex financial systems or that result from uncontrolled model behavior, Congress should impose greater ex-ante obligations that encourage the proactive identification of potential risks. For harms that are capable of containment and remedy, obligations should instead encourage rapid detection and remedy. 

(2.2) Create a duty to test for emergent capabilities, including agentic behavior and its precursors. 

Developers will be best positioned to identify new emergent behaviors, including agentic behavior. While today’s systems have not displayed such qualities, there are strong theoretical reasons to believe that autonomous capabilities may emerge in the future, as acknowledged by the actions of key AI developers like Anthropic and OpenAI. As techniques develop, Congress should ensure that those working on frontier systems utilize these tools rigorously and consistently. Here too, Congress should authorize a qualified agency to update these duties over time as new best practices emerge.

(2.3) Create duties to monitor, report, and respond to post-deployment harms, including taking down or fixing models that pose an ongoing risk. 

If, as we expect, emergent capabilities are difficult to predict, it will be important to identify them even after deployment. In many cases, the only actors with sufficient information and technical insight to do so will be major developers of cutting-edge systems. Monitoring helps only insofar as it is accompanied by duties to report or respond. In at least some contexts, corporations already have a duty to report security breaches and respond to continuing risks of harm, but legal uncertainty limits the effectiveness of these obligations and puts safe actors at a competitive disadvantage. By clarifying these duties, Congress can ensure that all major developers meet a minimum threshold of safety. 

(2.4) Create strict liability for harms that result from agentic model behavior such as self-exfiltration, self-alteration, self-proliferation, and self-directed goal-seeking. 

Developers and deployers should maintain control over the systems they create. Behaviors that enable models to act on their own—without human oversight—should be disincentivized through liability for any resulting harms. “The model did it” is an untenable defense in a functioning liability system, and Congress should ensure that, where intent or personhood requirements would stand in the way, the law imputes liability to a responsible human or corporate actor.

Problem 3: Existing law may struggle to allocate costs efficiently. 

The AI value chain is complex, often involving a number of different parties who help develop, train, integrate, and deploy systems. Because those later in the value chain are more proximate to the harms that occur, they may be the first to be brought to court. But these smaller, less-resourced actors will often have less ability to prevent harm. Disproportionately penalizing these actors will further concentrate power and diminish safety incentives for large, capable developers. Congress can ensure that responsibility lies with those most able to prevent harm. 

Recommendations

(3.1) Establish joint and several liability for harms involving AI systems. 

Victims will have limited information about who in the value chain is responsible for their injuries. Joint and several liability would allow victims to bring any responsible party to court for the full value of the injury. This would limit the burden on victims and allow better-resourced corporate actors to quickly and efficiently bargain toward a fair allocation of blame. 

(3.2) Limit indemnification of liability by developers. 

Existing law may allow wealthy developers to escape liability by contractually transferring blame to smaller third parties with neither the control to prevent nor assets to remedy harms. Because cutting-edge systems will be so desirable, a small number of powerful AI developers will have considerable leverage to extract concessions from third parties and users. Congress should limit indemnification clauses that help the wealthiest players avoid internalizing the costs of their products while still permitting them to voluntarily indemnify users

(3.3) Clarify that AI systems are products under products liability law. 

For over a decade, courts have refused to answer whether AI systems are software or products. This leaves critical ambiguity in existing law. The EU has proposed to resolve this uncertainty by declaring that AI systems are products. Though products liability is primarily developed through state law, a definitive federal answer to this question may spur quick resolution at the state level. Products liability has some notable advantages, focusing courts’ attention on the level of safety that is technically feasible, directly weighing risks and benefits, and applying liability across the value chain. Some have argued that this creates clearer incentives to proactively identify and invest in safer technology and limits temptations to go through the motions of adopting safety procedures without actually limiting risk. Products liability has its limitations, particularly in dealing with defects that emerge after deployment or alteration, but clarifying that AI systems are products is a good start. 

Problem 4: Federal law may obstruct the functioning of liability law. 

Parties are likely to argue that federal law preempts state tort and civil law and that Section 230 shields liability from generative AI models. Both would be unfortunate results that would prevent the redress of individual harms through state tort law and provide sweeping immunity to the very largest AI developers. 

Recommendations

(4.1) Add a savings clause to any federal legislation to avoid preemption. 

Congress regularly adds express statements that federal law does not eliminate, constrain, or preempt existing remedies under state law. Congress should do the same here. While federal law will provide much-needed ex-ante requirements, state liability law will serve a critical role in compensating victims and will be more responsive to harms that occur as AI develops by continuing to adjust obligations and standards of care. 

(4.2) Clarify that Section 230 does not apply to generative AI. 

The most sensible reading of Section 230 suggests that generative AI is a content creator. It creates novel and creative outputs rather than merely hosting existing information. But absent Congressional intervention, this ambiguity may persist. Congress should provide a clear answer: Section 230 does not apply to generative AI.

Open-sourcing highly capable foundation models

International governance of civilian AI

Defining “frontier AI”

What are legislative and administrative definitions?

Congress usually defines key terms like “Frontier AI” in legislation to establish the scope of agency authorization. The agency then implements the law through regulations that more precisely set forth what is regulated, in terms sufficiently concrete to give notice to those subject to the regulation. In doing so, the agency may provide administrative definitions of key terms and provide specific examples or mechanisms.

Who can update these definitions?

Congress can amend legislation and might do so to supersede regulatory or judicial interpretations of the legislation. The agency can amend regulations to update its own definitions and implementation of the legislative definition.

Congress can also expressly authorize an agency to further define a term. For example, the Federal Insecticide, Fungicide, and Rodenticide Act defines “pest” to include any organism “the Administrator declares to be a pest” pursuant to 7 U.S.C. § 136.

What is the process for updating administrative definitions?

For a definition to be legally binding, by default an agency must follow the rulemaking process in the Administrative Procedure Act (APA). Typically, this requires that the agency go through specific notice-and-comment proceedings (informal rulemaking). 

Congress can change the procedures an agency must follow to make rules, for example by dictating the frequency of updates or by authorizing interim final rulemaking, which permits the agency to accept comments after the rule is issued instead of before.

Can a technical standard be incorporated by reference into regulations and statutes?

Yes, but incorporation by reference in regulations is limited. The agency must specify what version of the standard is being incorporated, and regulations cannot dynamically update with a standard. Incorporation by reference in federal regulations is also subject to other requirements. When Congress codifies a standard in a statute, it may incorporate future versions directly, as it did in the Federal Food, Drug, and Cosmetic Act, defining “drug” with reference to the United States Pharmacopoeia. 21 U.S.C. § 321(g). Congress can instead require that an agency use a particular standard. For example, the U.S. Consumer Product Safety Improvement Act effectively adopted ASTM International Standards on toy safety as consumer product safety standards and required the Consumer Product Safety Commission to incorporate future revisions into consumer product safety rules. 15 U.S.C. § 2056b(a) & (g).

How frequently could the definition be updated?

By default the rulemaking process is time-consuming. While the length of time needed to issue a rule varies, estimates from several agencies range from 6 months to over 4 years; the internal estimate of the average for the Food and Drug Administration (FDA) is 3.5 years and for the Department of Transportation is 1.5 years. Less significant updates, such as minor changes to a definition or list of regulated models, might take less time. However, legislation could impliedly or expressly allow updates to be made in a shorter time frame than permitted by the APA.

An agency may bypass some or all of the notice-and-comment process “for good cause” if to do otherwise would be “impracticable, unnecessary, or contrary to the public interest,” 5 U.S.C. § 553(b)(3)(B), such as in the interest of an emergent national security issue or to prevent widespread disruption of flights. It may also bypass the process if the time required would harm the public or subvert the underlying statutory scheme, such as when an agency relied on the exemption for decades to issue weekly rules on volume restrictions for agricultural commodities because it could not reasonably “predict market and weather conditions more than a month in advance” as the 30-day advance notice would require (Riverbend Farms, 9th Cir. 1992).

Congress can also implicitly or explicitly waive the APA requirements. While mere existence of a statutory deadline is not sufficient, a stringent deadline that makes compliance impractical might constitute good cause. 

What existing regulatory regimes may offer some guidance?

  1. The Federal Select Agents Program (FSAP) regulates biological agents that threaten public health, maintains a database of such agents, and inspects entities using such agents. FSAP also works with the FBI to evaluate entity-specific security risks. Finally, FSAP investigates incidents of non-compliance. FSAP provides a model for regulating technology as well as labs. The Program has some drawbacks worthy of study, including risks of regulatory capture (entity investigations are often not done by an independent examiner), prioritization issues (high-risk activities are often not prioritized), and resource allocation (entity investigations are often slow and tedious).
  2. The FDA approves generic drugs by comparing their similarity in composition and risk to existing, approved drugs. Generic drug manufacturers attempt to show sufficient similarity to an approved drug so as to warrant a less rigorous review by the FDA. This framework has parallels with a relative, comparative definition of Frontier AI.

What are the potential legal challenges?

  1. Under the major questions doctrine, courts will not accept an agency interpretation of a statute that grants the agency authority over a matter of great “economic or political significance” unless there is a “clear congressional authorization” for the claimed authority. Defining “frontier AI” in certain regulatory contexts could plausibly qualify as a “major question.” Thus, an agency definition of “Frontier AI” could be challenged under the major questions doctrine if issued without congressional authorization.
  2. The regulation could face a non-delegation doctrine challenge, which limits congressional delegation of its legislative power. The doctrine requires Congress to include an “intelligible principle” on how to exercise its delegated authority. In practice, this is a lenient standard; however, some commentators believe that the Supreme Court may strengthen the doctrine in the near future. Legislation that provides more specific guidance regarding policy decisions is less problematic from a nondelegation perspective than legislation that confers a great deal of discretion on the agency and provides little or no guidance on how the agency should exercise it.

LawAI’s thoughts on proposed updates to U.S. federal benefit-cost analysis

This analysis is based on a comment submitted in response to the Request for Comment on proposed Circular A-4, “Regulatory Analysis”.

We support the many important and substantial reforms to the regulation review process in the proposed Circular A-4. The reforms, if adopted, would reduce the odds of regulations imposing undue costs on vulnerable, underrepresented, and disadvantaged communities both now and well into the future. In this piece, we outline a few additional changes that would further reduce those odds: expanding the scope of analysis to include catastrophic and existential risks, including those far in the future; including future generations in distributional analysis; providing more guidance regarding model uncertainty and regulations that involve irreversible outcomes; lowering the discount rate to zero for irreversible effects; and in a narrow set of cases or, minimally, lowering the discount rate in proportion to the temporal scope of a regulation.

1. Circular A-4 contains many improvements, including consideration of global impacts, expanding the temporal scope of analysis, and recommendations on developing an analytical baseline.

Circular A-4 contains many improvements on the current approach to benefit-cost analysis (BCA). In particular, the proposed reforms would allow for a more comprehensive understanding of the myriad risks posed by any regulation. The guidance for analysis to include global impacts1 will more accurately account for the effects of a regulation on increasingly interconnected and interdependent economic, political, and environmental systems. Many global externalities, such as pandemics and climate change, require international regulatory cooperation; in these cases, efficient allocation of global resources, which benefits the United States and its citizens and residents, requires all countries to consider global costs and benefits.2

The instruction to tailor the time scope of analysis to “encompass all the important benefits and costs likely to result from regulation” will likewise bolster the quality of a risk assessment3—though, as mentioned below, a slight modification to this instruction could aid regulators in identifying and mitigating existential risks posed by regulations. 

The recommendations on developing an analytic baseline have the potential to increase the accuracy and comprehensiveness of BCA by ensuring that analysts integrate current and likely technological developments and the resulting harms of those developments into their baseline.4

A number of other proposals would also qualify as improvements on the status quo. A litany of commentors have discussed those proposals, so the remainder of this piece is reserved for suggested amendments and recommendations for topics worthy of additional consideration.

2. The footnote considering catastrophic risks is a welcome addition that could be further strengthened with a minimum time frame of analysis and clear inclusion of catastrophic and existential threats in “important” and “likely” benefits and costs.

The proposed language will lead to a more thorough review of the benefits and costs of a regulation by expanding the time horizon over which those effects are assessed.5 We particularly welcome the footnote encouraging analysts to consider whether a regulation that involves a catastrophic risk may impose costs on future generations.6

We recommend two suggestions to further strengthen the purpose of this footnote in encouraging the consideration of catastrophic and existential risks and the long-run effects of related regulation. First, we recommend mandating consideration of long-run effects of a regulation.7 Given the economic significance of a regulation that triggers review under Executive Orders 12866 and 13563, as supplemented and reaffirmed by Executive Order 14094, the inevitable long-term impacts deserve consideration—especially because regulations of such size and scope could affect catastrophic and existential risks that imperil future generations. Thus, the Office should consider establishing a minimum time frame of analysis to ensure that long-run benefits and costs are adequately considered, even if they are sometimes found to be negligible or highly uncertain.

Second, the final draft should clarify what constitutes an “important” benefit and cost as well as when those effects will be considered “likely”.8 We recommend that those concepts clearly encompass potential catastrophic or existential threats, even those that have very low likelihood.9 An expansive definition of both qualifiers would allow the BCA to provide stakeholders with a more complete picture of the regulation’s short- and long-term impact.

3. Distributional analysis should become the default of regulatory review and include future generations as a group under consideration.

The potential for disparate effects of regulations on vulnerable, underrepresented, and disadvantaged groups merits analysis in all cases. Along with several other commentors, we recommend that distributional analysis become the default of any regulatory review. When possible, we further recommend that such analysis include future generations among the demographic categories.10 Future generations have no formal representation and will bear the costs imposed by any regulation for longer than other groups.11

The Office should also consider making this analysis mandatory, with no exceptions. Such a mandate would reduce the odds of any group unexpectedly bearing a disproportionate and unjust share of the costs of a regulation. The information generated by this analysis would also give groups a more meaningfully informed opportunity to engage in the review of regulations. 

4. Treatment of uncertainty is crucial for evaluating long-term impacts and should include more guidance regarding models, model uncertainty, and regulations that involve irreversible outcomes.

Circular A-4 directs agencies to seek out and respond to several different types of uncertainty from the outset of their analysis.12 This direction will allow for a more complete understanding of the impacts of a regulation both in the short- and long- term. Greater direction would accentuate those benefits. 

The current model uncertainty guidance, largely confined to a footnote, nudges agencies to “consider multiple models to establish robustness and reduce model uncertainty.”13 The brevity of this instruction conflicts with the complexity of this process. Absent more guidance, agencies may be poorly equipped to assess and treat uncertainty, which will frustrate the provision of “useful information to decision makers and the public about the effects and the uncertainties of alternative regulatory actions.”14 A more participatory, equitable, and robust regulation review process hinges on that information. 

We encourage the agency to provide further examples and guidance on how to prepare models and address model uncertainty, in particular regarding catastrophic and existential risks, as well as significant benefits and costs in the far future.15 A more robust approach to responding to uncertainty would include explicit instructions on how to identify, evaluate, and report uncertainty regarding the future. Several commentors highlighted that estimates of costs and benefits become more uncertain over time. We echo and amplify concerns that regulations with forecasted effects on future generations will require more rigorous treatment of uncertainty.

We similarly recommend that more guidance be offered with respect to regulations that involve irreversible outcomes, such as exhaustion of resources or extinction of a species.16 The Circular notes that such regulations may benefit from a “real options” analysis; however, this simple guidance is inadequate for the significance of the topic. The Circular acknowledges that “[t]he costs of shifting the timing of regulatory effects further into the future may be especially high when regulating to protect against irreversible harms.” We agree that preserving option value for future generations is of immense value. How to value those options should receive more attention in subsequent drafts. Likewise, guidance on how to identify irreversible outcomes and conduct real options analysis merits more attention in forthcoming iterations.

We recommend similar caution for regulations involving harms that are persistent and challenging to reverse, but not irreversible.

5. A lower discount rate and declining discount rate are necessary to account for the impact of regulations with significant and long-term effects on future generations.

The discount rate in a BCA is one signal of how much a society values the future. We join a chorus of commentors in applauding both the overall lowering of the discount rate as well as the idea of a declining discount rate schedule. 

The diversity of perspectives in those comments, however, indicate that this topic merits further consideration. In particular, we would welcome further discussion on the merits of a zero discount rate. Though sometimes characterized as a blunt tool to attempt to assist future generations,17 zero discount rates may become necessary when evaluating regulations that involve irreversible harm.18 In cases involving irreversibility, a fundamental assumption about discounting breaks down—specifically, that the discounted resource has more value in the present because it can be invested and, as a result, generate more resources in subsequent periods.19 If the regulation involves the elimination of certain resources, such as nonrenewable resources, rather than their preservation or investment, then the value of the resources remain constant across time periods.20 Several commentors indicated that they share our concern about such harms, suggesting that they would welcome this narrow use case for zero discount rates.21

We likewise support the general concept of declining discount rates and further conversations regarding the declining discount rate (DDR) schedule,22 given the importance of such schedules in accounting for the impact of regulations with significant and long-term effects on future generations.23 US adoption of a DDR schedule would bring us into alignment with two peers—namely, the UK and France.24 The former, which is based on the Ramsey formula rather than a fixed DDR schedule proposed, deserves particular attention given that it estimates time preference ρ as the sum of “pure time preference (δ , delta) and catastrophic risk (L)”,25 defined in the previous Green Book as the “likelihood that there will be some event so devastating that all returns from policies, programmes or projects are eliminated”.26 This approach to a declining discount schedule demonstrates the sort of risk aversion, considering catastrophic and existential risk, that is necessary in light of regulations that present significant uncertainty.

6. Regulations that relate to irreversible outcomes, catastrophic risk, or existential risk warrant review as being significant under Section 3(f)(1).

In establishing thresholds for which regulations will undergo regulatory analysis, Section 3(f)(1) of Executive Order 12866 includes a number of sufficient criteria in addition to the increased monetary threshold. We note that regulations that might increase or reduce catastrophic or existential risk should be reviewed as having the potential to “adversely affect in a material way the economy, a sector of the economy, productivity, competition, jobs, the environment, public health or safety, or State, local, territorial, or tribal governments or communities.”27 Even “minor” regulations can have unintended consequences with major ramifications on our institutions, systems, and norms—those that might influence such grave risks are of particular import. For similar reasons, the Office should also review any regulation that has a reasonable chance of causing irreversible harm to future generations.28

7. Conclusion

Circular A-4 contains important and substantial reforms to the regulation review process. The reforms, if adopted, would reduce the odds of regulations imposing undue costs on vulnerable, underrepresented, and disadvantaged communities both now and well into the future. A few additional changes would further reduce those odds—specifically, expanding the scope of analysis to include catastrophic and existential risks, including those far in the future; including future generations in distributional analysis; providing more guidance regarding model uncertainty and regulations that involve irreversible outcomes; lowering the discount rate to zero for irreversible effects; and in a narrow set of cases or, minimally, lowering the discount rate in proportion to the temporal scope of a regulation.

Re-evaluating GPT-4’s bar exam performance

1. Introduction

On March 14th, 2023, OpenAI launched GPT-4, said to be the latest milestone in the company’s effort in scaling up deep learning [1]. As part of its launch, OpenAI revealed details regarding the model’s “human-level performance on various professional and academic benchmarks” [1]. Perhaps none of these capabilities was as widely publicized as GPT-4’s performance on the Uniform Bar Examination, with OpenAI prominently displaying on various pages of its website and technical report that GPT-4 scored in or around the “90th percentile,” [1-3] or “the top 10% of test-takers,” [1, 2] and various prominent media outlets [4–8] and legal scholars [9] resharing and discussing the implications of these results for the legal profession and the future of AI.

Of course, assessing the capabilities of an AI system as compared to those of a human is no easy task [10–15], and in the context of the legal profession specifically, there are various reasons to doubt the usefulness of the bar exam as a proxy for lawyerly competence (both for humans and AI systems), given that, for example: (a) the content on the UBE is very general and does not pertain to the legal doctrine of any jurisdiction in the United States [16], and thus knowledge (or ignorance) of that content does not necessarily translate to knowledge (or ignorance) of relevant legal doctrine for a practicing lawyer of any jurisdiction; and (b) the tasks involved on the bar exam, particularly multiple-choice questions, do not reflect the tasks of practicing lawyers, and thus mastery (or lack of mastery) of those tasks does not necessarily reflect mastery (or lack of mastery) of the tasks of practicing lawyers.

Moreover, although the UBE is a closed-book exam for humans, GPT-4’s huge training corpus largely distilled in its parameters means that it can effectively take the UBE “open-book”, indicating that UBE may not only be an accurate proxy for lawyerly competence but is also likely to provide an overly favorable estimate of GPT-4’s lawyerly capabilities relative to humans.

Notwithstanding these concerns, the bar exam results appeared especially startling compared to GPT-4’s other capabilities, for various reasons. Aside from the sheer complexity of the law in form [17–19] and content [20–22], the first is that the boost in performance of GPT-4 over its predecessor GPT-3.5 (80 percentile points) far exceeded that of any other test, including seemingly related tests such as the LSAT (40 percentile points), GRE verbal (36 percentile points), and GRE Writing (0 percentile points) [2, 3].

The second is that half of the Uniform Bar Exam consists of writing essays[16],1 and GPT-4 seems to have scored much lower on other exams involving writing, such as AP English Language and Composition (14th-44th percentile), AP English Literature and Composition (8th-22nd percentile) and GRE Writing (~54th percentile) [1, 2]. In each of these three exams, GPT-4 failed to achieve a higher percentile performance over GPT-3.5, and failed to achieve a percentile score anywhere near the 90th percentile.

Moreover, in its technical report, GPT-4 claims that its percentile estimates are “conservative” estimates meant to reflect “the lower bound of the percentile range,” [2, p. 6] implying that GPT-4’s actual capabilities may be even greater than its estimates.

Methodologically, however, there appear to be various uncertainties related to the calculation of GPT’s bar exam percentile. For example, unlike the administrators of other tests that GPT-4 took, the administrators of the Uniform Bar Exam (the NCBE as well as different state bars) do not release official percentiles of the UBE [27, 28], and different states in their own releases almost uniformly report only passage rates as opposed to percentiles [29, 30], as only the former are considered relevant to licensing requirements and employment prospects.

Furthermore, unlike its documentation for the other exams it tested [2, p. 25], OpenAI’s technical report provides no direct citation for how the UBE percentile was computed, creating further uncertainty over both the original source and validity of the 90th percentile claim.

The reliability and transparency of this estimate has important implications on both the legal practice front and AI safety front. On the legal practice front, there is great debate regarding to what extent and when legal tasks can and should be automated [31–34]. To the extent that capabilities estimates for generative AI in the context law are overblown, this may lead both lawyers and non-lawyers to rely on generative AI tools when they otherwise wouldn’t and arguably shouldn’t, plausibly increasing the prevalence of bad legal outcomes as a result of (a) judges misapplying the law; (b) lawyers engaging in malpractice and/or poor representation of their clients; and (c) non-lawyers engaging in ineffective pro se representation.Meanwhile, on the AI safety front, there appear to be growing concerns of transparency2 among developers of the most powerful AI systems [36, 37]. To the extent that transparency is important to ensuring the safe deployment of AI, a lack of transparency could undermine our confidence in the prospect of safe deployment of AI [38, 39]. In particular, releasing models without an accurate and transparent assessment of their capabilities (including by third-party developers) might lead to unexpected misuse/misapplication of those models (within and beyond legal contexts), which might have detrimental (perhaps even catastrophic) consequences moving forward [40, 41].

Given these considerations, this paper begins by investigating some of the key methodological challenges in verifying the claim that GPT-4 achieved 90th percentile performance on the Uniform Bar Examination. The paper’s findings in this regard are fourfold. First, although GPT-4’s UBE score nears the 90th percentile when examining approximate conversions from February administrations of the Illinois Bar Exam, these estimates appear heavily skewed towards those who failed the July administration and whose scores are much lower compared to the general test-taking population. Second, using data from a recent July administration of the same exam reveals GPT-4’s percentile to be below the 69th percentile on the UBE, and ~48th percentile on essays. Third, examining official NCBE data and using several conservative statistical assumptions, GPT-4’s performance against first-time test takers is estimated to be ~62nd percentile, including 42nd percentile on essays. Fourth, when examining only those who passed the exam, GPT-4’s performance is estimated to drop to ~48th percentile overall, and ~15th percentile on essays.

Next, whereas the above four findings take for granted the scaled score achieved by GPT-4 as reported by OpenAI, the paper then proceeds to investigate the validity of that score, given the importance (and often neglectedness) of replication and reproducibility within computer science and scientific fields more broadly [42–46]. The paper successfully replicates the MBE score of 158, but highlights several methodological issues in the grading of the MPT + MEE components of the exam, which call into question the validity of the essay score (140).

Finally, the paper also investigates the effect of adjusting temperature settings and prompting techniques on GPT-4’s MBE performance, finding no significant effect of adjusting temperature settings on performance, and some significant effect of prompt engineering on model performance when compared to a minimally tailored baseline condition.

Taken together, these findings suggest that OpenAI’s estimates of GPT-4’s UBE percentile, though clearly an impressive leap over those of GPT-3.5, are likely overinflated, particularly if taken as a “conservative” estimate representing “the lower range of percentiles,” and even moreso if meant to reflect the actual capabilities of a practicing lawyer. These findings carry timely insights for the desirability and feasibility of outsourcing legally relevant tasks to AI models, as well as for the importance for generative AI developers to implement rigorous and transparent capabilities evaluations to help secure safer and more trustworthy AI.

2. Evaluating the 90th Percentile Estimate

2.1. Evidence from OpenAI

Investigating the OpenAI website, as well as the GPT-4 technical report, reveals a multitude of claims regarding the estimated percentile of GPT-4’s Uniform Bar Examination performance but a dearth of documentation regarding the backing of such claims. For example, the first paragraph of the official GPT-4 research page on the OpenAI website states that “it [GPT-4] passes a simulated bar exam with a score around the top 10% of test takers” [1]. This claim is repeated several times later in this and other webpages, both visually and textually, each time without explicit backing.3

Similarly undocumented claims are reported in the official GPT-4 Technical Report.4 Although OpenAI details the methodology for computing most of its percentiles in A.5 of the Appendix of the technical report, there does not appear to be any such documentation for the methodology behind computing the UBE percentile. For example, after providing relatively detailed breakdowns of its methodology for scoring the SAT, GRE, SAT, AP, and AMC, the report states that “[o]ther percentiles were based on official score distributions,” followed by a string of references to relevant sources [2, p. 25].

Examining these references, however, none of the sources contains any information regarding the Uniform Bar Exam, let alone its “official score distributions” [2, p. 22-23]. Moreover, aside from the Appendix, there are no other direct references to the methodology of computing UBE scores, nor any indirect references aside from a brief acknowledgement thanking “our collaborators at Casetext and Stanford CodeX for conducting the simulated bar exam” [2, p. 18].

2.2. Evidence from GPT-4 Passes the Bar

Another potential source of evidence for the 90th percentile claim comes from an early draft version of the paper, “GPT-4 passes the bar exam,” written by the administrators of the simulated bar exam referenced in OpenAI’s technical report [47]. The paper is very well-documented and transparent about its methodology in computing raw and scaled scores, both in the main text and in its comprehensive appendices. Unlike the GPT-4 technical report, however, the focus of the paper is not on percentiles but rather on the model’s scaled score compared to that of the average test taker, based on publicly available NCBE data. In fact, one of the only mentions of percentiles is in a footnote, where the authors state, in passing: “Using a percentile chart from a recent exam administration (which is generally available online), ChatGPT would receive a score below the 10th percentile of test-takers while GPT-4 would receive a combined score approaching the 90th percentile of test-takers.” [47, p. 10]

2.3. Evidence Online

As explained by [27], The National Conference of Bar Examiners (NCBE), the organization that writes the Uniform Bar Exam (UBE) does not release UBE percentiles.5 Because there is no official percentile chart for UBE, all generally available online estimates are unofficial. Perhaps the most prominent of such estimates are the percentile charts from pre-July 2019 Illinois bar exam. Pre-2019,6 Illinois, unlike other states, provided percentile charts of their own exam that allowed UBE test-takers to estimate their approximate percentile given the similarity between the two exams [27].7

Examining these approximate conversion charts, however, yields conflicting results. For example, although the percentile chart from the February 2019 administration of the Illinois Bar Exam estimates a score of 300 (2-3 points higher thatn GPT-4’s score) to be at the 90th percentile, this estimate is heavily skewed compared to the general population of July exam takers,8 since the majority of those who take the February exam are repeat takers who failed the July exam [52]9, and repeat takers score much lower10 and are much more likely to fail than are first-timers.11

Indeed, examining the latest available percentile chart for the July exam estimates GPT-4’s UBE score to be ~68th percentile, well below the 90th percentile figure cited by OpenAI [54].

3. Towards a More Accurate Percentile Estimate

Although using the July bar exam percentiles from the Illinois Bar would seem to yield a more accurate estimate than the February data, the July figure is also biased towards lower scorers, since approximately 23% of test takers in July nationally are estimated to be re-takers and score, for example, 16 points below first-timers on the MBE [55]. Limiting the comparison to first-timers would provide a more accurate comparison that avoids double-counting those who have taken the exam again after failing once or more.

Relatedly, although (virtually) all licensed attorneys have passed the bar,12 not all those who take the bar become attorneys. To the extent that GPT-4’s UBE percentile is meant to reflect its performance against other attorneys, a more appropriate comparison would not only limit the sample to first-timers but also to those who achieved a passing score.

Moreover, the data discussed above is based on purely Illinois Bar exam data, which (at the time of the chart) was similar but not identical to the UBE in its content and scoring [27], whereas a more accurate estimate would be derived more directly from official NCBE sources.

3.1. Methods

To account for the issues with both OpenAI’s estimate as well the July estimate, more accurate estimates (for GPT-3.5 and GPT-4) were sought to be computed here based on first-time test-takers, including both (a) first-time test-takers overall, and (b) those who passed.

To do so, the parameters for a normal distribution of scores were separately estimated for the MBE and essay components (MEE + MPT), as well as the UBE score overall.13

Assuming that UBE scores (as well as MBE and essay subscores) are normally distributed, percentiles of GPT’s score can be directly computed after computing the parameters of these distributions (i.e. the mean and standard deviation).

Thus, the methodology here was to first compute these parameters, then generate distributions with these parameters, and then compute (a) what per- centage of values on these distributions are lower than GPT’s scores (to estimate the percentile against first-timers); and (b) what percentage of values above the passing threshold are lower than GPT’s scores (to estimate the percentile against qualified attorneys).With regard to the mean, according to publicly available official NCBE data, the mean MBE score of first-time test-takers is 143.8 [55].

As explained by official NCBE publications, the essay component is scaled to the MBE data [59], such that the two components have approximately the same mean and standard deviation [53, 54, 59]. Thus, the methodology here assumed that the mean first-time essay score is 143.8.14

Given that the total UBE score is computed directly by adding MBE and essay scores [60], an assumption was made that mean first-time UBE score is 287.6 (143.8 + 143.8).

With regard to standard deviations, information regarding the SD of first- timer scores is not publicly available. However, distributions of MBE scores for July scores (provided in 5 point-intervals) are publicly available on the NCBE website [58].

Under the assumption that first-timers have approximately the same SD as that of the general test-taking population in July, the standard deviation of first-time MBE scores was computed by (a) entering the publicly available distribution of MBE scores into R; and (b) taking the standard deviation of this distribution using the built-in sd() function (which calculates the standard deviation of a normal distribution).

Given that, as mentioned above, the distribution (mean and SD) of essay scores is the same as MBE scores, the SD for essay scores was computed similarly as above.

With regard to the UBE, Although UBE standard deviations are not publicly available for any official exam, they can be inferred from a combination of the mean UBE score for first-timers (287.6) and first-time pass rates.

For reference, standard deviations can be computed analytically as follows:

Where:

Thus, by (a) subtracting the cutoff score of a given administration (x) from the mean (µ); and (b) dividing that by the z-score (z) corresponding to the percentile of the cutoff score (i.e., the percentage of people who did not pass), one is left with the standard deviation (σ).

Here, the standard deviation was calculated according to the above formula using the official first-timer mean, along with pass rate and cutoff score data from New York, which according to NCBE data has the highest number of examinees for any jurisdiction [61].15

After obtaining these parameters, distributions of first-timer scores for the MBE component, essay component, and UBE overall were computed using the built-in rnorm function in R (which generates a normal distribution with a given mean and standard deviation).

Finally, after generating these distributions, percentiles were computed by calculating (a) what percentage of values on these distributions were lower than GPT’s scores (to estimate the percentile against first-timers); and (b) what percentage of values above the passing threshold were lower than GPT’s scores (to estimate the percentile against qualified attorneys).

With regard to the latter comparison, percentiles were computed after re- moving all UBE scores below 270, which is the most common score cutoff for states using the UBE [62]. To compute models’ performance on the individual components relative to qualified attorneys, a separate percentile was likewise computed after removing all subscores below 135.16

3.2. Results

Table 1: Estimated percentile of GPT-4’s uniform bar examination performance
Table 2: Estimated percentiles of MBE, essay, and total UBE scores among first-time test takers of uniform bar exam

3.2.1. Performance against first-time test-takers

Results are visualized in Tables 1 and 2. For each component of the UBE, as well as the UBE overall, GPT-4’s estimated percentile among first-time July test takers is less than that of both the OpenAI estimate and the July estimate that include repeat takers.

With regard to the aggregate UBE score, GPT-4 scored in the 62nd percentile as compared to the ~90th percentile February estimate and the ~68th percentile July estimate. With regard to MBE, GPT-4 scored in the ~79th percentile as compared to the ~95th percentile February estimate and the 86th percentile July estimate. With regard to MEE + MPT, GPT-4 scored in the ~42nd percentile as compared to the ~69th percentile February estimate and the ~48th percentile July estimate.

With regard to GPT-3.5, its aggregate UBE score among first-timers was in the ~2nd percentile, as compared to the ~2nd percentile February estimate and

~1st percentile July estimate. Its MBE subscore was in the ~6th percentile, compared to the ~10th percentile February estimate ~7th percentile July estimate. Its essay subscore was in the ~0th percentile, compared to the ~1st percentile February estimate and ~0th percentile July estimate.

3.2.2. Performance against qualified attorneys

Predictably, when limiting the sample to those who passed the bar, the models’ percentile dropped further.

With regard to the aggregate UBE score, GPT-4 scored in the ~45th per- centile. With regard to MBE, GPT-4 scored in the ~69th percentile, whereas for the MEE + MPT, GPT-4 scored in the ~15th percentile.

With regard to GPT-3.5, its aggregate UBE score among qualified attorneys was 0th percentile, as were its percentiles for both subscores.

Table 3: Estimated percentile leap from GPT-3.5 to GPT-4 on uniform bar examination

4. Re-Evaluating the Raw Score

So far, this analysis has taken for granted the scaled score achieved by GPT-4 as reported by OpenAI—that is, assuming GPT-4 scored a 298 on the UBE, is the 90th-percentile figure reported by OpenAI warranted?

Table 4: Comparison of estimated percentiles of UBE Scores for different groups. February and July scores are based on data from Illinois bar exam [53, 54]. First-timer and Attorney percentiles are based on original calculations here. Attorney percentiles are based on a UBE cutoff score of 270, which is the most common cutoff score in UBE jurisdictions.

However, given calls for the replication and reproducibility within the practice of science more broadly [42–46], it is worth scrutinizing the validity of the score itself—that is, did GPT-4 in fact score a 298 on the UBE?

Moreover, given the various potential hyperparameter settings available when using GPT-4 and other LLMs, it is worth assessing whether and to what extent adjusting such settings might influence the capabilities of GPT-4 on exam performance.

To that end, this section first attempts to replicate the MBE score reported by [1] and [47] using methods as close to the original paper as reasonably feasible. The section then attempts to get a sense of the floor and ceiling of GPT-4’s out-of-the-box capabilities by comparing GPT-4’s MBE performance using the best and worst hyperparameter settings.

Finally, the section re-examines GPT-4’s performance on the essays, eval- uating (a) the extent to which the methodology of grading GPT-4’s essays deviated that from official protocol used by the National Conference of Bar Examiners during actual bar exam administrations; and (b) the extent to which such deviations might undermine one’s confidence in the the scaled essay scores reported by [1] and [47].

4.1. Replicating the MBE Score

4.1.1. Methodology

Materials. As in [47], the materials used here were the official MBE questions released by the NCBE. The materials were purchased and downloaded in pdf format from an authorized NCBE reseller. Afterwards, the materials were converted into TXT format, and text analysis tools were used to format the questions in a way that was suitable for prompting, following [47]

Procedure. To replicate the MBE score reported by [1], this paper followed the protocol documented by [47], with some minor additions for robustness purposes. In [47], the authors tested GPT-4’s MBE performance using three different temperature settings: 0, .5 and 1. For each of these temperature settings, GPT- 4’s MBE performance was tested using two different prompts, including (1) a prompt where GPT was asked to provide a top-3 ranking of answer choices, along with a justification and authority/citation for its answer; and (2) a prompt where GPT-4 was asked to provide a top-3 ranking of answer choices, without providing a justification or authority/citation for its answer.

For each of these prompts, GPT-4 was also told that it should answer as if it were taking the bar exam.

For each of these prompts / temperature combinations, [47] tested GPT-4 three different times (“experiments” or “trials”) to control for variation.

The minor additions to this protocol were twofold. First, GPT-4 was tested under two additional temperature settings: .25 and .7. This brought the total temperature / prompt combinations to 10 as opposed to 6 in the original paper. Second, GPT-4 was tested 5 times under each temperature / prompt combination as opposed to 3 times, bringing the total number of trials to 50 as opposed to 18.

After prompting, raw scores were computed using the official answer key provided by the exam. Scaled scores were then computed following the method outlined in [63], by (a) multiplying the number of correct answers by 190, and dividing by 200; and (b) converting the resulting number to a scaled score using a conversion chart based on official NCBE data.

After scoring, scores from the replication trials were analyzed in comparison to those from [47] using the data from their publicly available github repository.To assess whether there was a significant difference between GPT-4’s accuracy in the replication trials as compared to the [47] paper, as well as to assess any significant effect of prompt type or temperature, a mixed-effects binary logistic regression was conducted with: (a) paper (replication vs original), temperature and prompt as fixed effects17; and (b) question number and question category as random effects. These regressions were conducted using the lme4 [64] and lmertest [65] packages from R.

4.1.2. Results

Results are visualized in Table 4. Mean MBE accuracy across all trials in the replication here was 75.6% (95% CI: 74.7 to 76.4), whereas the mean accuracy across all trials in [47] was 75.7% (95% CI: 74.2 to 77.1).18

The regression model did not reveal a main effect of “paper” on accuracy (p=.883), indicating that there was no significant difference between GPT-4’s raw accuracy as reported by [47] and GPT-4’s raw accuracy as performed in the replication here.

There was also no main effect of temperature (p>.1)19 or prompt (p=.741). That is, GPT-4’s raw accuracy was not significantly higher or lower at a given temperature setting or when fed a certain prompt as opposed to another (among the two prompts used in [47] and the replication here).

Table 5: GPT-4’s MBE performance across temperature and prompt settings

4.2. Assessing the Effect of Hyperparameters

4.2.1. Methods

Although the above analysis found no effect of prompt on model performance, this could be due to a lack of variety of prompts used by [47] in their original analysis.

To get a better sense of whether prompt engineering might have any effect on model performance, a follow-up experiment compared GPT-4’s performance in two novel conditions not tested in the original [47] paper.

In Condition 1 (“minimally tailored” condition), GPT-4 was tested using minimal prompting compared to [47], both in terms of formatting and substance. In particular, the message prompt in [47] and the above replication followed OpenAI’s Best practices for prompt engineering with the API [66] through the use of (a) helpful markers (e.g. ‘ “‘ ’) to separate instruction and context; (b) details regarding the desired output (i.e. specifying that the response should include ranked choices, as well as [in some cases] proper authority and citation; (c) an explicit template for the desired output (providing an example of the format in which GPT-4 should provide their response); and (d) perhaps most crucially, context regarding the type of question GPT-4 was answering (e.g. “please respond as if you are taking the bar exam”).

In contrast, in the minimally tailored prompting condition, the message prompt for a given question simply stated “Please answer the following question,” followed by the question and answer choices (a technique sometimes referred to as “basic prompting”: 67). No additional context or formatting cues were provided.

In Condition 2 (“maximally tailored” condition), GPT-4 was tested using the highest performing parameter combinations as revealed in the replication section above, with one addition, namely that: the system prompt, similar to the approaches used in [67, 68], was edited from its default (“you are a helpful assistant”) to a more tailored message that included included multiple example MBE questions with sample answer and explanations structured in the desired format (a technique sometimes referred to as “few-shot prompting”: [67]).

As in the replication section, 5 trials were conducted for each of the two conditions. Based on the lack of effect of temperature in the replication study, temperature was not a manipulated variable. Instead, both conditions featured the same temperature setting (.5).To assess whether there was a significant difference between GPT-4’s accuracy in the maximally tailored vs minimally tailored conditions, a mixed-effects binary logistic regression was conducted with: (a) condition as a fixed effect; and (b) question number and question category as random effects. As above, these regressions were conducted using the lme4 [64] and lmertest [65] packages from R.

4.2.2. Results

Mean MBE accuracy across all trials in the maximally tailored condition was descriptively higher at 79.5% (95% CI: 77.1 to 82.1), than in the minimally tailored condition at 70.9% (95% CI: 68.1 to 73.7).

Fig. 1: GPT-4’s MBE Accuracy in minimally tailored vs. maximally tailored prompting conditions. Bars reflect the mean accuracy. Lines correspond to 95% bootstrapped confidence intervals

The regression model revealed a main effect of condition on accuracy (β=1.395, SE=.192, p<.0001), such that GPT-4’s accuracy in the maximally tailored condition was significantly higher than its accuracy in the minimally tailored condition.

In terms of scaled score, GPT-4’s MBE score in the minimally tailored condition would be approximately 150, which would place it: (a) in the 70th percentile among July test takers; (b) 64th percentile among first-timers; and (c) 48th percentile among those who passed.

GPT-4’s score in the maximally tailored condition would be approximately 164—6 points higher than that reported by [47] and [1]). This would place it: (a) in the 95th percentile among July test takers; (b) 87th percentile among first-timers; and (c) 82th percentile among those who passed.

4.3. Re-examining the Essay Scores

As confirmed in the above subsection, the scaled MBE score (not percentile) reported by OpenAI was accurately computed using the methods documented in [47].

With regard to the essays (MPT + MEE), however, the method described by the authors significantly deviates in at least three aspects from the official method used by UBE states, to the point where one may not be confident that the essay scores reported by the authors reflect GPT models’ “true” essay scores (i.e., the score that essay examiners would have assigned to GPT had they been blindly scored using official grading protocol).

The first aspect relates to the (lack of) use of a formal rubric. For example, unlike NCBE protocol, which provides graders with (a) (in the case of the MEE) detailed “grading guidelines” for how to assign grades to essays and distinguish answers for a given MEE; and (b) (for both MEE and MPT) a specific “drafters’ point sheet” for each essay that includes detailed guidance from the drafting committee with a discussion of the issues raised and the intended analysis [69],

[47] do not report using an official or unofficial rubric of any kind, and instead simply describe comparing GPT-4’s answers to representative “good” answers from the state of Maryland.

Utilizing these answers as the basis for grading GPT-4’s answers in lieu of a formal rubric would seem to be particularly problematic considering it is unclear even what score these representative “good” answers received. As clarified by the Maryland bar examiners: “The Representative Good Answers are not ‘average’ passing answers nor are they necessarily ‘perfect’ answers. Instead, they are responses which, in the Board’s view, illustrate successful answers written by applicants who passed the UBE in Maryland for this session” [70].

Given that (a) it is unclear what score these representative good answers received; and (b) these answers appear to be the basis for determining the score that GPT-4’s essays received, it would seem to follow that (c) it is likewise unclear what score GPT-4’s answers should receive. Consequently, it would likewise follow that any reported scaled score or percentile would seem to be insufficiently justified so as to serve as a basis for a conclusive statement regarding GPT-4’s relative performance on essays as compared to humans (e.g. a reported percentile).

The second aspect relates to the lack of NCBE training of the graders of the essays. Official NCBE essay grading protocol mandates the use of trained bar exam graders, who in addition to using a specific rubric for each question undergo a standardized training process prior to grading [71, 72]. In contrast, the graders in [47] (a subset of the authors who were trained lawyers) do not report expertise or training in bar exam grading. Thus, although the graders of the essays were no doubt experts in legal reasoning more broadly, it seems unlikely that they would have been sufficiently ingrained in the specific grading protocols of the MEE + MPT to have been able to reliably infer or apply the specific grading rubric when assigning the raw scores to GPT-4.

The third aspect relates to both blinding and what bar examiners refer to as “calibration,” as UBE jurisdictions use an extensive procedure to ensure that graders are grading essays in a consistent manner (both with regard to other essays and in comparison to other graders) [71, 72]. In particular, all graders of a particular jurisdiction first blindly grade a set of 30 “calibration” essays of variable quality (first rank order, then absolute scores) and make sure that consistent scores are being assigned by different graders, and that the same score (e.g. 5 of 6) is being assigned to exams of similar quality [72]. Unlike this approach, as well as efforts to assess GPT models’ law school performance [73], the method reported by [47] did not initially involve blinding. The method in [47] did involve a form of inter-grader calibration, as the authors gave “blinded samples” to independent lawyers to grade the exams, with the assigned scores “match[ing] or exceed[ing]” those assigned by the authors. Given the lack of reporting to the contrary, however, the method used by the graders would presumably be plagued by issue issues as highlighted above (no rubric, no formal training with bar exam grading, no formal intra-grader calibration).

Given the above issues, as well as the fact that, as alluded in the introduction, GPT-4’s performance boost over GPT-3 on other essay-based exams was far lower than that on the bar exam, it seems warranted not only to infer that GPT-4’s relative performance (in terms of percentile among human test-takers) was lower than that reported by OpenAI, but also that GPT-4’s reported scaled score on the essay may have deviated to some degree from GPT-4’s “true” essay (which, if true, would imply that GPT-4’s “true” percentile on the bar exam may be even lower than that estimated in previous sections).

Indeed, [47] to some degree acknowledge all of these limitations in their paper, writing: “While we recognize there is inherent variability in any qualitative assessment, our reliance on the state bars’ representative “good” answers and the multiple reviewers reduces the likelihood that our assessment is incorrect enough to alter the ultimate conclusion of passage in this paper.”

Given that GPT-4’s reported score of 298 is 28 points higher than the passing threshold (270) in the majority of UBE jurisdictions, it is true that the essay scores would have to have been wildly inaccurate in order to undermine the general conclusion of [47] (i.e., that GPT-4 “passed the [uniform] bar exam”). However, even supposing that GPT-4’s “true” percentile on the essay portion was just a few points lower than that reported by OpenAI, this would further call into question OpenAI’s claims regarding the relative performance of GPT-4 on the UBE relative to human test-takers. For example, supposing that GPT-4 scored 9 points lower essays would drop its estimated relative performance to (a) 31st percentile compared to July test-takers; (b) 24th percentile relative to first-time test takers; and (c) less than 5th percentile compared to licensed attorneys.

5. Discussion

This paper first investigated the issue of OpenAI’s claim of GPT-4’s 90th percentile UBE performance, resulting in four main findings. The first finding is that although GPT-4’s UBE score approaches the 90th percentile when examining approximate conversions from February administrations of the Illinois Bar Exam, these estimates are heavily skewed towards low scorers, as the majority of test- takers in February failed the July administration and tend to score much lower than the general test-taking population. The second finding is that using July data from the same source would result in an estimate of ~68th percentile, including below average performance on the essay portion. The third finding is that comparing GPT-4’s performance against first-time test takers would result in an estimate of ~62nd percentile, including ~42nd percentile on the essay portion. The fourth main finding is that when examining only those who passed the exam, GPT-4’s performance is estimated to drop to ~48th percentile overall, and ~15th percentile on essays.

In addition to these four main findings, the paper also investigated the validity of GPT-4’s reported UBE score of 298. Although the paper successfully replicated the MBE score of 158, the paper also highlighted several methodological issues in the grading of the MPT + MEE components of the exam, which call into question the validity of the essay score (140).

Finally, the paper also investigated the effect of adjusting temperature settings and prompting techniques on GPT-4’s MBE performance, finding no significant effect of adjusting temperature settings on performance, and some effect of prompt engineering when compared to a basic prompting baseline condition.

Of course, assessing the capabilities of an AI system as compared to those of a practicing lawyer is no easy task. Scholars have identified several theoretical and practical difficulties in creating accurate measurement scales to assess AI capabilities and have pointed out various issues with some of the current scales [10–12]. Relatedly, some have pointed out that simply observing that GPT-4 under- or over-performs at a task in some setting is not necessarily reliable evidence that it (or some other LLM) is capable or incapable of performing that task in general [13–15].

In the context of legal profession specifically, there are various reasons to doubt the usefulness of UBE percentile as a proxy for lawyerly competence (both for humans and AI systems), given that, for example: (a) the content on the UBE is very general and does not pertain to the legal doctrine of any jurisdiction in the United States [16], and thus knowledge (or ignorance) of that content does not necessarily translate to knowledge (or ignorance) of relevant legal doctrine for a practicing lawyer of any jurisdiction; (b) the tasks involved on the bar exam, particularly multiple-choice questions, do not reflect the tasks of practicing lawyers, and thus mastery (or lack of mastery) of those tasks does not necessarily reflect mastery (or lack of mastery) of the tasks of practicing lawyers; and (c) given the lack of direct professional incentive to obtain higher than a passing score (typically no higher than 270) [62], obtaining a particularly high score or percentile past this threshold is less meaningful than for other exams (e.g. LSAT), where higher scores are taken into account for admission into select institutions [74].

Setting these objections aside, however, to the extent that one believes the UBE to be a valid proxy for lawyerly competence, these results suggest GPT-4 to be substantially less lawyerly competent than previously assumed, as GPT-4’s score against likely attorneys (i.e. those who actually passed the bar) is ~48th percentile. Moreover, when just looking at the essays, which more closely resemble the tasks of practicing lawyers and thus more plausibly reflect lawyerly competence, GPT-4’s performance falls in the bottom ~15th percentile. These findings align with recent research work finding that GPT-4 performed below-average on law school exams [75].

The lack of precision and transparency in OpenAI’s reporting of GPT-4’s UBE performance has implications for both the current state of the legal profession and the future of AI safety. On the legal side, there appear to be at least two sets of implications. On the one hand, to the extent that lawyers put stock in the bar exam as a proxy for general legal competence, the results might give practicing lawyers at least a mild temporary sense of relief regarding the security of the profession, given that the majority of lawyers perform better than GPT on the component of the exam (essay-writing) that seems to best reflect their day-to-day activities (and by extension, the tasks that would likely need to be automated in order to supplant lawyers in their day-to-day professional capacity).

On the other hand, the fact that GPT-4’s reported “90th percentile” capa- bilities were so widely publicized might pose some concerns that lawyers and non-lawyers may use GPT-4 for complex legal tasks for which it is incapable of adequately performing, plausibly increasing the rate of (a) misapplication of the law by judges; (b) professional malpractice by lawyers; and (c) ineffective pro se representation and/or unauthorized practice of law by non-lawyers. From a legal education standpoint, law students who overestimate GPT-4’s UBE capabilities might also develop an unwarranted sense of apathy towards developing critical legal-analytical skills, particularly if under the impression that GPT-4’s level of mastery of those skills already surpasses that to which a typical law student could be expected to reach.

On the AI front, these findings raise concerns both for the transparency20 of capabilities research and the safety of AI development more generally. In particular, to the extent that one considers transparency to be an important prerequisite for safety [38], these findings underscore the importance of implementing rigorous transparency measures so as to reliably identify potential warning signs of transformative progress in artificial intelligence as opposed to creating a false sense of alarm or security [76]. Implementing such measures could help ensure that AI development, as stated in OpenAI’s charter, is a “value-aligned, safety-conscious project” as opposed to becoming “a competitive race without time for adequate safety precautions” [77].

Of course, the present study does not discount the progress that AI has made in the context of legally relevant tasks; after all, the improvement in UBE performance from GPT-3.5 to GPT-4 as estimated in this study remains impressive (arguably equally or even more so given that GPT-3.5’s performance is also estimated to be significantly lower than previously assumed), even if not as flashy as the 10th-90th percentile boost of OpenAI’s official estimation. Nor does the present study discount the seemingly inevitable future improvement of AI systems to levels far beyond their present capabilities, or, as phrased in GPT-4 Passes the Bar Exam, that the present capabilities “highlight the floor, not the ceiling, of future application” [47, 11].

To the contrary, given the inevitable rapid growth of AI systems, the results of the present study underscore the importance of implementing rigorous and transparent evaluation measures to ensure that both the general public and relevant decision-makers are made appropriately aware of the system’s capabilities, and to prevent these systems from being used in an unintentionally harmful or catastrophic manner. The results also indicate that law schools and the legal profession should prioritize instruction in areas such as law and technology and law and AI, which, despite their importance, are currently not viewed as descriptively or normatively central to the legal academy [78].

Algorithmic black swans

Three lines of defense against risks from AI