Research Article | 
February 2025

The role of compute thresholds for AI governance

Matteo Pistillo, Suzanne Van Arsdale, Lennart Heim, Christoph Winter

This piece was originally published in The George Washington Journal of Law & Technology.

I. Introduction

The idea of establishing a “compute threshold” and, more precisely, a “training compute threshold” has recently attracted significant attention from policymakers and commentators. In recent years, various scholars and AI labs have supported setting such a threshold,1 as have governments around the world. On October 30, 2023, President Biden’s Executive Order 14,110 on Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence introduced the first living example of a compute threshold,2 although it was one of many orders revoked by President Trump upon entering office.3 The European Parliament and the European Council adopted the Artificial Intelligence Act, on June 13, 2024, providing for the establishment of a compute threshold.4 On February 4, 2024, California State Senator Scott Wiener introduced Senate Bill 1047 that defined frontier AI models with a compute threshold.5 The bill was approved by the California legislature, but it was ultimately vetoed by the State’s Governor.6 China may be considering similar measures, as indicated by recent discussions in policy circles.7 While not perfect, compute thresholds are currently one of the best options available to identify potentially high-risk models and trigger further scrutiny. Yet, in spite of this, information about compute thresholds and their relevance from a policy and legal perspective remains dispersed.

This Article proceeds in two parts. Part I provides a technical overview of compute and how the amount of compute used in training corresponds to model performance and risk. It begins by explaining what compute is and the role compute plays in AI development and deployment. Compute refers to both computational infrastructure, the hardware necessary to develop and deploy an AI system, and the amount of computational power required to train a model, commonly measured in integer or floating-point operations. More compute is used to train notable models each year, and although the cost of compute has decreased, the amount of compute used for training has increased at a higher rate, causing training costs to increase dramatically.8 This increase in training compute has contributed to improvements in model performance and capabilities, described in part by scaling laws. As models are trained on more data, with more parameters and training compute, they grow more powerful and capable. As advances in AI continue, capabilities may emerge that pose potentially catastrophic risks if not mitigated.9

Part II discusses why, in light of this risk, compute thresholds may be important to AI governance. Since training compute can serve as a proxy for the capabilities of AI models, a compute threshold can operate as a regulatory trigger, identifying what subset of models might possess more powerful and dangerous capabilities that warrant greater scrutiny, such as in the form of reporting and evaluations. Both the European Union AI Act and Executive Order 14,110 established compute thresholds for different purposes, and many more policy proposals rely on compute thresholds to ensure that the scope of covered models matches the nature or purpose of the policy. This Part provides an overview of policy proposals that expressly call for such a threshold, as well as proposals that could benefit from the addition of a compute threshold to clarify the scope of policies that refer broadly to “advanced systems” or “systems with dangerous capabilities.” It then describes how, even absent a formal compute threshold, courts and regulators might rely on training compute as a proxy for how much risk a given AI system poses, even under existing law. This Part concludes with the advantages and limitations of using compute thresholds as a regulatory trigger.

II. Compute and the Scaling Hypothesis

A. What Is “Compute”?

The term “compute” serves as an umbrella term, encompassing several meanings that depend on context.

Commonly, the term “compute” is used to refer to computational infrastructure, i.e., the hardware stacks necessary to develop and deploy AI systems.10 Many hardware elements are integrated circuits (also called chips or microchips), such as logic chips, which perform operations, and memory chips, which store the information on which logic devices perform calculations.11 Logic chips cover a spectrum of specialization, ranging from general-purpose central processing units (“CPUs”), through graphics processing units (“GPUs”) and field-programmable gate arrays (“FPGAs”), to application-specific integrated circuits (“ASICs”) customized for specific algorithms.12 Memory chips include dynamic random-access memory (“DRAM”), static random-access memory (“SRAM”), and NOT AND (“NAND”) flash memory used in many solid state drives (“SSDs”).13

Additionally, the term “compute” is often used to refer to how much computational power is required to train a specific AI system. Whereas the computational performance of a chip refers to how quickly it can execute operations and thus generate results, solve problems, or perform specific tasks, such as processing and manipulating data or training an AI system, “compute” refers to the amount of computational power used by one or more chips to perform a task, such as training a model. Compute is commonly measured in integer operations or floating-point operations (“OP” or “FLOP”),14 expressing the number of operations that have been executed by one or more chips, while the computational performance of those chips is measured in operations per second (“OP/s” or “FLOP/s”). In this sense, the amount of computational power used is roughly analogous to the distance traveled by a car.15 Since large amounts of compute are used in modern computing, values are often reported in scientific notation such as 1e26 or 2e26, which refer to 1⋅1026 and 2⋅1026 respectively.

Compute is essential throughout the AI lifecycle. The AI lifecycle can be broken down into two phases: development and deployment.16 In the first phase, development, developers design the model by choosing an architecture, the structure of the network, and initial values for hyperparameters (i.e., parameters that control the learning process, such as number of layers and training rate).17 Enormous amounts of data, usually from publicly available sources, are processed and curated to produce high-quality datasets for training.18 The model then undergoes “pre-training,” in which the model is trained on a large and diverse dataset in order to build the general knowledge and features of the model, which are reflected in the weights and biases of the model.19 Alternatively, developers may use an existing pre-trained model, such as OpenAI’s GPT-4 (“Generative Pre-trained Transformer 4”). The term “foundation model” refers to models like these, which are trained on broad data and adaptable to many downstream tasks.20 Performance and capabilities improvements are then possible using methods such as fine-tuning on task-specific datasets, reinforcement learning from human feedback (“RLHF”), teaching the model to use tools, and instruction tuning.21 These enhancements are far less compute-intensive than pre-training, particularly for models trained on massive datasets.22

As of this writing, there is no agreed-upon standard for measuring “training compute.” Estimates of “training compute” typically refer only to the amount of compute used during pre-training. More specifically, they refer to the amount of compute used during the final pre-training run, which contributes to the final machine learning model, and does not include any previous test runs or post-training enhancements, such as fine-tuning.23 There are exceptions: for instance, the EU AI Act considers the cumulative amount of compute used for training by including all the compute “used across the activities and methods that are intended to enhance the capabilities of the model prior to deployment, such as pre-training, synthetic data generation and fine-tuning.”24 California Senate Bill 1047 addressed post-training modifications generally and fine-tuning in particular, providing that a covered model fine-tuned with more than 3e25 OP or FLOP would be considered a distinct “covered model,” while one fine-tuned on less compute or subjected to unrelated post-training modifications would be considered a “covered model derivative.”25

In the second phase, deployment, the model is made available to users and is used.26 Users provide input to the model, such as in the form of a prompt, and the model makes predictions from this input in a process known as “inference.”27 The amount of compute needed for a single inference request is far lower than what is required for a training run.28 However, for systems deployed at scale, the cumulative compute used for inference can surpass training compute by several orders of magnitude.29 Consider, for instance, a large language model (“LLM”). During training, a large amount of compute is required over a smaller time frame within a closed system, usually a supercomputer. Once the model is deployed, each text generation leverages its own copy of the trained model, which can be run on a separate compute infrastructure. The model may serve hundreds of millions of users, each generating unique content and using compute with each inference request. Over time, the cumulative compute usage for inference can surpass the total compute required for training.

There are various reasons to consider compute usage at different stages of the AI lifecycle, which is discussed in Section I.E. For clarity, this Article uses “training compute” for compute used during the final pre-training run and “inference compute” for compute used by the model during a single inference, measured in the number of operations (“OP” or “FLOP”). Figure 1 illustrates a simplified version of the language model compute lifecycle.


A diagram of a computer lifecycle

AI-generated content may be incorrect.
Figure 1: Simplified language model lifecycle

B. What Is Moore’s Law and Why Is It Relevant for AI?

In 1965, Gordon Moore forecasted that the number of transistors on an integrated circuit would double every year.30 Ten years later, Moore revised his initial forecast to a two-year doubling period.31 This pattern of exponential growth is now called “Moore’s Law.”32 Similar rates of growth have been observed in related metrics, notably including the increase in computational performance of supercomputers;33 as the number of transistors on a chip increases, so does computational performance (although other factors also play a role).34

A corollary of Moore’s Law is that the cost of compute has fallen dramatically; a dollar can buy more FLOP every year.35 Greater access to compute, along with greater spending from 2010 onwards (i.e., the so-called deep learning era),36 has contributed to developers using ever more compute to train AI systems. Research has found that the compute used to train notable and frontier models has grown by 4–5x per year between 2010 and May 2024.37


A graph with blue dots

AI-generated content may be incorrect.
Figure 2: Compute used to train notable AI systems from 1950 to 202338

However, the current rate of growth in training compute may not be sustainable. Scholars have cited the cost of training,39 a limited supply of AI chips,40 technical challenges with using that much hardware (such as managing the number of processors that must run in parallel to train larger models),41 and environmental impact42 as factors that could constrain the growth of training compute. Research in 2018 with data from OpenAI estimated that then-current trends of growth in training compute could be sustained for at most 3.5 to 10 years (2022 to 2028), depending on spending levels and how the cost of compute evolves over time.43 In 2022, that analysis was replicated with a more comprehensive dataset and suggested that this trend could be maintained for longer, for 8 to 18 years (2030 to 2040) depending on compute cost-performance improvements and specialized hardware improvements.44

C. What Are “Scaling Laws” and What Do They Say About AI Models?

Scaling laws describe the functional (mathematical) relationship between the amount of training compute and the performance of the AI model.45 In this context, performance is a technical metric that quantifies “loss,” which is the amount of error in the model’s predictions. When loss is measured on a test or validation set that uses data not part of the training set, it reflects how well the model has generalized its learning from the training phase. The lower the loss, the more accurate and reliable the model is in making predictions on data it has not encountered during its training.46 As training compute increases, alongside increases in parameters and training data, so does model performance, meaning that greater training compute reduces the errors made.47 Increased training compute also corresponds to an increase in capabilities.48 Whereas performance refers to a technical metric, such as test loss, capabilities refer to the ability to complete concrete tasks and solve problems in the real world, including in commercial applications.49 Capabilities can also be assessed using practical and real-world tests, such as standardized academic or professional licensing exams, or with benchmarks developed for AI models. Common benchmarks include “Beyond the Imitation Game” (“BIG-Bench”), which comprises 204 diverse tasks that cover a variety of topics and languages,50 and the “Massive Multitask Language Understanding” benchmark (“MMLU”), a suite of multiple-choice questions covering 57 subjects.51 To evaluate the capabilities of Google’s PaLM 2 and OpenAI’s GPT-4, developers relied on BIG-Bench and MMLU as well as exams designed for humans, such as the SAT and AP exams.52

Training compute has a relatively smooth and consistent relationship with technical metrics like training loss. Training compute also corresponds to real-world capabilities, but not in a smooth and predictable way. This is due in part to occasional surprising leaps, discussed in Section I.D, and subsequent enhancements such as fine-tuning, which can further increase capabilities using far less compute.53 Despite being unable to provide a full and accurate picture of a model’s final capabilities, training compute still provides a reasonable basis for estimating the base capabilities (and corresponding risk) of a foundation model. Figure 3 shows the relationship between an increase in training compute and dataset size, and performance on the MMLU benchmark.


A graph with green and blue dots

AI-generated content may be incorrect.
Figure 3: Relationship between increase in training compute and dataset size,
and performance on MMLU54

In light of the correlation between training compute and performance, the “scaling hypothesis” states that scaling training compute will predictably continue to produce even more capable systems, and thus more compute is important for AI development.55 Some have taken this hypothesis further, proposing a “Bitter Lesson:” that “the only thing that matters in the long run is the leveraging of comput[e].”56 Since the emergence of the deep learning era, this hypothesis has been sustained by the increasing use of AI models in commercial applications, whose development and commercial success have been significantly driven by increases in training compute.57

Two factors weigh against the scaling hypothesis. First, scaling laws describe more than just the performance improvements based on training compute; they describe the optimal ratio of the size of the dataset, the number of parameters, and the training compute budget.58 Thus, a lack of abundant or high-quality data could be a limiting factor. Researchers estimate that, if training datasets continue to grow at current rates, language models will fully utilize human-generated public text data between 2026 and 2032,59 while image data could be exhausted between 2030 and 2060.60 Specific tasks may be bottlenecked earlier by the scarcity of high-quality data sources.61 There are, however, several ways that data limitations might be delayed or avoided, such as synthetic data generation and using additional datasets that are not public or in different modalities.62

Second, algorithmic innovation permits performance gains that would otherwise require prohibitively expensive amounts of compute.63 Research estimates that every 9 months, improved algorithms for image classification64 and LLMs65 contribute the equivalent of a doubling of training compute budgets. Algorithmic improvements include more efficient utilization of data66 and parameters, the development of improved training algorithms, or new architectures.67 Over time, the amount of training compute needed to achieve a given capability is reduced, and it may become more difficult to predict performance and capabilities on that basis (although scaling trends of new algorithms could be studied and perhaps predicted). The governance implications of this are multifold, including that increases in training compute may become less important for AI development and that many more actors will be able to access the capabilities previously restricted to a limited number of developers.68 Still, responsible frontier AI development may enable stakeholders to develop understanding, safety practices, and (if needed) defensive measures for the most advanced AI capabilities before these capabilities proliferate.

D. Are High-Compute Systems Dangerous?

Advances in AI could deliver immense opportunities and benefits across a wide range of sectors, from healthcare and drug discovery69 to public services.70 However, more capable models may come with greater risk, as improved capabilities could be used for harmful and dangerous ends. While the degree of risk posed by current AI models is a subject of debate,71 future models may pose catastrophic and existential risks as capabilities improve.72 Some of these risks are expected to be closely connected to the unexpected emergence of dangerous capabilities and the dual-use nature of AI models.

As discussed in Section I.C, increases in compute, data, and the number of parameters lead to predictable improvements in model performance (test loss) and general but somewhat less predictable improvements in capabilities (real-world benchmarks and tasks). However, scaling up these inputs to a model can also result in qualitative changes in capabilities in a phenomenon known as “emergence.”73 That is, a larger model might unexpectedly display emergent capabilities not present in smaller models, suddenly able to perform a task that smaller models could not.74 During the development of GPT-3, early models had close-to-zero performance on a benchmark for addition, subtraction, and multiplication. Arithmetic capabilities appeared to emerge suddenly in later models, with performance jumping substantially above random at 2·1022 FLOP and continuing to improve with scale.75 Similar jumps were observed at different thresholds, and for different models, on a variety of tasks.76

Some have contested the concept of emergent capabilities, arguing that what appear to be emergent capabilities in large language models are explained by the use of discontinuous measures, rather than by sharp and unpredictable improvements or developments in model capabilities with scale.77 However, discontinuous measures are often meaningful, as when the correct answer or action matters more than how close the model gets to it. As Anderljung and others explain: “For autonomous vehicles, what matters is how often they cause a crash. For an AI model solving mathematics questions, what matters is whether it gets the answer exactly right or not.”78 Given the difficulties inherent in choosing an appropriate continuous measure and determining how it corresponds to the relevant discontinuous measure,79 it is likely that capabilities will continue to seemingly emerge.

Together with emerging capabilities come emerging risks. Like many other innovations, AI systems are dual-use by nature, with the potential to be used for both beneficial and harmful ends.80 Executive Order 14,110 recognized that some models may “pose a serious risk to security, national economic security, national public health or safety” by “substantially lowering the barrier of entry for non-experts to design, synthesize, acquire, or use chemical, biological, radiological, or nuclear weapons; enabling powerful offensive cyber operations . . . ; [or] permitting the evasion of human control or oversight through means of deception or obfuscation.”81

Predictions and evaluations will likely adequately identify many capabilities before deployment, allowing developers to take appropriate precautions. However, systems trained at a greater scale may possess novel capabilities, or improved capabilities that surpass a critical threshold for risk, yet go undetected by evaluations.82 Some of these capabilities may appear to emerge only after post-training enhancements, such as fine-tuning or more effective prompting methods. A system may be capable of conducting offensive cyber operations, manipulating people in conversation, or providing actionable instructions on conducting acts of terrorism,83 and still be deployed without the developers fully comprehending unexpected and potentially harmful behaviors. Research has already detected unexpected behavior in current models. For instance, during the recent U.K. AI Safety Summit on November 1, 2023, Apollo Research showed that GPT-4 can take illegal actions like insider trading and then lie about its actions without being instructed to do so.84 Since the capabilities of future foundation models may be challenging to predict and evaluate, “emergence” has been described as “both the source of scientific excitement and anxiety about unanticipated consequences.”85

Not all risks come from large models. Smaller models trained on data from certain domains, such as biology or chemistry, may pose significant risks if repurposed or misused.86 When MegaSyn, a generative molecule design tool used for drug discovery, was repurposed to find the most toxic molecules instead of the least toxic, it found tens of thousands of candidates in under six hours, including known biochemical agents and novel compounds predicted to be as or more deadly.87 The amount of compute used to train DeepMind’s AlphaFold, which predicts three-dimensional protein structures from the protein sequence, is minimal compared to frontier language models.88 While scaling laws can be observed in a variety of domains, the amount of compute required to train models in some domains may be so low that a compute threshold is not a practical restriction on capabilities.

Broad consensus is forming around the need to test, monitor, and restrict systems of concern.89 The role of compute thresholds, and whether they are used at all, depends on the nature of the risk and the purpose of the policy: does it target risks from emergent capabilities of frontier models,90 risks from models with more narrow but dangerous capabilities,91 or other risks from AI?

E. Does Compute Usage Outside of Training Influence Performance and Risk?

In light of the relationship between training compute and performance expressed by scaling laws, training compute is a common proxy for how capable and powerful AI models are and the risks that they pose.92 However, compute used outside of training can also influence performance, capabilities, and corresponding risk.

As discussed in Section I.A, training compute typically does not refer to all compute used during development, but is instead limited to compute used during the final pre-training run.93 This definition excludes subsequent (post-training) enhancements, such as fine-tuning and prompting methods, which can significantly improve capabilities (see supra Figure 1) using far less compute; many current methods can improve capabilities the equivalent of a 5x increase in training compute, while some can improve them by more than 20x.94

The focus on training compute also misses the significance of compute used for inference, in which the trained model generates output in response to a prompt or new input data.95 Inference is the biggest compute cost for models deployed at scale, due to the frequency and volume of requests they handle.96 While developing an AI model is far more computationally intensive than a single inference request, it is a one-time task. In contrast, once a model is deployed, it may receive numerous inference requests that, in aggregate, exceed the compute expenditures of training. Some have even argued that inference compute could be a bottleneck in scaling AI, if inference compute costs scaling with training compute grow too large.97

Greater availability of inference compute could enhance malicious uses of AI by allowing the model to process data more rapidly and enabling the operation of multiple instances in parallel. For example, AI could more effectively be used to carry out cyber attacks, such as a distributed denial-of-service (“DDoS”) attack,98 to manipulate financial markets,99 or to increase the speed, scale, and personalization of disinformation campaigns.100

Compute used outside of development may also impact model performance. Specifically, some techniques can increase the performance of a model at the cost of more compute used during inference.101 Developers could therefore choose to improve a model beyond its current capabilities or to shift some compute expenditures from training to inference, in order to obtain equally-capable systems with less training compute. Users could also prompt a model to use similar techniques during inference, for example by (1) using “few-shot” prompting, in which initial prompts provide the model with examples of the desired output for a type of input,102 (2) using chain-of-thought prompting, which uses few-shot prompting to provide examples of reasoning,103 or (3) simply providing the same prompt multiple times and selecting the best result. Some user-side techniques to improve performance might increase the compute used during a single inference, while others would leave it unchanged (while still increasing the total compute used, due to multiple inferences being performed).104 Meanwhile, other techniques—such as pruning,105 weight sharing,106 quantization,107 and distillation108—can reduce compute used during inference while maintaining or even improving performance, and they can further reduce inference compute at the cost of lower performance.

Beyond model characteristics such as parameter count, other factors can also affect the amount of compute used during inference in ways that may or may not improve performance, such as input size (compare a short prompt to a long document or high-resolution image) and batch size (compare one input provided at a time to many inputs in a single prompt).109 Thus, for a more accurate indication of model capabilities, compute used to run a single inference110 for a given set of prompts could be considered alongside other factors, such as training compute. However, doing so may be impractical, as data about inference compute (or architecture useful for estimating it) is rarely published by developers,111 different techniques could make inference more compute-efficient, and less information is available regarding the relationship between inference compute and capabilities.

While companies might be hesitant to increase inference compute at scale due to cost, doing so may still be worthwhile in certain circumstances, such as for more narrowly deployed models or those willing to pay more for improved capabilities. For example, OpenAI offers dedicated instances for users who want more control over system performance, with a reserved allocation of compute infrastructure and the ability to enable features such as longer context limits.112

Over time, compute usage during the AI development and deployment process may change. It was previously common practice to train models with supervised learning, which uses annotated datasets. In recent years, there has been a rise in self-supervised, semi-supervised, and unsupervised learning, which use data with limited or no annotation but require more compute.113 

III. The Role of Compute Thresholds for AI Governance

A. How Can Compute Thresholds Be Used in AI Policy?

Compute can be used as a proxy for the capabilities of AI systems, and compute thresholds can be used to define the limited subset of high-compute models subject to oversight or other requirements.114 Their use depends on the context and purpose of the policy. Compute thresholds serve as intuitive starting points to identify potential models of concern,115 perhaps alongside other factors.116 They operate as a trigger for greater scrutiny or specific requirements. Once a certain level of training compute is reached, a model is presumed to have a higher risk of displaying dangerous capabilities (and especially unknown dangerous capabilities) and, hence, is subject to stricter oversight and other requirements.

Compute thresholds have already entered AI policy. The EU AI Act requires model providers to assess and mitigate systemic risks, report serious incidents, conduct state-of-the-art tests and model evaluations, ensure cybersecurity, and report serious incidents if a compute threshold is crossed.117 Under the EU AI Act, a general-purpose model that meets the initial threshold is presumed to have high-impact capabilities and associated systemic risk.118

In the United States, Executive Order 14,110 directed agencies to propose rules based on compute thresholds. Although it was revoked by President Trump’s Executive Order 14,148,119 many actions have already been taken and rules have been proposed for implementing Executive Order 14,110. For instance, the Department of Commerce’s Bureau of Industry and Security issued a proposed rule on September 11, 2024120 to implement the requirement that AI developers and cloud service providers report on models above certain thresholds, including information about (1) “any ongoing or planned activities related to training, developing, or producing dual-use foundation models,” (2) the results of red-teaming, and (3) the measures the company has taken to meet safety objectives.121 The executive order also imposed know-your-customer (“KYC”) monitoring and reporting obligations on U.S. cloud infrastructure providers and their foreign resellers, again with a preliminary compute threshold.122 On January 29, 2024, the Bureau of Industry and Security issued a proposed rule implementing those requirements.123 The proposed rule noted that training compute thresholds may determine the scope of the rule; the program is limited to foreign transactions to “train a large AI model with potential capabilities that could be used in malicious cyber-enabled activity,” and technical criteria “may include the compute used to pre-train the model exceeding a specified quantity.” 124 The fate of these rules is uncertain, as all rules and actions taken pursuant to Executive Order 14,110 will be reviewed to ensure that they are consistent with the AI policy set forth in Executive Order 14,179, Removing Barriers to American Leadership in Artificial Intelligence.125 Any rules of actions identified as inconsistent are directed to be suspended, revised, or rescinded.126

Numerous policy proposals have likewise called for compute thresholds. Scholars and developers alike have expressed support for a licensing or registration regime,127 and a compute threshold could be one of several ways to trigger the requirement.128 Compute thresholds have also been proposed for determining the level of KYC requirements for compute providers (including cloud providers).129 The Framework to Mitigate AI-Enabled Extreme Risks, proposed by U.S. Senators Romney, Reed, Moran, and King, would include a compute threshold for requiring notice of development, model evaluation, and pre-deployment licensing.130

Other AI regulations and policy proposals do not explicitly call for the introduction of compute thresholds but could still benefit from them. A compute threshold could clarify when specific obligations are triggered in laws and guidance that refer more broadly to “advanced systems” or “systems with dangerous capabilities,” as in the voluntary guidance for “organizations developing the most advanced AI systems” in the Hiroshima Process International Code of Conduct for Advanced AI Systems, agreed upon by G7 leaders on October 30, 2023.131 Compute thresholds could identify when specific obligations are triggered in other proposals, including proposals for: (1) conducting thorough risk assessments of frontier AI models before deployment;132 (2) subjecting AI development to evaluation-gated scaling;133 (3) pausing development of frontier AI;134 (4) subjecting developers of advanced models to governance audits;135 (5) monitoring advanced models after deployment;136 and (6) requiring that advanced AI models be subject to information security protections.137

B. Why Might Compute Be Relevant Under Existing Law?

Even without a formal compute threshold, the significance of training compute could affect the interpretation and application of existing laws. Courts and regulators may rely on compute as a proxy for how much risk a given AI system poses—alongside other factors such as capabilities, domain, safeguards, and whether the application is in a higher-risk context—when determining whether a legal condition or regulatory threshold has been met. This section briefly covers a few examples. First, it discusses the potential implications for duty of care and foreseeability analyses in tort law. It then goes on to describe how regulatory agencies could depend on training compute as one of several factors in evaluating risk from frontier AI, for example as an indicator of change to a regulated product and as a factor in regulatory impact analysis.

The application of existing laws and ongoing development of common law, such as tort law, may be particularly important while AI governance is still nascent138 and may operate as a complement to regulations once developed.139 However, courts and regulators will face new challenges as cases involve AI, an emerging technology of which they have no specialized knowledge, and parties will face uncertainty and inconsistent judgments across jurisdictions. As developments in AI unsettle existing law140 and agency practice, courts and agencies might rely on compute in several ways.

For example, compute could inform the duty of care owed by developers who make voluntary commitments to safety.141 A duty of care, which is a responsibility to take reasonable care to avoid causing harm to another, can be conditioned on the foreseeability of the plaintiff as a victim or be an affirmative duty to act in a particular way; affirmative duties can arise from the relationship between the parties, such as between business owner and customer, doctor and patient, and parent and child.142 If AI companies make general commitments to security testing and cybersecurity, such as the voluntary safety commitments secured by the Biden administration,143 those commitments may give rise to a duty of care in which training compute is a factor in determining what security is necessary. If a lab adopts a responsible scaling policy that requires it to have protection measures based on specific capabilities or potential for risk or misuse,144 a court might consider training compute as one of several factors in evaluating the potential for risk or misuse.

A court might also consider training compute as a factor when determining whether a harm was foreseeable. More advanced AI systems, trained with more compute, could foreseeably be capable of greater harm, especially in light of scaling laws discussed in Section I.C that make clear the relationship between compute and performance. It may likewise be foreseeable that a powerful AI system could be misused145 or become the target of more sophisticated attempts at exfiltration, which might succeed without adequate security.146 Foreseeability may in turn bear on negligence elements of proximate causation and duty of care.

Compute could also play a role in other scenarios, such as in a false advertising claim under the Lanham Act147 or state and federal consumer protection laws. If a business makes a claim about its AI system or services that is false or misleading, it could be held liable for monetary damages and enjoined from making that claim in the future (unless it becomes true).148 While many such claims will not involve compute, some may; for example, if a lab publicly claims to follow a responsible scaling policy, training compute could be relevant as an indicator of model capability and the corresponding security and safety measures promised by the policy.

Regulatory agencies may likewise consider compute in their analyses and regulatory actions. For example, the Environmental Protection Agency could consider training (and inference) compute usage as part of environmental impact assessments.149 Others could treat compute as a proxy for threat to national or public security. Agencies and committees responsible for identifying and responding to various risks, such as the Interagency Committee on Global Catastrophic Risk150 and Financial Stability Oversight Council,151 could consider compute in their evaluation of risk from frontier AI. Over fifty federal agencies were directed to take specific actions to promote responsible development, deployment, federal use of AI, and regulation of industry, in the government-wide effort established by Executive Order 14,110152—although these actions are now under review.153 Even for agencies not directed to consider compute or implement a preliminary compute threshold, compute might factor into how guidance is implemented over time.

More speculatively, changes to training compute could be used by agencies as one of many indicators of how much a regulated product has changed, and thus whether it warrants further review. For example, the Food and Drug Administration might consider compute when evaluating AI in medical devices or diagnostic tools.154 While AI products considered to be medical devices are more likely to be narrow AI systems trained on comparatively less compute, significant changes to training compute may be one indicator that software modifications require premarket submission. The ability to measure, report, and verify compute155 could make this approach particularly compelling for regulators.

Finally, training compute may factor into regulatory impact analyses, which evaluate the impact of proposed and existing regulations through quantitative and qualitative methods such as cost-benefit analysis.156 While this type of analysis is not necessarily determinative, it is often an important input into regulatory decisions and necessary for any “significant regulatory action.”157 As agencies develop and propose new regulations and consider how those rules will affect or be affected by AI, compute could be relevant in drawing lines that define what conduct and actors are affected. For example, a rule with a higher compute threshold and narrower scope may be less significant and costly, as it covers fewer models and developers. The amount of compute used to train models now and in the future may be not only a proxy for threat to national security (or innovation, or economic growth), but also a source of uncertainty, given the potential for emergent capabilities.

C. Where Should the Compute Threshold(s) Sit?

The choice of compute threshold depends on the policy under consideration: what models are the intended target, given the purpose of the policy? What are the burdens and costs of compliance? Can the compute threshold be complemented with other elements for determining whether a model falls within the scope of the policy, in order to more precisely accomplish its purpose?

Some policy proposals would establish a compute threshold “at the level of FLOP used to train current foundational models.”158 While the training compute of many models is not public, according to estimates, the largest models today were trained with 1e25 FLOP or more, including at least one open-source model, Llama 3.1 405B.159 This is the initial threshold established by the EU AI Act. Under the Act, general-purpose AI models are considered to have “systemic risk,” and thus trigger a series of obligations for their providers, if found to have “high impact capabilities.”160 Such capabilities are presumed if the cumulative amount of training compute, which includes all “activities and methods that are intended to enhance the capabilities of the model prior to deployment, such as pre-training, synthetic data generation and fine-tuning,” exceeds 1e25 FLOP.161 This threshold encompasses existing models such as Gemini Ultra and GPT-4, and it can be updated upwards or downwards by the European Commission through delegated acts.162 During the AI Safety Summit held in 2023, the U.K. Government included current models by defining “frontier AI” as “highly capable general-purpose AI models that can perform a wide variety of tasks and match or exceed the capabilities present in today’s most advanced models” and acknowledged that the definition included the models underlying ChatGPT, Claude, and Bard.163

Others have proposed an initial threshold of “more training compute than already-deployed systems,”164 such as 1e26 FLOP165 or 1e27 FLOP.166 No known model currently exceeds 1e26 FLOP training compute, which is roughly five times the compute used to train GPT-4.167 These higher thresholds would more narrowly target future systems that pose greater risks, including potential catastrophic and existential risks.168 President Biden’s Executive Order on AI169 and recently-vetoed California Senate Bill 1047170 are in line with these proposals, both targeting models trained with more than 1e26 OP or FLOP.

Far more models would fall within the scope of a compute threshold set lower than current frontier models. While only two models exceeded 1e23 FLOP training compute in 2017, over 200 models meet that threshold today.171 As discussed in Section II.A, compute thresholds operate as a trigger for additional scrutiny, and more models falling within the ambit of regulation would entail a greater burden not only on developers, but also on regulators.172 These smaller, general-purpose models have not yet posed extreme risks, making a lower threshold unwarranted at this time.173

While the debate has centered mostly around the establishment of a single training compute threshold, governments could adopt a pluralistic and risk-adjusted approach by introducing multiple compute thresholds that trigger different measures or requirements according to the degree or nature of risk. Some proposals recommend a tiered approach that would create fewer obligations for models trained on less compute. For example, the Responsible Advanced Artificial Intelligence Act of 2024 would require pre-registration and benchmarks for lower-compute models, while developers of higher-compute models must submit a safety plan and receive a permit prior to training or deployment.174 Multi-tiered systems may also incorporate a higher threshold beyond which no development or deployment can take place, with limited exceptions, such as for development at a multinational consortium working on AI safety and emergency response infrastructure175 or for training runs and models with strong evidence of safety.176

Domain-specific thresholds could be established for models that possess capabilities or expertise in areas of concern and models that are trained using less compute than general-purpose models.177 A variety of specialized models are already available to advance research, trained on extensive scientific databases.178 As discussed in Part I.D, these models present a tremendous opportunity, yet many have also recognized the potential threat of their misuse to research, develop, and use chemical, biological, radiological, and nuclear weapons.179 To address these risks, President Biden’s Executive Order on AI, which set a compute threshold of 1e26 FLOP to trigger reporting requirements, set a substantially lower compute threshold of 1e23 FLOP for models trained “using primarily biological sequence data.”180 The Hiroshima Process International Code of Conduct for Advanced AI Systems likewise recommends devoting particular attention to offensive cyber capabilities and chemical, biological, radiological, and nuclear risks, although it does not propose a compute threshold.181

While domain-specific thresholds could be useful for a variety of policies tailored to specific risks, there are some limitations. It may be technically difficult to verify how much biological sequence data (or other domain-specific data) was used to train a model.182 Another challenge is specifying how much data in a given domain causes a model to fall within scope, particularly considering the potential capabilities of models trained on mixed data.183 Finally, the amount of training compute required may be so low that, over time, a compute threshold is not practical.

When choosing a threshold, regulators should be aware that capabilities might be substantially improved through post-training enhancements, and training compute is only a general predictor of capabilities. The absolute limits are unclear at this point; however, current methods can result in capability improvements equivalent to a 5- to 30-times increase in training.184 To account for post-training enhancements, a governance regime could create a safety buffer, in which oversight or other protective measures are set at a lower threshold.185 Along similar lines, open-source models may warrant a lower threshold for at least some regulatory requirements, since they could be further trained by another actor and, once released, cannot be moderated or rescinded. 186

D. Does a Compute Threshold Require Updates?

Once established, compute thresholds and related criteria will likely require updates over time.187 Improvements in algorithmic efficiency could reduce the amount of compute needed to train an equally capable model,188 or a threshold could be raised or eliminated if adequate protective measures are developed or if models trained with a certain amount of compute are demonstrated to be safe.189 To further guard against future developments in a rapidly evolving field, policymakers can authorize regulators to update compute thresholds and related criteria.190

Several policies, proposed and enacted, have incorporated a dynamic compute threshold. For example, President Biden’s Executive Order on AI authorized the Secretary of Commerce to update the initial compute threshold set in the order, as well as other technical conditions for models subject to reporting requirements, “as needed on a regular basis” while establishing an interim compute threshold of 1e26 OP or FLOP.191 Similarly, the EU AI Act provides that the 1e25 FLOP compute threshold “should be adjusted over time to reflect technological and industrial changes, such as algorithmic improvements” and authorizes the European Commission to amend the threshold and “supplement benchmarks and indicators in light of evolving technological developments.”192 The California Senate Bill 1047 would have created the Frontier Model Division within the Government Operations Agency and authorized it to “update both of the [compute] thresholds in the definition of a ‘covered model’ to ensure that it accurately reflects technological developments, scientific literature, and widely accepted national and international standards and applies to artificial intelligence models that pose a significant risk of causing or materially enabling critical harms.”193

Regulators may need to update compute thresholds rapidly. Historically, failure to quickly update regulatory definitions in the context of emerging technologies has led to definitions becoming useless or even counterproductive.194 In the field of AI, developments may occur quickly and with significant implications for national security and public health, making responsive rulemaking particularly important. In the United States, there are several statutory tools to authorize and encourage expedited and regular rulemaking.195 For example, Congress could expressly authorize interim or direct final rulemaking, which would enable an agency to shift the comment period in notice-and-comment rulemaking to take place after the rule has already been promulgated, thereby allowing them to respond quickly to new developments.196

Policymakers could also require a periodic evaluation of whether compute thresholds are achieving their purpose to ensure that it does not become over- or under-inclusive. While establishing and updating a compute threshold necessarily involves prospective ex ante impact assessment, in order to take precautions against risk without undue burdens, regulators can learn much from retrospective ex post analysis of current and previous thresholds.197 In a survey conducted for the Administrative Conference of the United States, “[a]ll agencies stated that periodic reviews have led to substative [sic] regulatory improvement at least some of time. This was more likely when the underlying evidence basis for the rule, particularly the science or technology, was changing.”198 While the optimal frequency of periodic review is unknown, the study found that U.S. federal agencies were more likely to conduct reviews when provided with a clear time interval (“at least every X years”).199

Several further institutional and procedural factors could affect whether and how compute thresholds are updated. In order to effectively update compute thresholds and other criteria, regulators must have access to expertise and talent through hiring, training, consultation and collaboration, and other avenues that facilitate access to experts from academia and industry.200 Decisions will be informed by the availability of data, including scientific and commercial data, to enable ongoing monitoring, learning, analysis, and adaptation in light of new developments. Decision-making procedures, agency design, and influence and pressures from policymakers, developers, and other stakeholders will likewise affect updates, among many other factors.201 While more analysis is beyond the scope of this Article, others have explored procedural and substantive measures for adaptive regulation202 and effective governance of emerging technologies.203

Some have proposed defining compute thresholds in terms of effective compute,204 as an alternative to updates over time. Effective compute could index to a particular year (similar to inflation adjustments) and thus account for the role that algorithmic progress (e.g., 1e25 of 2023-level effective compute).205 However, there is not an agreed upon way to more precisely define and calculate effective compute, and the ability to do so depends on the challenging task of calculating algorithmic efficiency, including choosing a performance metric to anchor on. Furthermore, effective compute alone would fail to address potential changes in the risk landscape, such as the development of protective measures.

E. What Are the Advantages and Limitations of a Training Compute Threshold?

Compute has several properties that make it attractive for policymaking: it is (1) correlated with capabilities and thus risk, (2) essential for training, with thresholds that are difficult to circumvent without reducing performance, (3) an objective and quantifiable measure, (4) capable of being estimated before training (5) externally verifiable after training, and (6) a significant cost during development and thus indicative of developer resources. However, training compute thresholds are not infallible: (1) training compute is an imprecise indicator of potential risk, (2) a compute threshold could be circumvented, and (3) there is no industry standard for measuring and reporting training compute.206 Some of these limitations can be addressed with thoughtful drafting, including clear language, alternative and supplementary elements for defining what models are within scope, and authority to update any compute threshold and other criteria in light of future developments.

First, training compute is correlated with model capabilities and associated risks. Scaling laws predict an increase in performance as training compute increases, and real-world capabilities generally follow (Section I.C). As models become more capable, they may also pose greater risks if they are misused or misaligned (Section I.D). However, training compute is not a precise indicator of downstream capabilities. Capabilities can seemingly emerge abruptly and discontinuously as models are developed with more compute,207 and the open-ended nature of foundation models means those capabilities may go undetected.208 Post-training enhancements such as fine-tuning are often not considered a part of training compute, yet they can dramatically improve performance and capabilities with far less compute. Furthermore, not all models with dangerous capabilities require large amounts of training compute; low-compute models with capabilities in certain domains, such as biology or chemistry, may also pose significant risks, such as biological design tools that could be used for drug discovery or the creation of pathogens worse than any seen to date.209 The market may shift towards these smaller, cheaper, more specialized models,210 and even general-purpose low-compute models may come to pose significant risks. Given these limitations, a training compute threshold cannot capture all possible risks; however, for large, general-purpose AI models, training compute can act as an initial threshold for capturing emerging capabilities and risks.

Second, compute is necessary throughout the AI lifecycle, and a compute threshold would be difficult to circumvent. There is no AI without compute (Section I.A). Due to its relationship with model capabilities, training compute cannot be easily reduced without a corresponding reduction in capabilities, making it difficult to circumvent for developers of the most advanced models. Nonetheless, companies might find “creative ways” to account for how much compute is used for a given system in order to avoid being subject to stricter regulation.211 To reduce this risk, some have suggested monitoring compute usage below these thresholds to help identify circumvention methods, such as structuring techniques or outsourcing.212 Others have suggested using compute thresholds alongside additional criteria, such as the model’s performance on benchmarks, financial or energy cost, or level of integration into society.213 As in other fields, regulatory burdens associated with compute thresholds could encourage regulatory arbitrage if a policy does not or cannot effectively account for that possibility.214 For example, since compute can be accessed remotely via digital means, data centers and compute providers could move to less-regulated jurisdictions.

Third, compute is an objective and quantifiable metric that is relatively straightforward to measure. Compute is a quantitative measure that reflects the number of mathematical operations performed. It does not depend on specific infrastructure and can be compared across different sets of hardware and software.215 By comparison, other metrics, such as algorithmic innovation and data, have been more difficult to track.216 Whereas quantitative metrics like compute can be readily compared across different instances, the qualitative nature of many other metrics makes them more subject to interpretation and difficult to consistently measure. Compute usage can be measured internally with existing tools and systems; however, there is not yet an industry standard for measuring, auditing, and reporting the use of computational resources.217 That said, there have been some efforts toward standardization of compute measurement.218 In the absence of a standard, some have instead presented a common framework for calculating compute, based on information about the hardware used and training time.219

Fourth, compute can be estimated ahead of model development and deployment. Developers already estimate training compute with information about the model’s architecture and amount of training data, as part of planning before training takes place. The EU AI Act recognizes this, noting that “training of general-purpose AI models takes considerable planning which includes the upfront allocation of compute resources and, therefore, providers of general-purpose AI models are able to know if their model would meet the threshold before the training is completed.”220 Since compute can be readily estimated before a training run, developers can plan a model with existing policies in mind and implement appropriate precautions during training, such as cybersecurity measures.

Fifth, the amount of compute used could be externally verified after training. While laws that use compute thresholds as a trigger for additional measures could depend on self-reporting, meaningful enforcement requires regulators to be aware of or at least able to verify the amount of compute being used. A regulatory threshold will be ineffective if regulators have no way of knowing whether a threshold has been reached. For this reason, some scholars have proposed that developers and compute providers be required to report the amount of compute used at different stages of the AI lifecycle.221 Compute providers already employ chip-hours for client billing, which could be used to calculate total computational operations,222 and the centralization of a few key cloud providers could make monitoring and reporting requirements simpler to administer.223 Others have proposed using “on-chip” or “hardware-enabled governance mechanisms” to verify claims about compute usage.224

Sixth, training compute is an indicator of developer resources and capacity to comply with regulatory requirements, as it represents a substantial financial investment.225 For instance, Sam Altman reported that the development of GPT-4 cost “much more” than $100 million.226 Researchers have estimated that Gemini Ultra cost $70 million to $290 million to develop.227 A regulatory approach based on training compute thresholds can therefore be used to subject only the most resourced AI developers to increased regulatory scrutiny, while avoiding overburdening small companies, academics, and individuals. Over time, the cost of compute will most likely continue to fall, meaning the same thresholds will capture more developers and models. To ensure that the law remains appropriately scoped, compute thresholds can be complemented by additional metrics, such as the cost of compute or development. For example, the vetoed California Senate Bill 1047 was amended to include a compute cost threshold, defining a “covered model” to include one trained with over 1e26 OP, only if the cost of that training compute exceeded $100,000,000 at the start of training.228

At the time of writing, many consider compute thresholds to be the best option currently available for determining which AI models should be subject to regulation, although the limitations of this approach underscore the need for careful drafting and adaptive governance. When considering the legal obligations imposed, the specific compute threshold should correspond to the nature and extent of additional scrutiny and other requirements and reflect the fact that compute is only a proxy for, and not a precise measure of, risk.

F. How Do Compute Thresholds Compare to Capability Evaluations?

A regulatory approach that uses a capabilities-based threshold or evaluation may seem more intuitively appealing and has been proposed by many.229 There are currently two main types of capability evaluations: benchmarking and red-teaming.230 In benchmarking, a model is tested on a specific dataset and receives a numerical score. In red-teaming, evaluators can use different approaches to identify vulnerabilities and flaws in a system, such as through prompt injection attacks to subvert safety guardrails. Model evaluations like these already serve as the basis for responsible scaling policies, which specify what protective measures an AI developer must implement in order to safely handle a given level of capabilities. Responsible scaling policies have been adopted by companies like Anthropic, OpenAI, and Google, and policymakers have also encouraged their development and practice.231

Capability evaluations can complement compute thresholds. For example, capability evaluations could be required for models exceeding a compute threshold that indicates that dangerous capabilities might exist. They could also be used as an alternative route to being covered by regulation. The EU AI Act adopts the latter approach, complementing the compute threshold with the possibility for the European Commission to “take individual decisions designating a general-purpose AI model as a general-purpose AI model with systemic risk if it is found that such model has capabilities or an impact equivalent to those captured by the set threshold.”232

Nonetheless, there are several downsides to depending on capabilities alone. First, model capabilities are difficult to measure.233 Benchmark results can be affected by factors other than capabilities, such as benchmark data being included during training234 and model sensitivity to small changes in prompting.235 Downstream capabilities of a model may also differ from those during evaluation due to changes in dataset distribution.236 Some threats, such as misuse of a model to develop a biological weapon, may be particularly difficult to evaluate due to the domain expertise required, the sensitivity of information related to national security, and the complexity of the task.237 For dangerous capabilities such as deception and manipulation, the nature of the capability makes it difficult to assess,238 although some evaluations have already been developed.239 Furthermore, while evaluations can point to what capabilities do exist, it is far more difficult to prove that a model does not possess a given capability. Over time, new capabilities may even emerge and improve due to prompting techniques, tools, and other post-training enhancements.

Second, and compounding the issue, there is no standard method for evaluating model capabilities.240 While benchmarks allow for comparison across models, there are competing benchmarks for similar capabilities; with none adopted as standard by developers or the research community, evaluators could select different benchmark tests entirely.241 Red-teaming, while more in-depth and responsive to differences in models, is even less standardized and provides less comparable results. Similarly, no standard exists for when during the AI lifecycle a model is evaluated, even though fine-tuning and other post-training enhancements can have a significant impact on capabilities. Nevertheless, there have been some efforts toward standardization, including the U.S. National Institute of Standards and Technology beginning to develop guidelines and benchmarks for evaluating AI capabilities, including through red-teaming.242

Third, it is much more difficult to externally verify model evaluations. Since evaluation methods are not standardized, different evaluators and methods may come to different conclusions, and even a small difference could determine whether a model falls within the scope of regulation. This makes external verification simultaneously more important and more challenging. In addition to the technical challenge of how to consistently verify model evaluations, there is also a practical challenge: certain methods, such as red-teaming and audits, depend on far greater access to a model and information about its development. Developers have been reluctant to grant permissive access,243 which has contributed to numerous calls to mandate external evaluations.244

Fourth, model evaluations may be circumvented. For red-teaming and more comprehensive audits, evaluations for a given model may reasonably reach different conclusions, which allows room for an evaluator to deliberately shape results through their choice of methods and interpretation. Careful institutional design is needed to ensure that evaluations are robust to conflicts of interest, perverse incentives, and other limitations.245 If known benchmarks are used to determine whether a model is subject to regulation, developers might train models to achieve specific scores without affecting capabilities, whether to improve performance on safety measures or strategically underperform on certain measures of dangerous capabilities.

Finally, capability evaluations entail more uncertainty and expense. Currently, the capabilities of a model can only reliably be determined ex post,246 making it difficult for developers to predict whether it will fall within the scope of applicable law. More in-depth model evaluations such as red-teaming and audits are expensive and time-consuming, which may constrain small organizations, academics, and individuals.247

Capability evaluations can thus be viewed as a complementary tool for estimating model risk. While training compute makes an excellent initial threshold for regulatory oversight, as an objective and quantifiable measure that can be estimated prior to training and verified after, capabilities correspond more closely to risk. Capability evaluations provide more information and can be completed after fine-tuning and other post-training enhancements, but are more expensive, difficult to carry out, and less standardized. Both are important components of AI governance but serve different roles.

IV. Conclusion

More powerful AI could bring transformative changes in society. It promises extraordinary opportunities and benefits across a wide range of sectors, with the potential to improve public health, make new scientific discoveries, improve productivity and living standards, and accelerate economic growth. However, the very same advanced capabilities could result in tremendous harms that are difficult to control or remedy after they have occurred. AI could fail in critical infrastructure, further concentrate wealth and increase inequality, or be misused for more effective disinformation, surveillance, cyberattacks, and development of chemical and biological weapons.

In order to prevent these potential harms, laws that govern AI must identify models that pose the greatest threat. The obvious answer would be to evaluate the dangerous capabilities of frontier models; however, state of the art model evaluations are subjective and unable to reliably predict downstream capabilities, and they can take place only after the model has been developed with a substantial investment.

This is where training compute thresholds come into play. Training compute can operate as an initial threshold for estimating the performance and capabilities of a model and, thus, the potential risk it poses. Despite its limitations, it may be the most effective option we have to identify potentially dangerous AI that warrants further scrutiny. However, compute thresholds alone are not sufficient. They must be used alongside other tools to mitigate and respond to risk, such as capability evaluations, post-market monitoring, and incident reporting. Further research avenues could develop better governance via compute thresholds:

  1. What amount of training compute corresponds to future systems of concern? What threshold is appropriate for different regulatory targets, and how can we identify that threshold in advance? What are the downstream effects of different compute thresholds?
  2. Are compute thresholds appropriate for different stages of the AI lifecycle? For example, could thresholds for compute used for post-training enhancements or during inference be used alongside a training compute threshold, given the ability to significantly improve capabilities at these stages?
  3. Should domain-specific compute thresholds be established, and if so, to address which risks? If domain-specific compute thresholds are established, such as in President Biden’s Executive Order 14,110, how can competent authorities determine if a system is domain-specific and verify the training data?
  4. How should compute usage be reported, monitored, and audited?
  5. How should a compute threshold be updated over time? What is the likelihood of future frontier systems being developed using less (or far less) compute than is used today? Does growth or slowdown in compute usage, hardware improvement, or algorithmic efficiency warrant an update, or should it correspond solely to an increase in capabilities? Relatedly, what kind of framework would allow a regulatory agency to respond to developments effectively (e.g., with adequate information and the ability to update rapidly)?
  6. How could a capabilities-based threshold complement or replace a compute threshold, and what would be necessary (e.g., improved model evaluations for dangerous capabilities and alignment)?
  7. How should the law mitigate risks from AI systems that sit below the training compute threshold?
Share
The role of compute thresholds for AI governance
Matteo Pistillo, Suzanne Van Arsdale, Lennart Heim, Christoph Winter
Full text PDFs
The role of compute thresholds for AI governance
Matteo Pistillo, Suzanne Van Arsdale, Lennart Heim, Christoph Winter
URL links