Red Teaming AI Legislation: Lessons from SB 1047
AI models can be abused. But so can AI laws. Let's test both.
Table of Contents:
Red Teaming and the Duality of SB 1047’s Fiercest Supporters
Red teaming is a well-established approach to testing AI systems where testers adopt the role of a motivated adversary trying to use a well-intentioned service for an unintended or even nefarious use. Red teams systematically probe models, trying to break them to uncover vulnerabilities, biases, and potential misuse scenarios. Red teaming helps identify and mitigate risks before AI models are put into application, ultimately leading to more trustworthy and reliable AI solutions.
We ought to red team not just AI technology, but AI legislation. I am starting an initiative to do just that.
Red teaming legislation might be even more important than red teaming technology, because technology is easier to update. Engineers can usually quickly modify AI systems, even once deployed, to address newly discovered misuses. But after a legislature enacts a bill, any changes to the statute could take years.
And yet many AI safety advocates who have proposed AI legislation have not red teamed their proposals. People who are deeply familiar with and supportive of red teaming AI models seem at best disinterested in, and often hostile to, critiques of their legislative proposals that assume someone might try to misuse the laws.
Consider California’s Safe and Secure Innovation for Frontier Artificial Intelligence Models Act (SB 1047). The controversial bill is nearing final votes in the state legislature. This bill has been through many iterations, but it appears to have taken its final form.
A certain segment of very-online SB 1047 boosters have, since its introduction, spoken with absolute confidence about the legislation and what a future enforcer could and could not do under the proposed bill. Rather than consider how a bad actor misuse this bill, they instead assumed that enforcers would be cautious, reasonable, and careful. They have poo-pooed and even mocked those who pointed out the bills’ potential for overreach or abuse. They appear to believe that because the bill requires certain “reasonable” actions that only unreasonable actions will ever be prosecuted.
But as anyone who has encountered law enforcement can tell you, that’s not how law works in practice. The language of the statute, however well intentioned, is applied by humans who have their own various incentives and motivations. To understand the likely effects of legislation, then, it is vital to understand those various incentives and speculate as to how various stakeholders might apply a statute.
Taking government officials’ incentives seriously is a core tenant of a field of political economy known as public choice theory. Applying public choice analysis to proposed legislation is the policy version of red teaming.
AI safety proponents would never just trust users of a foundation model because the developer had good intentions. Why would they just trust enforcers and other stakeholders of a well-intentioned law?
Perhaps SB 1047 supporters’ pollyannaish attitude toward government is understandable. Few are lawyers, as far as I can tell, and may have never experienced the fierce (and expensive) debates that lawyers have over what seemingly simple words like “reasonable” and “material,” mean, for example. Most supporters appear to be white, well-off people who don’t own businesses, and so they have likely had no reason to have personally experienced an overzealous criminal or civil law enforcement officer.
Why You Should Listen to Me
While obviously no one has direct experience with enforcement of SB 1047, I do have some experience with how enforcement of statutes with broad standards works in practice. I am a computer scientist but I am also a lawyer; and more specifically, I was a high-level employee at a federal law enforcement agency, the Federal Trade Commission.
The FTC’s consumer protection authority is particularly relevant to SB 1047. Congress in 1938 charged the FTC with a broad and rather abstract consumer protection mandate: preventing and prosecuting “unfair or deceptive acts or practices.” That so-called “UDAP” standard is, on its face, almost as general as the “reasonable,” “material,” and “good faith” language within SB 1047. I analyzed dozens and dozens of cases applying the FTC’s broad language to actual fact patterns in the world, often to identify potential unintended consequences. This experience has shown me how enforcement of such general principles can go astray.
Many other people have extensive experience with the FTC’s UDAP standards, but I haven’t seen anyone draw lessons from that experience for SB 1047. So here I am.
Public Choice Analysis 101 and the Power of Settlements
I’ll explain how my FTC experience applies further below. But first, a very simplified overview of public choice analysis focused on the incentives of various players.
When a legislature writes a statute, their primary incentive is to get a statute out the door so they can take credit for having done something about the problem. Credit for problem-solving helps them get reelected. Once they’ve passed a statute, their role is largely finished - although sometimes legislation includes oversight components such as reports to the legislature.
After passage, the statute must be enforced. The law enforcer’s incentive is to bring successful cases under the statute. Often, the bigger the target defendant, the better for the enforcer’s career. Of course, they don’t want to get embarrassed in court, so they’ll try to bring winnable cases.
Enforcers can win in several ways. First, they can win by convincing a court. Second, an individual enforcer can win by bringing cases that won’t be resolved until long after the individual who brought the case has moved on to their next position. Finally, and very commonly, enforcers can win by getting defendants to settle out of court. Defendants will settle even when they are convinced the law is on their side in order to avoid expensive litigation.
Because settlements are wins even though a court never weighs in, they are a powerful tool for law enforcers. An overwhelming majority of FTC UDAP cases are solved through settlements. These settlements usually involve a “consent order” that binds the defendant to certain behavioral conditions, with a standard term of twenty years.
What the Improved SB 1047 Says
I have been a critic of SB 1047 from the start, in part because I believe the effective altruist / AI safety motivations behind it are deeply misguided, unsupportable, and inevitably tend toward authoritarian responses, as I have written. Frankly, I find it difficult to trust claims that SB 1047 won’t stop AI innovation from people who deeply believe that AI is poised to kill us all.
However, while the motivations of some sponsors remain deeply suspect, the bill itself has evolved significantly from its earliest versions. It is clearly much less authoritarian than it was.
Gone is the Frontier Model Division, although many of the authorities are ported to a Board of Frontier Models within the Government Operations Agency (GOA).
Gone is the provision that would have allowed the FMD to create arbitrary pre-release requirements for developers.
Gone is the “similar performance” criteria for covered models that could have empowered the FMD to reach far smaller model developers (although the Board of Frontier Models can still redefine the thresholds without clear limits).
Gone are potential criminal penalties.
Gone are fees on developers.
The bill has also improved how it allocates responsibility to developers for uses outside of their control.
Many of these changes have been adopted to address concerns that SB 1047 supporters repeatedly insisted were unjustified or fanciful. Often, their insistence that critics were misreading the bill demonstrated a deep resistance to a red teaming mindset for legislation.
After these changes, what does SB 1047 say?
SB 1047 seeks to prevent or reduce the likelihood of “critical harms” from the development or use of certain large AI models. But the bill’s authors, like everyone else, do not know what technical practices would achieve that goal. Rather than set out that specific substantive requirements on how to train or use models so as to avoid such harms, the bill requires developers training covered models to adopt a range of processes, and to make promises about those processes.
For example, the bill requires developers training large models to “implement a written and separate safety and security protocol that” if followed “would successfully comply with the developer’s duty to take reasonable care to avoid producing a covered model or covered model derivative that poses an unreasonable risk of causing or materially enabling a critical harm.” (22603(a)(3))
The bill also requires developers to make promises about the future effects of those processes. Specifically, sections 22603(f) and 22603(h) require developers to submit a statement verifying their compliance with section 22603. Putting the process and promise components together, SB 1047 requires developers to promise that they have in place processes that fulfill their “duty to take reasonable care” to not produce an AI model with an “unreasonable risk of causing or materially enabling a critical harm.”
SB 1047 Looks Like a FTC Consent Order
The current version of SB 1047 resembles the kind of provisions in FTC consent orders. In both situations, there are broad legal standards (UDAP and “reasonable care”) underlying specific requirements intended to detail how to comply with those broad standards. Like SB 1047, FTC-imposed data security and privacy consent orders include similar process requirements, reporting requirements, auditing requirements, prohibitions on specific unfair or deceptive practices, document retention requirements, incident reporting requirements, and transparency requirements. (Interestingly, 22607(a)(3) adopts the standard of California’s “little FTC Act” to define “false and misleading statements,” making the FTC analogy almost literal.)
There are three important differences between FTC consent orders and SB 1047. First, and most importantly, such consent orders are entered only AFTER a company has allegedly violated the FTC Act (or other law under FTC jurisdiction) by harming or likely harming consumers. Second, consent orders only apply to the company that violated the law. Here, SB1047 applies to all who fit a criteria. Third, consent orders are generally static. Yet under SB 1047, the GOA can change the guidance for how to comply with the bill.
In short, SB 1047 looks like an industry-wide consent order enacted on the presumption that the industry is already guilty. I’ve seen very aggressive FTC enforcement of consent orders, and it is easy for me to imagine SB 1047 being deployed similarly.
Red Teaming SB 1047: Two Scenarios
With this overview, how might we red team SB 1047? There are many scenarios we could explore:
A disgruntled employee of a model developer weaponizing the whistleblower protections of the law;
A company using the law to slow a competing developer;
A bad actor seeking to avoid responsibility for criminal actions by scapegoating a model developer;
A auditor seeking to comply with the letter of the law while keeping their client happy
Each of these scenarios could reveal important potential impacts of the law. But I will focus on two scenarios that, given my FTC experience, I find the most concerning: an aggressive prosecutor in the wake of an incident, and an anti-AI official seeking to halt or slow AI development by any means.
A Hypothetical Incident
Imagine you are a prosecutor in the year 2030. Security company MobHit uses a small subcontractor’s code quality assurance tool to test its security updates. That tool relies in part on Anthropomorphic’s Clod model, which is a covered model under SB 1047, to vet code for errors. Unfortunately, the tool and MobHit’s other review process fails, and MobHit rolls out an update to millions of California computers that stops them from booting. Airlines and PG&E computers are among the businesses affected; the estimated economic damage is $400 million dollars. One person suffered severe kidney damage because her home dialysis machine failed and couldn’t be restarted.
How, might you, the prosecutor, bring a case under SB 1047? Remember that your motivations are to earn publicity and reputation by taking on big cases and winning. You should be extremely aggressive in applying the law, pushing the boundaries of the law's meaning and scope. (This helps stress test the statute’s potential to be abused.)
An investigation of all three involved companies is inevitable, but which is the juiciest target? MobHit didn’t develop the model; there might be other laws that apply, but SB 1407 seems like a bad fit. The code assurance tool company probably fine tuned the model, so SB 1047 might apply, but the company is small and doesn’t have deep pockets. However, Anthropomorphic created the model, has deep pockets and is a big brand.
The first step will be to dig through every filing Anthropomorphic made under SB 1047. Ask for all relevant materials from the company. Did they check every box? This part of an investigation can be very time consuming and expensive for a company even without a formal charge.
But let’s assume that Anthropomorphic followed the required processes and made the right promises. Does the investigation stop after we ensured that all the procedurally required steps were met?
In my experience, absolutely not! After all, something really bad happened; as a prosecutor, it’s in your career interest to make sure someone pays. The investigation will move into the substance of the SB 1047 requirements, which (simplifying somewhat) turn on two critical questions: harm and reasonableness.
Was there a critical harm? The hypothetical harm falls below the $500 million dollar threshold. But an aggressive prosecutor will jump straight to 22602(g)(1)(D), which defines “critical harm” to include “[o]ther grave harms to public safety and security that are of comparable severity to the harms described in subparagraphs (A) to (C), inclusive.” $400 million is of “comparable severity” to g(1)(B)’s $500 million threshold. Critical infrastructure was affected, and perhaps this even qualifies as a cyberattack. The (g)(2)(B) carve out doesn’t apply here because although the model did not itself deploy the update that failed, the model arguably did materially contribute to the failure of the system as a whole.
Did the developer, before training, implement reasonable protection under 22603(a)(1) and implement an adequate safety and security protocol under 22603(a)(3)? Did they, before using or releasing the model, take reasonable care to implement appropriate safeguards under 22603(b)(3)? As a prosecutor, the biggest piece of evidence that Anthropomorphic failed these requirements is that a critical harm actually occurred. The bigger the harm, the easier it is to argue that the developer failed their duty to take reasonable care.
This is one of the most naive or duplicitous parts of SB 1047 defender’s arguments, who claim that SB 1047 isn’t a strict liability statute. That’s true – but if a covered AI model is involved in a significantly harmful incident, those same folks will undoubtedly point to the incident as proof of the need for SB 1047 and call for enforcement against the developer.
Compared to a court tort case, where plaintiffs have to plead a facially sufficient case to get to the discovery stage, developers under SB 1047 start on their back foot, having submitted significant information to the enforcer that can then be combed through for missteps if something bad happens. This is a major practical effect of SB 1047.
This, again, is similar to the positioning of companies under FTC settlements. Take, for example, the Cambridge Analytica situation that led the FTC to reach an unprecedented $5 billion settlement with Facebook. Cambridge Analytica provided data-driven analysis to political campaigns. Their services used data from approximately 50 million Facebook profiles. CA acquired that data from a university researcher who had used the Open Graph API that Facebook had offered at the time. The researcher’s initial use of the data seems to have been within Facebook’s terms of service, but the transfer to CA and CA’s use of that data violated Facebook’s terms of service. In other words, Facebook didn’t intend for the data to be used that way.
There is zero evidence that any user was ever injured from CA’s use of this data. But the amount of data and the political nature of the business caused a huge furor. Absent the CA incident, it’s highly unlikely any case would have been brought. But because Facebook was under a pre-existing consent agreement, the FTC was able to plausibly allege a wide range of procedural missteps including a failure to adequately vet third parties who had access to data. The end result? A big dollar settlement.
To be fair, the “reasonableness” analysis under the latest draft of SB 1047 is better than previous versions of the bill, which arguably discarded causation from the analysis. Under the current version, the “duty of care” language suggests a legal possibility of developers not being responsible for every bad thing, even if the practical effect remains that prosecutors will seek to impose legal liability.
In short, one effect of SB 1047 is that if anything bad happens involving a covered foundation model, model developers should anticipate that they will be treated and pursued as liable no matter the underlying facts.
An Anti-AI Enforcer
Having gamed through what might happen if there is an actual harm, let’s look at a different potential use of this statute: the use of it to slow or stop new model development even absent any harmful incident. Assume for argument's sake that the people leading the relevant parts of the Government Operations Agency want to halt or slow all new model development. How would they use SB 1047 to do so?
Previous versions of SB 1047 were extremely susceptible to this abuse. For example, earlier versions required developers, pre-training, to certify to the FMD their positive safety determination and explain how that determination was reached. An anti-AI enforcer could have straightforwardly interpreted this to mean that any such certification had to be not just filed, but also accepted by the FMD before training could begin. This process could be easily weaponized. The GOA or Board could declare that they need to approve the adequacy of a protocol before it is “accepted” and training can begin.
In the latest language, developers are required to transmit a copy of their safety and security protocol. “Transmit” undercuts the pre-approval interpretation above, but it doesn’t fully foreclose it. An aggressive agency could probably coax the small number of companies involved to play along. And even under the current language, an aggressively anti-AI GOA or Attorney General could immediately start an investigation into whether a transmitted safety and security protocol complies with the requirements of SB 1047, send the developer a demand letter for further information, and even potentially get an injunction to stop development until the investigation concludes.
An aggressive enforcer might also:
Issue guidance explaining that “taki[ing] reasonable care to implement appropriate safeguards” under 22603(b)(3) precludes release of open source models.
Start a program to groom whistleblowers within AI companies.
Start a rulemaking process under the authority in 11547.6(d)(1) to modify the threshold of “covered model” while asking whether the dollar thresholds ought to be interpreted broadly.
Incorporate many of the jettisoned provisions of earlier versions of SB 1047 into the binding auditing requirements under 11547.6(e)(1) or the “unreasonable risk” guidance under 11547.6(f)(1).
These types of actions might seem improbable or speculative. But considering edge cases is common and appropriate in red teaming. Also, the average person would be surprised at what a regulator can do with seemingly limited authority when their incentives are very different than the intended regulation.
Conclusion
SB 1047 could have benefited from early red teaming by its sponsors.
If passed, SB 1047 will certainly impose tens of millions of dollars in compliance and auditing costs. It will inevitably delay the release of frontier models above the threshold, probably by three to six months, as companies work to file the necessary paperwork. It will create additional uncertainty and risk for open source developers and those who rely on open weight or open source models. Perhaps, for those whose goal is to slow AI development rather than increase safety, these are the benefits, although I see them as costs.
But will SB 1047 actually make AI safer? This is unknowable, in part because the threat remains deeply hypothetical. There remains exactly zero evidence that some special risk exists above the chosen compute threshold that is not already sufficiently constrained by traditional tort and consumer protection regulation.
Good red teaming requires creative thinking and an ability to see a system from multiple perspectives. The AI safety community has thought deeply about how to red team AI technology. Some of that same energy and creativity should be focused on how to red team AI legislation. Such an approach would result in more refined legislation and would improve communication across ideological groups in legislative fights.
I’m committed to this idea and am starting a legislative red teaming program for AI policy. If you are interested in participating, please contact me at neil@abundance.institute.
Red-teaming is a really good way of thinking about finding the flaws in legislation. It seems like there would be room for a great non-partisan foundation that specialized in doing just that.