How to Prevent Woke AI - Part Two

The Mechanisms of Woke AI

May 29, 2025

This is Part Two in a three-part series of posts exploring the problem of “Woke AI” and how to deal with it. This series will cover:

What Woke AI is and why companies might do it (Part One)
How builders accomplish it, including where it is easier and harder to do (Part Two, below)
How to stop Woke AI and, surprisingly, why this tech fight favors the right (Part Three)

Today’s part provides the key technical background for the policy implications that I’ll discuss tomorrow. Let’s dive in.

Generative AI systems are complex, multilayer systems that are built (“trained”) and then operated (what is called “inference”). Builders can attempt to shape outcomes at many different points in this process. But some of these points are more susceptible to intentional manipulation, or “Woke AI.” Let’s walk through the phases of generative AI development/deployment and discuss the ease and likelihood of manipulation in each.

In the training phase, the model is formed by analyzing large amounts of data, much like a student studying to recognize patterns, relationships, and structures within the material. For instance, an AI designed to generate text might analyze millions of books and articles to understand grammar, sentence structure, and context. This phase is resource-intensive, requiring specialized hardware, vast amounts of data, and fine-tuning by developers to ensure the model generalizes its knowledge effectively. By the end of training, the system has developed a model capable of generating outputs based on what it has learned.

Training is the heavy-lifting phase that occurs ahead of time and behind the scenes.

In the inference phase, a party uses a trained AI model by applying it to new inputs, generating results in real-time. This is where the AI system takes a user’s input, processes it using the patterns learned during training, and produces an output, such as a creative story, an image, code, or a technical response. Inference is frequently optimized for speed and efficiency, allowing the AI to perform tasks interactively.

Inference is the real-world application phase, where users directly experience the AI’s capabilities.

Each of these two phases has several sub-phases. Below, I analyze each sub-phase for its vulnerability to “woke” manipulation. For each sub-phase I explain how plausible it is to manipulate and how easy it is to mitigate. Specifically, I rate (High Moderate Low) the following characteristics:

Ability to manipulate: How practical, for a motivated developer, it is to predictably shape the future outputs in a specific direction? For example, could adjusting something at this phase reduce or exclude conservative ideas from responses?
Ability to attribute (insider): If a person had access to the entire system once built and operating, how easy would it be for them to attribute any detected wokeness to manipulation during this sub-phase?
Ability to attribute (outsider):How easy would it be for a regular user or outside tester to attribute any detected wokeness to manipulation during this sub-phase?
Ability to mitigate (insider): If manipulation at this phase is detected or suspected, how easy is it to correct by the developer?
Ability to mitigate (outsider): If manipulation at this phase is detected or suspected, how easy is it for the user to avoid or mitigate the manipulation?

Phases that are easy to manipulate and difficult to attribute and mitigate present the highest risk. A phase that is difficult or impossible to manipulate but would be easy to attribute and mitigate presents low risk. Other combinations offer in-between risk. To the extent there are differences between the abilities of insiders and outsiders in a category, mandates or best practices could level that ability.

Note that this analysis is pretty heavily vibe-based! I’m drawing on what I know of these systems from my own study and discussion with others, as well as understanding of content moderation practices in other technical areas (mostly social media), live examples that we’ve seen (such as xAI and Gemini), and my intuitions about the incentives of parties involved. This is certainly an area that could use more quantitative research.

Training Phase

Compared to manipulation in the inference phase, manipulation in the training phase is generally 1) harder to accomplish with a level of certainty; 2) harder to detect because the resulting models are not human-interpretable; and 3) harder to mitigate because it could require expensive retraining of the model.

1. Model Architecture and Fundamental Design Choices

This category involves the core technical decisions made before any training data is introduced, shaping how the system processes information and generates responses. For instance, choosing a particular neural network architecture (e.g., a large-scale transformer with billions of parameters) can influence the model’s ability to understand complex queries or produce coherent long-form answers. Design aspects such as how text is tokenized, what loss functions are used, and how parameters are initialized all subtly determine the model’s representational capacity and inductive biases. These foundational choices set the “rules of engagement” for everything that comes afterward. Even if the model is never explicitly guided to favor one viewpoint over another, certain architectural choices can predispose it to generate particular types of content. For example, a model with fewer parameters might struggle with nuanced reasoning and default to simplistic or stereotypical answers. Conversely, a highly sophisticated architecture could produce more contextually rich content but may also exhibit more intricate and less easily predictable biases. In short, decisions made at this early stage lay the groundwork for the nature and quality of the model’s responses and how susceptible it might be to certain forms of bias or manipulation later on.

Ability to manipulate: Low. It’s not at all obvious that various architectures would be more or less biased against certain political or cultural views.
Ability to attribute (insider): Low. Once in operation, it might be very difficult to determine that the architecture itself was the source of manipulated results.
Ability to attribute (outsider): Low, if not impossible.
Ability to mitigate (insider): Low. If it was done and detected, fixing manipulation would require rearchitecting the entire system and then retraining the model.
Ability to mitigate (outsider): Low. To the extent a manipulation is fixed in the system architecture, users will be unable to affect it.

2. Data Selection and Preprocessing

The data chosen to train a model heavily influences the concepts it learns and the perspectives it internalizes. Since most large language models are trained on massive text corpora gathered from the internet, they inherit all the biases, cultural assumptions, and worldview skews present in that content. The act of curating, filtering, or augmenting this data can mitigate or exacerbate these issues. For example, a developer may remove content deemed overtly harmful or hateful, or intentionally include underrepresented dialects and languages to broaden the model’s worldview. Preprocessing steps such as text normalization, de-duplication, or segmenting documents into tokens further shape what the model sees and doesn’t see, influencing how well it understands nuances like slang, cultural references, or minority group terminology. By carefully selecting, filtering, and balancing the training corpus, developers can correct the model towards what they believe are more representative outputs. Conversely, biased source selection can make the model more likely to produce skewed or manipulated responses.

Ability to manipulate: Moderate. Can screen out unwanted ideas. But because of the value of additional data, and the massive amount of data used, there are limits to the incentive and ability to do this.
Ability to attribute (insider): Low. Without insight into the actual data selection and preprocessing practices, it would be difficult to confirm that skewed results were the effect of choices made at this subphase.
Ability to attribute (outsider): Low to impossible.
Ability to mitigate (insider): Low. Once the model is trained, it would be very expensive to return to the data selection phase.
Ability to mitigate (outsider): Moderate. If the political view is present in the data (even if it is underrepresented compared to the real world), prompting the resulting model can often get correct responses, assuming that manipulations at other levels aren’t in place.

3. Base Model Training (Unsupervised Pretraining on Unlabeled Data)

Base model training usually involves feeding vast amounts of unlabeled text into the chosen architecture, allowing it to learn statistical patterns of language. This unsupervised “pretraining” step establishes the general linguistic and factual competencies of the model. During this stage, the model absorbs the implicit norms, ideologies, and assumptions embedded in the text. While no explicit instruction tells it “what’s right” or “what’s wrong,” the sheer volume and distribution of content—who writes the data, what topics are most common, whose perspectives are front and center—determine which views the model will more readily reproduce. The outcome is a baseline model that can generate plausible-sounding text but might reproduce stereotypes or misinformation found in the training data. While not a direct moderation mechanism, the ideas baked in at this stage can be hard to entirely undo later. As a result, developers often return to this phase to refine or retrain the model on updated datasets as their policies and goals evolve.

Ability to manipulate: Moderate. Can screen out unwanted ideas. But because of the value of additional data, and the massive amount of data used, there are limits to the incentive and ability to do this.
Ability to attribute (insider): Low. Without insight into the actual data set, it would be difficult to confirm that skewed results were the effect of choices made at this subphase.
Ability to attribute (outsider): Low to impossible.
Ability to mitigate (insider): Low. Once a base model is trained, re-training is expensive.
Ability to mitigate (outsider): Moderate. If the political view is represented in the model (even if it is underrepresented compared to the real world), prompting can get the model to respond correctly, assuming that manipulations at other levels aren’t in place.

4. Alignment and Fine-Tuning Processes

After pretraining, developers often refine the model’s behavior through processes like Reinforcement Learning from Human Feedback (RLHF) or Constitutional AI. These steps impose explicit sets of guidelines onto the general-purpose model, guiding it toward more “aligned” behavior that matches human expectations, ethical standards, or organizational values. For RLHF, human evaluators interact with the model, rating and comparing its responses, and these preferences are then used to further train the model, effectively injecting human judgments into the model’s decision-making process. In Constitutional AI, a predefined “constitution” of principles is used to adjudicate which responses align with those principles, automating certain aspects of the alignment effort. These fine-tuning steps can make the model safer, more helpful, or more polite, but they also encode the values of whoever sets the guidelines or writes the constitution. Through alignment and fine-tuning, developers actively shape the moral and ethical stance of the model, controlling which responses are deemed acceptable or desirable.

Ability to manipulate: High. Feedback from human reviewers can contain both explicit and implicit manipulation. The effect of constitutional reinforcement will depend on the content of the constitution.
Ability to attribute (insider): Moderate to High. Identifying manipulation fostered by the instructions to the human testers or in the constitution should be easy to identify by looking at those instructions or the constitution.
Ability to attribute (outsider): Low. Difficult to impossible unless the company shares, for example, a copy of the constitution.
Ability to mitigate (insider): Moderate. Once the alignment and fine tuning process is complete, modifying it is time-intensive though not as difficult as retraining.
Ability to mitigate (outsider): Moderate. If the political view is represented in the model (even if fine-tuning means it is suppressed compared to the real world), prompting can get the model to respond correctly, assuming that manipulations at other levels aren’t in place.

Inference Phase

5. System-, Developer-, and Role-level Instructions (Meta Prompts)

Even after training, the model’s immediate context plays a crucial role in shaping responses. System and developer prompts—often hidden from end users—provide the model with “meta-instructions” that persist throughout the interaction. For example, a system-level prompt might instruct the model to always respond in a helpful and unbiased manner, or to avoid giving medical advice beyond a certain level of detail. These meta prompts set the tone and impose constraints on every user query, acting like an invisible policy layer that guides how the model interprets and responds to the user’s request. They can be continually updated to refine the model’s style, restrict certain content, or encourage particular patterns of discourse. By leveraging these overarching instructions, developers can correct known issues or prevent certain undesirable outputs without retraining the model, effectively steering the model’s behavior in real time. This level is where xAI’s “South Africa” manipulation occurred (see Part One, and this piece by Scott Alexander and Daniel Kokotajlo).

Ability to manipulate: High. Meta prompts are probably the easiest, effective way to shape results.
Ability to attribute (insider): High. Meta prompts are written in english and must be clearly stated in order to have a manipulative effect. As such, manipulation is easy to identify.
Ability to attribute (outsider): Moderate to low. Sometimes advanced users can trick AI systems into revealing their meta prompts, but this is unreliable and not accessible to normal users.
Ability to mitigate (insider): High. These prompts can be changed relatively easily by the provider.
Ability to mitigate (outsider): Moderate to low. Models can also be “jail broken,” or tricked into subverting their meta prompts, but this has gotten increasingly difficult.

6. Prompt Engineering and Chain-of-Thought Management

Developers may use “prompt engineering” techniques—crafting user-facing or system prompts in carefully structured ways—to control how the model thinks about a query. They might provide the model with step-by-step reasoning instructions, hidden reasoning steps (chain-of-thought), or reformat the user’s question into simpler subproblems. By shaping the model’s internal reasoning process, developers can make it less likely to produce disallowed or incorrect content. These methods can also bias the model toward certain interpretations or solutions, effectively injecting a subtle form of guidance or constraint. Although not as overt as system prompt or filtering, chain-of-thought management can systematically influence the directions the model explores in its internal reasoning, thus indirectly governing which responses are likely to appear in the final answer.

Ability to manipulate: High. These techniques can be applied and tested easily.
Ability to attribute (insider): High. Like meta prompts, manipulation can be easily identified with access to the content of the prompt engineering.
Ability to attribute (outsider): Moderate to low. Typically these techniques are hidden from the user, often for understandable reasons. However, some systems expose the chain-of-thought steps or summaries of it. The more transparent these steps are, the more a user might be able to identify manipulation.
Ability to mitigate (insider): High. Provider can revise the hidden prompts.
Ability to mitigate (outsider): Low, although advanced users may sometimes evade through their own prompt engineering or jail breaking techniques.

7. Inference-Time Filters and Moderation Layers (Post-Generation Screens)

After the model generates a response, additional moderation layers can analyze the output before it’s shown to the user. These filters might be simple keyword detectors, advanced classifiers trained to identify hate speech or disallowed content, or more sophisticated rule-based systems that evaluate the sentiment, factual correctness, or legal compliance of a response. If the output triggers any of these filters, the system may block, revise, or request the model to produce a different answer. This final checkpoint often represents the last line of defense against harmful or policy-violating content. By leveraging inference-time filtering, developers and platform providers can enforce dynamic, context-sensitive standards of quality and safety, adjusting them over time as new risks are discovered or policies change. However, these filters themselves reflect certain judgments and criteria, and thus can also be a source of manipulation, determining which ideas are allowed to surface and which are suppressed.

Ability to manipulate: High. Keyword filters and other post-output techniques can directly manipulate results.
Ability to attribute (insider): High. These techniques are again, explicit and easy to identify when one has access to the system.
Ability to attribute (outsider): High to moderate. Often these filters generate and then re-write responses, tipping off users that the responses are being manipulated. (If you seen an image half-generated before the system refuses to complete it, that’s a clue.) In other cases, the system directly refuses to provide what the user asked for. It is relatively easy for users to systematically test what prompts trigger such refusals.
Ability to mitigate (insider): High. Provider can adjust any of these filters.
Ability to mitigate (outsider): Low, although clever users can sometimes evade keyword filters through tricks like asking the model to “swapping A for 4 and E for 3” in responses.

8. User Interface / Interaction Design

How the user interacts with the system can also shape the responses they receive. The interface might include features that guide user behavior, such as prominent safety disclaimers that discourage misuse, or input constraints that prevent certain categories of queries. Providing structured forms, pre-set categories, or suggestion menus can steer users toward queries that elicit better or safer answers. In some systems, users might even have toggles or filters to control the “style” or “temperature” of responses. Through careful UI design, developers influence both what the user requests and how the model responds, indirectly influencing the output by reducing the incidence of problematic prompts or by framing user expectations. While more subtle than direct manipulation, interaction design can significantly affect the model’s final outputs by shaping the entire conversation flow.

Ability to manipulate: Low to Moderate. Covert UI and ID manipulation is indirect and unpredictable. Of course, some “manipulation” could be expressly encoded into the user interface. For example, think of a drop down that allows you to select “conservative” “progressive” or “libertarian” responses. This may not actually qualify as manipulation, as it is not hidden.
Ability to attribute (insider): Low. To the extent covert UI and ID manipulation is implemented, it could be hard to detect the intent from the interface or interaction itself, even with full access to the system.
Ability to attribute (outsider): Low, except in the cases of express UI / ID options.
Ability to mitigate (Insider): High. The interface can be adjusted.
Ability to mitigate (outsider): Low for user in many cases, although it will depend on the specific design.

9. Institutional / Platform Policy and Governance

Beyond the technical and user-facing layers, overarching institutional policies and governance frameworks define the broader environment in which the model operates. Companies, research institutions, and regulatory bodies set rules about what content is allowed, which groups must be protected, and how transparency and accountability are maintained. For example, a platform may have strict policies requiring the removal of hateful content, or it might be subject to regional laws that outlaw certain forms of disinformation. These policies filter down into everything else: they guide data selection strategies, inform the design of alignment protocols, dictate what meta prompts say, influence what type of filtering tools are implemented, and shape UI guidelines. Institutional governance provides the ultimate backdrop against which the entire ecosystem functions, as the developers seek to ensure that the model’s final outputs adhere to the legal, ethical, and cultural norms that stakeholders have agreed upon.

Ability to manipulate: High, but lacks covertness. While systems could adopt policies and governance mechanisms that expressly favor certain viewpoints, doing so typically lacks the covertness necessary to manipulate.
Ability to attribute (insider): High. To be effective, policies and governance documents must be clear in their intent and available to insiders.
Ability to attribute (outsider): Moderate. Policies and governance documents are often shared externally, but not always.
Ability to mitigate (insider): High. Providers can obviously change their governance documents and policies. (Although changing laws would be far more difficult, in the U.S. there are not many laws dictating this kind of manipulation - yet.)
Ability to mitigate (outsider): Low. To the extent that viewpoints are expressly encoded in policies and governance documents, it will be difficult for users to mitigate those policies.

Below is a summary of the above analysis in tabular format.

The big takeaway here: the inference phase is where most of the risk of intentional manipulation lies. This has significant political and policy implications, which we will explore tomorrow!

Getting Out of Control

Discussion about this post