Constitutional AI: Harmlessness from AI Feedback

Download PDF

Constitutional AI: Harmlessness from AI Feedback

Metadata

  • Reading status: read complete
  • Year: 2022
  • Compute regime: Inference-time compute and post-training (inference_time_compute_post_training)
  • PDF: 2022-constitutional_ai_2022.pdf
  • Extracted text: 2022-constitutional_ai_2022.txt
  • OpenAlex:
  • Citation count source/date:
  • Citation count:
  • Reading card created: 2026-06-15

Compute Setup

The paper does not list GPU/TPU hardware, accelerator counts, wall-clock time, or energy use. It does disclose that the main models include 52B-parameter systems and that the work builds on Anthropic pretraining, RLHF, preference-model, and distributed training infrastructure. Under the project rule, the device setup should be inferred only broadly as Anthropic accelerator-cluster training and inference for 2022-era large language models; no exact device should be claimed.

The source gives useful scale proxies. For supervised Constitutional AI (SL-CAI), the authors use 42,496 human-written red-team prompts plus 140,335 model-generated red-team prompts, for 182,831 total. They sample four critique-revision pairs per red-team prompt from a helpful RLHF model. They also use 135,296 human-written helpfulness prompts and sample two responses per prompt. SL-CAI fine-tuning runs for one epoch with batch size 1024 sequences and learning rate 0.5 times the pretraining learning rate.

For RL-CAI preference modeling, the paper uses 135,296 human-feedback helpfulness comparisons and 182,831 constitutionally generated harmlessness comparisons. Controlled RL runs use all prompts from SL-CAI plus additional model-generated prompts, 491,142 for red-team and 474,300 for helpfulness. Evaluation uses 10,274 helpfulness and 8,135 harmlessness comparisons for AB testing 24 model snapshots.

Bottleneck

The bottleneck is alignment data and preference supervision, not just pretraining FLOPs. RLHF traditionally needs many human comparisons for harmlessness, and the paper argues that this is expensive, subjective, unpleasant for red-team workers, and difficult to scale. The compute/data question becomes: can an existing helpful model generate critiques, revisions, and harmlessness preference labels that are good enough to train another model?

There is also a safety-quality bottleneck. Prior helpfulness-only RLHF models can become more willing to answer harmful requests, while helpful-and-harmless RLHF models can become evasive. The paper frames constitutional methods as a way to move along the helpfulness/harmlessness Pareto frontier: reduce harmfulness without simply refusing or adding generic boilerplate. That requires many model samples, preference-model passes, and RL updates, so inference-time generation and scoring become part of the training compute.

Chain-of-thought is used as a compute lever for supervision. The paper reports that CoT-style reasoning improves AI identification of harms and makes training-time decisions more legible. The extra tokens can improve label quality and preference-model calibration.

Method Adaptation

The method has two stages. In the supervised stage, a helpful-only assistant responds to harmfulness prompts, critiques its own response according to a randomly selected constitutional principle, and revises it. The revised responses are mixed with helpfulness samples, and a pretrained model is fine-tuned on the result. This moves the model "on-distribution" for the later RL stage and reduces exploration needs.

In the RL stage, the method replaces human harmlessness labels with AI feedback. The SL-CAI model generates pairs of responses to harmful prompts. Another model evaluates the pair under a constitutional principle, producing a preference label. These harmlessness labels are mixed with human helpfulness labels to train a hybrid preference model. The policy is then trained with RL against that preference model, making the RL pipeline structurally similar to RLHF but with RLAIF for harmlessness.

This is compute-adapted alignment. Instead of paying humans for every harmlessness comparison, the system spends LLM inference on critique, revision, pair generation, CoT feedback, and preference labeling. The constitution is a compact natural-language control surface, and the preference model distills many expensive AI judgments into a reward function that can be used repeatedly during RL.

Evidence

The paper gives several evidence streams. First, it reports that larger models are better at HHH evaluation and that chain-of-thought improves performance on 438 binary comparison questions. The authors state that these AI evaluations become competitive with preference models trained on human feedback labels as model size increases, with 52B as the key reported scale.

Second, the critique-revision pipeline improves harmlessness scores. Figure 5 evaluates initial responses and successive revisions using a 52B preference model trained on human-feedback harmlessness comparisons. Harmlessness and combined helpful-harmless scores improve monotonically with revision number, while pure helpfulness scores decrease. Figure 7 compares critiqued revisions with direct revisions; critiqued revisions help smaller models and are retained for transparency even though the difference is less noticeable for large models.

Third, crowdworker evaluations support the final RL-CAI models. The paper reports that RL-CAI models are significantly more harmless than RLHF and SL-CAI models, while maintaining helpfulness better than a purely evasive harmless model. Figure 2 plots harmlessness versus helpfulness for 52B RL runs and shows RL-CAI improving the frontier. Figure 8 tracks Elo over RL training sequences, and the text states RL-CAI with CoT is slightly less helpful but slightly more harmless than RL-CAI without CoT. The authors also report that RL-CAI labels on the new HHH evaluation are reasonably calibrated.

Historical Effect

Constitutional AI made RLAIF a concrete alternative or supplement to RLHF. Historically, it shifted some alignment supervision from large human-label datasets to model-mediated critique, revision, and preference labeling. The paper's important compute move is not that humans disappear entirely; helpfulness labels remain human in this setup. The move is that harmlessness supervision can be scaled through model inference and a compact constitution.

It also made the alignment target more inspectable. Instead of a large implicit collection of human preference labels, the constitution is a short editable set of principles, and chain-of-thought critiques provide some visibility into training-time judgments. This influenced later work on AI feedback, self-critique, red-team automation, and constitutional/system-prompt style control.

Limits

Hardware is not disclosed, so device-level compute accounting is unavailable. The method likely requires substantial hidden inference and training compute: four critique-revision samples per red-team prompt, two helpfulness samples per helpfulness prompt, pair generation, AI preference labeling, preference-model training, and RL. Exact latency, batching, and accelerator memory constraints cannot be recovered from the source.

The supervision source is also a limitation. AI labels can inherit model blind spots, constitutional principles are hand-designed, and preference models can be Goodharted. The paper observes over-training behavior such as overly harsh or boilerplate responses. The harmlessness/helpfulness tradeoff remains present: revisions improve harmlessness scores while pure helpfulness can fall, and CoT feedback appears slightly more harmless but slightly less helpful.

Links

  • Compute regime: history/compute_regimes/inference_time_compute_post_training/README.md
  • Source PDF and extracted text are listed in metadata above.
  • Queue status: read_complete.
  • Method index: rlhf
  • Ledger updates: compute bottlenecks