On Anthropic's "Constitutional AI" — what's actually new here — Field note

Citation

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, et al. · Anthropic · Published Dec 2022 · Revised 2024

Field: AI alignment / RLAIF
Length: 34 pages · 11 figures
Citations: 2,400+ as of May 2026
Free at: arxiv.org/abs/2212.08073

The “constitution” framing is doing a lot of marketing work in this paper, but the constitution itself isn’t the technical breakthrough. It’s a short, plain-English list of principles — useful as a communication tool, but no different in kind from the value statements I’ve been pulling together for client AI Principles documents.

What matters is the training loop. The model critiques its own outputs against the constitution, generates revisions, and those revisions become training data. The human red-teaming step that dominates RLHF gets compressed into a much smaller seed set. The result is a model that scales its own alignment without scaling human attention proportionally.

The implication for SMEs is interesting: governance can be encoded into the system, not just enforced around it.

For organisations that don’t have a dedicated review team but do have clear ethical principles (B-Corps, ESG-led businesses, mission-driven SMEs), this is the first technical work I’ve seen that maps cleanly to MVG. The “constitution” is the AI Principles document. The self-critique is a draft of the approval workflow. The remaining work is human — but the volume of human work scales sub-linearly with use.

Caveats worth holding onto:

This is from a frontier lab using their own model. The result may not generalise to off-the-shelf foundation models used through APIs — which is what most SMEs will actually be touching.
The “harmlessness” axis is narrow. It does not handle bias-in-training-data, hallucination, or domain-specific accuracy. So this is not a substitute for human review on outputs that matter — it’s a filter, not a sign-off.
The 2024 revisions added a section on cases where the constitution itself contradicts user intent. The resolution is a deference rule that I want to study more carefully before relying on this as a governance pattern.

Filed for the November essay on automated assurance — where I’ll try to draw the line between “encoded governance” and “the human bit you don’t get to skip.”

Constitutional AI: Harmlessness from AI Feedback

Cross-posted to