Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, et al. · Anthropic · Published Dec 2022 · Revised 2024
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, et al. · Anthropic · Published Dec 2022 · Revised 2024
The “constitution” framing is doing a lot of marketing work in this paper, but the constitution itself isn’t the technical breakthrough. It’s a short, plain-English list of principles — useful as a communication tool, but no different in kind from the value statements I’ve been pulling together for client AI Principles documents.
What matters is the training loop. The model critiques its own outputs against the constitution, generates revisions, and those revisions become training data. The human red-teaming step that dominates RLHF gets compressed into a much smaller seed set. The result is a model that scales its own alignment without scaling human attention proportionally.
For organisations that don’t have a dedicated review team but do have clear ethical principles (B-Corps, ESG-led businesses, mission-driven SMEs), this is the first technical work I’ve seen that maps cleanly to MVG. The “constitution” is the AI Principles document. The self-critique is a draft of the approval workflow. The remaining work is human — but the volume of human work scales sub-linearly with use.
Caveats worth holding onto:
Filed for the November essay on automated assurance — where I’ll try to draw the line between “encoded governance” and “the human bit you don’t get to skip.”