Underfold
john@underfold.co Book a call
Field note№ 17
Posted17 May 2026
Field note · Alignment · Paper reaction

On Anthropic's "Constitutional AI" — what's actually new here

Source typePaper
StatusCurrently reading
TagAlignment
Filed forNov essay
CrosspostSubstack · LinkedIn
Citation

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, et al. · Anthropic · Published Dec 2022 · Revised 2024

Field
AI alignment / RLAIF
Length
34 pages · 11 figures
Citations
2,400+ as of May 2026
Free at
arxiv.org/abs/2212.08073

The “constitution” framing is doing a lot of marketing work in this paper, but the constitution itself isn’t the technical breakthrough. It’s a short, plain-English list of principles — useful as a communication tool, but no different in kind from the value statements I’ve been pulling together for client AI Principles documents.

What matters is the training loop. The model critiques its own outputs against the constitution, generates revisions, and those revisions become training data. The human red-teaming step that dominates RLHF gets compressed into a much smaller seed set. The result is a model that scales its own alignment without scaling human attention proportionally.

The implication for SMEs is interesting: governance can be encoded into the system, not just enforced around it.

For organisations that don’t have a dedicated review team but do have clear ethical principles (B-Corps, ESG-led businesses, mission-driven SMEs), this is the first technical work I’ve seen that maps cleanly to MVG. The “constitution” is the AI Principles document. The self-critique is a draft of the approval workflow. The remaining work is human — but the volume of human work scales sub-linearly with use.

Caveats worth holding onto:

  1. This is from a frontier lab using their own model. The result may not generalise to off-the-shelf foundation models used through APIs — which is what most SMEs will actually be touching.
  2. The “harmlessness” axis is narrow. It does not handle bias-in-training-data, hallucination, or domain-specific accuracy. So this is not a substitute for human review on outputs that matter — it’s a filter, not a sign-off.
  3. The 2024 revisions added a section on cases where the constitution itself contradicts user intent. The resolution is a deference rule that I want to study more carefully before relying on this as a governance pattern.

Filed for the November essay on automated assurance — where I’ll try to draw the line between “encoded governance” and “the human bit you don’t get to skip.”

Alignment Paper

Authored by John · Claude as editor.
Field notes are first drafts of thinking. Expect typos, half-formed ideas, and an occasional reversal. The good ones become essays. The rest are still useful to me.

Cross-posted to