Replace human annotators with a stronger model.
Same RLHF pipeline, different feedback source. Instead of humans comparing outputs, an AI judge generates the preference data. Cheaper, faster, more consistent. But it inherits the judge model’s blind spots and biases. If the judge can’t tell when an answer is wrong, the reward model won’t learn to penalize it.
The most well-known RLAIF variant. Introduced by Anthropic. Instead of learning from raw preference comparisons, the model critiques and revises its own outputs against a set of written principles. The “constitution” is the reward signal.
The model generates a response, evaluates whether it violates any principle, rewrites if needed, and those revisions become training data. Reduces the annotation bottleneck, though the quality depends on how well the principles cover real failure modes.