Back to Glossary
LLM

Alignment

Definition

Alignment is the challenge of ensuring AI systems pursue intended goals and behave according to human values and preferences, encompassing training techniques like RLHF and Constitutional AI that shape model behavior.

Why It Matters

Raw language models are like powerful tools without safety guards. They predict text that matches training patterns, including harmful, biased, or unhelpful patterns. Alignment is the process of shaping models to be helpful, harmless, and honest.

The practical impact: before alignment, GPT-3 would happily write instructions for harmful activities. After RLHF alignment, ChatGPT refuses those requests and tries to be genuinely helpful. This isn’t just safety theater. Alignment fundamentally changes how models behave.

For AI engineers, understanding alignment matters because it explains your tools’ behavior. Why does Claude refuse certain requests? Why does GPT-4 sometimes seem overly cautious? These behaviors come from alignment training. Understanding alignment helps you work with models effectively and build systems that complement their strengths.

Implementation Basics

Major Alignment Techniques

RLHF (Reinforcement Learning from Human Feedback) Humans rate model outputs. A reward model learns from these preferences. The language model is then trained to maximize predicted reward. This is how ChatGPT became ChatGPT.

Constitutional AI The model critiques and revises its own outputs based on a “constitution” of principles. Reduces need for human feedback by using AI feedback, guided by explicit rules about helpful and harmless behavior.

DPO (Direct Preference Optimization) Simplified alternative to RLHF. Directly optimizes for human preferences without training a separate reward model. Increasingly popular for its simplicity.

Instruction Tuning Fine-tuning on instruction-following examples. The model learns the format and style of helpful responses.

Why Alignment Is Hard

Specification: What exactly should the model do? “Be helpful” is vague. Edge cases require nuanced judgment.

Evaluation: How do you measure alignment? Automated metrics miss important aspects. Human evaluation is expensive.

Robustness: Aligned behavior should persist under unusual inputs, adversarial attacks, and distribution shift.

Goodhart’s Law: When you optimize for a metric, the metric stops being a good measure. Models can learn to game alignment measures.

Practical Implications

For application developers:

  • Rely on aligned models but don’t assume they’re perfect
  • Add application-level guardrails
  • Test edge cases and adversarial scenarios
  • Provide clear instructions that align with how models were trained

Alignment is an ongoing research area, not a solved problem. Models get better over time, but expect limitations and plan for failure modes.

Source

Training language models to follow instructions with human feedback (InstructGPT) demonstrated that RLHF can significantly improve alignment with user intent.

https://arxiv.org/abs/2203.02155