Training Language Models to Follow Instructions with Human Feedback

Summary (Overview)

Key Contribution: Introduces InstructGPT, a method for aligning large language models with human intent by fine-tuning GPT-3 using a three-step process of Supervised Fine-Tuning (SFT), Reward Modeling (RM), and Reinforcement Learning from Human Feedback (RLHF).
Primary Finding: Despite having 100x fewer parameters (1.3B vs 175B), outputs from the 1.3B InstructGPT model are significantly preferred by human evaluators over outputs from the 175B GPT-3. The 175B InstructGPT model is preferred over GPT-3 85% ± 3% of the time.
Improved Safety & Truthfulness: InstructGPT models show improvements in truthfulness (e.g., on TruthfulQA) and generate about 25% fewer toxic outputs when prompted respectfully, though improvements in bias are minimal.
Mitigated Alignment Tax: The paper introduces PPO-ptx, a variant that mixes RLHF updates with pretraining objective updates, which successfully minimizes performance regressions on standard NLP benchmarks (e.g., SQuAD, DROP) while preserving alignment gains.
Generalization: InstructGPT models demonstrate promising generalization to instructions outside the fine-tuning distribution, such as following non-English instructions or answering questions about code, despite these being rare in the training data.

Introduction and Theoretical Foundation

The core problem addressed is the misalignment between the standard language modeling objective (predicting the next token) and the desired objective of "following the user’s instructions helpfully and safely." Large language models (LMs) like GPT-3, while capable, often generate untruthful, biased, toxic, or simply unhelpful outputs that do not align with user intent.

This work builds on the framework of Reinforcement Learning from Human Feedback (RLHF), previously applied to domains like summarization. The goal is to align models to be helpful, honest, and harmless. The paper operationalizes alignment as training models to act in accordance with user intentions, which includes both explicit instructions and implicit desires like truthfulness and safety.

The research is motivated by the practical need to make deployed language models more reliable and controllable, and serves as an iterative, empirical step in the broader agenda of aligning increasingly capable AI systems.

Methodology

The methodology involves three key steps, as illustrated in Figure 2 of the paper:

Step 1: Supervised Fine-Tuning (SFT)

A dataset of ~13k prompts with human-written demonstrations of desired outputs is collected.
A pretrained GPT-3 model is fine-tuned on this data using supervised learning.

Step 2: Reward Model (RM) Training

A dataset of ~33k prompts is collected, where labelers rank 4 to 9 model outputs from best to worst.
A reward model (6B parameters) is trained to predict human preferences. The loss function for a prompt x, preferred completion y_w, and dispreferred completion y_l is: $\text{loss}(\theta) = -\frac{1}{\binom{K}{2}} E_{(x, y_w, y_l) \sim D} \left[ \log \left( \sigma(r_\theta(x, y_w) - r_\theta(x, y_l)) \right) \right]$ where r_θ(x, y) is the scalar reward, σ is the sigmoid function, and D is the dataset of comparisons. Comparisons from the same prompt are treated as a single batch element to prevent overfitting.

Step 3: Reinforcement Learning (RL) Fine-Tuning

The SFT model is fine-tuned using the Proximal Policy Optimization (PPO) algorithm to maximize the reward output by the RM, plus a penalty for deviating too far from the SFT model (KL divergence).
The PPO-ptx variant adds a term to preserve performance on the original pretraining distribution, mitigating the "alignment tax." The combined objective is: $\text{objective}(\phi) = E_{(x,y) \sim D_{\pi_\phi^{\text{RL}}}} \left[ r_\theta(x, y) - \beta \log \left( \pi_\phi^{\text{RL}}(y | x) / \pi^{\text{SFT}}(y | x) \right) \right] + \gamma E_{x \sim D_{\text{pretrain}}} \left[ \log(\pi_\phi^{\text{RL}}(x)) \right]$ where β controls the KL penalty strength and γ controls the pretraining gradient strength.

Datasets & Human Data:

Prompts come from the OpenAI API (Playground) and labeler-written tasks.
A team of ~40 contractors provided demonstrations and comparisons. Inter-annotator agreement was ~72-77%.

Baselines: Models are compared to the base GPT-3, a few-shot prompted GPT-3, and GPT-3 fine-tuned on public instruction datasets (FLAN and T0).

Empirical Validation / Results

1. Human Preference on API Distribution:

Labelers significantly prefer InstructGPT outputs over GPT-3 across all model sizes.
The 175B PPO-ptx (InstructGPT) model is preferred over the 175B GPT-3 85% ± 3% of the time and over few-shot GPT-3 71% ± 4% of the time.
InstructGPT outputs are rated as more appropriate, better at following explicit constraints, and less prone to "hallucination" on closed-domain tasks.

Table: Key Human Preference Results (Win rate vs 175B SFT baseline)

Model Size	Model Type	Win Rate (Instruct Dist.)	Win Rate (GPT Dist.)
1.3B	GPT-3	~0.25	~0.25
1.3B	GPT-3 (prompted)	~0.35	N/A
1.3B	SFT	~0.50	~0.40
1.3B	PPO-ptx (InstructGPT)	~0.70	~0.55
175B	GPT-3	~0.20	~0.20
175B	PPO-ptx (InstructGPT)	~0.75	~0.65

Data approximated from Figures 1 & 3.

2. Truthfulness and Toxicity:

Truthfulness: On TruthfulQA, InstructGPT generates truthful and informative answers about twice as often as GPT-3. On closed-domain API tasks, hallucination rates drop from 41% (GPT-3) to 21% (InstructGPT).
Toxicity: When given a "respectful" instruction, the 175B InstructGPT generates ~25% fewer toxic outputs than GPT-3 according to human evaluators on RealToxicityPrompts. However, it shows no significant improvement on bias benchmarks (Winogender, CrowSPairs).

3. Performance on Public NLP Datasets:

Standard RLHF (PPO) causes regressions on tasks like SQuAD and DROP ("alignment tax").
The PPO-ptx variant successfully mitigates these regressions, recovering or even surpassing GPT-3 performance on most tasks while retaining alignment benefits.

4. Generalization and Limitations:

Generalization: InstructGPT models show an emergent ability to follow instructions in non-English languages and to summarize/answer questions about code, despite minimal such data in training.
Limitations: Models can still make simple mistakes: follow instructions with false premises, give overly hedged answers, and struggle with complex multi-constraint instructions.

Theoretical and Practical Implications

1. Implications for Alignment Research:

Cost-Effectiveness: RLHF alignment is modest in cost compared to pretraining, suggesting it's a highly efficient way to improve model usefulness.
Generalization of Alignment: Evidence that aligned behavior generalizes to unsupervised tasks is promising for scaling alignment techniques.
Low Alignment Tax: The success of PPO-ptx demonstrates that alignment need not come at a high cost to general capabilities, making it more likely to be adopted.
Real-World Validation: This work grounds alignment research in production-scale systems, providing a crucial empirical feedback loop.

2. Practical Implications and Deployment:

Model Safety: While improvements are made, InstructGPT is not fully safe. It will follow harmful instructions, indicating that alignment is not a panacea and must be part of a broader safety ecosystem (e.g., use-case policies, monitoring).
Defining "Alignment": The paper highlights that models are aligned to the specific preferences of the researchers, labelers, and API users involved in the process. This raises critical questions about whose values the model is aligned to and how to design fair, transparent, and participatory alignment processes.
Dual-Use Nature: Improving instruction-following also makes models easier to misuse for generating misinformation or harmful content, underscoring the need for responsible deployment frameworks.

Conclusion

This paper demonstrates that fine-tuning large language models with human feedback (RLHF) is a highly effective and promising direction for aligning models with human intent. The resulting InstructGPT models are significantly preferred by humans, more truthful, and less toxic (when prompted appropriately) than the much larger GPT-3.

Key takeaways:

Alignment is achievable and scalable with current techniques, offering large gains over simply scaling up model size.
The PPO-ptx method is crucial for maintaining general capabilities while achieving alignment.
Important open challenges remain: improving model safety against malicious instructions, reducing bias, defining whose preferences to align to, and understanding the limits of generalization.
The work serves as a critical step in empirically advancing AI alignment research on real-world systems, providing a foundation for future work to build upon.