Summary of "Alignment Makes Language Models Normative, Not Descriptive"
Summary (Overview)
- Core Finding: Post-training alignment systematically shifts language models from being descriptive predictors of observed human behavior to being normative predictors of how people should behave according to idealized preferences. This creates a trade-off between optimizing models for human use and using them as accurate proxies for human behavior.
- Key Result: In multi-round strategic games (bargaining, persuasion, negotiation, repeated matrix games), base (pre-alignment) models significantly outperform their aligned counterparts at predicting real human decisions, with a win ratio of 9.7:1 (213 vs. 22 wins).
- Boundary Condition: This advantage reverses in simpler, one-shot settings where human behavior aligns more closely with normative theory. Aligned models outperform base models on one-shot 2x2 matrix games (4.1:1 win ratio) and non-strategic lottery choices (2.2:1 win ratio).
- Mechanism: The pattern is driven by history-dependent strategic dynamics. Aligned models perform well at the first round of multi-round games but lose their advantage as interaction history accumulates. This suggests alignment induces a normative bias, suppressing the prediction of complex, sometimes "uncooperative" behaviors like retaliation and reciprocity that emerge over repeated interactions.
- Robustness: The base model advantage is robust across 23 model families, 10+ prompt formulations, all game configuration parameters, and grows stronger with increased model scale.
Introduction and Theoretical Foundation
Large Language Models (LLMs) are increasingly used as proxies for human behavior (homo silicus) in social science research, from replicating psychological experiments to predicting strategic decisions. A critical, often implicit, assumption is that post-training alignment (e.g., via RLHF or DPO) is neutral or beneficial for this behavioral prediction task.
This paper challenges that assumption. Alignment optimizes models to generate responses that human evaluators approve of (cooperative, fair, helpful). However, human behavior in strategic settings is often not normatively "good"—people bluff, retaliate, and deviate from approved patterns. The authors hypothesize that alignment creates a normative bias, causing models to predict behavior people endorse rather than behavior they exhibit. This distinction between normative (prescriptive) and descriptive theories is foundational in behavioral sciences.
The paper tests the hypothesis that aligned models will predict human behavior well in simple, one-shot settings where behavior is relatively well-described by normative theory (e.g., Nash equilibrium), but poorly in multi-round strategic settings where behavior is shaped by complex, history-dependent dynamics like reciprocity, reputation, and adaptation.
Methodology
The study conducts a systematic comparison of 120 same-provider base–aligned model pairs from 23 families, evaluating their ability to predict 10,050 real human decisions.
-
Game Families & Human Data:
- Bargaining: Alternating-offers game (Rubinstein model).
- Persuasion: Repeated cheap-talk game (Crawford and Sobel model).
- Negotiation: Bilateral price negotiation with outside options.
- Repeated Matrix Games: Prisoner's Dilemma and Battle of the Sexes (10 rounds). Data for the first three families comes from the GLEE benchmark, where humans played against LLM opponents without knowing their nature.
-
Prediction Task: Framed as token probability extraction. For each human decision point, a prompt containing the game rules and dialogue history is fed to the model. The log-probabilities for decision tokens (e.g., "accept"/"reject") are extracted and normalized:
where ranges over all decision tokens for that game family. This yields a predicted probability for the affirmative action.
-
Evaluation Metric: Pearson correlation between the model's predicted and the ground-truth human decision (coded as 1 for accept/yes/cooperate, 0 otherwise).
-
Comparison: For each model pair in a game family, the correlations of the base and aligned models are compared, recording a "win" for the model with the higher correlation.
-
Boundary Condition Tests: To test the limits of the base advantage, models are also evaluated on:
- One-shot 2x2 Matrix Games: 2,416 procedurally generated games (Zhu et al., 2025).
- Binary Lottery Choices: 1,001 non-strategic decision problems under risk (Marantz and Plonsky, 2025).
-
Controls: The study controls for prompt format confounds by testing four variants: Base (native format), Aligned (native chat template), Base (with aligned chat template), and Aligned (with plain text format).
Empirical Validation / Results
1. Dominant Base Model Advantage in Multi-Round Games
- Base models won 213 out of 235 valid comparisons against their aligned counterparts across the four multi-round game families, a ratio of 9.7:1 ().
- The advantage was significant in every individual game family ( each) and consistent across all 23 model families.
Table: Base vs. Aligned Model Wins by Game Family
| Game Family | Base Wins | Aligned Wins | Win Ratio | Significance (p) |
|---|---|---|---|---|
| Bargaining | 75 | 4 | 18.8:1 | |
| Persuasion | 32 | 4 | 8.0:1 | |
| Negotiation | 25 | 1 | 25.0:1 | |
| Matrix Games | 81 | 13 | 6.2:1 | |
| Total | 213 | 22 | 9.7:1 |
- Robustness: The advantage persisted when controlling for prompt format (base wins 5.0:1 with plain text, 5.3:1 with chat template) and across 10 different prompt formulations (base won 959 of 1,003 comparisons, ).
- Scaling: The base advantage grew with model size (see Figure 2 in the paper), suggesting it reflects richer pre-training representations that alignment shifts.
2. Reversal at Boundary Conditions
- One-shot Games: Aligned models won 57 vs. 14 comparisons (4.1:1, ). This advantage was consistent across all 12 game types tested.
- Lottery Choices: Aligned models won 62 vs. 28 comparisons (2.2:1, ).
3. Round-by-Round Dynamics within Multi-Round Games
A crucial finding explains the reversal: the base advantage is history-dependent.
- Round 1: Before interaction history develops, aligned models actually performed better in bargaining, negotiation, and persuasion.
- Round 2 Onward: The base model advantage emerged and grew as history accumulated. For example, in bargaining, the win ratio shifted from aligned-favored (61:32) at round 1 to strongly base-favored (82:4) from round 2 onward.
4. Mechanism: Alignment Shifts Predictions Toward Normative Patterns
Analysis of one-shot games showed that human aggregate choices correlated with Nash Equilibrium (NE) predictions (). Aligned models' predictions were systematically more aligned with NE than base models' (mean vs. ; aligned closer in 59 of 76 pairs, ). This confirms that alignment shifts model predictions toward normative, textbook solutions.
Theoretical and Practical Implications
- Theoretical Explanation: The results are consistent with the distributional narrowing induced by standard alignment methods like KL-regularized RLHF. The optimal policy under this framework is: This represents an exponential tilt of the base distribution that concentrates probability mass on high-reward (annotator-approved) responses at the expense of the distribution's "tails." These tails contain the complex, sometimes normatively "bad" behaviors (retaliation, bluffing) that are crucial for predicting real human behavior in multi-round strategic settings.
- Practical Implications:
- For Behavioral Prediction: Researchers using LLMs as proxies for human behavior must choose models based on context. Base models are superior for predicting behavior in interactive, history-rich settings. Aligned models may be adequate for one-shot decisions or non-strategic tasks where behavior aligns with norms.
- Methodological Risk: Studies claiming "LLMs replicate human behavior" based on aligned models may actually be reporting that LLMs replicate normative behavior, with the gap invisible in simple settings. This poses a risk for social science simulations (e.g., of voters, consumers).
- Alignment Design: Current alignment methods create a trade-off between helpfulness/safety and behavioral fidelity. New methods are needed to preserve the empirical behavioral diversity of base models while adding desired assistant qualities.
Conclusion
The paper demonstrates a fundamental normative–descriptive trade-off induced by post-training alignment. Alignment optimizes LLMs to reflect human preferences for how one should act, which improves prediction in simple, normatively clear settings but systematically degrades prediction in complex, multi-round strategic interactions where human behavior is shaped by descriptive dynamics like reciprocity and history-dependent adaptation.
The choice between base and aligned models is therefore a substantive modeling assumption with significant consequences for predictive accuracy. Until alignment methods are developed that can preserve behavioral diversity, users must be aware that aligned models are better models for human use, while base models are better models of human behavior.
Future Directions include investigating which aspects of multi-round play drive the base advantage, testing the effect in other interactive domains (auctions, coalitions), and developing alignment techniques that mitigate this distributional collapse.