# Alignment Makes Language Models Normative, Not Descriptive

> Alignment makes language models predict how people should behave normatively rather than how they actually behave descriptively, especially in complex strategic interactions.

- **Source:** [arXiv](https://arxiv.org/abs/2603.17218)
- **Published:** 2026-03-20
- **Permalink:** https://picx.dev/p/TpCN66
- **Whiteboard:** https://picx.dev/p/TpCN66/image

## Summary

# Summary of "Alignment Makes Language Models Normative, Not Descriptive"

## Summary (Overview)
*   **Core Finding:** Post-training alignment systematically shifts language models from being **descriptive** predictors of observed human behavior to being **normative** predictors of how people *should* behave according to idealized preferences. This creates a trade-off between optimizing models for human use and using them as accurate proxies for human behavior.
*   **Key Result:** In multi-round strategic games (bargaining, persuasion, negotiation, repeated matrix games), **base (pre-alignment) models** significantly outperform their aligned counterparts at predicting real human decisions, with a win ratio of **9.7:1 (213 vs. 22 wins)**.
*   **Boundary Condition:** This advantage **reverses** in simpler, one-shot settings where human behavior aligns more closely with normative theory. Aligned models outperform base models on one-shot 2x2 matrix games (**4.1:1 win ratio**) and non-strategic lottery choices (**2.2:1 win ratio**).
*   **Mechanism:** The pattern is driven by **history-dependent strategic dynamics**. Aligned models perform well at the first round of multi-round games but lose their advantage as interaction history accumulates. This suggests alignment induces a **normative bias**, suppressing the prediction of complex, sometimes "uncooperative" behaviors like retaliation and reciprocity that emerge over repeated interactions.
*   **Robustness:** The base model advantage is robust across 23 model families, 10+ prompt formulations, all game configuration parameters, and grows stronger with increased model scale.

## Introduction and Theoretical Foundation
Large Language Models (LLMs) are increasingly used as proxies for human behavior (*homo silicus*) in social science research, from replicating psychological experiments to predicting strategic decisions. A critical, often implicit, assumption is that **post-training alignment** (e.g., via RLHF or DPO) is neutral or beneficial for this behavioral prediction task.

This paper challenges that assumption. Alignment optimizes models to generate responses that human evaluators **approve of** (cooperative, fair, helpful). However, human behavior in strategic settings is often not normatively "good"—people bluff, retaliate, and deviate from approved patterns. The authors hypothesize that alignment creates a **normative bias**, causing models to predict behavior people *endorse* rather than behavior they *exhibit*. This distinction between **normative** (prescriptive) and **descriptive** theories is foundational in behavioral sciences.

The paper tests the hypothesis that aligned models will predict human behavior well in simple, one-shot settings where behavior is relatively well-described by normative theory (e.g., Nash equilibrium), but poorly in **multi-round strategic settings** where behavior is shaped by complex, history-dependent dynamics like reciprocity, reputation, and adaptation.

## Methodology
The study conducts a systematic comparison of **120 same-provider base–aligned model pairs** from 23 families, evaluating their ability to predict **10,050 real human decisions**.

*   **Game Families & Human Data:**
    *   **Bargaining:** Alternating-offers game (Rubinstein model).
    *   **Persuasion:** Repeated cheap-talk game (Crawford and Sobel model).
    *   **Negotiation:** Bilateral price negotiation with outside options.
    *   **Repeated Matrix Games:** Prisoner's Dilemma and Battle of the Sexes (10 rounds).
    Data for the first three families comes from the GLEE benchmark, where humans played against LLM opponents without knowing their nature.

*   **Prediction Task:** Framed as token probability extraction. For each human decision point, a prompt containing the game rules and dialogue history is fed to the model. The log-probabilities for decision tokens (e.g., "accept"/"reject") are extracted and normalized:
    $$
    p_{\text{accept}} = \frac{p(\text{yes})}{\sum_{d} p(d)}
    $$
    where $d$ ranges over all decision tokens for that game family. This yields a predicted probability $p_{\text{accept}} \in [0, 1]$ for the affirmative action.

*   **Evaluation Metric:** **Pearson correlation** between the model's predicted $p_{\text{accept}}$ and the ground-truth human decision (coded as 1 for accept/yes/cooperate, 0 otherwise).

*   **Comparison:** For each model pair in a game family, the correlations of the base and aligned models are compared, recording a "win" for the model with the higher correlation.

*   **Boundary Condition Tests:** To test the limits of the base advantage, models are also evaluated on:
    1.  **One-shot 2x2 Matrix Games:** 2,416 procedurally generated games (Zhu et al., 2025).
    2.  **Binary Lottery Choices:** 1,001 non-strategic decision problems under risk (Marantz and Plonsky, 2025).

*   **Controls:** The study controls for prompt format confounds by testing four variants: Base (native format), Aligned (native chat template), Base (with aligned chat template), and Aligned (with plain text format).

## Empirical Validation / Results

### 1. Dominant Base Model Advantage in Multi-Round Games
*   Base models won **213 out of 235** valid comparisons against their aligned counterparts across the four multi-round game families, a ratio of **9.7:1** ($p < 10^{-40}$).
*   The advantage was significant in every individual game family ($p < 10^{-6}$ each) and consistent across all 23 model families.

**Table: Base vs. Aligned Model Wins by Game Family**
| Game Family | Base Wins | Aligned Wins | Win Ratio | Significance (p) |
| :--- | :---: | :---: | :---: | :---: |
| Bargaining | 75 | 4 | 18.8:1 | $<10^{-40}$ |
| Persuasion | 32 | 4 | 8.0:1 | $<10^{-6}$ |
| Negotiation | 25 | 1 | 25.0:1 | $<10^{-6}$ |
| Matrix Games | 81 | 13 | 6.2:1 | $<10^{-6}$ |
| **Total** | **213** | **22** | **9.7:1** | **$<10^{-40}$** |

*   **Robustness:** The advantage persisted when controlling for prompt format (base wins 5.0:1 with plain text, 5.3:1 with chat template) and across 10 different prompt formulations (base won **959 of 1,003** comparisons, $p < 10^{-200}$).
*   **Scaling:** The base advantage grew with model size (see Figure 2 in the paper), suggesting it reflects richer pre-training representations that alignment shifts.

### 2. Reversal at Boundary Conditions
*   **One-shot Games:** Aligned models won **57 vs. 14** comparisons (4.1:1, $p < 10^{-6}$). This advantage was consistent across all 12 game types tested.
*   **Lottery Choices:** Aligned models won **62 vs. 28** comparisons (2.2:1, $p = 2.19 \times 10^{-4}$).

### 3. Round-by-Round Dynamics within Multi-Round Games
A crucial finding explains the reversal: the base advantage is **history-dependent**.
*   **Round 1:** Before interaction history develops, aligned models actually performed better in bargaining, negotiation, and persuasion.
*   **Round 2 Onward:** The base model advantage emerged and grew as history accumulated. For example, in bargaining, the win ratio shifted from aligned-favored (61:32) at round 1 to strongly base-favored (82:4) from round 2 onward.

### 4. Mechanism: Alignment Shifts Predictions Toward Normative Patterns
Analysis of one-shot games showed that human aggregate choices correlated with Nash Equilibrium (NE) predictions ($r = 0.616$). Aligned models' predictions were systematically more aligned with NE than base models' (mean $r = 0.41$ vs. $0.28$; aligned closer in 59 of 76 pairs, $p < 10^{-6}$). This confirms that alignment shifts model predictions toward normative, textbook solutions.

## Theoretical and Practical Implications
*   **Theoretical Explanation:** The results are consistent with the **distributional narrowing** induced by standard alignment methods like KL-regularized RLHF. The optimal policy under this framework is:
    $$
    \pi^*(x) \propto \pi_0(x) \exp(r(x)/\beta)
    $$
    This represents an exponential tilt of the base distribution $\pi_0$ that concentrates probability mass on high-reward (annotator-approved) responses at the expense of the distribution's "tails." These tails contain the complex, sometimes normatively "bad" behaviors (retaliation, bluffing) that are crucial for predicting real human behavior in multi-round strategic settings.
*   **Practical Implications:**
    *   **For Behavioral Prediction:** Researchers using LLMs as proxies for human behavior must choose models based on context. **Base models** are superior for predicting behavior in interactive, history-rich settings. **Aligned models** may be adequate for one-shot decisions or non-strategic tasks where behavior aligns with norms.
    *   **Methodological Risk:** Studies claiming "LLMs replicate human behavior" based on aligned models may actually be reporting that LLMs replicate *normative* behavior, with the gap invisible in simple settings. This poses a risk for social science simulations (e.g., of voters, consumers).
    *   **Alignment Design:** Current alignment methods create a trade-off between helpfulness/safety and behavioral fidelity. New methods are needed to preserve the empirical behavioral diversity of base models while adding desired assistant qualities.

## Conclusion
The paper demonstrates a fundamental **normative–descriptive trade-off** induced by post-training alignment. Alignment optimizes LLMs to reflect human preferences for *how one should act*, which improves prediction in simple, normatively clear settings but systematically degrades prediction in complex, multi-round strategic interactions where human behavior is shaped by descriptive dynamics like reciprocity and history-dependent adaptation.

The choice between base and aligned models is therefore a substantive modeling assumption with significant consequences for predictive accuracy. Until alignment methods are developed that can preserve behavioral diversity, users must be aware that **aligned models are better models *for* human use, while base models are better models *of* human behavior.**

**Future Directions** include investigating which aspects of multi-round play drive the base advantage, testing the effect in other interactive domains (auctions, coalitions), and developing alignment techniques that mitigate this distributional collapse.

---

_Markdown view of https://picx.dev/p/TpCN66, served by PicX — AI-generated visual whiteboard summaries of research papers._
