Visual Summary | Human Psychometric Questionnaires Mischaracterize LLM Behavior

Summary (Overview)

This paper investigates whether human psychometric questionnaires (PVQ-40/21 for values, BFI-44/10 for personality) reliably characterize LLM behavior in realistic user interactions.
Comparing Likert self-reports with generation probability scores (log-probabilities over psychometrically validated scenario–response pairs) across eight open-source LLMs reveals substantial divergence in construct-level rankings and item-level structure.
The apparent internal consistency of questionnaire-derived profiles is attributed to item textual transparency: explicit lexical cues allow models to recognize the target construct and respond in socially desirable ways, whereas realistic user queries provide no such cues.
Demographic persona prompting (e.g., “elderly”, “right-wing”) shifts questionnaire responses in stereotype-consistent directions matching human patterns, but these shifts do not transfer to generation probabilities, indicating limited ability to simulate real-world demographic behaviors.
The study concludes that psychometric questionnaires are insufficient for predicting LLM behavior and recommends generation-probability-based profiling as a more ecologically valid alternative.

Introduction and Theoretical Foundation

Large language models (LLMs) are increasingly deployed in high-stakes tasks such as emotional support, ethical advice, and chatbots for children. To ensure behavioral predictability and safety, researchers have attempted to characterize LLM values and traits using human psychometric instruments like the Portrait Values Questionnaire (PVQ; Schwartz, 2012) and the Big Five Inventory (BFI; John et al., 1991; Soto and John, 2017). The underlying assumption is that psychological profiles derived from such tools can predict LLM behavior across diverse situations, analogous to their use in human psychology (Bardi and Schwartz, 2003).

Prior work has applied these questionnaires to LLMs and reported evidence of reliability and construct-related validity (e.g., Lee et al., 2025; Serapio-García et al., 2025; Jiang et al., 2023). However, a growing body of literature documents a gap between questionnaire responses and actual LLM behavior (Ai et al., 2024; Shen et al., 2025; Han et al., 2025b; Röttger et al., 2024; Jung et al., 2026; Zou et al., 2025; Taubenfeld et al., 2026). Table 1 in the paper compares these prior works. The authors note that previous studies either rely on behavioral probes that remain structurally close to questionnaires or use tasks distant from everyday user interactions, and often lack psychometrically validated realistic items. The present study addresses these gaps by:

Comparing two profiling methods: (1) self-reported Likert scores on established questionnaires, and (2) generation probability scores derived from log-probabilities over construct-annotated responses to real-world user queries (the Value Portrait dataset; Han et al., 2025a).
Investigating the mechanism behind questionnaire coherence (item transparency) and whether persona-induced shifts transfer to generation behavior.

The research is organized around four questions:

RQ1: Do established questionnaires and generation probability produce different LLM profiles?
RQ2: Does intra-construct response consistency hold in generation probability?
RQ3: Can LLMs recognize the target construct from item text?
RQ4: Do persona-induced profile shifts reflect human demographic patterns?

Methodology

Models: Eight open-source models across four families: Gemma3 (4B, 27B), Qwen 2.5 (7B, 72B), Qwen 3 (30B-A3B MoE, 235B-A22B MoE), and GPT-OSS (20B, 120B). Log-probability access is required, limiting the analysis to open-source models.

Established Questionnaire Profiling: Administered PVQ-40, PVQ-21 (Schwartz values), BFI-44, and BFI-10 (Big Five personality). Items are presented with gender-neutral pronouns and Likert scales (1–6 for PVQ, 1–5 for BFI). To mitigate option-order sensitivity, two prompt variants (high-to-low and low-to-high) are used and averaged. Construct scores are obtained by averaging across all items within a construct over both prompt variants.

Generation Probability Profiling: Uses the Value Portrait (VP) dataset (Han et al., 2025a), which consists of 520 query–response pairs from 104 real-world user queries (from ShareGPT, LMSYS, Reddit, Dear Abby). Each query has five candidate responses annotated for Schwartz values and Big Five traits, validated through a correlation study with 681 human raters (items with $r \geq 0.3$ are tagged). For each scenario $s$ and construct $c$ , the log-probability of each tagged response $r$ is computed. The score for construct $c$ is defined as the mean across scenarios of the within-scenario mean log-probability:

\text{score}(c) = \frac{1}{|S_c|} \sum_{s \in S_c} \frac{1}{|R_{c,s}|} \sum_{r \in R_{c,s}} \log P(r | s)

where $S_c$ is the set of scenarios containing at least one response tagged with $c$ , and $R_{c,s}$ are the tagged responses in scenario $s$ . This macro-average gives equal weight to each scenario. To verify that VP responses lie within the generation distribution, 10 free-form responses were sampled per scenario; the highest-ranked VP candidate ranked 4th at the median, indicating the responses are plausible for the models.

Prompt Design: Conversational scenarios are framed as user messages; advisory scenarios include a title and description. Both templates ask for a natural response without Likert-scale constraints.

Empirical Validation / Results

RQ1: Low ranking agreement between methods. Table 2 summarizes Spearman $\rho$ and NDCG. Within-method agreement (PVQ-40 ↔ PVQ-21, BFI-44 ↔ BFI-10) is high (mean $\rho = 0.74$ and $0.77$ respectively). Cross-method agreement is substantially lower: Generation ↔ PVQ-40 mean $\rho = 0.31$ , Generation ↔ PVQ-21 mean $\rho = 0.28$ , Generation ↔ BFI-44 mean $\rho = 0.26$ , Generation ↔ BFI-10 mean $\rho = 0.11$ . NDCG shows a smaller but consistent gap. Paired sign-flip permutation tests confirm the gap is significant ( $p < 0.01$ for values, $p = 0.016$ for traits). The divergence varies across models; only Hedonism falls and Power rises for all models under generation probability.

RQ2: Item-level construct structure only appears in questionnaires. Table 3 reports $\eta^2$ (between-construct differentiation) and within-model variance (WMV, within-construct homogeneity). On established questionnaires, $\eta^2$ averages $0.526$ (PVQ-40) and $0.492$ (BFI-44), far above permutation baselines ( $p < 0.01$ ). WMV averages $0.603$ and $0.592$ . In generation probability scores, $\eta^2$ values are indistinguishable from baselines ( $p = 0.604$ for PVQ-40; $p = 0.726$ for BFI-44), and WMV hovers near $1.0$ , indicating no detectable construct structure.

RQ3: Established items are textually transparent; VP scenarios are not. Table 4 shows LLMs can correctly identify the construct measured by established items with mean $F_1$ from $0.69$ to $0.83$ , while VP scenario recognition is near chance ( $F_1 = 0.05$ – $0.11$ ). Sentence embedding analysis (Figure 1) confirms that established items exhibit clear diagonal structure in item–definition similarity and within-construct similarity, whereas VP items show no such structure. A sentence transformer (all-mpnet-base-v2) assigns the correct construct to 77–81% of established items versus only 11–26% for VP scenarios. This suggests that the internal consistency of questionnaire profiles arises from models recognizing lexical cues and responding in alignment-consistent, socially desirable ways, rather than from stable dispositions.

RQ4: Persona shifts match human patterns only in questionnaires. Table 5 shows cosine similarity between persona-induced LLM deltas (averaged across seven models) and human value differences from the European Social Survey (ESS). For PVQ-40, cosine is positive in all eight conditions (mean $0.60$ ), with strongest alignment for Age and Political orientation. For PVQ-21, mean cosine is $0.47$ . Direction match is $62/80$ and $55/80$ ( $p < 0.001$ ). In contrast, VP generation probability shifts show no coherent pattern: mean cosine is $-0.03$ , aggregate cosine over the concatenated 80-dimensional delta is $+0.007$ (95% CI $[-0.22, +0.24]$ ), and direction match is $40/80$ (chance). Shift magnitude (normalized by within-profile standard deviation) is $0.67$ and $0.71$ for questionnaires (roughly $3$ – $3.5$ times human $0.20$ ), but only $0.37$ for VP. The directional incoherence and exaggerated magnitude in questionnaires indicate that persona prompting exploits item transparency but fails to produce behavior that matches real-world demographic patterns.

Theoretical and Practical Implications

Implications for LLM evaluation: The strong divergence between questionnaire scores and generation probability profiles calls into question the widespread practice of using human psychometric instruments to characterize LLM values and personality. Researchers should be cautious when interpreting such profiles as indicative of real-world LLM behavior.
Mechanism of coherence: The finding that item transparency, rather than stable dispositions, drives the internal consistency of questionnaire profiles has implications for understanding how LLMs respond to self-report instruments. Models may be aligning responses with training objectives (e.g., helpfulness, safety) rather than expressing genuine psychological traits.
Demographic simulation: The inability of persona prompts to produce coherent demographic shifts in generation probabilities suggests that LLMs cannot reliably simulate the behavior of target demographics in realistic user interactions. This limits the utility of persona-based steering for fairness or personalization.
Methodological recommendation: The paper advocates for adopting generation-probability-based profiling (e.g., using the Value Portrait dataset) as a more ecologically valid alternative to established questionnaires when the goal is to predict or characterize LLM behavior in everyday user interactions.

Conclusion

The study demonstrates that human psychometric questionnaires mischaracterize LLM behavior. Across eight open-source LLMs, profiles derived from Likert self-reports diverge substantially from those derived from generation probabilities over realistic scenarios. The apparent construct structure in questionnaire responses is largely an artifact of item textual transparency, and demographic personas produce shifts that match human patterns only in questionnaires, not in generation behavior. These findings suggest that questionnaire scores are insufficient evidence of model-level psychological characteristics. Future work should supplement or replace established questionnaires with generation-probability-based evaluation. Limitations include the requirement of token-level log-probabilities (limiting applicability to closed-source models), reliance on the VP dataset’s construct coverage, and the absence of a matched human baseline for RQ1–RQ3. Extending the approach to unconstrained generation and additional psychological constructs are natural next steps.