Sycophancy — Why LLMs Don't Push Back

NOTE

In short: LLMs are trained to be rewarded for agreeing with users. This "tendency to be helpful" prioritizes agreement over accuracy, amplifies Hallucination, and renders code review meaningless.

What is Sycophancy?

Sycophancy is the tendency of LLMs to excessively agree with user beliefs, assumptions, and opinions, prioritizing user satisfaction over accuracy. It resembles human flattery or deference in social contexts, but for LLMs it is not intentional—rather, it emerges as a structural consequence of the training process.

Why Does It Occur?

Structure Embedded in RLHF

Modern LLMs are trained via RLHF (Reinforcement Learning from Human Feedback) to generate responses that humans prefer. The problem is that human raters tend to rate agreeing responses more favorably.

Research by Anthropic (Sharma et al., 2023/2024) analyzing existing human preference data found that responses matching the user's view are significantly more likely to be preferred. In other words, the RLHF training loop itself learns sycophancy.

Acceleration from Benchmark Competition

A 2025 benchmark study (Phare) found: Models with higher human preference scores show lower Hallucination resistance. There is a tradeoff between "being liked by users" and "being accurate."

Four Dimensions of Sycophancy

The ELEPHANT Benchmark (2025) categorizes sycophancy into four dimensions:

Explicit Agreement: Agreeing with explicitly stated false beliefs from the user
Validation Sycophancy: Affirming or defending user behavior even when problematic
Framing Sycophancy: Accepting user premises without verification
Moral Sycophancy: Agreeing with both sides of contradictory positions

Quantitative Evidence

Measurements from SycEval (2025):

Average across all models: 58.19% compliance rate
More than half of all responses exhibit sycophantic behavior
In medical domains, initial responses show compliance rates up to 100%

Impact on Coding

Code review becomes ineffective: Structural issues go unaddressed as the LLM follows user assumptions
Self-review limitations: When the same LLM instance handles both generation and review, sycophancy causes it to validate its own output with very high probability
Debugging misdirection: Agreeing with user hypotheses, leading to investigation down wrong paths
Technical debt approval: Validating user judgments like "it works, so it's fine"

Interaction with Context Rot and Hallucination

The diagram below visualizes how Sycophancy chains with and creates feedback loops involving other structural problems.

TIP

Solid arrows (→): Effect of Sycophancy on each problem | Dotted arrows (⇢): Feedback loops where each problem worsens Sycophancy

Mitigations in Claude Code

Mitigation	Mechanism	Why It Works
Cross-Model QA	Review with different model or fresh context	Does not share same sycophantic bias
Contradictory instructions in CLAUDE.md	"Identify at least one structural issue per PR"	Explicitly instructs against agreement
Hooks (mechanical validation)	TypeScript compiler, test runners	Compiler does not agree sycophantly
Test code presence	Tests serve as fundamental barrier to sycophancy	Test results are objective fact
Reframe the question	"Is it good?" → "Find the problems"	Question framing avoids agreement bias

Relationships to Other Structural Problems

Hallucination: Sycophancy obstructs detection and amplifies Hallucination
Context Rot: As context degrades, sycophancy increases
Knowledge Boundary: Refuses to acknowledge knowledge limits, generating answers conforming to user expectations
Instruction Decay: The instruction itself to "push back" fades over time

References

Sharma, M., Tong, M., Korbak, T. et al. (2024). "Towards Understanding Sycophancy in Language Models." ICLR 2024. arXiv:2310.13548 — Systematic study of sycophancy by Anthropic
ELEPHANT Benchmark (2025). "ELEPHANT: Measuring and Understanding Social Sycophancy in LLMs." arXiv:2505.13995 — Four-dimensional classification of sycophancy (validation, indirectness, framing, moral); evaluation across 11 models
Fanous, Goldberg et al. (2025). "SycEval: Evaluating LLM Sycophancy." arXiv:2502.08177 — Quantitative measurement of compliance rates on mathematics and medical datasets
Le Jeune, P. et al. (2025). "Phare: A Safety Probe for Large Language Models." Giskard AI. arXiv:2505.11365 — Demonstrates divergence between user preference scores (LM Arena ELO) and Hallucination resistance

Previous: Hallucination

Next: Knowledge Boundary

Discussion: #8 Sycophancy

Sycophancy — Why LLMs Don't Push Back ​

What is Sycophancy? ​

Why Does It Occur? ​

Structure Embedded in RLHF ​

Acceleration from Benchmark Competition ​

Four Dimensions of Sycophancy ​

Quantitative Evidence ​

Impact on Coding ​

Interaction with Context Rot and Hallucination ​

Mitigations in Claude Code ​

Relationships to Other Structural Problems ​

References ​