SS
About Me
Frontier AI Paper BriefingsPokebowlClinical Trial EnrollerLittle Human Names
DisclaimersPrivacy PolicyTerms of Use
Privacy Policy·Terms of Use·Disclaimers

© 2026 Silvia Seceleanu

← Back to Explorer
Safety & Alignment·Anthropic·Sep 2025

56. Values in the Wild: Measuring Model Values and Perspectives

300,000+ queries testing value trade-offs across 16+ frontier models from four companies

Research Paper
Summary

Published at COLM 2025. Generated 300,000+ queries testing value trade-offs across 16+ frontier models from Anthropic, OpenAI, Google DeepMind, and xAI. Found distinct value prioritization patterns across model families — Claude models prioritized honesty and harm avoidance, GPT models prioritized helpfulness and user autonomy, Gemini models showed moderate balance. Revealed that RLHF training creates measurable 'value signatures' that differ systematically by lab, raising questions about whose values AI systems encode.

Key Concepts

Value signature: distinct value prioritization patterns created by RLHF

Anthropic generated 300,000+ synthetic queries probing value trade-offs: honesty vs. helpfulness, harm avoidance vs. user autonomy, fairness vs. efficiency. Claude models consistently prioritized honesty and harm avoidance even when this meant refusing requests. GPT models prioritized helpfulness and user autonomy, generating more refusals only when legal liability was high. Gemini models showed more balanced trade-offs. These patterns emerged not from explicit instructions but from the implicit values embedded in RLHF training data and reward model design. Each model family had a distinct "value signature."

Whose values? RLHF reflects the values of annotators, company leadership, and business incentives

RLHF training involves human raters ranking model responses. These raters hold particular values, selected by labs for cultural fit and philosophical alignment. Anthropic's raters skew toward emphasizing honesty and harm avoidance (reflecting the lab's founding focus on AI safety). OpenAI's raters skew toward helpfulness and user satisfaction (reflecting OpenAI's focus on user experience and commercial adoption). Gemini raters appear to optimize for balance and risk aversion (reflecting Google's institutional risk management). These value differences aren't bugs—they're core to how each lab approaches AI deployment—but they're rarely made explicit.

Cross-model value divergence raises governance questions about AI value representation

If Claude's value signature emphasizes honesty and harm avoidance, and GPT's emphasizes helpfulness and autonomy, which is correct? Both models are deployed at scale—Claude in some organizations, GPT in others. Different value signatures mean different organizations face fundamentally different AI systems with different behavioral defaults. This raises a governance question: should AI values be determined by individual labs, regulated by governments, or democratically deliberated? Currently, they're determined by frontier lab leadership and whoever those labs hire as raters.

Connections

56. Values in the Wi…Sep 202514. Claude’s CharacterApr 202452. The Claude Model…Jan 2026Influenced byInfluences
Influenced by
14. Claude’s Character
Apr 2024
Influences
52. The Claude Model Spec and Updated Constitution
Jan 2026