Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge
The measurement tasks involved in evaluating generative AI systems lack sufficient scientific rigor, leading to a tangle of sloppy tests and apples-to-oranges comparisons.
ICML 2025 position paper arguing that generative AI evaluation is fundamentally a social-science measurement problem, and presenting a four-level framework grounded in measurement theory for constructs related to GenAI capabilities, behaviors, and impacts.
Classification
- Role
- measurement-piece
- Domain
- research
- Source type
- paper
- Harness types
- validation-harnessratification-harness
- Validation position
- before-generationpost-deploymentcontinuous
- Validation mode
- empiricalsocialinstitutionalinterpretive
- Prescription stance
- strongly-procedural
- Relation to argument
- validation-is-constitutiveinstitutions-shape-capabilityobservability-matters
- Tags
- measurementconstruct-validityevaluationsocial-scienceicml-2025position-paper
Extended capability commentary
- Input legibility
- Task structure
- Reward richness
- Deliberate counterweight to reward-richness maximalism: a rich signal for the wrong construct is not evidence of capability.
- Feedback latency
- Repairability
- Observability
- Offline evaluability
- Offline eval is only as good as the construct it ratifies.
- Institutional ratification
- The paper's four-level framework makes ratification a first-class object of inquiry.
Why it matters
The foundational pushback against treating any evaluation number as self-evidencing. If the measurement instrument doesn't validly pick out the construct (reasoning, helpfulness, safety, legal competence), a high score is not a capability claim.
Annotation
Sets up measurement and construct validity as prior to evaluation. A benchmark score is a claim about a construct, and the validity of that claim depends on whether the instrument actually measures the construct. The paper argues that most GenAI evaluation skips this step, producing a tangle of sloppy tests and apples-to-oranges comparisons.
The authors import a four-level framework from social-science measurement theory and apply it to GenAI. The argument is explicitly not that better metrics solve the problem — it is that capability claims depend on validity work that is social, interpretive, and institutional.
Placed against verifiable-reward framings (Royzen; Expanding RLVR), the tension is direct:
- Verifiable-reward: the reward is verifiable when the outcome is checkable.
- Measurement-validity: checkability of an outcome does not imply the outcome measures the construct you care about. The "verifiable" in verifiable reward is doing more work than it admits.
Both can be true at once. A narrow technical task (theorem proved, test suite passed) may have near-trivial validity. A broad capability claim (legal reasoning, medical judgment, general agentic competence) almost never does. The library preserves this disagreement structurally — entries can score high on reward_richness while scoring low on input_legibility and unknown on validity.
Related entries
- Measurement to Meaning (Salaudeen et al. 2025) — the validity-centered framework applied.
- Royzen: Standard Signal — poster case for reward richness.
Related entries
- Measurement to Meaning: A Validity-Centered Framework for AI EvaluationOlawale Salaudeen, Anka Reuel, Ahmed Ahmed, Suhana Bedi, Zachary Robertson, Sudharsan Sundar, Ben Domingue, Angelina Wang, Sanmi Koyejo · 2025-05-14#measurement#construct-validity#evaluationvalidation-is-constitutiveobservability-mattersinstitutions-shape-capabilityvalidation-harnessratification-harness
- An open-source spec for Codex orchestration: SymphonyAlex Kotliarskyi, Victor Zhu, and Zach Brock · 2026-04-26validation-is-constitutiveobservability-mattersinstitutions-shape-capabilityvalidation-harness
- Deep Research Query: Work Registration and Collision PreventionDaniel S. Griffin · 2026-05-05validation-is-constitutiveobservability-mattersinstitutions-shape-capabilityvalidation-harness
- Standard Signal: AI-native hedge fund announcementMichael Royzen · 2026-02-28institutions-shape-capabilityvalidation-is-constitutivevalidation-harnessratification-harness
Overlap is computed on tags, relation-to-argument, and harness types — not on role or domain, because contrasts are often the most useful neighbours.