This library entry is part of The Extended Frontier thesis. Entries are curated with AI assistance and human review; most initial entries were prepared with Claude (Anthropic), while individual entries may note other assisting systems. Metadata and annotations are editorial, not peer-reviewed. Entries flagged as unverified may contain placeholder dates, authors, or classifications.

Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge

Hanna Wallach, Meera Desai, A. Feder Cooper, Angelina Wang, Chad Atalla, Solon Barocas, Su Lin Blodgett, Alexandra Chouldechova, Emily Corvi, P. Alex Dow, Jean Garcia-Gathright, Alexandra Olteanu, Nicholas Pangakis, Stefanie Reed, Emily Sheng, Dan Vann, Jennifer Wortman Vaughan, Matthew Vogel, Hannah Washington, Abigail Z. Jacobs··paper·source
The measurement tasks involved in evaluating generative AI systems lack sufficient scientific rigor, leading to a tangle of sloppy tests and apples-to-oranges comparisons.

ICML 2025 position paper arguing that generative AI evaluation is fundamentally a social-science measurement problem, and presenting a four-level framework grounded in measurement theory for constructs related to GenAI capabilities, behaviors, and impacts.

Classification

Role
measurement-piece
Domain
research
Source type
paper
Harness types
validation-harnessratification-harness
Validation position
before-generationpost-deploymentcontinuous
Validation mode
empiricalsocialinstitutionalinterpretive
Prescription stance
strongly-procedural
Relation to argument
validation-is-constitutiveinstitutions-shape-capabilityobservability-matters
Tags
measurementconstruct-validityevaluationsocial-scienceicml-2025position-paper

Extended capability commentary

Input legibility
Task structure
Reward richness
Deliberate counterweight to reward-richness maximalism: a rich signal for the wrong construct is not evidence of capability.
Feedback latency
Repairability
Observability
Offline evaluability
Offline eval is only as good as the construct it ratifies.
Institutional ratification
The paper's four-level framework makes ratification a first-class object of inquiry.

Why it matters

The foundational pushback against treating any evaluation number as self-evidencing. If the measurement instrument doesn't validly pick out the construct (reasoning, helpfulness, safety, legal competence), a high score is not a capability claim.

Annotation

Sets up measurement and construct validity as prior to evaluation. A benchmark score is a claim about a construct, and the validity of that claim depends on whether the instrument actually measures the construct. The paper argues that most GenAI evaluation skips this step, producing a tangle of sloppy tests and apples-to-oranges comparisons.

The authors import a four-level framework from social-science measurement theory and apply it to GenAI. The argument is explicitly not that better metrics solve the problem — it is that capability claims depend on validity work that is social, interpretive, and institutional.

Placed against verifiable-reward framings (Royzen; Expanding RLVR), the tension is direct:

  • Verifiable-reward: the reward is verifiable when the outcome is checkable.
  • Measurement-validity: checkability of an outcome does not imply the outcome measures the construct you care about. The "verifiable" in verifiable reward is doing more work than it admits.

Both can be true at once. A narrow technical task (theorem proved, test suite passed) may have near-trivial validity. A broad capability claim (legal reasoning, medical judgment, general agentic competence) almost never does. The library preserves this disagreement structurally — entries can score high on reward_richness while scoring low on input_legibility and unknown on validity.

Related entries

Related entries

Overlap is computed on tags, relation-to-argument, and harness types — not on role or domain, because contrasts are often the most useful neighbours.