Daniel's personal website

This library entry is part of The Extended Frontier thesis. Entries are curated with AI assistance and human review; most initial entries were prepared with Claude (Anthropic), while individual entries may note other assisting systems. Metadata and annotations are editorial, not peer-reviewed. Entries flagged as unverified may contain placeholder dates, authors, or classifications.

Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge

Hanna Wallach, Meera Desai, A. Feder Cooper, Angelina Wang, Chad Atalla, Solon Barocas, Su Lin Blodgett, Alexandra Chouldechova, Emily Corvi, P. Alex Dow, Jean Garcia-Gathright, Alexandra Olteanu, Nicholas Pangakis, Stefanie Reed, Emily Sheng, Dan Vann, Jennifer Wortman Vaughan, Matthew Vogel, Hannah Washington, Abigail Z. Jacobs·2025-02-01·paper·source

The measurement tasks involved in evaluating generative AI systems lack sufficient scientific rigor, leading to a tangle of sloppy tests and apples-to-oranges comparisons.

ICML 2025 position paper arguing that generative AI evaluation is fundamentally a social-science measurement problem, and presenting a four-level framework grounded in measurement theory for constructs related to GenAI capabilities, behaviors, and impacts.

Classification

Role: measurement-piece
Domain: research
Source type: paper
Harness types: validation-harnessratification-harness
Validation position: before-generationpost-deploymentcontinuous
Validation mode: empiricalsocialinstitutionalinterpretive
Prescription stance: strongly-procedural
Relation to argument: validation-is-constitutiveinstitutions-shape-capabilityobservability-matters
Tags: measurementconstruct-validityevaluationsocial-scienceicml-2025position-paper

Extended capability commentary

Input legibility
Task structure
Reward richness: Deliberate counterweight to reward-richness maximalism: a rich signal for the wrong construct is not evidence of capability.
Feedback latency
Repairability
Observability
Offline evaluability: Offline eval is only as good as the construct it ratifies.
Institutional ratification: The paper's four-level framework makes ratification a first-class object of inquiry.

Why it matters

The foundational pushback against treating any evaluation number as self-evidencing. If the measurement instrument doesn't validly pick out the construct (reasoning, helpfulness, safety, legal competence), a high score is not a capability claim.

Annotation

Sets up measurement and construct validity as prior to evaluation. A benchmark score is a claim about a construct, and the validity of that claim depends on whether the instrument actually measures the construct. The paper argues that most GenAI evaluation skips this step, producing a tangle of sloppy tests and apples-to-oranges comparisons.

The authors import a four-level framework from social-science measurement theory and apply it to GenAI. The argument is explicitly not that better metrics solve the problem — it is that capability claims depend on validity work that is social, interpretive, and institutional.

Placed against verifiable-reward framings (Royzen; Expanding RLVR), the tension is direct:

Verifiable-reward: the reward is verifiable when the outcome is checkable.
Measurement-validity: checkability of an outcome does not imply the outcome measures the construct you care about. The "verifiable" in verifiable reward is doing more work than it admits.

Both can be true at once. A narrow technical task (theorem proved, test suite passed) may have near-trivial validity. A broad capability claim (legal reasoning, medical judgment, general agentic competence) almost never does. The library preserves this disagreement structurally — entries can score high on reward_richness while scoring low on input_legibility and unknown on validity.

Related entries

Measurement to Meaning (Salaudeen et al. 2025) — the validity-centered framework applied.
Royzen: Standard Signal — poster case for reward richness.

Related entries

Measurement to Meaning: A Validity-Centered Framework for AI Evaluation
Olawale Salaudeen, Anka Reuel, Ahmed Ahmed, Suhana Bedi, Zachary Robertson, Sudharsan Sundar, Ben Domingue, Angelina Wang, Sanmi Koyejo · 2025-05-14
#measurement#construct-validity#evaluationvalidation-is-constitutiveobservability-mattersinstitutions-shape-capabilityvalidation-harnessratification-harness
An open-source spec for Codex orchestration: Symphony
Alex Kotliarskyi, Victor Zhu, and Zach Brock · 2026-04-26
validation-is-constitutiveobservability-mattersinstitutions-shape-capabilityvalidation-harness
Deep Research Query: Work Registration and Collision Prevention
Daniel S. Griffin · 2026-05-05
validation-is-constitutiveobservability-mattersinstitutions-shape-capabilityvalidation-harness
Standard Signal: AI-native hedge fund announcement
Michael Royzen · 2026-02-28
institutions-shape-capabilityvalidation-is-constitutivevalidation-harnessratification-harness

Overlap is computed on tags, relation-to-argument, and harness types — not on role or domain, because contrasts are often the most useful neighbours.