Measurement to Meaning: A Validity-Centered Framework for AI Evaluation
The paper provides a structured approach for reasoning about the types of evaluative claims that can be made given the available evidence.
Proposes a validity-centered framework for AI evaluation that reasons explicitly about which evaluative claims the evidence actually supports, with detailed vision and language case studies. The operational companion to the Wallach/Jacobs position paper.
Classification
- Role
- measurement-piece
- Domain
- research
- Source type
- paper
- Harness types
- validation-harnessratification-harness
- Validation position
- immediately-after-generationpost-deployment
- Validation mode
- empiricalinterpretivesocial
- Prescription stance
- strongly-procedural
- Relation to argument
- validation-is-constitutiveobservability-mattersinstitutions-shape-capability
- Tags
- measurementconstruct-validityevaluationvalidity-frameworkclaims-vs-evidence
Extended capability commentary
- Input legibility
- Task structure
- Reward richness
- Repairability
- Observability
- Observability in the measurement-theoretic sense: what does the evaluation actually let you see?
- Offline evaluability
- Institutional ratification
Why it matters
The applied-framework companion to the Wallach/Jacobs position paper. Where the position paper diagnoses the field, this paper hands you a tool: map the evidence you have to the claim you want to make, and refuse to make claims the evidence can't support.
Annotation
Where the Wallach/Jacobs position paper argues that generative-AI evaluation is a social-science measurement challenge, this paper supplies the operational framework. Two case studies (vision and language model evaluations) demonstrate how explicitly reasoning about validity strengthens or weakens the claims an evaluation can support.
The central move is refusing the shortcut from benchmark score to capability claim. A model that does well on a math benchmark may be good at that benchmark, not good at math. A model that does well on graduate-exam-style questions may be good at graduate-exam-style questions, not good at reasoning.
Useful against
- "Reward richness is the lever" framings — this paper asks which construct the reward even measures.
- "Thin harness, fat skills" — reminds that the skills you think you are pushing into the model are defined by the evaluations you check them with.
Useful for
- Anyone who wants to score a library entry on
institutional_ratificationorobservabilitywith conceptual grounding rather than intuition.
Related entries
- Position: Evaluating Generative AI Systems Is a Social Science Measurement ChallengeHanna Wallach, Meera Desai, A. Feder Cooper, Angelina Wang, Chad Atalla, Solon Barocas, Su Lin Blodgett, Alexandra Chouldechova, Emily Corvi, P. Alex Dow, Jean Garcia-Gathright, Alexandra Olteanu, Nicholas Pangakis, Stefanie Reed, Emily Sheng, Dan Vann, Jennifer Wortman Vaughan, Matthew Vogel, Hannah Washington, Abigail Z. Jacobs · 2025-02-01#measurement#construct-validity#evaluationvalidation-is-constitutiveinstitutions-shape-capabilityobservability-mattersvalidation-harnessratification-harness
- An open-source spec for Codex orchestration: SymphonyAlex Kotliarskyi, Victor Zhu, and Zach Brock · 2026-04-26validation-is-constitutiveobservability-mattersinstitutions-shape-capabilityvalidation-harness
- Deep Research Query: Work Registration and Collision PreventionDaniel S. Griffin · 2026-05-05validation-is-constitutiveobservability-mattersinstitutions-shape-capabilityvalidation-harness
- Standard Signal: AI-native hedge fund announcementMichael Royzen · 2026-02-28institutions-shape-capabilityvalidation-is-constitutivevalidation-harnessratification-harness
Overlap is computed on tags, relation-to-argument, and harness types — not on role or domain, because contrasts are often the most useful neighbours.