Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
Proposes rubrics as a reward source for reinforcement learning in domains where a crisp verifiable outcome does not exist. A deliberate extension of RLVR-style methods past the easy cases.
Classification
- Role
- measurement-piece
- Domain
- research
- Source type
- paper
- Harness types
- validation-harnesslearning-harness
- Validation position
- immediately-after-generation
- Validation mode
- interpretivesocialempirical
- Prescription stance
- mixed
- Relation to argument
- reward-structure-mattersdomain-structure-mattersvalidation-is-constitutive
- Tags
- rubrics-as-rewardsrarnon-verifiable-domainsreward-shapingrlvr-adjacent
Extended capability commentary
- Input legibility
- Task structure
- Reward richness
- Rubric scores are richer than nothing, sparser than a verified pass/fail.
- Feedback latency
- Repairability
- Rubrics *name* failure modes — that is diagnostic by construction, not just verifiable.
- Observability
- Offline evaluability
- Institutional ratification
Why it matters
The library's schema intentionally separates reward richness from repairability and input legibility. This entry is a technical illustration: a rubric can score lower than a verified outcome on richness while scoring *higher* on repairability — because the rubric names which dimension failed.
Annotation
An important entry for preserving the library's most-subtle disagreement: verifiable outcome ≠ diagnostic feedback. Rubrics as Rewards operationalises that distinction by trading some of the "cleanness" of a verified outcome (one bit: passed / failed) for the structured richness of a rubric (multi-dimensional, failure-mode-named, repairable).
Read alongside
- Expanding RLVR Across Diverse Domains — the verifiable-outcome pole.
- Royzen: Standard Signal — the domain where the outcome is unusually verifiable.
- Wallach/Jacobs et al — the measurement critique that applies to both rubrics and verifiable rewards.
Related entries
- Expanding RL with Verifiable Rewards Across Diverse DomainsMa et al. · 2025-03-30reward-structure-mattersdomain-structure-mattersvalidation-is-constitutivelearning-harnessvalidation-harness
- Standard Signal: AI-native hedge fund announcementMichael Royzen · 2026-02-28reward-structure-mattersdomain-structure-mattersvalidation-is-constitutivevalidation-harnesslearning-harness
- OpenEstimate: Evaluating LLMs on Reasoning Under Uncertainty with Real-World DataAlana Renda, Jillian Ross, Michael Cafarella, Jacob Andreas · 2025-10-22validation-is-constitutivereward-structure-mattersdomain-structure-mattersvalidation-harness
- LLM Knowledge BasesAndrej Karpathy · 2026-04-01validation-is-constitutivedomain-structure-mattersvalidation-harnesslearning-harness
Overlap is computed on tags, relation-to-argument, and harness types — not on role or domain, because contrasts are often the most useful neighbours.