This library entry is part of The Extended Frontier thesis. Entries are curated with AI assistance and human review; most initial entries were prepared with Claude (Anthropic), while individual entries may note other assisting systems. Metadata and annotations are editorial, not peer-reviewed. Entries flagged as unverified may contain placeholder dates, authors, or classifications.

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Unknown (OpenReview: 21UFlJrmS2)·2025-08-31·paper·source

Metadata unverified. OpenReview URL and title verified. Authors and exact date not confirmed; confirm from the forum page before citing.

Proposes rubrics as a reward source for reinforcement learning in domains where a crisp verifiable outcome does not exist. A deliberate extension of RLVR-style methods past the easy cases.

Classification

Role: measurement-piece
Domain: research
Source type: paper
Harness types: validation-harnesslearning-harness
Validation position: immediately-after-generation
Validation mode: interpretivesocialempirical
Prescription stance: mixed
Relation to argument: reward-structure-mattersdomain-structure-mattersvalidation-is-constitutive
Tags: rubrics-as-rewardsrarnon-verifiable-domainsreward-shapingrlvr-adjacent

Extended capability commentary

Input legibility
Task structure
Reward richness: Rubric scores are richer than nothing, sparser than a verified pass/fail.
Feedback latency
Repairability: Rubrics *name* failure modes — that is diagnostic by construction, not just verifiable.
Observability
Offline evaluability
Institutional ratification

Why it matters

The library's schema intentionally separates reward richness from repairability and input legibility. This entry is a technical illustration: a rubric can score lower than a verified outcome on richness while scoring *higher* on repairability — because the rubric names which dimension failed.

Annotation

An important entry for preserving the library's most-subtle disagreement: verifiable outcome ≠ diagnostic feedback. Rubrics as Rewards operationalises that distinction by trading some of the "cleanness" of a verified outcome (one bit: passed / failed) for the structured richness of a rubric (multi-dimensional, failure-mode-named, repairable).

Read alongside

Expanding RLVR Across Diverse Domains — the verifiable-outcome pole.
Royzen: Standard Signal — the domain where the outcome is unusually verifiable.
Wallach/Jacobs et al — the measurement critique that applies to both rubrics and verifiable rewards.

Related entries

Expanding RL with Verifiable Rewards Across Diverse Domains
Ma et al. · 2025-03-30
reward-structure-mattersdomain-structure-mattersvalidation-is-constitutivelearning-harnessvalidation-harness
Standard Signal: AI-native hedge fund announcement
Michael Royzen · 2026-02-28
reward-structure-mattersdomain-structure-mattersvalidation-is-constitutivevalidation-harnesslearning-harness
OpenEstimate: Evaluating LLMs on Reasoning Under Uncertainty with Real-World Data
Alana Renda, Jillian Ross, Michael Cafarella, Jacob Andreas · 2025-10-22
validation-is-constitutivereward-structure-mattersdomain-structure-mattersvalidation-harness
LLM Knowledge Bases
Andrej Karpathy · 2026-04-01
validation-is-constitutivedomain-structure-mattersvalidation-harnesslearning-harness

Overlap is computed on tags, relation-to-argument, and harness types — not on role or domain, because contrasts are often the most useful neighbours.