This library entry is part of The Extended Frontier thesis. Entries are curated with AI assistance and human review; most initial entries were prepared with Claude (Anthropic), while individual entries may note other assisting systems. Metadata and annotations are editorial, not peer-reviewed. Entries flagged as unverified may contain placeholder dates, authors, or classifications.

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Unknown (OpenReview: 21UFlJrmS2)··paper·source
Metadata unverified. OpenReview URL and title verified. Authors and exact date not confirmed; confirm from the forum page before citing.

Proposes rubrics as a reward source for reinforcement learning in domains where a crisp verifiable outcome does not exist. A deliberate extension of RLVR-style methods past the easy cases.

Classification

Role
measurement-piece
Domain
research
Source type
paper
Harness types
validation-harnesslearning-harness
Validation position
immediately-after-generation
Validation mode
interpretivesocialempirical
Prescription stance
mixed
Relation to argument
reward-structure-mattersdomain-structure-mattersvalidation-is-constitutive
Tags
rubrics-as-rewardsrarnon-verifiable-domainsreward-shapingrlvr-adjacent

Extended capability commentary

Input legibility
Task structure
Reward richness
Rubric scores are richer than nothing, sparser than a verified pass/fail.
Feedback latency
Repairability
Rubrics *name* failure modes — that is diagnostic by construction, not just verifiable.
Observability
Offline evaluability
Institutional ratification

Why it matters

The library's schema intentionally separates reward richness from repairability and input legibility. This entry is a technical illustration: a rubric can score lower than a verified outcome on richness while scoring *higher* on repairability — because the rubric names which dimension failed.

Annotation

An important entry for preserving the library's most-subtle disagreement: verifiable outcome ≠ diagnostic feedback. Rubrics as Rewards operationalises that distinction by trading some of the "cleanness" of a verified outcome (one bit: passed / failed) for the structured richness of a rubric (multi-dimensional, failure-mode-named, repairable).

Read alongside

Related entries

Overlap is computed on tags, relation-to-argument, and harness types — not on role or domain, because contrasts are often the most useful neighbours.