This library entry is part of The Extended Frontier thesis. Entries are curated with AI assistance and human review; most initial entries were prepared with Claude (Anthropic), while individual entries may note other assisting systems. Metadata and annotations are editorial, not peer-reviewed. Entries flagged as unverified may contain placeholder dates, authors, or classifications.

Expanding RL with Verifiable Rewards Across Diverse Domains

Ma et al.··paper·source
Metadata unverified. Title and URL verified. First-author and full author list not confirmed; arxiv date is best-estimate from the arxiv ID (2503 = March 2025). Confirm before citing.

Arxiv paper investigating how reinforcement learning with verifiable rewards (RLVR) generalises beyond the easy cases (math, code) to more diverse domains. The technical paper whose conceptual shadow Royzen's domain-claim entry sits in.

Classification

Role
domain-claim
Domain
research
Source type
paper
Harness types
learning-harnessvalidation-harness
Validation position
immediately-after-generationpost-deployment
Validation mode
mechanicalempirical
Prescription stance
strongly-procedural
Relation to argument
reward-structure-mattersdomain-structure-mattersvalidation-is-constitutive
Tags
rlvrverifiable-rewardsreinforcement-learningdomain-generalisation

Extended capability commentary

Input legibility
Task structure
Reward richness
RLVR is the paradigmatic high-reward-richness method; the paper's concern is which domains actually admit it.
Feedback latency
Repairability
Mark against reward-richness: a verifiable outcome signal can still be silent on the error mechanism.
Observability
Offline evaluability
Institutional ratification

Why it matters

Grounds the 'verifiable-reward domain' framing in the ML research literature. Useful for readers who want the technical story behind practitioner claims that finance, code, and math are uniquely favourable.

Annotation

Technical complement to the practitioner entries on verifiable-reward domains. The paper asks the question the library should keep asking: which diverse domains does RLVR actually generalise to, and what breaks when it doesn't?

Rhetorically, this entry is included to prevent the library from collapsing "verifiable reward" into a slogan. There is a research program behind it with real empirical findings — both supporting and complicating the practitioner framings.

Read alongside

Related entries

Overlap is computed on tags, relation-to-argument, and harness types — not on role or domain, because contrasts are often the most useful neighbours.