This library entry is part of The Extended Frontier thesis. Entries are curated with AI assistance and human review; most initial entries were prepared with Claude (Anthropic), while individual entries may note other assisting systems. Metadata and annotations are editorial, not peer-reviewed. Entries flagged as unverified may contain placeholder dates, authors, or classifications.

Expanding RL with Verifiable Rewards Across Diverse Domains

Ma et al.·2025-03-30·paper·source

Metadata unverified. Title and URL verified. First-author and full author list not confirmed; arxiv date is best-estimate from the arxiv ID (2503 = March 2025). Confirm before citing.

Arxiv paper investigating how reinforcement learning with verifiable rewards (RLVR) generalises beyond the easy cases (math, code) to more diverse domains. The technical paper whose conceptual shadow Royzen's domain-claim entry sits in.

Classification

Role: domain-claim
Domain: research
Source type: paper
Harness types: learning-harnessvalidation-harness
Validation position: immediately-after-generationpost-deployment
Validation mode: mechanicalempirical
Prescription stance: strongly-procedural
Relation to argument: reward-structure-mattersdomain-structure-mattersvalidation-is-constitutive
Tags: rlvrverifiable-rewardsreinforcement-learningdomain-generalisation

Extended capability commentary

Input legibility
Task structure
Reward richness: RLVR is the paradigmatic high-reward-richness method; the paper's concern is which domains actually admit it.
Feedback latency
Repairability: Mark against reward-richness: a verifiable outcome signal can still be silent on the error mechanism.
Observability
Offline evaluability
Institutional ratification

Why it matters

Grounds the 'verifiable-reward domain' framing in the ML research literature. Useful for readers who want the technical story behind practitioner claims that finance, code, and math are uniquely favourable.

Annotation

Technical complement to the practitioner entries on verifiable-reward domains. The paper asks the question the library should keep asking: which diverse domains does RLVR actually generalise to, and what breaks when it doesn't?

Rhetorically, this entry is included to prevent the library from collapsing "verifiable reward" into a slogan. There is a research program behind it with real empirical findings — both supporting and complicating the practitioner framings.

Read alongside

Royzen: Standard Signal — the finance-domain-favourability claim.
Rubrics as Rewards (RaR) — extending the framing past crisp-outcome domains.
Wallach/Jacobs et al — the measurement-validity pushback.

Related entries

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
Unknown (OpenReview: 21UFlJrmS2) · 2025-08-31
reward-structure-mattersdomain-structure-mattersvalidation-is-constitutivevalidation-harnesslearning-harness
Standard Signal: AI-native hedge fund announcement
Michael Royzen · 2026-02-28
reward-structure-mattersdomain-structure-mattersvalidation-is-constitutivevalidation-harnesslearning-harness
OpenEstimate: Evaluating LLMs on Reasoning Under Uncertainty with Real-World Data
Alana Renda, Jillian Ross, Michael Cafarella, Jacob Andreas · 2025-10-22
validation-is-constitutivereward-structure-mattersdomain-structure-mattersvalidation-harness
LLM Knowledge Bases
Andrej Karpathy · 2026-04-01
validation-is-constitutivedomain-structure-mattersvalidation-harnesslearning-harness

Overlap is computed on tags, relation-to-argument, and harness types — not on role or domain, because contrasts are often the most useful neighbours.