Compare entries

Side-by-side view of every entry across prescription stance and five load-bearing capability commentary fields.

The library is designed to preserve disagreement. Look for entries that talk about reward richness but stay thin on repairability, or foreground institutional ratification while saying little about feedback and learning. Those contrasts are what the schema exists to keep visible.

Entry	Stance	Input legibility	Reward richness	Repairability	Observability	Offline evaluability
Deep Research Query: Work Registration and Collision Prevention Daniel S. Griffin · 2026-05-05 · framework-piece	strongly-procedural	The query turns a local operational pain into a concrete research object with constraints, existing tools, and desired decision outputs.	The target protocol can observe collisions and stale claims, but success is partly social adoption rather than a single crisp reward.	A claim registry would make collisions diagnosable and recoverable, but only if the protocol records enough state to distinguish active, blocked, stale, and abandoned work.	The whole prompt is about making otherwise invisible concurrent agent work visible before new work begins.	The proposed solution space is local-first: SQLite, filesystem manifests, git worktrees, and reproducible preflight checks.
Hermes Agent README Nous Research · 2026-04-28 · case-study	strongly-procedural	Slash commands, personalities, skills, memory, and cross-session search make user intent and prior context available to the model.	Hermes emphasizes learning from experience and skill improvement, but the README does not define a single reward signal.	The system can improve skills during use, create skills from experience, search past sessions, and persist knowledge.	Terminal UI, command history, streaming tool output, diagnostics, and session search make agent behavior inspectable.	Research tooling and batch trajectory generation suggest evaluability, but this is not the main README argument.
An open-source spec for Codex orchestration: Symphony Alex Kotliarskyi, Victor Zhu, and Zach Brock · 2026-04-26 · framework-piece	strongly-procedural	Issues, WORKFLOW.md, project state, and review packets turn ambiguous work into agent-readable objectives.	CI, reviews, issue state transitions, PR landing, videos, and human review all become feedback signals.	The system rebases, resolves conflicts, retries flaky checks, restarts stalled agents, and feeds failures back into guardrails and skills.	Symphony foregrounds logs, status surfaces, review packets, videos, Linear state, and operator visibility.	Software tasks have tests, CI, smoke tests, Chrome DevTools checks, and reproducible workspaces.
What Is an Agent Harness Aparna Dhinakaran · 2026-04-21 · framework-piece	strongly-procedural	Project instruction files, context injection, skills, and tool discovery make the task environment legible to the model before and during work.	The source emphasizes act-observe-adjust feedback, but not explicit reward-model training or scalar reward design.	Repair is central to the definition: the model can observe consequences and continue until the task is actually solved.	Hooks, session logs, context compression, and tool results make harness behavior inspectable, though the post is more architectural than telemetry-specific.	Coding agents inherit strong offline checks through tests, shell commands, diffs, and build outputs.
From Junior to Senior: Allocating Agency and Navigating Professional Growth in Agentic AI-Mediated Software Engineering Dana Feng, Bhada Yun, April Yi Wang · 2026-04-13 · field-observation	mixed	Seniors shape inputs well — scoping, constraining, arriving with a plan. Juniors struggle because they do not know what to ask. The familiar/unfamiliar split (Figure 1) is essentially about whether the human can form a good first-mile input.	—	What happens when the agent goes wrong. Seniors iterate and refine; juniors spiral. J6: "it started spiraling ... I just stopped it. The fix was a three-line change."	Prompt history review is a major theme. S7: "accept, accept, accept is a very different thing versus generating all that content and then not actually reading it." Seniors read diffs, reject bloat, cross-check with other models.	—
Thin Harness, Fat Skills Garry Tan · 2026-04-10 · practitioner-note	anti-prescriptive	Assumes inputs are legible enough that heavy shaping is unnecessary — a domain-specific bet.	Does not foreground reward signal as the key lever.	Thin-harness framing tends to under-specify where repair loops live.	Commentary present	—
LLM Knowledge Bases Andrej Karpathy · 2026-04-01 · practitioner-note	mixed	The raw/ to wiki compilation process is explicitly about making heterogeneous documents legible to future LLM turns.	The workflow has useful signals from links, consistency, and answer quality, but not an explicit reward signal.	Health checks, missing-data imputation, and filing outputs back into the wiki make the knowledge base incrementally repairable.	The wiki is human-readable markdown and images viewed in Obsidian, so the agent's knowledge substrate stays inspectable.	Some checks can be run offline over the wiki, but factual gaps still require web search or source refresh.
Standard Signal: AI-native hedge fund announcement Michael Royzen · 2026-02-28 · domain-claim	strongly-procedural	Commentary present	P&L is an unusually clean, cardinal, self-consistent reward. The library's own framing — Royzen does not use the phrase 'verifiable reward.'	Critical tension: trading P&L tells you that a model is wrong but not where or why. Verifiable outcome ≠ diagnostic feedback.	Commentary present	Backtesting is real but regime-shift biased.
Skill Issue: Harness Engineering for Coding Agents HumanLayer · 2026-02-28 · case-study	strongly-procedural	Commentary present	Commentary present	Back-pressure mechanisms are repair harness by another name.	Commentary present	Commentary present
OpenEstimate: Evaluating LLMs on Reasoning Under Uncertainty with Real-World Data Alana Renda, Jillian Ross, Michael Cafarella, Jacob Andreas · 2025-10-22 · measurement-piece	strongly-procedural	Questions are constructed from real-world datasets with natural-language variable descriptions and conditioning information.	Ground-truth distributions from observational data support accuracy and calibration metrics.	The benchmark identifies overconfidence and inaccuracy, but does not itself provide rich diagnostic repair paths.	Priors, quantiles, calibration error, uncertainty-accuracy correlation, and statistical baselines expose how models represent uncertainty.	The benchmark is explicitly offline and reproducible against dataset-derived ground truth.
Resurrecting deceased darlings: The Missing Foreword to AI and the Art of Being Human Andrew Maynard · 2025-10-18 · case-study	mixed	The authors built a library of resources and deep prompts over months before drafting.	The feedback signal is editorial and human, not mechanical or scalar.	The post emphasizes manual refinement, removal of hallucinations, reduction of AI tells, and killing beloved text for reader flow.	The foreword makes the collaboration visible, including worries, Claude's failures, and the retained AI tell.	Quality is judged through reading, editing, credibility, and reader engagement rather than offline tests.
Equipping agents for the real world with Agent Skills Anthropic · 2025-10-15 · framework-piece	mixed	Commentary present	Commentary present	Commentary present	Commentary present	Commentary present
Claude Skills are awesome, maybe a bigger deal than MCP Simon Willison · 2025-10-15 · synthesis-node	mixed	Progressive disclosure — scan metadata, load full skill on demand — is a legibility pattern.	Commentary present	Commentary present	Commentary present	Commentary present
Good and Bad Harness Engineering Daniel Miessler · 2025-08-31 · framework-piece	mixed	Treats input formation as part of the engineered system, not preprocessing.	Commentary present	Commentary present	Commentary present	Commentary present
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains Unknown (OpenReview: 21UFlJrmS2) · 2025-08-31 · measurement-piece	mixed	Commentary present	Rubric scores are richer than nothing, sparser than a verified pass/fail.	Rubrics name failure modes — that is diagnostic by construction, not just verifiable.	Commentary present	Commentary present
Building an AI-ready public workforce OECD · 2025-06-30 · governance-piece	strongly-procedural	Commentary present	Public-sector outcomes rarely collapse to a cardinal reward.	Commentary present	Commentary present	—
Bitter Lesson Engineering Daniel Miessler · 2025-05-31 · framework-piece	anti-prescriptive	Being specific about intent is input legibility. The whole prescription.	Commentary present	Anti-prescriptive stances tend to underweight the value of diagnostic repair loops.	Commentary present	—
Measurement to Meaning: A Validity-Centered Framework for AI Evaluation Olawale Salaudeen, Anka Reuel, Ahmed Ahmed, Suhana Bedi, Zachary Robertson, Sudharsan Sundar, Ben Domingue, Angelina Wang, Sanmi Koyejo · 2025-05-14 · measurement-piece	strongly-procedural	Commentary present	Commentary present	Commentary present	Observability in the measurement-theoretic sense: what does the evaluation actually let you see?	Commentary present
Expanding RL with Verifiable Rewards Across Diverse Domains Ma et al. · 2025-03-30 · domain-claim	strongly-procedural	Commentary present	RLVR is the paradigmatic high-reward-richness method; the paper's concern is which domains actually admit it.	Mark against reward-richness: a verifiable outcome signal can still be silent on the error mechanism.	Commentary present	Commentary present
Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge Hanna Wallach, Meera Desai, A. Feder Cooper, Angelina Wang, Chad Atalla, Solon Barocas, Su Lin Blodgett, Alexandra Chouldechova, Emily Corvi, P. Alex Dow, Jean Garcia-Gathright, Alexandra Olteanu, Nicholas Pangakis, Stefanie Reed, Emily Sheng, Dan Vann, Jennifer Wortman Vaughan, Matthew Vogel, Hannah Washington, Abigail Z. Jacobs · 2025-02-01 · measurement-piece	strongly-procedural	Commentary present	Deliberate counterweight to reward-richness maximalism: a rich signal for the wrong construct is not evidence of capability.	Commentary present	Commentary present	Offline eval is only as good as the construct it ratifies.

Dashes mean the entry has no commentary on that dimension yet. Absence is not a negative rating.