Compare entries

Side-by-side view of every entry across prescription stance and five load-bearing capability commentary fields.

The library is designed to preserve disagreement. Look for entries that talk about reward richness but stay thin on repairability, or foreground institutional ratification while saying little about feedback and learning. Those contrasts are what the schema exists to keep visible.

EntryStanceInput legibilityReward richnessRepairabilityObservabilityOffline evaluability
Deep Research Query: Work Registration and Collision Prevention
Daniel S. Griffin · 2026-05-05 · framework-piece
strongly-proceduralThe query turns a local operational pain into a concrete research object with constraints, existing tools, and desired decision outputs.The target protocol can observe collisions and stale claims, but success is partly social adoption rather than a single crisp reward.A claim registry would make collisions diagnosable and recoverable, but only if the protocol records enough state to distinguish active, blocked, stale, and abandoned work.The whole prompt is about making otherwise invisible concurrent agent work visible before new work begins.The proposed solution space is local-first: SQLite, filesystem manifests, git worktrees, and reproducible preflight checks.
Hermes Agent README
Nous Research · 2026-04-28 · case-study
strongly-proceduralSlash commands, personalities, skills, memory, and cross-session search make user intent and prior context available to the model.Hermes emphasizes learning from experience and skill improvement, but the README does not define a single reward signal.The system can improve skills during use, create skills from experience, search past sessions, and persist knowledge.Terminal UI, command history, streaming tool output, diagnostics, and session search make agent behavior inspectable.Research tooling and batch trajectory generation suggest evaluability, but this is not the main README argument.
An open-source spec for Codex orchestration: Symphony
Alex Kotliarskyi, Victor Zhu, and Zach Brock · 2026-04-26 · framework-piece
strongly-proceduralIssues, WORKFLOW.md, project state, and review packets turn ambiguous work into agent-readable objectives.CI, reviews, issue state transitions, PR landing, videos, and human review all become feedback signals.The system rebases, resolves conflicts, retries flaky checks, restarts stalled agents, and feeds failures back into guardrails and skills.Symphony foregrounds logs, status surfaces, review packets, videos, Linear state, and operator visibility.Software tasks have tests, CI, smoke tests, Chrome DevTools checks, and reproducible workspaces.
What Is an Agent Harness
Aparna Dhinakaran · 2026-04-21 · framework-piece
strongly-proceduralProject instruction files, context injection, skills, and tool discovery make the task environment legible to the model before and during work.The source emphasizes act-observe-adjust feedback, but not explicit reward-model training or scalar reward design.Repair is central to the definition: the model can observe consequences and continue until the task is actually solved.Hooks, session logs, context compression, and tool results make harness behavior inspectable, though the post is more architectural than telemetry-specific.Coding agents inherit strong offline checks through tests, shell commands, diffs, and build outputs.
From Junior to Senior: Allocating Agency and Navigating Professional Growth in Agentic AI-Mediated Software Engineering
Dana Feng, Bhada Yun, April Yi Wang · 2026-04-13 · field-observation
mixedSeniors shape inputs well — scoping, constraining, arriving with a plan. Juniors struggle because they do not know what to ask. The familiar/unfamiliar split (Figure 1) is essentially about whether the human can form a good first-mile input.What happens when the agent goes wrong. Seniors iterate and refine; juniors spiral. J6: "it started spiraling ... I just stopped it. The fix was a three-line change."Prompt history review is a major theme. S7: "accept, accept, accept is a very different thing versus generating all that content and then not actually reading it." Seniors read diffs, reject bloat, cross-check with other models.
Thin Harness, Fat Skills
Garry Tan · 2026-04-10 · practitioner-note
anti-prescriptiveAssumes inputs are legible enough that heavy shaping is unnecessary — a domain-specific bet.Does not foreground reward signal as the key lever.Thin-harness framing tends to under-specify where repair loops live.Commentary present
LLM Knowledge Bases
Andrej Karpathy · 2026-04-01 · practitioner-note
mixedThe raw/ to wiki compilation process is explicitly about making heterogeneous documents legible to future LLM turns.The workflow has useful signals from links, consistency, and answer quality, but not an explicit reward signal.Health checks, missing-data imputation, and filing outputs back into the wiki make the knowledge base incrementally repairable.The wiki is human-readable markdown and images viewed in Obsidian, so the agent's knowledge substrate stays inspectable.Some checks can be run offline over the wiki, but factual gaps still require web search or source refresh.
Standard Signal: AI-native hedge fund announcement
Michael Royzen · 2026-02-28 · domain-claim
strongly-proceduralCommentary presentP&L is an unusually clean, cardinal, self-consistent reward. The library's own framing — Royzen does not use the phrase 'verifiable reward.'Critical tension: trading P&L tells you that a model is wrong but not *where* or *why*. Verifiable outcome ≠ diagnostic feedback.Commentary presentBacktesting is real but regime-shift biased.
Skill Issue: Harness Engineering for Coding Agents
HumanLayer · 2026-02-28 · case-study
strongly-proceduralCommentary presentCommentary presentBack-pressure mechanisms are repair harness by another name.Commentary presentCommentary present
OpenEstimate: Evaluating LLMs on Reasoning Under Uncertainty with Real-World Data
Alana Renda, Jillian Ross, Michael Cafarella, Jacob Andreas · 2025-10-22 · measurement-piece
strongly-proceduralQuestions are constructed from real-world datasets with natural-language variable descriptions and conditioning information.Ground-truth distributions from observational data support accuracy and calibration metrics.The benchmark identifies overconfidence and inaccuracy, but does not itself provide rich diagnostic repair paths.Priors, quantiles, calibration error, uncertainty-accuracy correlation, and statistical baselines expose how models represent uncertainty.The benchmark is explicitly offline and reproducible against dataset-derived ground truth.
Resurrecting deceased darlings: The Missing Foreword to AI and the Art of Being Human
Andrew Maynard · 2025-10-18 · case-study
mixedThe authors built a library of resources and deep prompts over months before drafting.The feedback signal is editorial and human, not mechanical or scalar.The post emphasizes manual refinement, removal of hallucinations, reduction of AI tells, and killing beloved text for reader flow.The foreword makes the collaboration visible, including worries, Claude's failures, and the retained AI tell.Quality is judged through reading, editing, credibility, and reader engagement rather than offline tests.
Equipping agents for the real world with Agent Skills
Anthropic · 2025-10-15 · framework-piece
mixedCommentary presentCommentary presentCommentary presentCommentary presentCommentary present
Claude Skills are awesome, maybe a bigger deal than MCP
Simon Willison · 2025-10-15 · synthesis-node
mixedProgressive disclosure — scan metadata, load full skill on demand — is a legibility pattern.Commentary presentCommentary presentCommentary presentCommentary present
Good and Bad Harness Engineering
Daniel Miessler · 2025-08-31 · framework-piece
mixedTreats input formation as part of the engineered system, not preprocessing.Commentary presentCommentary presentCommentary presentCommentary present
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
Unknown (OpenReview: 21UFlJrmS2) · 2025-08-31 · measurement-piece
mixedCommentary presentRubric scores are richer than nothing, sparser than a verified pass/fail.Rubrics *name* failure modes — that is diagnostic by construction, not just verifiable.Commentary presentCommentary present
Building an AI-ready public workforce
OECD · 2025-06-30 · governance-piece
strongly-proceduralCommentary presentPublic-sector outcomes rarely collapse to a cardinal reward.Commentary presentCommentary present
Bitter Lesson Engineering
Daniel Miessler · 2025-05-31 · framework-piece
anti-prescriptiveBeing specific about intent *is* input legibility. The whole prescription.Commentary presentAnti-prescriptive stances tend to underweight the value of diagnostic repair loops.Commentary present
Measurement to Meaning: A Validity-Centered Framework for AI Evaluation
Olawale Salaudeen, Anka Reuel, Ahmed Ahmed, Suhana Bedi, Zachary Robertson, Sudharsan Sundar, Ben Domingue, Angelina Wang, Sanmi Koyejo · 2025-05-14 · measurement-piece
strongly-proceduralCommentary presentCommentary presentCommentary presentObservability in the measurement-theoretic sense: what does the evaluation actually let you see?Commentary present
Expanding RL with Verifiable Rewards Across Diverse Domains
Ma et al. · 2025-03-30 · domain-claim
strongly-proceduralCommentary presentRLVR is the paradigmatic high-reward-richness method; the paper's concern is which domains actually admit it.Mark against reward-richness: a verifiable outcome signal can still be silent on the error mechanism.Commentary presentCommentary present
Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge
Hanna Wallach, Meera Desai, A. Feder Cooper, Angelina Wang, Chad Atalla, Solon Barocas, Su Lin Blodgett, Alexandra Chouldechova, Emily Corvi, P. Alex Dow, Jean Garcia-Gathright, Alexandra Olteanu, Nicholas Pangakis, Stefanie Reed, Emily Sheng, Dan Vann, Jennifer Wortman Vaughan, Matthew Vogel, Hannah Washington, Abigail Z. Jacobs · 2025-02-01 · measurement-piece
strongly-proceduralCommentary presentDeliberate counterweight to reward-richness maximalism: a rich signal for the wrong construct is not evidence of capability.Commentary presentCommentary presentOffline eval is only as good as the construct it ratifies.

Dashes mean the entry has no commentary on that dimension yet. Absence is not a negative rating.