Compare entries
Side-by-side view of every entry across prescription stance and five load-bearing capability commentary fields.
The library is designed to preserve disagreement. Look for entries that talk about reward richness but stay thin on repairability, or foreground institutional ratification while saying little about feedback and learning. Those contrasts are what the schema exists to keep visible.
| Entry | Stance | Input legibility | Reward richness | Repairability | Observability | Offline evaluability |
|---|---|---|---|---|---|---|
| Deep Research Query: Work Registration and Collision Prevention Daniel S. Griffin · 2026-05-05 · framework-piece | strongly-procedural | The query turns a local operational pain into a concrete research object with constraints, existing tools, and desired decision outputs. | The target protocol can observe collisions and stale claims, but success is partly social adoption rather than a single crisp reward. | A claim registry would make collisions diagnosable and recoverable, but only if the protocol records enough state to distinguish active, blocked, stale, and abandoned work. | The whole prompt is about making otherwise invisible concurrent agent work visible before new work begins. | The proposed solution space is local-first: SQLite, filesystem manifests, git worktrees, and reproducible preflight checks. |
| Hermes Agent README Nous Research · 2026-04-28 · case-study | strongly-procedural | Slash commands, personalities, skills, memory, and cross-session search make user intent and prior context available to the model. | Hermes emphasizes learning from experience and skill improvement, but the README does not define a single reward signal. | The system can improve skills during use, create skills from experience, search past sessions, and persist knowledge. | Terminal UI, command history, streaming tool output, diagnostics, and session search make agent behavior inspectable. | Research tooling and batch trajectory generation suggest evaluability, but this is not the main README argument. |
| An open-source spec for Codex orchestration: Symphony Alex Kotliarskyi, Victor Zhu, and Zach Brock · 2026-04-26 · framework-piece | strongly-procedural | Issues, WORKFLOW.md, project state, and review packets turn ambiguous work into agent-readable objectives. | CI, reviews, issue state transitions, PR landing, videos, and human review all become feedback signals. | The system rebases, resolves conflicts, retries flaky checks, restarts stalled agents, and feeds failures back into guardrails and skills. | Symphony foregrounds logs, status surfaces, review packets, videos, Linear state, and operator visibility. | Software tasks have tests, CI, smoke tests, Chrome DevTools checks, and reproducible workspaces. |
| What Is an Agent Harness Aparna Dhinakaran · 2026-04-21 · framework-piece | strongly-procedural | Project instruction files, context injection, skills, and tool discovery make the task environment legible to the model before and during work. | The source emphasizes act-observe-adjust feedback, but not explicit reward-model training or scalar reward design. | Repair is central to the definition: the model can observe consequences and continue until the task is actually solved. | Hooks, session logs, context compression, and tool results make harness behavior inspectable, though the post is more architectural than telemetry-specific. | Coding agents inherit strong offline checks through tests, shell commands, diffs, and build outputs. |
| From Junior to Senior: Allocating Agency and Navigating Professional Growth in Agentic AI-Mediated Software Engineering Dana Feng, Bhada Yun, April Yi Wang · 2026-04-13 · field-observation | mixed | Seniors shape inputs well — scoping, constraining, arriving with a plan. Juniors struggle because they do not know what to ask. The familiar/unfamiliar split (Figure 1) is essentially about whether the human can form a good first-mile input. | — | What happens when the agent goes wrong. Seniors iterate and refine; juniors spiral. J6: "it started spiraling ... I just stopped it. The fix was a three-line change." | Prompt history review is a major theme. S7: "accept, accept, accept is a very different thing versus generating all that content and then not actually reading it." Seniors read diffs, reject bloat, cross-check with other models. | — |
| Thin Harness, Fat Skills Garry Tan · 2026-04-10 · practitioner-note | anti-prescriptive | Assumes inputs are legible enough that heavy shaping is unnecessary — a domain-specific bet. | Does not foreground reward signal as the key lever. | Thin-harness framing tends to under-specify where repair loops live. | Commentary present | — |
| LLM Knowledge Bases Andrej Karpathy · 2026-04-01 · practitioner-note | mixed | The raw/ to wiki compilation process is explicitly about making heterogeneous documents legible to future LLM turns. | The workflow has useful signals from links, consistency, and answer quality, but not an explicit reward signal. | Health checks, missing-data imputation, and filing outputs back into the wiki make the knowledge base incrementally repairable. | The wiki is human-readable markdown and images viewed in Obsidian, so the agent's knowledge substrate stays inspectable. | Some checks can be run offline over the wiki, but factual gaps still require web search or source refresh. |
| Standard Signal: AI-native hedge fund announcement Michael Royzen · 2026-02-28 · domain-claim | strongly-procedural | Commentary present | P&L is an unusually clean, cardinal, self-consistent reward. The library's own framing — Royzen does not use the phrase 'verifiable reward.' | Critical tension: trading P&L tells you that a model is wrong but not *where* or *why*. Verifiable outcome ≠ diagnostic feedback. | Commentary present | Backtesting is real but regime-shift biased. |
| Skill Issue: Harness Engineering for Coding Agents HumanLayer · 2026-02-28 · case-study | strongly-procedural | Commentary present | Commentary present | Back-pressure mechanisms are repair harness by another name. | Commentary present | Commentary present |
| OpenEstimate: Evaluating LLMs on Reasoning Under Uncertainty with Real-World Data Alana Renda, Jillian Ross, Michael Cafarella, Jacob Andreas · 2025-10-22 · measurement-piece | strongly-procedural | Questions are constructed from real-world datasets with natural-language variable descriptions and conditioning information. | Ground-truth distributions from observational data support accuracy and calibration metrics. | The benchmark identifies overconfidence and inaccuracy, but does not itself provide rich diagnostic repair paths. | Priors, quantiles, calibration error, uncertainty-accuracy correlation, and statistical baselines expose how models represent uncertainty. | The benchmark is explicitly offline and reproducible against dataset-derived ground truth. |
| Resurrecting deceased darlings: The Missing Foreword to AI and the Art of Being Human Andrew Maynard · 2025-10-18 · case-study | mixed | The authors built a library of resources and deep prompts over months before drafting. | The feedback signal is editorial and human, not mechanical or scalar. | The post emphasizes manual refinement, removal of hallucinations, reduction of AI tells, and killing beloved text for reader flow. | The foreword makes the collaboration visible, including worries, Claude's failures, and the retained AI tell. | Quality is judged through reading, editing, credibility, and reader engagement rather than offline tests. |
| Equipping agents for the real world with Agent Skills Anthropic · 2025-10-15 · framework-piece | mixed | Commentary present | Commentary present | Commentary present | Commentary present | Commentary present |
| Claude Skills are awesome, maybe a bigger deal than MCP Simon Willison · 2025-10-15 · synthesis-node | mixed | Progressive disclosure — scan metadata, load full skill on demand — is a legibility pattern. | Commentary present | Commentary present | Commentary present | Commentary present |
| Good and Bad Harness Engineering Daniel Miessler · 2025-08-31 · framework-piece | mixed | Treats input formation as part of the engineered system, not preprocessing. | Commentary present | Commentary present | Commentary present | Commentary present |
| Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains Unknown (OpenReview: 21UFlJrmS2) · 2025-08-31 · measurement-piece | mixed | Commentary present | Rubric scores are richer than nothing, sparser than a verified pass/fail. | Rubrics *name* failure modes — that is diagnostic by construction, not just verifiable. | Commentary present | Commentary present |
| Building an AI-ready public workforce OECD · 2025-06-30 · governance-piece | strongly-procedural | Commentary present | Public-sector outcomes rarely collapse to a cardinal reward. | Commentary present | Commentary present | — |
| Bitter Lesson Engineering Daniel Miessler · 2025-05-31 · framework-piece | anti-prescriptive | Being specific about intent *is* input legibility. The whole prescription. | Commentary present | Anti-prescriptive stances tend to underweight the value of diagnostic repair loops. | Commentary present | — |
| Measurement to Meaning: A Validity-Centered Framework for AI Evaluation Olawale Salaudeen, Anka Reuel, Ahmed Ahmed, Suhana Bedi, Zachary Robertson, Sudharsan Sundar, Ben Domingue, Angelina Wang, Sanmi Koyejo · 2025-05-14 · measurement-piece | strongly-procedural | Commentary present | Commentary present | Commentary present | Observability in the measurement-theoretic sense: what does the evaluation actually let you see? | Commentary present |
| Expanding RL with Verifiable Rewards Across Diverse Domains Ma et al. · 2025-03-30 · domain-claim | strongly-procedural | Commentary present | RLVR is the paradigmatic high-reward-richness method; the paper's concern is which domains actually admit it. | Mark against reward-richness: a verifiable outcome signal can still be silent on the error mechanism. | Commentary present | Commentary present |
| Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge Hanna Wallach, Meera Desai, A. Feder Cooper, Angelina Wang, Chad Atalla, Solon Barocas, Su Lin Blodgett, Alexandra Chouldechova, Emily Corvi, P. Alex Dow, Jean Garcia-Gathright, Alexandra Olteanu, Nicholas Pangakis, Stefanie Reed, Emily Sheng, Dan Vann, Jennifer Wortman Vaughan, Matthew Vogel, Hannah Washington, Abigail Z. Jacobs · 2025-02-01 · measurement-piece | strongly-procedural | Commentary present | Deliberate counterweight to reward-richness maximalism: a rich signal for the wrong construct is not evidence of capability. | Commentary present | Commentary present | Offline eval is only as good as the construct it ratifies. |
Dashes mean the entry has no commentary on that dimension yet. Absence is not a negative rating.