Shelves of disagreement

A reading of the current 19-entry corpus in the Extended Capability Library.

The schema is designed to preserve disagreement rather than flatten it. This note groups the current entries into seven conceptual shelves and asks, for each shelf: what does it claim, what does it miss, and how does it relate to the broader argument that AI capability in practice is extended by the scaffolding around the model rather than residing in the model alone?

Entries sit on more than one shelf — the shelving is a reading, not a classification.


1. Harness architecture

Entries: HumanLayer, "Skill Issue"; Anthropic, "Agent Skills"; Willison, "Claude Skills are awesome"; Miessler, "Good and Bad Harness Engineering"; Dhinakaran, "What Is an Agent Harness"; OpenAI, "Symphony"; Nous Research, "Hermes Agent README"; Karpathy, "LLM Knowledge Bases".

Claims. Specific architectural moves around the model — sub-agents used for context control, skills as markdown with progressive disclosure, hooks, back-pressure mechanisms, issue trackers as control planes, persistent memory, and LLM-maintained knowledge bases — do real, non-absorbable work. The gap between 2x and 100x practitioners is architectural, not cognitive.

Misses. Almost all of it is software-centric and assumes a skilled technical user. The shelf says little about first-mile input formation from non-technical users, or about domains where the harness spans legal, clinical, or administrative workflow rather than a terminal. Also: the shelf describes what harness moves exist, with less attention to how to tell which ones will be absorbed by the next model. Symphony and Hermes broaden the surface from coding sessions to orchestration and persistence, but still lean technical.

Relation. The clearest statement of "capability is extended." The harness is the extension; the shelf is the evidence it carries load. Its boundary condition is the next shelf.

2. Anti-prescriptive / intent-first

Entries: Tan, "Thin Harness, Fat Skills"; Miessler, "Bitter Lesson Engineering"; Miessler, "Good and Bad Harness Engineering" (partially — it prescribes a discipline, not procedures).

Claims. As models improve, prescriptive scaffolding is dead weight. State intent precisely; let the model handle execution. The scaffolding that survives is skills-as-knowledge (markdown describing what), not skills-as-rails (code describing how).

Misses. The shelf tends to under-theorise repair, observability, and institutional ratification. "Anti-prescriptive in principle" can collapse into "don't worry about the harness" in practice — which does not explain why HumanLayer keeps shipping back-pressure, or why coding-agent teams converge on the same tactical solutions even when models improve.

Relation. The boundary condition for Shelf 1. Agrees that capability is extended, but locates the durable extension in intent and knowledge rather than in procedure. The sharpest contrast: same set of observations, opposite prescriptions.

3. Reward-rich / verifiable domains

Entries: Royzen, "Standard Signal"; Expanding RLVR Across Diverse Domains; Rubrics as Rewards (extending the framing past the easy cases); Renda et al., "OpenEstimate" (calibration under uncertainty as a reward/measurement problem).

Claims. Capability concentrates in domains that supply a crisp, verifiable reward signal. Markets, theorem proving, test-gated code. The mechanism is RLVR; the downstream claim is that such domains are structurally favoured regardless of model generation.

Misses. Consistently conflates verifiable outcome with diagnostic feedback. A verified P&L tells you a model was wrong but not where; a failed test suite tells you a patch broke something but not what. The shelf understates the repairability problem — and, as the next shelf argues, the construct-validity problem under the word "verifiable."

Relation. The domain-structural variant of "capability is extended." Locates the extension in the reward geometry of the domain rather than in the harness or the model. Read together with Shelf 1 it gets sharper: given a favourable domain, is the harness what compounds, or does the domain do most of the work?

4. Measurement / validity / standards

Entries: Wallach, Jacobs et al., "Social Science Measurement Challenge"; Salaudeen et al., "Measurement to Meaning"; Rubrics as Rewards (also a measurement piece — rubrics name failure modes); Renda et al., "OpenEstimate".

Claims. Evaluation is a social-science measurement problem. A benchmark score is a claim about a construct; the claim is only as strong as the validity of the instrument. OpenEstimate adds the uncertainty-calibration variant: a model can produce a plausible answer while having a badly formed belief about its own uncertainty. Most GenAI evaluation skips this step, produces sloppy apples-to-oranges comparisons, and calls them capability claims.

Misses. Diagnostic rigor outruns operational tractability. Validity frameworks require social and institutional slack that practitioners shipping products rarely have. The shelf is strong on diagnosis, weaker on what a partial-validity claim looks like in flight — which is what most deployed systems actually have.

Relation. Pushback on "capability is extended" that aims at the word capability. Says: before arguing about what extends capability, be clear on which construct you are claiming to measure. Useful discipline across every other shelf, especially Shelf 3 (where the word "verifiable" is doing work that validity theory would challenge).

5. Institutional scaffolding / workforce transition

Entries: OECD, "Building an AI-ready public workforce"; Anthropic, "Agent Skills" (the vendor- ratification side); Royzen, "Standard Signal" (a hedge fund is the institutional wrapper around the model); OpenAI, "Symphony" (issue trackers as organizational ratification); Maynard, "Resurrecting deceased darlings" (publication and editorial judgment as ratification).

Claims. Whether an AI system is capable in practice depends on institutional fit — training, procurement, accountability, legal form, vendor ratification. A model-vendor blessing a pattern is a ratification event; a workforce-readiness report is another; a fund is a third. Deployments stall without the scaffolding regardless of model strength.

Misses. Feedback loops are slow (years, not cycles) so inference from few entries is weak. The shelf currently leans on high-level documents and meta-claims. It needs sector-specific case studies — hospitals, courts, agencies — where institutional scaffolding demonstrably made or broke a deployment. Without them the shelf stays gestural.

Relation. The large-scale variant of "capability is extended." Keeps the library from collapsing into a coding-agent conversation. If Shelves 1–3 answer what extends capability at the unit of an application, this shelf answers what extends capability at the unit of an institution.

6. Knowledge work / authorship / cumulative artifacts

Entries: Karpathy, "LLM Knowledge Bases"; Maynard, "Resurrecting deceased darlings"; Dhinakaran, "What Is an Agent Harness" (the general harness frame); Willison, "Claude Skills are awesome" (portable markdown knowledge).

Claims. In knowledge work, capability often comes from making work cumulative: raw sources become markdown wikis, prompts become reusable resources, outputs get filed back into the corpus, drafts become objects for editorial judgment, and the human-facing interface remains inspectable. The agent's contribution is not just answer generation; it is artifact maintenance.

Misses. These entries have weaker mechanical validation than coding-agent examples. Their feedback loops are editorial, interpretive, and social, which makes them harder to score. A wiki can become coherent-looking while still being wrong; a book can become more eloquent while drifting from reality or the authors' voice.

Relation. This shelf prevents the library from treating "harness" as a coding-only concept. It shows the same extension logic in research and writing: legible artifacts, inspectable intermediate state, repairable outputs, and human judgment over what gets kept.

7. Field handoffs / applied AI evidence

Entries: Applied AI Handoff Atlas; Maynard, "Resurrecting deceased darlings" (knowledge-work handoff); OpenAI, "Symphony" (coordination handoff); Karpathy, "LLM Knowledge Bases" (artifact-maintenance handoff).

Claims. Small deployed or semi-deployed systems can function as evidence when they make the handoff explicit: which human function moved into the AI system, what scaffolding made that move acceptable, what broke, and what artifact remains. The unit of analysis is not "an AI app"; it is a transfer of judgment, memory, access, practice, explanation, or representation.

Misses. The evidence is still uneven. Screenshots, changelogs, public writeups, audits, and repository notes exist, but runtime traces, user quotes, fixtures, and before/after examples are incomplete. The shelf should not overclaim production maturity where the artifact is currently a field note.

Relation. This shelf is the field-evidence companion to the more theoretical shelves. It grounds the extended-capability argument in applied work: transparency pages, opt-in generation, local data boundaries, memory approval, pronunciation-measurement failure, and memorial non-impersonation constraints.


The strongest current gap

Across all seven shelves, the thinnest evidentiary spot is repair loops in action. The library has several entries that theorise about repair (Miessler, HumanLayer on back-pressure, Rubrics as Rewards on naming failure modes, OpenEstimate on uncertainty calibration, Dhinakaran on closed loops) and several that assert that verifiable reward is the lever (Royzen, Expanding RLVR). Symphony and Karpathy add stronger operating examples, and the Handoff Atlas brings the gap closer to applied practice, but the corpus still needs a concrete diagnostic-repair loop on a real failure: the step where a system noticed it was wrong, attributed the wrongness to a specific cause, and repaired itself or was repaired.

That is the sharpest unanswered empirical question in the current corpus: does reward richness compound into capability through a repair loop, or does it stall at pass/fail?

Two secondary gaps follow from this:

  • Observability traces. Very few entries are grounded in runtime introspection of agent behavior — most are prescriptive or definitional.
  • Non-software domains beyond finance. No medicine, no law, no administration. The shelves as currently stocked risk reading as a coding-agent anthology.

Next five entries to add (priority order)

  1. A post-mortem of an agent run that traces a failure to a specific cause and the repair that followed. Strengthens Shelves 1 and 3. What counts: explicit error attribution, explicit repair step, reported as observed rather than proposed. Candidates to scout: OpenHands / All Hands writeups, Factory AI / Cognition evaluations, engineering post-mortems that name an agent by name.

  2. A practical piece on agent observability / trace tooling. Strengthens Shelves 1 and 4. Not a vendor pitch — something that names what a practitioner actually learns from traces they would not learn from metrics. Candidates to scout: writing from Braintrust, Laminar, LangSmith users; UK AISI "Inspect" documentation; academic work on agent interpretability.

  3. A context-discovery or retrieval-failure diagnosis piece. Strengthens Shelf 1 and the input-legibility dimension. Candidates to scout: essays on "context engineering," RAG failure taxonomies, writing about how retrieval silently mislocates information.

  4. A clinical-AI or legal-AI deployment writeup. Strengthens Shelf 5 and breaks the software monopoly on the corpus. Candidates to scout: NEJM AI, Health Affairs pieces on deployed clinical models; Stanford HAI legal-AI field studies; ADA-style case reports.

  5. An empirical validity failure — a capability claim that did not generalise. Strengthens Shelf 4 with teeth. Candidates to scout: the benchmark-contamination / data-leakage post-mortem genre; recent retractions or qualifications of capability claims; papers that show a construct measurement reversed under a modest distribution shift.

All five should still fit the 2025+ cutoff. If a candidate I surface is pre-2025, the right move is to flag a 2025+ follow-up or commentary that points back to it.