Extended Capability Library

A reading list on how AI systems gain capability in practice — not from models alone, but from the harnesses, validation loops, repair feedback, and task structures built around them.

Sources from 2025 and later.

Daniel S. Griffin · note
Sessions frequently recommend or start work that another session already owns.

A ChatGPT Deep Research query and result asking how task queues, multi-agent systems, advisory locks, CI/CD systems, collaborative editing tools, and lightweight session manifests handle claim, heartbeat, expiry, and duplicate-work prevention.

framework-piecesoftwarestrongly-proceduralexecution-harnessvalidation-harnessmonitoring-harness+2unverified
Nous Research · doc · source
The self-improving AI agent built by Nous Research.

The Hermes Agent README presents an open agent harness with model-provider switching, terminal and messaging interfaces, scheduled automations, isolated subagents, toolsets, persistent memory, session search, and a closed learning loop around skills.

case-studycross-domainstrongly-proceduralinput-shapinggrounding-context-loadingexecution-harness+5unverified
Alex Kotliarskyi, Victor Zhu, and Zach Brock · blog · source
The agents were fast, but we had a system bottleneck: human attention.

OpenAI describes Symphony, a spec and reference implementation that turns issue trackers such as Linear into always-on control planes for coding agents, shifting humans from supervising sessions to managing work.

framework-piecesoftwarestrongly-proceduralexecution-harnessvalidation-harnessrepair-harness+4unverified
Aparna Dhinakaran · tweet · source
LangChain is not a harness. LangGraph is not a harness.

Defines the modern agent harness as an out-of-the-box architecture that emerged from coding agents: an iteration loop over tools, context management, skill/tool discovery, permissions, hooks, session persistence, sub-agents, and project-context injection.

framework-piececross-domainstrongly-proceduralinput-shapinggrounding-context-loadingexecution-harness+6unverified
Dana Feng, Bhada Yun, April Yi Wang · paper · source
Agency in software engineering is preconfigured at the organizational layer (policies, tooling defaults, CI guardrails) before individual preferences matter.

Three-phase mixed-methods study with 20 software engineers (10 junior, 10 senior) examining how agency is allocated between humans and agentic AI. Finds that organizational policies and norms preconfigure agency before individual preferences, with seniors maintaining control through delegation and juniors oscillating between over-reliance and resistance.

field-observationsoftwaremixedsocial-harnessratification-harnessinterface-harness
Garry Tan · doc · source
The 2x people and the 100x people are using the same models. The difference is five concepts that fit on an index card.

Short, practitioner-facing ethos doc arguing that the durable leverage in agent systems comes from model-resident skills (markdown) and deterministic code at the edges, with the harness kept as thin as possible so each model upgrade flows through.

practitioner-notesoftwareanti-prescriptiveexecution-harnessinterface-harnesslearning-harness
Andrej Karpathy · tweet · source
You rarely ever write or edit the wiki manually, it's the domain of the LLM.

Describes a personal research workflow where raw source documents are compiled by an LLM into a markdown wiki, maintained through index files, health checks, generated outputs, and lightweight tools rather than a heavyweight RAG stack.

practitioner-noteresearchmixedgrounding-context-loadingexecution-harnessvalidation-harness+4unverified
Michael Royzen · tweet · source
Standard Signal is the first hedge fund that researches and executes trades purely with AI. We train models to discover and trade on new fundamental truths about the world before humans can.

Launch announcement for a YC-backed hedge fund where AI models both generate hypotheses and execute trades. Included here as a domain-claim entry: markets-with-P&L are a paradigmatically favorable domain — clean outcome signal, fast feedback, offline backtestable, institutionally-ratified wrapper (a fund).

domain-claimfinancestrongly-proceduralvalidation-harnessratification-harnesslearning-harness+1unverified
HumanLayer · blog · source
Skills, MCP servers, sub-agents, hooks, and back-pressure mechanisms are tactical solutions HumanLayer has arrived at.

Case-study framing of harness engineering for coding agents, with specific claims about what does and does not work (notably: role-based sub-agents don't work; sub-agents for context control do).

case-studysoftwarestrongly-proceduralexecution-harnessrepair-harnessmonitoring-harness+1unverified
Alana Renda, Jillian Ross, Michael Cafarella, Jacob Andreas · paper · source
LM-elicited priors are often inaccurate and overconfident.

OpenEstimate is a multi-domain benchmark for testing whether language models can express calibrated Bayesian priors for numerical estimation tasks under uncertainty, using real-world datasets in healthcare, employment, and finance.

measurement-piececross-domainstrongly-proceduralvalidation-harnessgrounding-context-loading
Andrew Maynard · essay · source
This book could not have been written without the learning and insights gained from working closely with one of the most powerful AI models available.

Maynard publishes the cut foreword to AI and the Art of Being Human, describing months of close collaboration with Claude while emphasizing human agency, manual refinement, AI tells, fictional allegories, and practical tools for staying human with AI.

case-studyeducationmixedinput-shapingvalidation-harnessrepair-harness+3unverified
Anthropic · blog · source
Agent Skills are organized folders of instructions, scripts, and resources that agents can discover and load dynamically to perform better at specific tasks.

Anthropic's engineering announcement of Agent Skills: a markdown-based pattern for extending Claude's capabilities by progressive disclosure. Important as an *institutional* ratification of the thin-harness / fat-skills framing.

framework-piecesoftwaremixedgrounding-context-loadingexecution-harnesslearning-harness+1unverified
Simon Willison · blog · source
A skill is a Markdown file telling the model how to do something, optionally accompanied by extra documents and pre-written scripts.

Practitioner synthesis of Anthropic's Agent Skills feature, arguing the markdown-file pattern is conceptually simpler and more token-efficient than MCP, and that the ease of sharing a single file is the feature.

synthesis-nodesoftwaremixedgrounding-context-loadingexecution-harnesslearning-harness+1
Daniel Miessler · essay · source
In the early days of prompt engineering (2023-2024) it was helpful to tell AI exactly how to do things, but this inversion probably happened somewhere in 2025.

Argues that good harness engineering focuses on who the user is and what they're trying to accomplish — the 'what' — and lets the model handle the 'how'. Pairs with Miessler's 'Bitter Lesson Engineering' as a design discipline for scaffolding that extends capability rather than compensating for model weakness.

framework-piececross-domainmixedinput-shapinggrounding-context-loadingexecution-harness+3unverified
Unknown (OpenReview: 21UFlJrmS2) · paper · source

Proposes rubrics as a reward source for reinforcement learning in domains where a crisp verifiable outcome does not exist. A deliberate extension of RLVR-style methods past the easy cases.

measurement-pieceresearchmixedvalidation-harnesslearning-harnessunverified
OECD · doc · source

OECD full report on how public-sector workforces are (and are not) prepared to deploy AI. Brought into the library as a governance-piece anchor: the argument is that whether an AI system is capable *in practice* depends on the institutional scaffolding around its use, not only on the model or the harness.

governance-pieceoperationsstrongly-proceduralratification-harnesssocial-harnessmonitoring-harnessunverified
Daniel Miessler · essay · source
As AI gets better, Bitter Lesson Engineering becomes increasingly important.

Leans on Richard Sutton's 'The Bitter Lesson' to argue that prescriptive scaffolding around AI systems is a losing strategy in the limit: you should specify intent precisely and let the best available model figure out the path.

framework-piececross-domainanti-prescriptiveinput-shapingunverified
Olawale Salaudeen, Anka Reuel, Ahmed Ahmed, Suhana Bedi, Zachary Robertson, Sudharsan Sundar, Ben Domingue, Angelina Wang, Sanmi Koyejo · paper · source
The paper provides a structured approach for reasoning about the types of evaluative claims that can be made given the available evidence.

Proposes a validity-centered framework for AI evaluation that reasons explicitly about which evaluative claims the evidence actually supports, with detailed vision and language case studies. The operational companion to the Wallach/Jacobs position paper.

measurement-pieceresearchstrongly-proceduralvalidation-harnessratification-harnessunverified
Ma et al. · paper · source

Arxiv paper investigating how reinforcement learning with verifiable rewards (RLVR) generalises beyond the easy cases (math, code) to more diverse domains. The technical paper whose conceptual shadow Royzen's domain-claim entry sits in.

domain-claimresearchstrongly-procedurallearning-harnessvalidation-harnessunverified
Hanna Wallach, Meera Desai, A. Feder Cooper, Angelina Wang, Chad Atalla, Solon Barocas, Su Lin Blodgett, Alexandra Chouldechova, Emily Corvi, P. Alex Dow, Jean Garcia-Gathright, Alexandra Olteanu, Nicholas Pangakis, Stefanie Reed, Emily Sheng, Dan Vann, Jennifer Wortman Vaughan, Matthew Vogel, Hannah Washington, Abigail Z. Jacobs · paper · source
The measurement tasks involved in evaluating generative AI systems lack sufficient scientific rigor, leading to a tangle of sloppy tests and apples-to-oranges comparisons.

ICML 2025 position paper arguing that generative AI evaluation is fundamentally a social-science measurement problem, and presenting a four-level framework grounded in measurement theory for constructs related to GenAI capabilities, behaviors, and impacts.

measurement-pieceresearchstrongly-proceduralvalidation-harnessratification-harness
Curation note

This library is part of The Extended Frontier thesis. Entries are curated with AI assistance and human review; most initial entries were prepared with Claude (Anthropic), while individual entries may note other assisting systems. Metadata and annotations are editorial, not peer-reviewed. Entries flagged as unverified may contain placeholder dates, authors, or classifications.

About the schema

The schema is designed to preserve disagreement. One source may emphasize reward signal, another validation, another anti-prescriptive harnesses. The library should make those differences visible rather than flatten them.

Entries preserve commentary on independent axes: reward richness, repairability, input legibility, and others. A verifiable outcome (pass/fail, P&L) is not the same as diagnostic feedback: why it failed, where to repair, and what work made the system usable.