Extended Capability Library

A reading list on how AI systems gain capability in practice — not from models alone, but from the harnesses, validation loops, repair feedback, and task structures built around them.

Sources from 2025 and later.

20 of 20

Filters

Role

Harness type

Domain

Source type

Prescription stance

Relation to argument

Year

Deep Research Query: Work Registration and Collision Prevention

2026-05-05

Daniel S. Griffin · note

Sessions frequently recommend or start work that another session already owns.

A ChatGPT Deep Research query and result asking how task queues, multi-agent systems, advisory locks, CI/CD systems, collaborative editing tools, and lightweight session manifests handle claim, heartbeat, expiry, and duplicate-work prevention.

framework-piecesoftwarestrongly-proceduralexecution-harnessvalidation-harnessmonitoring-harness+2unverified

Hermes Agent README

2026-04-28

Nous Research · doc · source

The self-improving AI agent built by Nous Research.

The Hermes Agent README presents an open agent harness with model-provider switching, terminal and messaging interfaces, scheduled automations, isolated subagents, toolsets, persistent memory, session search, and a closed learning loop around skills.

case-studycross-domainstrongly-proceduralinput-shapinggrounding-context-loadingexecution-harness+5unverified

An open-source spec for Codex orchestration: Symphony

2026-04-26

Alex Kotliarskyi, Victor Zhu, and Zach Brock · blog · source

The agents were fast, but we had a system bottleneck: human attention.

OpenAI describes Symphony, a spec and reference implementation that turns issue trackers such as Linear into always-on control planes for coding agents, shifting humans from supervising sessions to managing work.

framework-piecesoftwarestrongly-proceduralexecution-harnessvalidation-harnessrepair-harness+4unverified

What Is an Agent Harness

2026-04-21

Aparna Dhinakaran · tweet · source

LangChain is not a harness. LangGraph is not a harness.

Defines the modern agent harness as an out-of-the-box architecture that emerged from coding agents: an iteration loop over tools, context management, skill/tool discovery, permissions, hooks, session persistence, sub-agents, and project-context injection.

framework-piececross-domainstrongly-proceduralinput-shapinggrounding-context-loadingexecution-harness+6unverified

From Junior to Senior: Allocating Agency and Navigating Professional Growth in Agentic AI-Mediated Software Engineering

2026-04-13

Dana Feng, Bhada Yun, April Yi Wang · paper · source

Agency in software engineering is preconfigured at the organizational layer (policies, tooling defaults, CI guardrails) before individual preferences matter.

Three-phase mixed-methods study with 20 software engineers (10 junior, 10 senior) examining how agency is allocated between humans and agentic AI. Finds that organizational policies and norms preconfigure agency before individual preferences, with seniors maintaining control through delegation and juniors oscillating between over-reliance and resistance.

field-observationsoftwaremixedsocial-harnessratification-harnessinterface-harness

Thin Harness, Fat Skills

2026-04-10

Garry Tan · doc · source

The 2x people and the 100x people are using the same models. The difference is five concepts that fit on an index card.

Short, practitioner-facing ethos doc arguing that the durable leverage in agent systems comes from model-resident skills (markdown) and deterministic code at the edges, with the harness kept as thin as possible so each model upgrade flows through.

practitioner-notesoftwareanti-prescriptiveexecution-harnessinterface-harnesslearning-harness

LLM Knowledge Bases

2026-04-01

Andrej Karpathy · tweet · source

You rarely ever write or edit the wiki manually, it's the domain of the LLM.

Describes a personal research workflow where raw source documents are compiled by an LLM into a markdown wiki, maintained through index files, health checks, generated outputs, and lightweight tools rather than a heavyweight RAG stack.

practitioner-noteresearchmixedgrounding-context-loadingexecution-harnessvalidation-harness+4unverified

Standard Signal: AI-native hedge fund announcement

2026-02-28

Michael Royzen · tweet · source

Standard Signal is the first hedge fund that researches and executes trades purely with AI. We train models to discover and trade on new fundamental truths about the world before humans can.

Launch announcement for a YC-backed hedge fund where AI models both generate hypotheses and execute trades. Included here as a domain-claim entry: markets-with-P&L are a paradigmatically favorable domain — clean outcome signal, fast feedback, offline backtestable, institutionally-ratified wrapper (a fund).

domain-claimfinancestrongly-proceduralvalidation-harnessratification-harnesslearning-harness+1unverified

Skill Issue: Harness Engineering for Coding Agents

2026-02-28

HumanLayer · blog · source

Skills, MCP servers, sub-agents, hooks, and back-pressure mechanisms are tactical solutions HumanLayer has arrived at.

Case-study framing of harness engineering for coding agents, with specific claims about what does and does not work (notably: role-based sub-agents don't work; sub-agents for context control do).

case-studysoftwarestrongly-proceduralexecution-harnessrepair-harnessmonitoring-harness+1unverified

OpenEstimate: Evaluating LLMs on Reasoning Under Uncertainty with Real-World Data

2025-10-22

Alana Renda, Jillian Ross, Michael Cafarella, Jacob Andreas · paper · source

LM-elicited priors are often inaccurate and overconfident.

OpenEstimate is a multi-domain benchmark for testing whether language models can express calibrated Bayesian priors for numerical estimation tasks under uncertainty, using real-world datasets in healthcare, employment, and finance.

measurement-piececross-domainstrongly-proceduralvalidation-harnessgrounding-context-loading

Resurrecting deceased darlings: The Missing Foreword to AI and the Art of Being Human

2025-10-18

Andrew Maynard · essay · source

This book could not have been written without the learning and insights gained from working closely with one of the most powerful AI models available.

Maynard publishes the cut foreword to AI and the Art of Being Human, describing months of close collaboration with Claude while emphasizing human agency, manual refinement, AI tells, fictional allegories, and practical tools for staying human with AI.

case-studyeducationmixedinput-shapingvalidation-harnessrepair-harness+3unverified

Equipping agents for the real world with Agent Skills

2025-10-15

Anthropic · blog · source

Agent Skills are organized folders of instructions, scripts, and resources that agents can discover and load dynamically to perform better at specific tasks.

Anthropic's engineering announcement of Agent Skills: a markdown-based pattern for extending Claude's capabilities by progressive disclosure. Important as an *institutional* ratification of the thin-harness / fat-skills framing.

framework-piecesoftwaremixedgrounding-context-loadingexecution-harnesslearning-harness+1unverified

Claude Skills are awesome, maybe a bigger deal than MCP

2025-10-15

Simon Willison · blog · source

A skill is a Markdown file telling the model how to do something, optionally accompanied by extra documents and pre-written scripts.

Practitioner synthesis of Anthropic's Agent Skills feature, arguing the markdown-file pattern is conceptually simpler and more token-efficient than MCP, and that the ease of sharing a single file is the feature.

synthesis-nodesoftwaremixedgrounding-context-loadingexecution-harnesslearning-harness+1

Good and Bad Harness Engineering

2025-08-31

Daniel Miessler · essay · source

In the early days of prompt engineering (2023-2024) it was helpful to tell AI exactly how to do things, but this inversion probably happened somewhere in 2025.

Argues that good harness engineering focuses on who the user is and what they're trying to accomplish — the 'what' — and lets the model handle the 'how'. Pairs with Miessler's 'Bitter Lesson Engineering' as a design discipline for scaffolding that extends capability rather than compensating for model weakness.

framework-piececross-domainmixedinput-shapinggrounding-context-loadingexecution-harness+3unverified

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

2025-08-31

Unknown (OpenReview: 21UFlJrmS2) · paper · source

Proposes rubrics as a reward source for reinforcement learning in domains where a crisp verifiable outcome does not exist. A deliberate extension of RLVR-style methods past the easy cases.

measurement-pieceresearchmixedvalidation-harnesslearning-harnessunverified

Building an AI-ready public workforce

2025-06-30

OECD · doc · source

OECD full report on how public-sector workforces are (and are not) prepared to deploy AI. Brought into the library as a governance-piece anchor: the argument is that whether an AI system is capable *in practice* depends on the institutional scaffolding around its use, not only on the model or the harness.

governance-pieceoperationsstrongly-proceduralratification-harnesssocial-harnessmonitoring-harnessunverified

Bitter Lesson Engineering

2025-05-31

Daniel Miessler · essay · source

As AI gets better, Bitter Lesson Engineering becomes increasingly important.

Leans on Richard Sutton's 'The Bitter Lesson' to argue that prescriptive scaffolding around AI systems is a losing strategy in the limit: you should specify intent precisely and let the best available model figure out the path.

framework-piececross-domainanti-prescriptiveinput-shapingunverified

Measurement to Meaning: A Validity-Centered Framework for AI Evaluation

2025-05-14

Olawale Salaudeen, Anka Reuel, Ahmed Ahmed, Suhana Bedi, Zachary Robertson, Sudharsan Sundar, Ben Domingue, Angelina Wang, Sanmi Koyejo · paper · source

The paper provides a structured approach for reasoning about the types of evaluative claims that can be made given the available evidence.

Proposes a validity-centered framework for AI evaluation that reasons explicitly about which evaluative claims the evidence actually supports, with detailed vision and language case studies. The operational companion to the Wallach/Jacobs position paper.

measurement-pieceresearchstrongly-proceduralvalidation-harnessratification-harnessunverified

Expanding RL with Verifiable Rewards Across Diverse Domains

2025-03-30

Ma et al. · paper · source

Arxiv paper investigating how reinforcement learning with verifiable rewards (RLVR) generalises beyond the easy cases (math, code) to more diverse domains. The technical paper whose conceptual shadow Royzen's domain-claim entry sits in.

domain-claimresearchstrongly-procedurallearning-harnessvalidation-harnessunverified

Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge

2025-02-01

Hanna Wallach, Meera Desai, A. Feder Cooper, Angelina Wang, Chad Atalla, Solon Barocas, Su Lin Blodgett, Alexandra Chouldechova, Emily Corvi, P. Alex Dow, Jean Garcia-Gathright, Alexandra Olteanu, Nicholas Pangakis, Stefanie Reed, Emily Sheng, Dan Vann, Jennifer Wortman Vaughan, Matthew Vogel, Hannah Washington, Abigail Z. Jacobs · paper · source

The measurement tasks involved in evaluating generative AI systems lack sufficient scientific rigor, leading to a tangle of sloppy tests and apples-to-oranges comparisons.

ICML 2025 position paper arguing that generative AI evaluation is fundamentally a social-science measurement problem, and presenting a four-level framework grounded in measurement theory for constructs related to GenAI capabilities, behaviors, and impacts.

measurement-pieceresearchstrongly-proceduralvalidation-harnessratification-harness

Shelves of disagreement →Compare table →Handoff Atlas →Download /library.json Download /library.md

Curation note

This library is part of The Extended Frontier thesis. Entries are curated with AI assistance and human review; most initial entries were prepared with Claude (Anthropic), while individual entries may note other assisting systems. Metadata and annotations are editorial, not peer-reviewed. Entries flagged as unverified may contain placeholder dates, authors, or classifications.

About the schema

The schema is designed to preserve disagreement. One source may emphasize reward signal, another validation, another anti-prescriptive harnesses. The library should make those differences visible rather than flatten them.

Entries preserve commentary on independent axes: reward richness, repairability, input legibility, and others. A verifiable outcome (pass/fail, P&L) is not the same as diagnostic feedback: why it failed, where to repair, and what work made the system usable.