# Extended Capability Library

_Markdown export. Canonical UI: https://danielsgriffin.com/library._
_Structured data: https://danielsgriffin.com/library.json._

---

## Schema

# Extended Capability Library

*Notes and sources on AI capability in practice.*

An annotated, human-curated library of recent material on where AI capability
in practice comes from. Sources are tweets, blog posts, essays, talks, papers,
and practitioner notes — the kinds of material where the field is actively
arguing.

The library is a thinking tool, not a bookmark manager or a research feed. The
route stays at `/library` for now; the directory and the loader are likewise
unchanged, so the name can evolve without URL churn.

## Purpose

The library exists to help analyze the **extensions around models** that shape
whether AI systems are actually capable in the field:

- harnesses (input-shaping, validation, repair, ratification, monitoring, …)
- validation structures and construct validity
- repair loops and diagnostic feedback
- reward signals and domain structure
- observability, reversibility, offline evaluability
- institutional ratification and diffusion

It should help answer questions like:

- What kinds of harnesses make AI systems more capable in practice?
- What kinds of validation are *constitutive* of capability rather than post-hoc checks?
- Which domains are especially favorable because they offer strong reward signals,
  repair loops, or offline evaluability?
- How do practitioner theories disagree? (thin-harness vs. harness-engineering,
  verifiable-reward framings vs. construct-validity framings, anti-prescriptive
  vs. strongly-procedural harness theories.)

## Scope: 2025 and later only

This library **only accepts sources dated 2025-01-01 or later**. The loader
throws on pre-cutoff entries; the CLI helper refuses to create them; the list
page reminds readers that the cutoff is deliberate.

Why: the "AI harness" conversation crystallized in 2025. A 2023 blog post about
prompt engineering lives in a different conceptual world and would muddle the
corpus.

## Data model

One Markdown file per entry at `content/library/<slug>.md`. The source of truth
is the file — diffs are useful, editing is local-first, no database required
for reading or writing entries.

### Frontmatter fields

**Required metadata**

| Field | Type | Notes |
| --- | --- | --- |
| `title` | string | |
| `author` | string | |
| `date` | `YYYY-MM-DD` | Must be 2025-01-01 or later. |
| `source_type` | enum | `tweet`, `blog`, `essay`, `paper`, `talk`, `doc`, `podcast`, `note` |
| `role` | enum | See *Role* below. |
| `domain` | enum | See *Domain* below. |

**Optional metadata**

| Field | Type | Notes |
| --- | --- | --- |
| `url` | string | Canonical source URL. |
| `excerpt` | string | Short direct quote. |
| `summary` | string | One- or two-sentence gloss. |
| `notes` | string | Editorial notes. |
| `why_it_matters` | string | Shown prominently on the entry page. |
| `tags` | string[] | Free-form. |
| `verification_needed` | bool | True if metadata is placeholder. |
| `verification_note` | string | What needs confirming. |

**Multi-select classification**

| Field | Allowed values |
| --- | --- |
| `harness_types` | `input-shaping`, `grounding-context-loading`, `execution-harness`, `validation-harness`, `repair-harness`, `ratification-harness`, `monitoring-harness`, `learning-harness`, `social-harness`, `interface-harness` |
| `validation_position` | `before-generation`, `during-generation`, `immediately-after-generation`, `before-action`, `post-deployment`, `continuous` |
| `validation_mode` | `mechanical`, `empirical`, `social`, `institutional`, `interpretive`, `adversarial` |
| `relation_to_argument` | `capability-is-extended`, `validation-is-constitutive`, `repairability-matters`, `first-mile-input-formation`, `institutions-shape-capability`, `reward-structure-matters`, `domain-structure-matters`, `observability-matters`, `breakdown-when-harness-absent`, `diffusion-adoption-bottleneck` |

**Interpretive**

| Field | Allowed values |
| --- | --- |
| `prescription_stance` | `anti-prescriptive`, `mixed`, `strongly-procedural` |

**Extended capability dimensions**

`dimensions` is a map from dimension key to either a number (`1..5`) or an
object `{ score: 1..5, note?: string }`. Valid keys:

- `input_legibility`
- `task_structure`
- `reward_richness`
- `feedback_latency`
- `repairability`
- `observability`
- `reversibility`
- `offline_evaluability`
- `institutional_ratification`

**Design rule: score these independently.** A verifiable outcome is not the
same as diagnostic feedback. An entry can (and should) score high on
`reward_richness` while scoring low on `repairability` or `input_legibility` —
the schema exists precisely so those tensions stay visible.

### Body

Everything after the closing `---` is the **annotation**: prose commentary,
cross-references, open questions, disagreements with neighboring entries.
Markdown is supported; react-markdown renders it with `rehype-raw` so limited
HTML works.

## Field meanings

### Role

- **synthesis-node** — a piece that pulls multiple threads together.
- **practitioner-note** — a short, first-person observation from someone shipping.
- **field-observation** — reported-out pattern across practitioners.
- **framework-piece** — proposes vocabulary, taxonomy, or architecture.
- **case-study** — narrative of a specific deployment or build.
- **domain-claim** — asserts something about the structure of a domain.
- **measurement-piece** — about evaluation, benchmarks, or construct validity.
- **governance-piece** — about ratification, policy, adoption, or institutional fit.

### Harness types

Sketch of what each covers:

- **input-shaping** — forming the prompt/request/context before generation.
- **grounding-context-loading** — retrieving, routing, and packaging context.
- **execution-harness** — the scaffolding the model acts within (tools, plans).
- **validation-harness** — checks on candidate outputs before they ship.
- **repair-harness** — diagnostic feedback that routes errors back into generation.
- **ratification-harness** — who/what blesses the output as valid.
- **monitoring-harness** — continuous observation post-deployment.
- **learning-harness** — signals that feed model or system updates.
- **social-harness** — norms, review cultures, and human practices around the system.
- **interface-harness** — the surface through which humans interact.

### Domain

Single value: `software`, `law`, `medicine`, `finance`, `research`, `education`,
`security`, `operations`, `cross-domain`, `other`.

## Preserving disagreement

The library is designed to let sources disagree cleanly:

- Tan-style *thin harness / fat skills* vs. Miessler-style *harness engineering*.
- Royzen-style *verifiable reward is the lever* vs. Jacobs-style *construct
  validity is prior to any reward signal*.
- Anti-prescriptive stances vs. strongly-procedural stances.

Filters and dimension scales are the primary way this shows up — two entries
can be tagged with opposite `prescription_stance` values, or score opposite on
`reward_richness` vs. `repairability`, and both remain legible side by side.

## Golden path: add an entry end-to-end

The five-step loop. Nothing else is required and nothing is automated.

1. **Scaffold.** From the repo root:

   ```bash
   bun scripts/new-library-entry.ts \
     --slug=someone-short-title \
     --title="The Title" \
     --author="Someone" \
     --date=2025-11-03 \
     --url="https://..." \
     --source-type=essay
   ```

   `--slug`/`--title` and `--author` are required. Run with `--help` to see
   every flag, and `--draft` to prefix the filename with `_` so the loader
   skips it until you rename.

2. **Fill in metadata.** Open `content/library/<slug>.md` in your editor. The
   template has every enum's allowed values as a comment next to the field, so
   you don't need to come back here. Focus first on: `role`, `domain`,
   `relation_to_argument`, one or two `harness_types`, and any scored
   `dimensions` that actually apply. Leave the rest empty.

3. **Run the site.**

   ```bash
   bun dev
   ```

   Then visit <http://localhost:3000/library/your-slug>. A broken entry does
   *not* break the rest of the library in dev — the loader logs the error to
   your terminal, skips the entry, and keeps going. Fix the logged issue and
   the page refreshes.

4. **Verify.** When title, author, date, URL, and excerpt are all confirmed,
   flip `verification_needed: false` (or delete the line). The "unverified"
   chip and the amber banner disappear.

5. **Run the relationship pass.** Any time you identify a related entry,
   open that related entry and decide whether it needs a reciprocal update.
   Do this before shipping, not as a later curation chore.

   Minimum loop:

   ```bash
   rg -n "keyword|author|concept" content/library
   bun test tests/library/
   ```

   For each related entry, check:

   - Should its annotation mention the new source?
   - Should its `tags`, `harness_types`, `relation_to_argument`, or dimension
     notes change?
   - Should `_shelves.md` change because the conceptual grouping shifted?
   - Does the visible related-entry output make sense, or do metadata overlaps
     need tuning?

   If no reciprocal edit is needed, leave the entry alone. The point is to
   inspect every related entry, not to force backlinks everywhere.

6. **Ship.** Commit the new file. `bun build` runs in strict mode — a bad
   entry fails the build instead of being skipped, so CI will catch problems
   the dev server hid.

```bash
git add content/library/your-slug.md
git commit -m "library: add <slug> on <shelf>"
```

### Tests

```bash
bun test tests/
```

Covers loader validation, pre-2025 rejection, enum errors, and filter logic.

### Without the CLI

Copy an existing entry (`content/library/tan-thin-harness-fat-skills.md` is a
good template) and edit. The loader validates enum values, so a typo in `role`
or `harness_types` fails loudly with a "Did you mean" suggestion and the file
path.

## Running the site

```bash
bun install
bun dev          # http://localhost:3000/library
bun build        # static build
```

The library is pure file-system data — no database, no API keys, no workflow
infrastructure. The code is deliberately split across three files:

- `src/lib/library-schema.ts` — types and controlled vocabularies (pure).
- `src/lib/library-filters.ts` — filter state, search, related-entry logic (pure).
- `src/lib/library.ts` — filesystem loader, caching, facet helpers (server-only).

A future static export, notebook analysis, or separate UI can consume
`library-schema` + `library-filters` without touching Next.js.

## Views

- **`/library`** — default list with search and facet filters.
- **`/library/<slug>`** — single entry with classification, dimension bars,
  annotation body, and related entries computed from overlapping tags,
  relation-to-argument, and harness types.
- **`/library/compare`** — table of every entry against prescription stance
  and five load-bearing dimensions (input legibility, reward richness,
  repairability, observability, offline evaluability). Designed to surface
  contrasts like "high reward richness, low repairability."
- **`/library.json`** — static JSON export of the full corpus, prerendered at
  build time. Use it from notebooks or external tooling; the shape mirrors
  `LibraryEntry`.

## Reusing the corpus outside the site

```ts
import { getAllEntries } from './src/lib/library'
const entries = getAllEntries()
// entries is LibraryEntry[] — no React, no Next.js dependencies.
```

Or, from any language:

```bash
curl https://danielsgriffin.com/library.json | jq '.entries[] | select(.domain == "finance")'
```

## Seed entries

Entries live in this directory, spread across the main disagreement axes
(thin harness / anti-prescriptive / verifiable-reward / measurement-validity /
institutional-scaffolding). See `content/library/` for the full list.

Several entries are still flagged `verification_needed: true` — usually just
the exact publish date or an author name that automated fetching could not
confirm. Each flagged entry names what specifically needs verifying in its
`verification_note`.

The normal local importer is intentionally simple:

```bash
bun scripts/seed-library.ts --dry-run
bun scripts/seed-library.ts
```

For the live Supabase database, use the production wrapper. It requires an
explicit remote `DATABASE_URL`, defaults to dry-run, refuses localhost unless
`--allow-local` is passed, stores a before/after snapshot in
`dsg_library_seed_runs`, and records per-entry versions in
`dsg_library_seed_entry_versions`.

```bash
DATABASE_URL=... bun run library:seed:prod --dry-run
DATABASE_URL=... bun run library:seed:prod --apply --only=renda-openestimate
DATABASE_URL=... bun run library:seed:prod --apply --soft-delete-stale
DATABASE_URL=... bun run library:seed:prod --list-runs
DATABASE_URL=... bun run library:seed:prod --apply --rollback=<run_id>
```

`--soft-delete-stale` never hard-deletes rows; it marks DB-only active library
rows as `content_type='library_deleted'`. Use it only when stale DB rows have
been reviewed.

Operational lessons from production seeding:

- Use `--dry-run` first and seed targeted entries with `--only=<slug>` when
  adding a small batch.
- Keep source dates as date-only semantics. The seeding scripts write noon UTC
  so Pacific-time rendering does not show the previous day.
- A live DB update may not be visible until the deployed app process refreshes
  its in-memory library cache. If the DB row is correct but the public page is
  stale, deploy/restart the site and verify the public URL again.
- Do not use `--soft-delete-stale` to clean up DB-only rows unless those rows
  have been explicitly reviewed.

## Next curation priority

The current corpus is well-populated on practitioner harnesses,
measurement/validity, reward structure, and institutional scaffolding. The
next pass should prioritise sources that strengthen three underrepresented
dimensions:

- **Repairability** — diagnostic feedback, error attribution, repair loops
  as distinct from verifiable outcomes. Concrete traces of failure-mode
  triage rather than pass/fail.
- **Input legibility / context discovery** — first-mile input formation,
  retrieval that surfaces missing context, situational-awareness patterns.
- **Observability** — runtime introspection, agent-trace tooling,
  post-deployment monitoring beyond aggregate metrics.

## Follow-up improvements

Not implemented yet; tracked here so they don't block shipping:

- Tag pages (`/library/tags/<tag>`).
- Import helper that accepts a pasted URL and pulls title/author/date as a
  draft (currently out of scope — the CLI template is the import path).
- Optional full-text search through the existing Supabase index (the library
  is small enough that client-side string search is fine for now).
- A separate draft directory (`content/library/_drafts/`) for work in progress
  — the loader already skips files starting with `_`.

---

## Shelves

# Shelves of disagreement

*A reading of the current 19-entry corpus in the Extended Capability Library.*

The schema is designed to preserve disagreement rather than flatten it. This
note groups the current entries into **seven conceptual shelves** and asks,
for each shelf: what does it claim, what does it miss, and how does it
relate to the broader argument that AI capability in practice is *extended*
by the scaffolding around the model rather than residing in the model alone?

Entries sit on more than one shelf — the shelving is a reading, not a
classification.

---

## 1. Harness architecture

*Entries:* [HumanLayer, "Skill Issue"](/library/humanlayer-skill-issue);
[Anthropic, "Agent Skills"](/library/anthropic-agent-skills);
[Willison, "Claude Skills are awesome"](/library/willison-claude-skills-bigger-than-mcp);
[Miessler, "Good and Bad Harness Engineering"](/library/miessler-good-and-bad-harness-engineering);
[Dhinakaran, "What Is an Agent Harness"](/library/dhinakaran-agent-harness);
[OpenAI, "Symphony"](/library/openai-symphony-codex-orchestration);
[Nous Research, "Hermes Agent README"](/library/nous-hermes-agent-readme);
[Karpathy, "LLM Knowledge Bases"](/library/karpathy-llm-knowledge-bases).

**Claims.** Specific architectural moves around the model — sub-agents used
for context control, skills as markdown with progressive disclosure, hooks,
back-pressure mechanisms, issue trackers as control planes, persistent
memory, and LLM-maintained knowledge bases — do real, non-absorbable work.
The gap between 2x and 100x practitioners is architectural, not cognitive.

**Misses.** Almost all of it is software-centric and assumes a skilled
technical user. The shelf says little about first-mile input formation from
non-technical users, or about domains where the harness spans legal,
clinical, or administrative workflow rather than a terminal. Also: the shelf
describes what harness moves *exist*, with less attention to *how to tell
which ones will be absorbed by the next model*. Symphony and Hermes broaden
the surface from coding sessions to orchestration and persistence, but still
lean technical.

**Relation.** The clearest statement of "capability is extended." The harness
is the extension; the shelf is the evidence it carries load. Its boundary
condition is the next shelf.

## 2. Anti-prescriptive / intent-first

*Entries:* [Tan, "Thin Harness, Fat Skills"](/library/tan-thin-harness-fat-skills);
[Miessler, "Bitter Lesson Engineering"](/library/miessler-bitter-lesson-engineering);
[Miessler, "Good and Bad Harness Engineering"](/library/miessler-good-and-bad-harness-engineering)
(partially — it prescribes a *discipline*, not procedures).

**Claims.** As models improve, prescriptive scaffolding is dead weight. State
intent precisely; let the model handle execution. The scaffolding that
survives is skills-as-knowledge (markdown describing *what*), not
skills-as-rails (code describing *how*).

**Misses.** The shelf tends to under-theorise repair, observability, and
institutional ratification. "Anti-prescriptive in principle" can collapse
into "don't worry about the harness" in practice — which does not explain
why HumanLayer keeps shipping back-pressure, or why coding-agent teams
converge on the same tactical solutions even when models improve.

**Relation.** The boundary condition for Shelf 1. Agrees that capability is
extended, but locates the durable extension in *intent* and *knowledge*
rather than in *procedure*. The sharpest contrast: same set of observations,
opposite prescriptions.

## 3. Reward-rich / verifiable domains

*Entries:* [Royzen, "Standard Signal"](/library/royzen-standard-signal);
[Expanding RLVR Across Diverse Domains](/library/expanding-rlvr-across-domains);
[Rubrics as Rewards](/library/rubrics-as-rewards) (extending the framing
past the easy cases); [Renda et al., "OpenEstimate"](/library/renda-openestimate)
(calibration under uncertainty as a reward/measurement problem).

**Claims.** Capability concentrates in domains that supply a crisp,
verifiable reward signal. Markets, theorem proving, test-gated code. The
mechanism is RLVR; the downstream claim is that such domains are
structurally favoured regardless of model generation.

**Misses.** Consistently conflates *verifiable outcome* with *diagnostic
feedback*. A verified P&L tells you a model was wrong but not where; a
failed test suite tells you a patch broke something but not what. The shelf
understates the repairability problem — and, as the next shelf argues,
the construct-validity problem under the word "verifiable."

**Relation.** The domain-structural variant of "capability is extended."
Locates the extension in the reward geometry of the domain rather than in
the harness or the model. Read together with Shelf 1 it gets sharper:
*given* a favourable domain, is the harness what compounds, or does the
domain do most of the work?

## 4. Measurement / validity / standards

*Entries:* [Wallach, Jacobs et al., "Social Science Measurement
Challenge"](/library/wallach-measurement-challenge);
[Salaudeen et al., "Measurement to Meaning"](/library/salaudeen-measurement-to-meaning);
[Rubrics as Rewards](/library/rubrics-as-rewards) (also a measurement
piece — rubrics *name* failure modes); [Renda et al.,
"OpenEstimate"](/library/renda-openestimate).

**Claims.** Evaluation is a social-science measurement problem. A benchmark
score is a claim about a construct; the claim is only as strong as the
validity of the instrument. OpenEstimate adds the uncertainty-calibration
variant: a model can produce a plausible answer while having a badly formed
belief about its own uncertainty. Most GenAI evaluation skips this step,
produces sloppy apples-to-oranges comparisons, and calls them capability
claims.

**Misses.** Diagnostic rigor outruns operational tractability. Validity
frameworks require social and institutional slack that practitioners
shipping products rarely have. The shelf is strong on diagnosis, weaker on
what a *partial-validity* claim looks like in flight — which is what most
deployed systems actually have.

**Relation.** Pushback on "capability is extended" that aims at the word
*capability*. Says: before arguing about what extends capability, be clear
on which construct you are claiming to measure. Useful discipline across
every other shelf, especially Shelf 3 (where the word "verifiable" is doing
work that validity theory would challenge).

## 5. Institutional scaffolding / workforce transition

*Entries:* [OECD, "Building an AI-ready public workforce"](/library/oecd-ai-ready-workforce);
[Anthropic, "Agent Skills"](/library/anthropic-agent-skills) (the vendor-
ratification side); [Royzen, "Standard Signal"](/library/royzen-standard-signal)
(a hedge fund is the institutional wrapper around the model);
[OpenAI, "Symphony"](/library/openai-symphony-codex-orchestration) (issue
trackers as organizational ratification); [Maynard, "Resurrecting deceased
darlings"](/library/maynard-resurrecting-deceased-darlings) (publication and
editorial judgment as ratification).

**Claims.** Whether an AI system is capable in practice depends on
institutional fit — training, procurement, accountability, legal form,
vendor ratification. A model-vendor blessing a pattern is a ratification
event; a workforce-readiness report is another; a fund is a third.
Deployments stall without the scaffolding regardless of model strength.

**Misses.** Feedback loops are slow (years, not cycles) so inference from
few entries is weak. The shelf currently leans on high-level documents and
meta-claims. It needs sector-specific case studies — hospitals, courts,
agencies — where institutional scaffolding demonstrably made or broke a
deployment. Without them the shelf stays gestural.

**Relation.** The large-scale variant of "capability is extended." Keeps
the library from collapsing into a coding-agent conversation. If Shelves
1–3 answer *what extends capability at the unit of an application*, this
shelf answers *what extends capability at the unit of an institution*.

## 6. Knowledge work / authorship / cumulative artifacts

*Entries:* [Karpathy, "LLM Knowledge Bases"](/library/karpathy-llm-knowledge-bases);
[Maynard, "Resurrecting deceased darlings"](/library/maynard-resurrecting-deceased-darlings);
[Dhinakaran, "What Is an Agent Harness"](/library/dhinakaran-agent-harness)
(the general harness frame); [Willison, "Claude Skills are awesome"](/library/willison-claude-skills-bigger-than-mcp)
(portable markdown knowledge).

**Claims.** In knowledge work, capability often comes from making work
cumulative: raw sources become markdown wikis, prompts become reusable
resources, outputs get filed back into the corpus, drafts become objects for
editorial judgment, and the human-facing interface remains inspectable.
The agent's contribution is not just answer generation; it is artifact
maintenance.

**Misses.** These entries have weaker mechanical validation than coding-agent
examples. Their feedback loops are editorial, interpretive, and social, which
makes them harder to score. A wiki can become coherent-looking while still
being wrong; a book can become more eloquent while drifting from reality or
the authors' voice.

**Relation.** This shelf prevents the library from treating "harness" as a
coding-only concept. It shows the same extension logic in research and
writing: legible artifacts, inspectable intermediate state, repairable
outputs, and human judgment over what gets kept.

## 7. Field handoffs / applied AI evidence

*Entries:* [Applied AI Handoff Atlas](/library/handoff-atlas);
[Maynard, "Resurrecting deceased darlings"](/library/maynard-resurrecting-deceased-darlings)
(knowledge-work handoff); [OpenAI, "Symphony"](/library/openai-symphony-codex-orchestration)
(coordination handoff); [Karpathy, "LLM Knowledge Bases"](/library/karpathy-llm-knowledge-bases)
(artifact-maintenance handoff).

**Claims.** Small deployed or semi-deployed systems can function as evidence
when they make the handoff explicit: which human function moved into the AI
system, what scaffolding made that move acceptable, what broke, and what
artifact remains. The unit of analysis is not "an AI app"; it is a transfer of
judgment, memory, access, practice, explanation, or representation.

**Misses.** The evidence is still uneven. Screenshots, changelogs, public
writeups, audits, and repository notes exist, but runtime traces, user quotes,
fixtures, and before/after examples are incomplete. The shelf should not
overclaim production maturity where the artifact is currently a field note.

**Relation.** This shelf is the field-evidence companion to the more
theoretical shelves. It grounds the extended-capability argument in applied
work: transparency pages, opt-in generation, local data boundaries, memory
approval, pronunciation-measurement failure, and memorial non-impersonation
constraints.

---

## The strongest current gap

Across all seven shelves, the thinnest evidentiary spot is **repair loops in
action**. The library has several entries that *theorise* about repair
(Miessler, HumanLayer on back-pressure, Rubrics as Rewards on naming failure
modes, OpenEstimate on uncertainty calibration, Dhinakaran on closed loops)
and several that assert that verifiable reward is the lever (Royzen,
Expanding RLVR). Symphony and Karpathy add stronger operating examples, and
the Handoff Atlas brings the gap closer to applied practice, but the corpus
still needs a concrete diagnostic-repair loop on a real failure: the step
where a system noticed it
was wrong, attributed the wrongness to a specific cause, and repaired itself
or was repaired.

That is the sharpest unanswered empirical question in the current corpus:
does reward richness *compound* into capability through a repair loop, or
does it stall at pass/fail?

Two secondary gaps follow from this:

- **Observability traces.** Very few entries are grounded in runtime
  introspection of agent behavior — most are prescriptive or definitional.
- **Non-software domains beyond finance.** No medicine, no law, no
  administration. The shelves as currently stocked risk reading as a
  coding-agent anthology.

## Next five entries to add (priority order)

1. **A post-mortem of an agent run that traces a failure to a specific cause
   and the repair that followed.** Strengthens Shelves 1 and 3. What counts:
   explicit error attribution, explicit repair step, reported as observed
   rather than proposed. Candidates to scout: OpenHands / All Hands writeups,
   Factory AI / Cognition evaluations, engineering post-mortems that name an
   agent by name.

2. **A practical piece on agent observability / trace tooling.** Strengthens
   Shelves 1 and 4. Not a vendor pitch — something that names what a
   practitioner actually learns from traces they would not learn from
   metrics. Candidates to scout: writing from Braintrust, Laminar, LangSmith
   users; UK AISI "Inspect" documentation; academic work on agent
   interpretability.

3. **A context-discovery or retrieval-failure diagnosis piece.** Strengthens
   Shelf 1 and the input-legibility dimension. Candidates to scout: essays
   on "context engineering," RAG failure taxonomies, writing about how
   retrieval silently mislocates information.

4. **A clinical-AI or legal-AI deployment writeup.** Strengthens Shelf 5 and
   breaks the software monopoly on the corpus. Candidates to scout: NEJM AI,
   Health Affairs pieces on deployed clinical models; Stanford HAI legal-AI
   field studies; ADA-style case reports.

5. **An empirical validity failure — a capability claim that did not
   generalise.** Strengthens Shelf 4 with teeth. Candidates to scout: the
   benchmark-contamination / data-leakage post-mortem genre; recent
   retractions or qualifications of capability claims; papers that show a
   construct measurement reversed under a modest distribution shift.

All five should still fit the 2025+ cutoff. If a candidate I surface is pre-2025,
the right move is to flag a 2025+ follow-up or commentary that points back
to it.

---

## Entries

<!-- source: anthropic-agent-skills.md -->

---
title: "Equipping agents for the real world with Agent Skills"
author: "Anthropic"
date: "2025-10-16"
source_type: "blog"
url: "https://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills"
verification_needed: true
verification_note: "URL and title verified; date borrowed from Simon Willison's same-day commentary (2025-10-16) — confirm from the Anthropic post header before citing."
excerpt: "Agent Skills are organized folders of instructions, scripts, and resources that agents can discover and load dynamically to perform better at specific tasks."
summary: "Anthropic's engineering announcement of Agent Skills: a markdown-based pattern for extending Claude's capabilities by progressive disclosure. Important as an *institutional* ratification of the thin-harness / fat-skills framing."
tags:
  - agent-skills
  - anthropic
  - progressive-disclosure
  - institutional-launch
  - markdown-skills
role: "framework-piece"
harness_types:
  - grounding-context-loading
  - execution-harness
  - learning-harness
  - ratification-harness
validation_position:
  - before-generation
  - during-generation
validation_mode:
  - empirical
  - institutional
domain: "software"
prescription_stance: "mixed"
relation_to_argument:
  - capability-is-extended
  - first-mile-input-formation
  - institutions-shape-capability
  - diffusion-adoption-bottleneck
dimensions:
  input_legibility:
    score: 4
  task_structure:
    score: 4
  reward_richness:
    score: 3
  repairability:
    score: 3
  observability:
    score: 4
  offline_evaluability:
    score: 3
  institutional_ratification:
    score: 5
    note: "The model vendor officially blessing the markdown-skill pattern is a ratification event, not only a technical one."
why_it_matters: "When the model vendor publishes an engineering post describing the pattern, the pattern becomes a point of reference that downstream tooling, hiring, and documentation can anchor on. The library should treat this as institutional ratification of the thin-harness-adjacent thesis."
---

An important entry for the `institutions-shape-capability` axis. Anthropic's engineering post is not just a feature announcement — it is the moment the markdown-skill pattern gets a canonical, vendor-endorsed framing. That changes what downstream practitioners cite, what conference talks reference, and which designs are considered "default."

Three things to note:

1. **The pattern has been used internally at Anthropic for some time** (search snippet: "they now have hundreds of them in production"). The public post is ratification, not invention.
2. **Progressive disclosure** — scanning only metadata until a skill is relevant — is a specific design move and deserves to be tracked. It is not just "markdown files"; it is a loading strategy.
3. **Agent Skills and MCP are presented as complementary**, not competing. That framing matters for the library's `harness_types` taxonomy.

### Related entries

- [Willison, "Claude Skills are awesome, maybe a bigger deal than MCP"](/library/willison-claude-skills-bigger-than-mcp) — practitioner synthesis the same week.
- [Tan, "Thin Harness, Fat Skills"](/library/tan-thin-harness-fat-skills) — the ethos this operationalises.
- [HumanLayer, "Skill Issue"](/library/humanlayer-skill-issue) — working-through of where skills stop and harness-engineering continues.

---

<!-- source: dhinakaran-agent-harness.md -->

---
title: "What Is an Agent Harness"
author: "Aparna Dhinakaran"
date: "2026-04-22"
source_type: "tweet"
url: "https://x.com/aparnadhinak/status/2046980769747533830"
excerpt: "LangChain is not a harness. LangGraph is not a harness."
summary: "Defines the modern agent harness as an out-of-the-box architecture that emerged from coding agents: an iteration loop over tools, context management, skill/tool discovery, permissions, hooks, session persistence, sub-agents, and project-context injection."
tags:
  - agent-harness
  - coding-agents
  - harness-architecture
  - tool-loops
  - permissions
  - context-management
  - skills
role: "framework-piece"
harness_types:
  - input-shaping
  - grounding-context-loading
  - execution-harness
  - validation-harness
  - repair-harness
  - monitoring-harness
  - learning-harness
  - social-harness
  - interface-harness
validation_position:
  - before-generation
  - during-generation
  - immediately-after-generation
  - before-action
  - post-deployment
  - continuous
validation_mode:
  - mechanical
  - empirical
  - institutional
domain: "cross-domain"
prescription_stance: "strongly-procedural"
relation_to_argument:
  - capability-is-extended
  - validation-is-constitutive
  - repairability-matters
  - observability-matters
  - breakdown-when-harness-absent
  - diffusion-adoption-bottleneck
dimensions:
  input_legibility:
    score: 4
    note: "Project instruction files, context injection, skills, and tool discovery make the task environment legible to the model before and during work."
  task_structure:
    score: 5
    note: "The while loop, tool registry, permission layer, and lifecycle hooks are presented as fixed architecture, not human-assembled graph wiring."
  reward_richness:
    score: 3
    note: "The source emphasizes act-observe-adjust feedback, but not explicit reward-model training or scalar reward design."
  feedback_latency:
    score: 5
    note: "Coding-agent feedback is immediate: read, edit, run tests, observe failure, repair, and repeat."
  repairability:
    score: 5
    note: "Repair is central to the definition: the model can observe consequences and continue until the task is actually solved."
  observability:
    score: 4
    note: "Hooks, session logs, context compression, and tool results make harness behavior inspectable, though the post is more architectural than telemetry-specific."
  reversibility:
    score: 3
    note: "Permissions and approval gates reduce destructive risk, but rollback is not foregrounded as a first-class component."
  offline_evaluability:
    score: 4
    note: "Coding agents inherit strong offline checks through tests, shell commands, diffs, and build outputs."
  institutional_ratification:
    score: 4
    note: "Hooks and permission policies are explicitly framed as the enterprise adoption layer."
why_it_matters: "A strongly procedural counterweight to thin-harness framings. The post argues that harnesses are not generic frameworks for humans to assemble agents, but working closed-loop environments that let models act, observe, repair, persist, and extend themselves."
notes: "Source text supplied by Daniel from X. Date confirmed as Apr 22, 2026. This entry was prepared with Codex (OpenAI); the earlier library entries were prepared with Claude (Anthropic)."
verification_needed: true
verification_note: "Date confirmed by Daniel. Title, author, URL, and content came from user capture; confirm directly in X before formal citation."
---

Dhinakaran draws a bright line between *frameworks* and *harnesses*. Frameworks such as LangChain and LangGraph give human developers abstractions to wire together. A harness, in her account, ships as a working agent architecture: outer loop, context manager, tool and skill registry, permission system, lifecycle hooks, session persistence, sub-agent management, and dynamic project-context injection.

The post is useful because it treats harnesses as an empirical convergence, not a vendor category. Coding agents such as Cursor, Claude Code, Windsurf, and Codex started from the practical problem of changing real repositories, then converged on similar structures: tool loops, compressed context, approval layers, and built-in file/shell/code-navigation primitives. Arize's Alyx is positioned as the same pattern appearing outside pure coding.

For the Extended Frontier argument, this is direct evidence that capability is produced by the situated assembly. The model alone is a one-shot text generator; the model inside a harness becomes a feedback-seeking system that can act, observe consequences, and adjust. That closed loop is not incidental plumbing. It is what changes the unit of capability from *model output* to *model-in-environment performance*.

This entry should sit beside:

- [Tan, "Thin Harness, Fat Skills"](/library/tan-thin-harness-fat-skills) — disagrees on where durable leverage should live.
- [Miessler, "Good and Bad Harness Engineering"](/library/miessler-good-and-bad-harness-engineering) — adjacent harness-engineering vocabulary.
- [Anthropic, "Agent Skills"](/library/anthropic-agent-skills) — one of the built-in skill-layer mechanisms this post treats as part of harness architecture.

### Components To Reuse

Dhinakaran's harness 1.0 component list is a useful checklist for classifying future entries:

- Outer iteration loop.
- Context management and compression.
- Skills and tools management.
- Sub-agent management.
- Built-in pre-packaged skills.
- Session persistence and recovery.
- System prompt assembly and project-context injection.
- Lifecycle hooks.
- Permission and safety layer.

### Tension

The strongest claim is also the pressure point: if a harness is defined as an out-of-the-box working agent architecture, then LangGraph-style frameworks are excluded even when they can be used to build similar loops. That exclusion is analytically useful for the library because it keeps the focus on *deployed capability environments*, not just orchestration abstractions.

---

<!-- source: expanding-rlvr-across-domains.md -->

---
title: "Expanding RL with Verifiable Rewards Across Diverse Domains"
author: "Ma et al."
date: "2025-03-31"
source_type: "paper"
url: "https://arxiv.org/abs/2503.23829"
verification_needed: true
verification_note: "Title and URL verified. First-author and full author list not confirmed; arxiv date is best-estimate from the arxiv ID (2503 = March 2025). Confirm before citing."
summary: "Arxiv paper investigating how reinforcement learning with verifiable rewards (RLVR) generalises beyond the easy cases (math, code) to more diverse domains. The technical paper whose conceptual shadow Royzen's domain-claim entry sits in."
tags:
  - rlvr
  - verifiable-rewards
  - reinforcement-learning
  - domain-generalisation
role: "domain-claim"
harness_types:
  - learning-harness
  - validation-harness
validation_position:
  - immediately-after-generation
  - post-deployment
validation_mode:
  - mechanical
  - empirical
domain: "research"
prescription_stance: "strongly-procedural"
relation_to_argument:
  - reward-structure-matters
  - domain-structure-matters
  - validation-is-constitutive
dimensions:
  input_legibility:
    score: 3
  task_structure:
    score: 4
  reward_richness:
    score: 5
    note: "RLVR is the paradigmatic high-reward-richness method; the paper's concern is which domains actually admit it."
  feedback_latency:
    score: 4
  repairability:
    score: 3
    note: "Mark against reward-richness: a verifiable outcome signal can still be silent on the error mechanism."
  observability:
    score: 3
  offline_evaluability:
    score: 4
  institutional_ratification:
    score: 2
why_it_matters: "Grounds the 'verifiable-reward domain' framing in the ML research literature. Useful for readers who want the technical story behind practitioner claims that finance, code, and math are uniquely favourable."
---

Technical complement to the practitioner entries on verifiable-reward domains. The paper asks the question the library should keep asking: *which* diverse domains does RLVR actually generalise to, and what breaks when it doesn't?

Rhetorically, this entry is included to prevent the library from collapsing "verifiable reward" into a slogan. There is a research program behind it with real empirical findings — both supporting and complicating the practitioner framings.

### Read alongside

- [Royzen: Standard Signal](/library/royzen-standard-signal) — the finance-domain-favourability claim.
- [Rubrics as Rewards (RaR)](/library/rubrics-as-rewards) — extending the framing past crisp-outcome domains.
- [Wallach/Jacobs et al](/library/wallach-measurement-challenge) — the measurement-validity pushback.

---

<!-- source: handoff-atlas.md -->

---
title: "Applied AI Handoff Atlas"
author: "Daniel S. Griffin"
date: "2026-05-02"
source_type: "note"
url: "https://danielsgriffin.com/library/handoff-atlas"
summary: "A field-evidence companion to the Extended Capability Library: Curiosity Builds read as small handoffs of judgment, memory, practice, access, and representation into AI-shaped systems."
tags:
  - applied-ai
  - handoffs
  - curiosity-builds
  - field-evidence
  - extended-capability
  - hypandra
role: "synthesis-node"
harness_types:
  - input-shaping
  - grounding-context-loading
  - validation-harness
  - repair-harness
  - ratification-harness
  - social-harness
  - interface-harness
validation_position:
  - before-generation
  - immediately-after-generation
  - before-action
  - continuous
validation_mode:
  - empirical
  - interpretive
  - social
domain: "cross-domain"
prescription_stance: "mixed"
relation_to_argument:
  - capability-is-extended
  - validation-is-constitutive
  - repairability-matters
  - first-mile-input-formation
  - reward-structure-matters
  - observability-matters
  - institutions-shape-capability
  - breakdown-when-harness-absent
  - diffusion-adoption-bottleneck
dimensions:
  input_legibility:
    score: 4
    note: "Several builds expose first-mile input formation directly: student interviews, recipe intent, farm context, stop-motion plans, and reasoning explanations."
  task_structure:
    score: 4
    note: "The builds are small, bounded handoffs with concrete interfaces rather than general chat claims."
  reward_richness:
    score: 3
    note: "Some builds have clear local rewards, but most are not scalar or mechanically verifiable."
  feedback_latency:
    score: 3
    note: "Family, teacher, user, and product feedback arrived quickly; formal evaluation traces remain uneven."
  repairability:
    score: 4
    note: "The atlas foregrounds what broke or forced redesign, but several repairs still need preserved artifacts."
  observability:
    score: 3
    note: "Screenshots, handoff notes, changelogs, audits, and repository evidence exist; runtime traces are thinner."
  reversibility:
    score: 4
    note: "Several designs preserve human approval, local data, opt-in generation, and non-impersonation boundaries."
  offline_evaluability:
    score: 2
    note: "Scheduler Mark and tutoring builds point toward fixtures and rubrics, but the corpus is not yet a formal benchmark."
  institutional_ratification:
    score: 3
    note: "Teacher, family, client, and archive-steward contexts create real ratification pressure, though not yet formal institutional deployment."
why_it_matters: "This is Daniel's own field-evidence bridge between Hypandra Curiosity Builds and the Extended Capability Library. It turns small builds into claim-bearing notes about what extends applied AI capability in practice."
notes: "Prepared as a hosted artifact for danielsgriffin.com. It excludes Borderless Wire and The Conundrum Desk per Daniel's curation decision."
verification_needed: false
---

The Handoff Atlas reads Hypandra Curiosity Builds as evidence for the library's
central argument: AI capability in practice is extended by the surrounding
scaffolding, not contained inside the model alone.

The unit of analysis is the **handoff**. A handoff is a small transfer of work
into a system: judgment, memory, access control, practice, explanation, or
representation. The atlas asks what moved, what evidence supports that claim,
what broke, and which reusable lesson another applied AI team should carry
forward.

### Why it belongs in the library

Most entries in the library are external sources: papers, practitioner essays,
framework posts, and vendor announcements. This note supplies a different kind
of source: a field notebook from building small AI systems with actual users,
families, teachers, clients, and archive stewards.

That matters because the corpus has an explicit evidence gap around repair loops
in action and non-software operational settings. The builds do not close that
gap completely, but they make it more concrete:

- Pronouncle shows how the wrong measurement instrument can invalidate a product
  claim.
- Scheduler Mark shows why model-vs-model critique is not evaluation without
  ground truth or a rubric.
- HighSchoolResumes shows transparency as an interface users can inspect, not
  merely a disclosure page.
- Bunny Biscuits shows intentional friction and local data boundaries as part
  of the AI design.
- Zanna Smith Archive / MelTemp shows representation boundaries as product
  requirements in memory-sensitive domains.

### Read alongside

- [Rubrics as Rewards](/library/rubrics-as-rewards) - diagnostic feedback and
  named failure modes.
- [Wallach/Jacobs et al., "Social Science Measurement Challenge"](/library/wallach-measurement-challenge) -
  why evaluation is a construct-validity problem.
- [Good and Bad Harness Engineering](/library/miessler-good-and-bad-harness-engineering) -
  the distinction between scaffolding that extends capability and scaffolding
  that merely compensates for model weakness.
- [Resurrecting deceased darlings](/library/maynard-resurrecting-deceased-darlings) -
  human agency, disclosure, and editorial ratification in AI-assisted knowledge
  work.

---

<!-- source: humanlayer-skill-issue.md -->

---
title: "Skill Issue: Harness Engineering for Coding Agents"
author: "HumanLayer"
date: "2026-03-01"
source_type: "blog"
url: "https://www.humanlayer.dev/blog/skill-issue-harness-engineering-for-coding-agents"
verification_needed: true
verification_note: "URL, title, and publisher verified. Date is best-estimate from search snippet ('published in March 2026'); confirm the exact date and specific author(s) from the post header."
excerpt: "Skills, MCP servers, sub-agents, hooks, and back-pressure mechanisms are tactical solutions HumanLayer has arrived at."
summary: "Case-study framing of harness engineering for coding agents, with specific claims about what does and does not work (notably: role-based sub-agents don't work; sub-agents for context control do)."
tags:
  - harness-engineering
  - coding-agents
  - sub-agents
  - context-control
  - back-pressure
role: "case-study"
harness_types:
  - execution-harness
  - repair-harness
  - monitoring-harness
  - interface-harness
validation_position:
  - during-generation
  - immediately-after-generation
  - post-deployment
validation_mode:
  - empirical
  - mechanical
domain: "software"
prescription_stance: "strongly-procedural"
relation_to_argument:
  - capability-is-extended
  - repairability-matters
  - observability-matters
  - breakdown-when-harness-absent
dimensions:
  input_legibility:
    score: 4
  task_structure:
    score: 5
    note: "Breaking work into discrete delegated tasks is a first-class move here."
  reward_richness:
    score: 3
  feedback_latency:
    score: 3
  repairability:
    score: 4
    note: "Back-pressure mechanisms are repair harness by another name."
  observability:
    score: 4
  offline_evaluability:
    score: 2
  institutional_ratification:
    score: 2
why_it_matters: "A strong counter-example to thin-harness-in-the-limit. HumanLayer has shipped coding-agent product and reports that sub-agents, hooks, and back-pressure do real work. Sharpens the disagreement with Tan/Miessler and localises it."
---

HumanLayer's post is the library's best current counterweight to the thin-harness pole. The claim is not that more harness is always better — they explicitly report that *role-based* sub-agents ("frontend engineer," "backend engineer") don't work. The claim is that specific harness moves — sub-agents as context-control, hooks, back-pressure — carry real load and cannot be absorbed into a better model.

The piece is useful for the library because it:

- **Distinguishes harness types that work from those that don't**, empirically rather than in principle.
- **Names specific mechanisms** (sub-agents-for-context, back-pressure) that belong on the `harness_types` taxonomy.
- **Speaks from shipped product**, which raises its weight on the `practitioner-note` vs `framework-piece` axis.

### Read alongside

- [Tan, "Thin Harness, Fat Skills"](/library/tan-thin-harness-fat-skills) — the opposite pole.
- [Miessler, "Good and Bad Harness Engineering"](/library/miessler-good-and-bad-harness-engineering) — the pole this piece is most compatible with.
- [Anthropic, "Agent Skills"](/library/anthropic-agent-skills) — the vendor framing for one of the tactical solutions cited.

---

<!-- source: karpathy-llm-knowledge-bases.md -->

---
title: "LLM Knowledge Bases"
author: "Andrej Karpathy"
date: "2026-04-02"
source_type: "tweet"
url: "https://x.com/karpathy/status/2039805659525644595"
excerpt: "You rarely ever write or edit the wiki manually, it's the domain of the LLM."
summary: "Describes a personal research workflow where raw source documents are compiled by an LLM into a markdown wiki, maintained through index files, health checks, generated outputs, and lightweight tools rather than a heavyweight RAG stack."
tags:
  - knowledge-base
  - markdown-wiki
  - obsidian
  - agentic-research
  - llm-maintained-artifacts
  - personal-knowledge-management
role: "practitioner-note"
harness_types:
  - grounding-context-loading
  - execution-harness
  - validation-harness
  - repair-harness
  - monitoring-harness
  - learning-harness
  - interface-harness
validation_position:
  - before-generation
  - immediately-after-generation
  - continuous
validation_mode:
  - empirical
  - interpretive
domain: "research"
prescription_stance: "mixed"
relation_to_argument:
  - capability-is-extended
  - first-mile-input-formation
  - validation-is-constitutive
  - repairability-matters
  - observability-matters
  - domain-structure-matters
dimensions:
  input_legibility:
    score: 5
    note: "The raw/ to wiki compilation process is explicitly about making heterogeneous documents legible to future LLM turns."
  task_structure:
    score: 4
    note: "Markdown files, indexes, backlinks, summaries, and Obsidian views give the work a manipulable structure."
  reward_richness:
    score: 2
    note: "The workflow has useful signals from links, consistency, and answer quality, but not an explicit reward signal."
  feedback_latency:
    score: 3
    note: "Feedback arrives through Q&A, rendered outputs, and health checks, but not usually as immediate pass/fail tests."
  repairability:
    score: 4
    note: "Health checks, missing-data imputation, and filing outputs back into the wiki make the knowledge base incrementally repairable."
  observability:
    score: 5
    note: "The wiki is human-readable markdown and images viewed in Obsidian, so the agent's knowledge substrate stays inspectable."
  reversibility:
    score: 3
    note: "Markdown artifacts are versionable, though the post does not foreground git or rollback."
  offline_evaluability:
    score: 3
    note: "Some checks can be run offline over the wiki, but factual gaps still require web search or source refresh."
  institutional_ratification:
    score: 1
    note: "This is a personal research workflow rather than an organizational ratification system."
why_it_matters: "This is the Extended Frontier applied to knowledge work: the model's capability comes from a maintained corpus, indexes, summaries, visual outputs, and health checks that make research cumulative instead of ephemeral."
notes: "Source text supplied by Daniel from X. This entry was prepared with Codex (OpenAI)."
verification_needed: true
verification_note: "Author, URL, timestamp, and content came from user capture. Confirm directly in X before formal citation."
---

Karpathy describes a knowledge-work harness, not just a note-taking habit. Raw sources go into one directory; an LLM incrementally compiles them into a markdown wiki with summaries, backlinks, concept pages, index files, and derived visualizations. Obsidian becomes the human-facing IDE, while the LLM owns most direct edits to the wiki.

The important move is that research outputs are not terminal chat answers. They become files: markdown notes, Marp slides, matplotlib images, search indexes, and follow-up articles that can be filed back into the corpus. Each query can make the next query easier because the knowledge base itself accumulates structure.

For the library, this is a clean example of **capability as artifact maintenance**. Karpathy expected to need "fancy RAG," but at roughly 100 articles and 400K words, LLM-maintained summaries and index files were enough. The boundary condition matters: the system works because the scale is still small enough for source-aware traversal and because the artifacts are legible.

### Extended Frontier Read

The raw model is not the unit of analysis. The useful system is model plus:

- a raw source archive,
- a compiled markdown wiki,
- index and summary files,
- Obsidian as inspection surface,
- generated outputs that feed back into the wiki,
- health checks over consistency and missing data,
- small custom tools such as a wiki search engine.

This belongs beside harness entries, but it broadens the frame from coding agents to research agents. The same pattern appears: make the environment legible, let the model act on files, inspect the result, repair the substrate, and let work accumulate.

### Open Questions

- At what corpus size does this stop working without stronger retrieval infrastructure?
- Which health checks are most predictive of useful future Q&A?
- Does finetuning on the wiki improve capability, or does it destroy the inspectability and repairability that make the workflow valuable?

---

<!-- source: maynard-resurrecting-deceased-darlings.md -->

---
title: "Resurrecting deceased darlings: The Missing Foreword to AI and the Art of Being Human"
author: "Andrew Maynard"
date: "2025-10-19"
source_type: "essay"
url: "https://www.futureofbeinghuman.com/p/ai-resurrecting-deceased-darlings?publication_id=1547141&post_id=176516949"
excerpt: "This book could not have been written without the learning and insights gained from working closely with one of the most powerful AI models available."
summary: "Maynard publishes the cut foreword to AI and the Art of Being Human, describing months of close collaboration with Claude while emphasizing human agency, manual refinement, AI tells, fictional allegories, and practical tools for staying human with AI."
tags:
  - writing
  - ai-assisted-book
  - claude
  - human-agency
  - editorial-process
  - inner-postures
  - storytelling
role: "case-study"
harness_types:
  - input-shaping
  - validation-harness
  - repair-harness
  - learning-harness
  - social-harness
  - interface-harness
validation_position:
  - before-generation
  - immediately-after-generation
  - before-action
  - continuous
validation_mode:
  - interpretive
  - social
  - empirical
domain: "education"
prescription_stance: "mixed"
relation_to_argument:
  - capability-is-extended
  - first-mile-input-formation
  - repairability-matters
  - institutions-shape-capability
  - breakdown-when-harness-absent
  - diffusion-adoption-bottleneck
dimensions:
  input_legibility:
    score: 4
    note: "The authors built a library of resources and deep prompts over months before drafting."
  task_structure:
    score: 4
    note: "The collaboration was organized around chapters, frameworks, stories, tools, and explicit postures."
  reward_richness:
    score: 2
    note: "The feedback signal is editorial and human, not mechanical or scalar."
  feedback_latency:
    score: 3
    note: "Passages were iteratively rewritten, but book-scale editorial feedback is slower than code/test loops."
  repairability:
    score: 4
    note: "The post emphasizes manual refinement, removal of hallucinations, reduction of AI tells, and killing beloved text for reader flow."
  observability:
    score: 4
    note: "The foreword makes the collaboration visible, including worries, Claude's failures, and the retained AI tell."
  reversibility:
    score: 4
    note: "The authors cut the foreword from the book, moved some material to the preface, and later published it separately."
  offline_evaluability:
    score: 2
    note: "Quality is judged through reading, editing, credibility, and reader engagement rather than offline tests."
  institutional_ratification:
    score: 3
    note: "Professional advice, publication context, reader reception, and credibility concerns shape what counts as acceptable."
why_it_matters: "A grounded writing case study where AI assistance is neither hidden nor treated as autonomous authorship. Capability comes from months of prompt/resource preparation, human refinement, editorial judgment, and disclosure."
notes: "Source text supplied by Daniel from Maynard's Substack. This entry was prepared with Codex (OpenAI)."
verification_needed: true
verification_note: "Author, title, date, URL, and content came from user capture. Confirm directly from Substack before formal citation."
---

Maynard's post is a useful counterexample to simplistic claims about AI-assisted writing. The cut foreword says the book was written in close collaboration with Claude, but also insists the result was not a quick AI-generated artifact. The process took months of discussion, research, prompt and resource development, initial drafting, and extensive human refinement.

The most important detail for this library is that the authors treat AI collaboration as a **practice**. Claude contributed language, connections, tools, fictional forms, and moments that moved the authors. It also produced hallucinations, AI tells, and repeated failures to capture what they wanted. The final artifact depended on human judgment: rewriting, cutting, shaping reader flow, deciding what to disclose, and even preserving one minor AI tell as a trace of the collaboration.

### Extended Frontier Read

This is a writing-domain version of the harness argument:

- input preparation through a library of resources and deep prompts;
- iterative drafting with Claude;
- human editorial judgment over every chapter;
- professional advice shaping the final structure;
- disclosure as social ratification;
- fictional stories as a designed interface for making abstract AI questions felt.

The "extension" is not a test suite. It is the editorial and social apparatus around the model: judgment, taste, reader empathy, credibility concerns, disclosure, and revision.

### Tension

The foreword was cut because it slowed reader engagement, even though it contained valuable context. That editorial decision is itself part of the capability story. AI helped produce material the authors valued, but human-facing publication required deciding what not to include. Less output was better output.

---

<!-- source: miessler-bitter-lesson-engineering.md -->

---
title: "Bitter Lesson Engineering"
author: "Daniel Miessler"
date: "2025-06-01"
source_type: "essay"
url: "https://danielmiessler.com/blog/bitter-lesson-engineering"
verification_needed: true
verification_note: "URL and author verified; exact publish date is a best guess (site blocks automated fetch). Confirm from the post header before citing."
excerpt: "As AI gets better, Bitter Lesson Engineering becomes increasingly important."
summary: "Leans on Richard Sutton's 'The Bitter Lesson' to argue that prescriptive scaffolding around AI systems is a losing strategy in the limit: you should specify intent precisely and let the best available model figure out the path."
tags:
  - bitter-lesson
  - anti-prescriptive
  - sutton
  - design-discipline
role: "framework-piece"
harness_types:
  - input-shaping
validation_position:
  - before-generation
validation_mode:
  - empirical
domain: "cross-domain"
prescription_stance: "anti-prescriptive"
relation_to_argument:
  - capability-is-extended
  - diffusion-adoption-bottleneck
  - first-mile-input-formation
dimensions:
  input_legibility:
    score: 4
    note: "Being specific about intent *is* input legibility. The whole prescription."
  task_structure:
    score: 2
  reward_richness:
    score: 2
  feedback_latency:
    score: 2
  repairability:
    score: 2
    note: "Anti-prescriptive stances tend to underweight the value of diagnostic repair loops."
  observability:
    score: 2
  institutional_ratification:
    score: 1
why_it_matters: "Supplies the underlying argument for Miessler's harness-engineering taxonomy. Useful anchor for the anti-prescriptive pole of the library."
---

The conceptual base for [Good and Bad Harness Engineering](/library/miessler-good-and-bad-harness-engineering). The argument is a corollary of Sutton's "Bitter Lesson": methods that encode human prior knowledge get beaten in the long run by methods that scale general learning. Therefore: encode *what* you want (the construct, the outcome, the user intent) and let the model handle *how*.

In practice this produces a design stance close to [Tan's thin-harness](/library/tan-thin-harness-fat-skills), but arrived at from a different direction. Tan: "as models improve, scaffolding gets absorbed." Miessler-via-Sutton: "general methods beat prescriptive ones; prescriptive harness is prescriptive method."

### Disagreement preserved

This entry deliberately scores low on `repairability`, `observability`, and `institutional_ratification`. That is the anti-prescriptive pole: less scaffolding means less to diagnose, less to inspect, and fewer institutional seams. Pair this entry with measurement-focused entries to see the tension.

---

<!-- source: miessler-good-and-bad-harness-engineering.md -->

---
title: "Good and Bad Harness Engineering"
author: "Daniel Miessler"
date: "2025-09-01"
source_type: "essay"
url: "https://danielmiessler.com/blog/good-and-bad-harness-engineering"
verification_needed: true
verification_note: "URL and author verified; content summarised via search snippets (site blocks automated fetch). Exact publish date is a best guess — confirm from the post header before citing."
excerpt: "In the early days of prompt engineering (2023-2024) it was helpful to tell AI exactly how to do things, but this inversion probably happened somewhere in 2025."
summary: "Argues that good harness engineering focuses on who the user is and what they're trying to accomplish — the 'what' — and lets the model handle the 'how'. Pairs with Miessler's 'Bitter Lesson Engineering' as a design discipline for scaffolding that extends capability rather than compensating for model weakness."
tags:
  - harness-engineering
  - bitter-lesson
  - design-discipline
  - agent-design
  - what-not-how
role: "framework-piece"
harness_types:
  - input-shaping
  - grounding-context-loading
  - execution-harness
  - validation-harness
  - repair-harness
  - monitoring-harness
validation_position:
  - before-generation
  - immediately-after-generation
  - post-deployment
validation_mode:
  - empirical
  - mechanical
domain: "cross-domain"
prescription_stance: "mixed"
relation_to_argument:
  - capability-is-extended
  - repairability-matters
  - observability-matters
  - breakdown-when-harness-absent
dimensions:
  input_legibility:
    score: 4
    note: "Treats input formation as part of the engineered system, not preprocessing."
  task_structure:
    score: 4
  reward_richness:
    score: 3
  feedback_latency:
    score: 3
  repairability:
    score: 4
  observability:
    score: 4
  reversibility:
    score: 3
  offline_evaluability:
    score: 3
  institutional_ratification:
    score: 2
why_it_matters: "Supplies the vocabulary for distinguishing harnesses that *extend* capability from harnesses that merely *compensate* for it. A critical lens for reading practitioner writing."
---

Stakes out the middle ground between "thin harness, fat skills" and fully prescriptive agent frameworks. The core move is a *good/bad* distinction inside harness engineering itself: some scaffolding genuinely extends what the system can do (input shaping, repair loops, observability), while other scaffolding is brittle compensation for current model weakness and will not survive the next model.

Miessler's design rule is compressed into one line: **don't confuse the *what* with the *how*.** Tell the model who you are and what outcome you want; let the model figure out the path.

Read together with:

- [Bitter Lesson Engineering](https://danielmiessler.com/blog/bitter-lesson-engineering) — the underlying argument, leaning on Sutton's "The Bitter Lesson."
- [Tan's "Thin Harness, Fat Skills"](/library/tan-thin-harness-fat-skills) — adjacent but less prescriptive-about-good-design.

Miessler is **not** endorsing the thin-harness conclusion that scaffolding is always waste. He is endorsing a *discipline* of harness design. The disagreement with Tan is legible: both agree some scaffolding is waste; they disagree about how much of the harness is waste in the limit of model improvement.

### What the library should extract once the post is fully read

- The explicit taxonomy (if any) of good vs. bad harness work.
- Concrete examples cited as each type.
- Whether repairability and observability are treated as *constitutive* of capability or merely as hygiene.

---

<!-- source: nous-hermes-agent-readme.md -->

---
title: "Hermes Agent README"
author: "Nous Research"
date: "2026-04-29"
source_type: "doc"
url: "https://github.com/nousresearch/hermes-agent"
excerpt: "The self-improving AI agent built by Nous Research."
summary: "The Hermes Agent README presents an open agent harness with model-provider switching, terminal and messaging interfaces, scheduled automations, isolated subagents, toolsets, persistent memory, session search, and a closed learning loop around skills."
tags:
  - hermes-agent
  - nous-research
  - self-improving-agent
  - skills
  - memory
  - messaging-gateway
  - subagents
role: "case-study"
harness_types:
  - input-shaping
  - grounding-context-loading
  - execution-harness
  - repair-harness
  - monitoring-harness
  - learning-harness
  - social-harness
  - interface-harness
validation_position:
  - before-generation
  - during-generation
  - immediately-after-generation
  - continuous
validation_mode:
  - mechanical
  - empirical
  - social
domain: "cross-domain"
prescription_stance: "strongly-procedural"
relation_to_argument:
  - capability-is-extended
  - first-mile-input-formation
  - repairability-matters
  - observability-matters
  - diffusion-adoption-bottleneck
dimensions:
  input_legibility:
    score: 4
    note: "Slash commands, personalities, skills, memory, and cross-session search make user intent and prior context available to the model."
  task_structure:
    score: 5
    note: "The README describes a full harness surface: CLI/TUI, messaging gateway, scheduler, tools, backends, model providers, subagents, and skills."
  reward_richness:
    score: 3
    note: "Hermes emphasizes learning from experience and skill improvement, but the README does not define a single reward signal."
  feedback_latency:
    score: 4
    note: "Interactive CLI, messaging interrupts, tool output, scheduled jobs, and session search create frequent feedback opportunities."
  repairability:
    score: 4
    note: "The system can improve skills during use, create skills from experience, search past sessions, and persist knowledge."
  observability:
    score: 4
    note: "Terminal UI, command history, streaming tool output, diagnostics, and session search make agent behavior inspectable."
  reversibility:
    score: 3
    note: "Retry/undo commands are present, but the README does not foreground a broad rollback model."
  offline_evaluability:
    score: 3
    note: "Research tooling and batch trajectory generation suggest evaluability, but this is not the main README argument."
  institutional_ratification:
    score: 2
    note: "The README is user/harness-oriented rather than focused on organizational approval or governance."
why_it_matters: "Hermes is an example of the harness conversation moving beyond coding alone: a persistent, multi-surface, model-agnostic agent with memory, skills, automations, and self-improvement loops."
notes: "README inspected on GitHub by Codex on Apr 29, 2026. Date is the capture date for this dynamic README snapshot. This entry was prepared with Codex (OpenAI)."
verification_needed: true
verification_note: "README content verified from GitHub snapshot. Date is access/capture date, not a stable publication date."
---

The Hermes README is valuable as a productized harness inventory. It does not present a single new model capability. It presents the surrounding system: model-provider switching, a terminal UI, messaging gateways, scheduled automations, persistent memory, skills, subagents, session search, toolsets, terminal backends, and research tooling.

The distinctive claim is the closed learning loop. Hermes says it can create skills from experience, improve skills during use, nudge itself to persist knowledge, search past conversations, and build a user model across sessions. That is a direct capability-extension claim: the agent becomes more useful not only because the model changes, but because the harness accumulates procedural and contextual memory.

### Extended Frontier Read

Hermes makes the "agent harness" category concrete across several surfaces:

- **interface harness**: CLI/TUI plus Telegram, Discord, Slack, WhatsApp, Signal, and email gateway;
- **learning harness**: skill creation, skill improvement, memory nudges, session search;
- **execution harness**: local, Docker, SSH, Daytona, Singularity, and Modal terminal backends;
- **social harness**: cross-platform continuity, user modeling, scheduled reports;
- **subagent harness**: isolated parallel workstreams and RPC-style tool scripts.

This is not just "a chatbot with tools." It is an attempt to make an agent live where the user lives, remember what matters, and turn repeated work into skills.

### Open Questions

- How much of the self-improvement loop is automatic versus user-confirmed?
- Which skills improve reliably during use, and which drift?
- What validation or audit trail exists when memory and user modeling become part of the harness?

---

<!-- source: oecd-ai-ready-workforce.md -->

---
title: "Building an AI-ready public workforce"
author: "OECD"
date: "2025-07-01"
source_type: "doc"
url: "https://www.oecd.org/en/publications/building-an-ai-ready-public-workforce_b89244c7-en/full-report.html"
verification_needed: true
verification_note: "Publisher and URL verified. Date is a best-estimate; confirm the publication date from the OECD page before citing."
summary: "OECD full report on how public-sector workforces are (and are not) prepared to deploy AI. Brought into the library as a governance-piece anchor: the argument is that whether an AI system is capable *in practice* depends on the institutional scaffolding around its use, not only on the model or the harness."
tags:
  - workforce
  - public-sector
  - oecd
  - institutional-scaffolding
  - governance
role: "governance-piece"
harness_types:
  - ratification-harness
  - social-harness
  - monitoring-harness
validation_position:
  - before-action
  - post-deployment
  - continuous
validation_mode:
  - institutional
  - social
domain: "operations"
prescription_stance: "strongly-procedural"
relation_to_argument:
  - institutions-shape-capability
  - diffusion-adoption-bottleneck
  - breakdown-when-harness-absent
dimensions:
  input_legibility:
    score: 2
  task_structure:
    score: 3
  reward_richness:
    score: 1
    note: "Public-sector outcomes rarely collapse to a cardinal reward."
  feedback_latency:
    score: 1
    note: "Policy-level feedback is slow. Years, not cycles."
  repairability:
    score: 3
  observability:
    score: 3
  institutional_ratification:
    score: 5
    note: "The report is itself a ratification instrument."
why_it_matters: "Counterweight to the software-centric pole of the library. A large portion of real AI deployment lives inside institutions whose capability depends on workforce preparation, training, accountability, and procurement — none of which is captured by 'harness' in the coding-agent sense."
---

A governance entry. Places the question of AI capability-in-practice inside the frame of *public administration*: whether AI makes a public-sector system more capable depends on training, data integration, procurement norms, and public-private partnership structures, not only on the model or its harness.

The OECD framing forces the library to reckon with a kind of scaffolding that coding-agent practitioners rarely name:

- Workforce **training** as a first-mile input-formation mechanism.
- Accountability **procedures** as a ratification harness with legal and political standing.
- Cross-agency **data integration** as a grounding-and-context-loading substrate.

### Why it pairs with the software entries

- [Tan, thin harness / fat skills](/library/tan-thin-harness-fat-skills) — highlights the domain mismatch: thin-harness prescriptions assume a software-practitioner user. Here the "user" is a multi-layered public institution.
- [HumanLayer, "Skill Issue"](/library/humanlayer-skill-issue) — both pieces agree that the harness matters; they disagree about which harness.

---

<!-- source: openai-symphony-codex-orchestration.md -->

---
title: "An open-source spec for Codex orchestration: Symphony"
author: "Alex Kotliarskyi, Victor Zhu, and Zach Brock"
date: "2026-04-27"
source_type: "blog"
url: "https://openai.com/index/symphony-codex-orchestration/"
excerpt: "The agents were fast, but we had a system bottleneck: human attention."
summary: "OpenAI describes Symphony, a spec and reference implementation that turns issue trackers such as Linear into always-on control planes for coding agents, shifting humans from supervising sessions to managing work."
tags:
  - codex
  - symphony
  - orchestration
  - issue-tracker
  - linear
  - agent-management
  - app-server
role: "framework-piece"
harness_types:
  - execution-harness
  - validation-harness
  - repair-harness
  - monitoring-harness
  - learning-harness
  - social-harness
  - interface-harness
validation_position:
  - before-generation
  - during-generation
  - immediately-after-generation
  - before-action
  - post-deployment
  - continuous
validation_mode:
  - mechanical
  - empirical
  - social
  - institutional
domain: "software"
prescription_stance: "strongly-procedural"
relation_to_argument:
  - capability-is-extended
  - validation-is-constitutive
  - repairability-matters
  - observability-matters
  - institutions-shape-capability
  - diffusion-adoption-bottleneck
dimensions:
  input_legibility:
    score: 4
    note: "Issues, WORKFLOW.md, project state, and review packets turn ambiguous work into agent-readable objectives."
  task_structure:
    score: 5
    note: "The issue tracker becomes a state machine/control plane with per-issue workspaces, retries, statuses, and dependencies."
  reward_richness:
    score: 4
    note: "CI, reviews, issue state transitions, PR landing, videos, and human review all become feedback signals."
  feedback_latency:
    score: 4
    note: "Agents continuously observe issue state, CI, review feedback, and runtime failures, though some feedback waits on human review."
  repairability:
    score: 5
    note: "The system rebases, resolves conflicts, retries flaky checks, restarts stalled agents, and feeds failures back into guardrails and skills."
  observability:
    score: 5
    note: "Symphony foregrounds logs, status surfaces, review packets, videos, Linear state, and operator visibility."
  reversibility:
    score: 4
    note: "Per-issue workspaces and PR review preserve isolation and throwaway explorations, though rollback policy is implementation-specific."
  offline_evaluability:
    score: 5
    note: "Software tasks have tests, CI, smoke tests, Chrome DevTools checks, and reproducible workspaces."
  institutional_ratification:
    score: 5
    note: "The issue tracker, review statuses, PM/designer requests, and human review make acceptance institutional rather than merely technical."
why_it_matters: "Symphony is an explicit account of the next bottleneck after coding-agent capability: organizing agentic work. It treats orchestration, workflow documentation, issue state, CI, and review as capability infrastructure."
notes: "Source text supplied by Daniel from OpenAI's April 27, 2026 engineering post. This entry was prepared with Codex (OpenAI)."
verification_needed: true
verification_note: "Content came from user capture; URL is the likely OpenAI canonical URL. Confirm exact canonical URL and byline before formal citation."
---

Symphony is a control-plane argument. OpenAI's team found that interactive coding agents were already capable enough to create a new bottleneck: engineers could only supervise a few sessions before context switching overwhelmed them. Symphony responds by moving the unit of management from "agent session" to "project work."

In the described setup, Linear is not just a queue. It becomes the state machine for agent work. Every eligible issue gets an isolated workspace and a running agent. The orchestrator watches issue states, starts work, restarts stalled agents, handles retries, respects blockers, follows dependency DAGs, and lets agents file follow-up issues when they discover work outside the current scope.

This is a strong example of **institutional scaffolding as capability**. The agents did not simply get better at coding. The work became more delegable because the surrounding system changed: issues became objectives, WORKFLOW.md captured implicit development norms, CI and QA became part of the run loop, and humans reviewed packets instead of steering terminals.

### Extended Frontier Read

The key sentence for this library is the attention bottleneck: the agents were fast, but humans were still micromanaging them. Symphony extends capability by changing the coordination layer:

- issue tracker as control plane,
- per-issue workspaces,
- agent sessions abstracted behind tickets,
- CI/rebase/conflict handling in the loop,
- review packets and videos for human ratification,
- WORKFLOW.md as versioned organizational knowledge,
- agent-created follow-up work.

That turns "can a model implement this task?" into "can the organization make useful agent work cheap to initiate, observe, review, and land?"

### Tension

The post is explicit that not every task belongs in Symphony. Some ambiguous work still needs direct interactive Codex sessions and strong human judgment. That caveat is important: orchestration smooths routine implementation and exploration, but it does not erase the frontier. It shifts which work humans spend attention on.

---

<!-- source: renda-openestimate.md -->

---
title: "OpenEstimate: Evaluating LLMs on Reasoning Under Uncertainty with Real-World Data"
author: "Alana Renda, Jillian Ross, Michael Cafarella, Jacob Andreas"
date: "2025-10-22"
source_type: "paper"
url: "https://arxiv.org/abs/2510.15096"
excerpt: "LM-elicited priors are often inaccurate and overconfident."
summary: "OpenEstimate is a multi-domain benchmark for testing whether language models can express calibrated Bayesian priors for numerical estimation tasks under uncertainty, using real-world datasets in healthcare, employment, and finance."
tags:
  - uncertainty
  - calibration
  - probabilistic-estimation
  - benchmark
  - bayesian-priors
  - openestimate
  - real-world-data
role: "measurement-piece"
harness_types:
  - validation-harness
  - grounding-context-loading
validation_position:
  - immediately-after-generation
validation_mode:
  - empirical
  - mechanical
domain: "cross-domain"
prescription_stance: "strongly-procedural"
relation_to_argument:
  - validation-is-constitutive
  - reward-structure-matters
  - domain-structure-matters
  - breakdown-when-harness-absent
dimensions:
  input_legibility:
    score: 3
    note: "Questions are constructed from real-world datasets with natural-language variable descriptions and conditioning information."
  task_structure:
    score: 4
    note: "The benchmark asks models to represent beliefs as distributional priors, making uncertainty part of the output contract."
  reward_richness:
    score: 4
    note: "Ground-truth distributions from observational data support accuracy and calibration metrics."
  feedback_latency:
    score: 2
    note: "Feedback is benchmark-level evaluation after elicitation, not an interactive repair loop."
  repairability:
    score: 2
    note: "The benchmark identifies overconfidence and inaccuracy, but does not itself provide rich diagnostic repair paths."
  observability:
    score: 4
    note: "Priors, quantiles, calibration error, uncertainty-accuracy correlation, and statistical baselines expose how models represent uncertainty."
  reversibility:
    score: 2
    note: "The source is an evaluation harness, not a workflow with rollback or undo semantics."
  offline_evaluability:
    score: 5
    note: "The benchmark is explicitly offline and reproducible against dataset-derived ground truth."
  institutional_ratification:
    score: 3
    note: "ICLR venue and open-source benchmark provide research-community ratification, but not deployment governance."
why_it_matters: "OpenEstimate targets a capability gap that ordinary right-answer benchmarks miss: knowing how uncertain you should be when neither the model nor the human has an obvious answer. It makes calibration and uncertainty representation first-class evaluation targets."
notes: "GitHub repository: https://github.com/alanarenda/openestimate. Announcement thread supplied by Daniel from X. arXiv page says v1 submitted Oct 16, 2025 and v2 revised Apr 22, 2026. This entry was prepared with Codex (OpenAI)."
verification_needed: false
verification_note: "Date uses the Oct 22, 2025 public announcement supplied by Daniel; title, authors, arXiv URL, and repository were verified against arXiv/GitHub on Apr 29, 2026."
---

OpenEstimate is a measurement entry for the part of the frontier where there is no simple "right answer" visible to the user. The task is numerical estimation under uncertainty: given partial information from real-world datasets, models must express beliefs as Bayesian priors, then those priors are evaluated against ground-truth distributions computed from data.

The benchmark covers domains such as health, employment, and finance using datasets including NHANES, Glassdoor, and PitchBook. It evaluates point accuracy, calibration, uncertainty-accuracy correlation, and the value of LM priors relative to statistical baselines based on samples from the true distribution.

The headline result is sobering for deployment: across six frontier models, model-elicited priors are often inaccurate and overconfident. The announcement thread adds the sharper interpretation that model priors can be equivalent to fewer than five real data points and that higher model certainty does not reliably mean higher accuracy.

### Extended Frontier Read

OpenEstimate strengthens the measurement shelf because it asks for a richer construct than correctness. The relevant capability is not "can the model answer?" but:

- can it represent uncertainty as a usable prior,
- is that prior calibrated,
- does the model know when it does not know,
- does additional reasoning effort or prompting actually improve uncertainty quality?

That makes it a direct counterweight to benchmarks that reward confident point answers. The extension here is the evaluation harness itself: a structured output contract plus ground-truth distributions plus calibration metrics.

### Boundary

This is a validation harness, not a repair harness. It can show that models are overconfident, and it can compare elicitation protocols, but it does not by itself teach the model how to repair its uncertainty estimates. That makes it a useful neighbor to RLCR-style work that tries to train models to reason about what they do not know.

Source code: [alanarenda/openestimate](https://github.com/alanarenda/openestimate). Announcement thread: [@alanamarzoev on X](https://x.com/alanamarzoev/status/1981004837102793075).

---

<!-- source: royzen-standard-signal.md -->

---
title: "Standard Signal: AI-native hedge fund announcement"
author: "Michael Royzen"
date: "2026-03-01"
source_type: "tweet"
url: "https://x.com/MichaelRoyzen/status/2039801841253564837"
verification_needed: true
verification_note: "URL and excerpt verified via search. Exact posting date is a best guess (Standard Signal is YC Spring 2026 / P26); confirm from the tweet timestamp before citing."
excerpt: "Standard Signal is the first hedge fund that researches and executes trades purely with AI. We train models to discover and trade on new fundamental truths about the world before humans can."
summary: "Launch announcement for a YC-backed hedge fund where AI models both generate hypotheses and execute trades. Included here as a domain-claim entry: markets-with-P&L are a paradigmatically favorable domain — clean outcome signal, fast feedback, offline backtestable, institutionally-ratified wrapper (a fund)."
tags:
  - standard-signal
  - finance
  - hedge-fund
  - outcome-signal
  - domain-favorability
role: "domain-claim"
harness_types:
  - validation-harness
  - ratification-harness
  - learning-harness
  - execution-harness
validation_position:
  - post-deployment
  - continuous
validation_mode:
  - mechanical
  - empirical
  - institutional
domain: "finance"
prescription_stance: "strongly-procedural"
relation_to_argument:
  - reward-structure-matters
  - domain-structure-matters
  - institutions-shape-capability
  - validation-is-constitutive
dimensions:
  input_legibility:
    score: 3
  task_structure:
    score: 4
  reward_richness:
    score: 5
    note: "P&L is an unusually clean, cardinal, self-consistent reward. The library's own framing — Royzen does not use the phrase 'verifiable reward.'"
  feedback_latency:
    score: 3
    note: "Faster than science, slower than software. Mark-to-market is continuous; attribution to a specific hypothesis is not."
  repairability:
    score: 2
    note: "Critical tension: trading P&L tells you that a model is wrong but not *where* or *why*. Verifiable outcome ≠ diagnostic feedback."
  observability:
    score: 3
  reversibility:
    score: 2
    note: "Trades execute and settle; losses are not rollbackable."
  offline_evaluability:
    score: 4
    note: "Backtesting is real but regime-shift biased."
  institutional_ratification:
    score: 4
    note: "A hedge fund is the institution that ratifies 'this worked.' LPs, auditors, and regulators are ratification harness."
why_it_matters: "Finance is often named as a poster domain for AI deployment because outcomes are crisply priced. This entry anchors that claim with a concrete 2026 example and marks the critical asymmetry — high reward richness co-existing with low repairability — that the schema is designed to surface."
---

The announcement tweet is compact, but the conceptual payload is substantial. Standard Signal positions itself as the first hedge fund where *every* trade is researched and executed by AI. That packaging matters — not for the technology, but for the **ratification wrapper** around the technology. A fund is a legal and social form that converts opaque model outputs into legible claims about the world.

Why this belongs in the library even though Royzen does not use "harness" or "verifiable reward" vocabulary:

1. It stakes a **domain-favorability** claim: markets are unusually hospitable to AI because the reward signal is priced, real-time, and cardinal.
2. It stakes an **institutional** claim: a YC-backed fund is institutional ratification in a form academic benchmarks cannot supply.
3. It exposes the **asymmetry** the library wants to keep visible: high `reward_richness`, low `repairability`. P&L tells you whether you were right; it does not tell you *why*.

### Read alongside

- [Expanding RL with Verifiable Rewards Across Diverse Domains](/library/expanding-rlvr-across-domains) — technical framing of the same bet.
- [Measurement to Meaning](/library/salaudeen-measurement-to-meaning) — sharpest pushback: even a "verifiable" outcome doesn't measure the construct you claim.

### Verification needed

- Exact posting date of the tweet.
- Whether subsequent Standard Signal writing explicitly uses "verifiable reward" language or stays in P&L terms.

---

<!-- source: rubrics-as-rewards.md -->

---
title: "Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains"
author: "Unknown (OpenReview: 21UFlJrmS2)"
date: "2025-09-01"
source_type: "paper"
url: "https://openreview.net/forum?id=21UFlJrmS2"
verification_needed: true
verification_note: "OpenReview URL and title verified. Authors and exact date not confirmed; confirm from the forum page before citing."
summary: "Proposes rubrics as a reward source for reinforcement learning in domains where a crisp verifiable outcome does not exist. A deliberate extension of RLVR-style methods past the easy cases."
tags:
  - rubrics-as-rewards
  - rar
  - non-verifiable-domains
  - reward-shaping
  - rlvr-adjacent
role: "measurement-piece"
harness_types:
  - validation-harness
  - learning-harness
validation_position:
  - immediately-after-generation
validation_mode:
  - interpretive
  - social
  - empirical
domain: "research"
prescription_stance: "mixed"
relation_to_argument:
  - reward-structure-matters
  - domain-structure-matters
  - validation-is-constitutive
dimensions:
  input_legibility:
    score: 3
  task_structure:
    score: 3
  reward_richness:
    score: 3
    note: "Rubric scores are richer than nothing, sparser than a verified pass/fail."
  feedback_latency:
    score: 3
  repairability:
    score: 4
    note: "Rubrics *name* failure modes — that is diagnostic by construction, not just verifiable."
  observability:
    score: 4
  offline_evaluability:
    score: 3
  institutional_ratification:
    score: 3
why_it_matters: "The library's schema intentionally separates reward richness from repairability and input legibility. This entry is a technical illustration: a rubric can score lower than a verified outcome on richness while scoring *higher* on repairability — because the rubric names which dimension failed."
---

An important entry for preserving the library's most-subtle disagreement: *verifiable outcome ≠ diagnostic feedback*. Rubrics as Rewards operationalises that distinction by trading some of the "cleanness" of a verified outcome (one bit: passed / failed) for the structured richness of a rubric (multi-dimensional, failure-mode-named, repairable).

### Read alongside

- [Expanding RLVR Across Diverse Domains](/library/expanding-rlvr-across-domains) — the verifiable-outcome pole.
- [Royzen: Standard Signal](/library/royzen-standard-signal) — the domain where the outcome is unusually verifiable.
- [Wallach/Jacobs et al](/library/wallach-measurement-challenge) — the measurement critique that applies to both rubrics and verifiable rewards.

---

<!-- source: salaudeen-measurement-to-meaning.md -->

---
title: "Measurement to Meaning: A Validity-Centered Framework for AI Evaluation"
author: "Olawale Salaudeen, Anka Reuel, Ahmed Ahmed, Suhana Bedi, Zachary Robertson, Sudharsan Sundar, Ben Domingue, Angelina Wang, Sanmi Koyejo"
date: "2025-05-15"
source_type: "paper"
url: "https://arxiv.org/abs/2505.10573"
verification_needed: true
verification_note: "Title, authors, and URL verified; date is best-estimate from the arxiv ID (2505 = May 2025) — confirm exact first-submission date before citing."
excerpt: "The paper provides a structured approach for reasoning about the types of evaluative claims that can be made given the available evidence."
summary: "Proposes a validity-centered framework for AI evaluation that reasons explicitly about which evaluative claims the evidence actually supports, with detailed vision and language case studies. The operational companion to the Wallach/Jacobs position paper."
tags:
  - measurement
  - construct-validity
  - evaluation
  - validity-framework
  - claims-vs-evidence
role: "measurement-piece"
harness_types:
  - validation-harness
  - ratification-harness
validation_position:
  - immediately-after-generation
  - post-deployment
validation_mode:
  - empirical
  - interpretive
  - social
domain: "research"
prescription_stance: "strongly-procedural"
relation_to_argument:
  - validation-is-constitutive
  - observability-matters
  - institutions-shape-capability
dimensions:
  input_legibility:
    score: 3
  task_structure:
    score: 3
  reward_richness:
    score: 2
  repairability:
    score: 3
  observability:
    score: 5
    note: "Observability in the measurement-theoretic sense: what does the evaluation actually let you see?"
  offline_evaluability:
    score: 4
  institutional_ratification:
    score: 4
why_it_matters: "The applied-framework companion to the Wallach/Jacobs position paper. Where the position paper diagnoses the field, this paper hands you a tool: map the evidence you have to the claim you want to make, and refuse to make claims the evidence can't support."
---

Where the [Wallach/Jacobs position paper](/library/wallach-measurement-challenge) argues that generative-AI evaluation is a social-science measurement challenge, this paper supplies the operational framework. Two case studies (vision and language model evaluations) demonstrate how explicitly reasoning about validity strengthens or weakens the claims an evaluation can support.

The central move is refusing the shortcut from *benchmark score* to *capability claim*. A model that does well on a math benchmark may be good at that benchmark, not good at math. A model that does well on graduate-exam-style questions may be good at graduate-exam-style questions, not good at reasoning.

### Useful against

- "Reward richness is the lever" framings — this paper asks which construct the reward even measures.
- "Thin harness, fat skills" — reminds that the skills you think you are pushing into the model are defined by the evaluations you check them with.

### Useful for

- Anyone who wants to score a library entry on `institutional_ratification` or `observability` with conceptual grounding rather than intuition.

---

<!-- source: tan-thin-harness-fat-skills.md -->

---
title: "Thin Harness, Fat Skills"
author: "Garry Tan"
date: "2026-04-11"
source_type: "doc"
url: "https://github.com/garrytan/gbrain/blob/master/docs/ethos/THIN_HARNESS_FAT_SKILLS.md"
excerpt: "The 2x people and the 100x people are using the same models. The difference is five concepts that fit on an index card."
summary: "Short, practitioner-facing ethos doc arguing that the durable leverage in agent systems comes from model-resident skills (markdown) and deterministic code at the edges, with the harness kept as thin as possible so each model upgrade flows through."
tags:
  - thin-harness
  - skills
  - scaffolding-skepticism
  - agent-design
  - markdown-skills
role: "practitioner-note"
harness_types:
  - execution-harness
  - interface-harness
  - learning-harness
validation_position:
  - before-generation
validation_mode:
  - empirical
domain: "software"
prescription_stance: "anti-prescriptive"
relation_to_argument:
  - capability-is-extended
  - diffusion-adoption-bottleneck
  - first-mile-input-formation
dimensions:
  input_legibility:
    score: 3
    note: "Assumes inputs are legible enough that heavy shaping is unnecessary — a domain-specific bet."
  task_structure:
    score: 3
  reward_richness:
    score: 2
    note: "Does not foreground reward signal as the key lever."
  repairability:
    score: 2
    note: "Thin-harness framing tends to under-specify where repair loops live."
  observability:
    score: 2
  institutional_ratification:
    score: 1
why_it_matters: "A counterweight to harness-heavy framings. Tracks the prediction that as models get better, elaborate scaffolding becomes dead weight. Useful to read alongside Miessler (harness-engineering) and HumanLayer (sub-agents-as-context-control)."
---

A compact practitioner thesis from the gbrain repo: the productivity gap between 2x and 100x agentic-engineering users is not the model, it is the architectural pattern around the model. The prescription is architectural restraint — push fuzzy operations into markdown *skills*, push must-be-perfect operations into *code*, and keep the *harness* thin so every model improvement flows through automatically.

Companion tweet (same framing, compressed): [@garrytan, "Thin harness, fat skills"](https://x.com/garrytan/status/2043566215927328955).

This is a **sharp disagreement** with framings that treat validation, repair, and context routing as *constitutive* of capability. In Tan's picture, most of that work is either absorbed by the next model or revealed as compensation for a weaker one. In the constitutive picture (Wallach/Jacobs et al.; Salaudeen et al.), those loops are where capability *lives* in practice — no matter how strong the base model.

Keep this entry visible when reading sources that argue the opposite. It marks the pole the library should preserve, not flatten.

### Open questions

- Under what domain conditions is "thin harness" actually enough? (Hypothesis: high offline evaluability, low institutional ratification cost.)
- Does "fat skills" degrade gracefully when inputs are illegible or reward signal is thin?
- What's the smallest counterexample — a task where a fat harness around a weaker model beats a thin harness around a stronger one and *continues* to beat it as models improve?

---

<!-- source: wallach-measurement-challenge.md -->

---
title: "Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge"
author: "Hanna Wallach, Meera Desai, A. Feder Cooper, Angelina Wang, Chad Atalla, Solon Barocas, Su Lin Blodgett, Alexandra Chouldechova, Emily Corvi, P. Alex Dow, Jean Garcia-Gathright, Alexandra Olteanu, Nicholas Pangakis, Stefanie Reed, Emily Sheng, Dan Vann, Jennifer Wortman Vaughan, Matthew Vogel, Hannah Washington, Abigail Z. Jacobs"
date: "2025-02-02"
source_type: "paper"
url: "https://arxiv.org/abs/2502.00561"
excerpt: "The measurement tasks involved in evaluating generative AI systems lack sufficient scientific rigor, leading to a tangle of sloppy tests and apples-to-oranges comparisons."
summary: "ICML 2025 position paper arguing that generative AI evaluation is fundamentally a social-science measurement problem, and presenting a four-level framework grounded in measurement theory for constructs related to GenAI capabilities, behaviors, and impacts."
tags:
  - measurement
  - construct-validity
  - evaluation
  - social-science
  - icml-2025
  - position-paper
role: "measurement-piece"
harness_types:
  - validation-harness
  - ratification-harness
validation_position:
  - before-generation
  - post-deployment
  - continuous
validation_mode:
  - empirical
  - social
  - institutional
  - interpretive
domain: "research"
prescription_stance: "strongly-procedural"
relation_to_argument:
  - validation-is-constitutive
  - institutions-shape-capability
  - observability-matters
dimensions:
  input_legibility:
    score: 3
  task_structure:
    score: 3
  reward_richness:
    score: 2
    note: "Deliberate counterweight to reward-richness maximalism: a rich signal for the wrong construct is not evidence of capability."
  feedback_latency:
    score: 2
  repairability:
    score: 3
  observability:
    score: 4
  offline_evaluability:
    score: 3
    note: "Offline eval is only as good as the construct it ratifies."
  institutional_ratification:
    score: 5
    note: "The paper's four-level framework makes ratification a first-class object of inquiry."
why_it_matters: "The foundational pushback against treating any evaluation number as self-evidencing. If the measurement instrument doesn't validly pick out the construct (reasoning, helpfulness, safety, legal competence), a high score is not a capability claim."
---

Sets up measurement and construct validity as *prior* to evaluation. A benchmark score is a claim about a construct, and the validity of that claim depends on whether the instrument actually measures the construct. The paper argues that most GenAI evaluation skips this step, producing a tangle of sloppy tests and apples-to-oranges comparisons.

The authors import a four-level framework from social-science measurement theory and apply it to GenAI. The argument is explicitly *not* that better metrics solve the problem — it is that capability claims depend on validity work that is social, interpretive, and institutional.

Placed against verifiable-reward framings ([Royzen](/library/royzen-standard-signal); [Expanding RLVR](/library/expanding-rlvr-across-domains)), the tension is direct:

- **Verifiable-reward**: the reward is verifiable when the outcome is checkable.
- **Measurement-validity**: checkability of an outcome does not imply the outcome measures the construct you care about. The "verifiable" in verifiable reward is doing more work than it admits.

Both can be true at once. A narrow technical task (theorem proved, test suite passed) may have near-trivial validity. A broad capability claim (legal reasoning, medical judgment, general agentic competence) almost never does. The library preserves this disagreement structurally — entries can score high on `reward_richness` while scoring low on `input_legibility` and unknown on validity.

### Related entries

- [Measurement to Meaning (Salaudeen et al. 2025)](/library/salaudeen-measurement-to-meaning) — the validity-centered framework applied.
- [Royzen: Standard Signal](/library/royzen-standard-signal) — poster case for reward richness.

---

<!-- source: willison-claude-skills-bigger-than-mcp.md -->

---
title: "Claude Skills are awesome, maybe a bigger deal than MCP"
author: "Simon Willison"
date: "2025-10-16"
source_type: "blog"
url: "https://simonwillison.net/2025/Oct/16/claude-skills/"
excerpt: "A skill is a Markdown file telling the model how to do something, optionally accompanied by extra documents and pre-written scripts."
summary: "Practitioner synthesis of Anthropic's Agent Skills feature, arguing the markdown-file pattern is conceptually simpler and more token-efficient than MCP, and that the ease of sharing a single file is the feature."
tags:
  - agent-skills
  - markdown-skills
  - mcp
  - progressive-disclosure
  - token-efficiency
role: "synthesis-node"
harness_types:
  - grounding-context-loading
  - execution-harness
  - learning-harness
  - social-harness
validation_position:
  - before-generation
validation_mode:
  - empirical
domain: "software"
prescription_stance: "mixed"
relation_to_argument:
  - capability-is-extended
  - first-mile-input-formation
  - diffusion-adoption-bottleneck
dimensions:
  input_legibility:
    score: 4
    note: "Progressive disclosure — scan metadata, load full skill on demand — is a legibility pattern."
  task_structure:
    score: 4
  reward_richness:
    score: 2
  repairability:
    score: 3
  observability:
    score: 3
  offline_evaluability:
    score: 2
  institutional_ratification:
    score: 3
    note: "Distribution is social: skills spread as shareable markdown files, not packaged tools."
why_it_matters: "Makes visible the argument that the markdown-skill pattern is a diffusion mechanism, not only a technical one. Pair with Tan (thin harness) and Anthropic's engineering post to triangulate what 'skills' actually refer to."
---

Willison argues two things at once:

1. **Conceptual simplicity beats MCP.** A skill is a markdown file; the model knows how to read markdown; a CLI tool with `--help` solves most of what an MCP server solves, at a fraction of the token budget.
2. **Distribution is the feature.** Many skills are a single file. The shareability is the point — skills spread.

Read as a synthesis node that connects:

- [Anthropic's Agent Skills announcement](/library/anthropic-agent-skills) — the institutional launch of the pattern.
- [Tan, "Thin Harness, Fat Skills"](/library/tan-thin-harness-fat-skills) — the practitioner ethos that the markdown-skills pattern operationalises.
- [HumanLayer, "Skill Issue"](/library/humanlayer-skill-issue) — what harness-engineering work remains around skills.
