{"meta":{"cutoff_year":2025,"generated_at":"2026-05-14T15:31:58.579Z","count":20,"note":"Extended Capability Library — notes and sources on AI capability in practice. Human-curated, 2025+ only. See /library for the UI and content/library/README.md for the schema."},"entries":[{"slug":"griffin-work-registration-collision-prevention-query","title":"Deep Research Query: Work Registration and Collision Prevention","author":"Daniel S. Griffin","date":"2026-05-05","year":2026,"source_type":"note","excerpt":"Sessions frequently recommend or start work that another session already owns.","summary":"A ChatGPT Deep Research query and result asking how task queues, multi-agent systems, advisory locks, CI/CD systems, collaborative editing tools, and lightweight session manifests handle claim, heartbeat, expiry, and duplicate-work prevention.","notes":"Source packet: Daniel's query to ChatGPT Deep Research, plus the resulting ChatGPT Deep Research report. The result's citation markers are session-local and need durable source-link cleanup before formal citation.","why_it_matters":"This is a local operating-system problem for agent work: the effective capability of several strong coding agents is bounded by whether the surrounding harness can prevent duplicate work, stale ownership, and invisible in-flight state.","tags":["deep-research-query","work-registration","collision-prevention","multi-agent","session-registry","sqlite","advisory-locks","heartbeat","orders","orchestration"],"role":"framework-piece","harness_types":["execution-harness","validation-harness","monitoring-harness","social-harness","interface-harness"],"validation_position":["before-action","continuous"],"validation_mode":["mechanical","social","institutional"],"domain":"software","prescription_stance":"strongly-procedural","relation_to_argument":["capability-is-extended","validation-is-constitutive","observability-matters","breakdown-when-harness-absent","institutions-shape-capability","diffusion-adoption-bottleneck"],"dimensions":{"observability":{"note":"The whole prompt is about making otherwise invisible concurrent agent work visible before new work begins.","score":5},"repairability":{"note":"A claim registry would make collisions diagnosable and recoverable, but only if the protocol records enough state to distinguish active, blocked, stale, and abandoned work.","score":4},"reversibility":{"note":"Crash-safe leases and expiring claims are designed to avoid permanent blockage while preserving evidence of abandoned work.","score":4},"task_structure":{"note":"It frames work ownership as a state-machine problem: claim, heartbeat, expiry, completion, abandonment, and migration.","score":5},"reward_richness":{"note":"The target protocol can observe collisions and stale claims, but success is partly social adoption rather than a single crisp reward.","score":3},"feedback_latency":{"note":"The stated preflight budget is under three seconds, and the heartbeat/expiry question asks how quickly abandoned work should be visible.","score":4},"input_legibility":{"note":"The query turns a local operational pain into a concrete research object with constraints, existing tools, and desired decision outputs.","score":4},"offline_evaluability":{"note":"The proposed solution space is local-first: SQLite, filesystem manifests, git worktrees, and reproducible preflight checks.","score":4},"institutional_ratification":{"note":"Orders, Linear issues, status fields, and human interruption rights make ownership a ratified workflow state, not merely a process-local lock.","score":4}},"annotation":"This entry is a new kind of library object: a **Deep Research packet**. The source is Daniel's query to ChatGPT Deep Research and the resulting ChatGPT Deep Research report. It is preserved because the prompt and result expose the operating theory behind the system being built.\n\nThe capability-library lens is straightforward: work collision is not a model-quality failure. It is a harness failure. Multiple sessions can each be locally reasonable and still globally collide if ownership is not legible, claimable, observable, and recoverable. A work registry is therefore not bureaucracy around capability; it is part of the capability surface.\n\n## Extended Capability Read\n\nThe query asks for a minimal offline protocol because the local constraint matters: no daemon, no cloud coordinator, no shared context window, and a human as the only interrupt channel. That rules out a lot of distributed-systems theater and points toward a small lease model:\n\n- atomic claim before action,\n- visible ownership before recommendation,\n- heartbeat while active,\n- expiry after silence,\n- explicit completion or abandonment,\n- migration that tolerates old sessions ignoring the protocol at first.\n\nThe interesting disagreement is not whether a registry is theoretically cleaner. It is whether the registry becomes a real gate or another voluntary signal beside the existing session feed. The prompt names adoption failure as a first-class risk, which is exactly the kind of social-harness issue this library should keep visible.\n\n## Capability Mapping\n\nThe result maps the coordination problem onto **leases** rather than permanent locks. That distinction is the load-bearing concept for this library entry: capability comes from a small lifecycle around the model's work, not from the model remembering more context. The lifecycle is claim, heartbeat, expiry, release, and human-visible override.\n\nMapped to the library's dimensions:\n\n- **Execution harness:** `work-on ORDER_ID` becomes the required start-work path. The harness does not merely advise the agent; it changes what actions are allowed.\n- **Monitoring harness:** `session_presence`, `order_claims`, heartbeat timestamps, stale thresholds, and worktree metadata turn invisible terminal work into observable system state.\n- **Validation harness:** preflight checks move from reading advisory status text to checking the authoritative lease table before recommending or beginning work.\n- **Repair harness:** expiry and reclaim convert crash recovery from manual cleanup into a recoverable state transition.\n- **Social/interface harness:** Order frontmatter, session feed, Linear, and worktrees remain mirrors for humans, while SQLite becomes the exclusion source of truth.\n- **Institutional ratification:** the result makes ownership a ratified workflow state: not \"the agent thinks it is working,\" but \"the local work system recognizes this claim.\"\n\nThe result also clarifies a shelf-level distinction: multi-agent frameworks and memory systems are not the same thing as a cross-session work registry. CrewAI, AutoGen, LangGraph, and Swarm may provide orchestration, persistence, checkpoints, or handoffs inside their runtimes, but the collision problem here lives between independent shells. That places this entry beside [OpenAI, \"Symphony\"](/library/openai-symphony-codex-orchestration), but at a smaller local scale: issue tracker as control plane there, SQLite lease registry here.\n\nThe new vocabulary this adds to the library:\n\n- **Lease, not lock:** ownership should expire unless refreshed.\n- **Presence is not ownership:** online/session liveness can inform UI, but it should not decide durable task ownership.\n- **Mirrors are not source of truth:** markdown, feed events, Linear, and worktrees are useful diagnostic surfaces, but exclusion needs an atomic shared substrate.\n- **Mandatory path beats advisory memory:** duplicate-work prevention fails if claim registration remains voluntary.\n\nThe main unmapped residue is citation hygiene. The pasted result includes ChatGPT Deep Research session-local citation markers. Those markers are useful provenance inside the originating session, but they are not durable public citations yet.\n\n## Appendix A: Deep Research Result\n\nCitation note: the `turn...` citation markers below came from the Deep Research session and still need replacement with durable public source links.\n\n### Bottom line\n\nThe pattern that keeps reappearing across task queues, durable workflow engines, CI/CD systems, and collaborative-presence systems is not a permanent lock; it is a **lease**: a worker atomically claims a unit of work, refreshes liveness while it is active, and another worker can reclaim only after expiry. That is explicit in BullMQ's job locks and renewals, Temporal's task queues plus activity heartbeats and retries, Sidekiq Pro's durable fetch plus orphan recovery, and Celery's late acknowledgements plus visibility-timeout redelivery. On a single machine with SQLite, you can implement the same idea with one `order_claims` table, one `session_presence` table, `BEGIN IMMEDIATE` for atomic claim attempts, and an expiring `lease_expires_at` column.\n\nFor your constraints, the lightest robust design is therefore **not** \"make sessions check the feed more carefully,\" and **not** \"put a `.lock` file next to each Order.\" It is a **SQLite-backed work registry** that is mandatory on session start, mandatory before work begins, and refreshed by a tiny per-session heartbeat helper. Order frontmatter, the session feed, Linear, and visible worktrees should become **mirrors for humans and diagnostics**; the SQLite lease row should become the **source of truth for exclusion**. SQLite's writer serialization and `BEGIN IMMEDIATE` are exactly what you want for an offline, single-machine, no-daemon claim path.\n\nThe other big finding is that the named multi-agent frameworks do **not** generally solve this exact problem for you. Their docs emphasize workflow orchestration, persisted state, memory, checkpoints, or stateless handoffs within a runtime; they do not expose a first-class cross-terminal \"claim board\" that independent coding sessions must consult before touching the same task. In other words, your coordination problem is closer to a local task-queue lease problem than to an agent-memory problem.\n\n### Claim semantics in task queues and durable runtimes\n\nIn Celery, the core tradeoff is visible in the ack policy. By default, Celery acknowledges a task **before** execution so an already-started task is not executed again; if you enable `acks_late`, the broker acknowledgement happens **after** execution instead. Celery also documents Redis/SQS-style `visibility_timeout`, where an unacked message is redelivered after the timeout, and `task_reject_on_worker_lost` if you want requeue on abrupt worker loss. This is the classic \"claim + ack/retry + expiry\" family, but with the usual at-least-once tradeoff: safer recovery means more duplicate-execution risk, which is why Celery tells you to make tasks idempotent.\n\nSidekiq's docs make the same lesson even more blunt. Its default fetch path uses `BRPOP`, which removes the job from Redis immediately; if the process crashes mid-job, the job can be lost. Sidekiq Pro's `super_fetch` switches to `LMOVE` so the job stays in Redis until completion, then recovers orphaned jobs after process-heartbeat expiry. Sidekiq Enterprise's unique-jobs feature adds a second lesson that matters directly for your design: uniqueness is **best effort**, the lock always has a TTL, and that TTL is mandatory because otherwise crash-held locks would last forever.\n\nBullMQ is almost a textbook lease implementation. A worker places a lock on the job when processing begins, renews it at `lockRenewTime` (by default half of `lockDuration`), and if renewal stops the job is considered stalled and can be restarted. BullMQ explicitly warns that a worker that dies, or a CPU-bound processor that blocks the event loop long enough to miss renewal, can cause a job to be double-processed. That is the exact failure shape you are trying to avoid, and it is also why your scheme needs both heartbeats and idempotent completion paths.\n\nTemporal wraps the same ideas in a stronger durable-execution model. The service persists workflow history, workers poll task queues, activity tasks are picked up by one worker attempt at a time, and the API presents an \"effectively once\" experience at the workflow level even though multiple activity-task executions may happen underneath due to retry or timeout. Activity heartbeats can carry progress so a retry can resume from the last checkpoint; Temporal's docs recommend heartbeats for long-running activities and show typical heartbeat timeouts on the order of a minute. The important design takeaway is that **claim state lives in durable shared storage, not in the worker's private context window**.\n\nFor a local SQLite-backed system, the minimum useful translation of all of that is: **one live row per claimable Order, acquired atomically, renewed periodically, released on completion, and stealable only after expiry**. Everything else, feeds, markdown status, worktrees, and issue trackers, should enrich observability, not decide ownership.\n\n### What agent frameworks, locks, CI/CD, and collaborative tools provide\n\nAgent frameworks such as CrewAI, AutoGen, LangGraph, and Swarm point in the same direction. CrewAI Flows own state and execution order, and its persistence can save Flow state to a database so execution can resume after a crash. AutoGen focuses on agent lifecycle, communication, and deterministic patterns inside a runtime. LangGraph saves checkpoints for fault-tolerant execution and human-in-the-loop resumes. Swarm is stateless between calls and stores no state between runs. These are useful primitives, but none is a built-in cross-process claim registry for independent terminal sessions on the same repo.\n\nThe practical implication is that \"multi-agent memory\" is not enough to prevent collisions when the agents are separate shells with separate context windows. Duplicate-work prevention needs to live in a shared substrate all sessions read before acting. In LangGraph that substrate could be a store/checkpointer; in CrewAI, persisted Flow state; in Swarm, something entirely external because the framework itself is stateless. In every case, the lock or lease must be **mandatory in the start-work path**, not an optional memory artifact.\n\nSQLite gives two decisive advantages for this use case: single-writer serialization and transactional state. SQLite transactions are serializable, there can only be one writer at a time, and `BEGIN IMMEDIATE` starts the write transaction up front so later writes inside that transaction do not fail with a surprise `SQLITE_BUSY`. WAL mode lets readers continue while a writer appends to the WAL, which is what you want when several sessions frequently read live claims and only occasionally write them. WAL is not the claim protocol; it is the concurrency mode that makes a claim table pleasant to use.\n\n`flock()` is good if all you need is same-host exclusion on one file. Its lock is released when file descriptors close, so process death has good crash behavior on one host. The downside is that `flock()` is not a queryable registry. It does not naturally answer \"who currently owns this Order, since when, with what issue, and when does the claim expire?\" without a second metadata channel.\n\nPlain `.lock` files have the opposite tradeoff. They are inspectable and easy to create, but they do not auto-release on crash. Git's `index.lock` is the canonical cautionary example: stale lock files can remain after early exit and require manual cleanup once no process is active. That is tolerable for occasional repository maintenance; it is a poor default for autonomous sessions that need reclaimable, crash-safe work ownership.\n\nCI/CD systems serialize conflict-prone work by attaching it to a centralized concurrency key. GitHub Actions concurrency groups allow at most one running and one pending job, with optional cancellation of the running one. GitHub self-hosted runner jobs are re-queued if the runner does not pick them up quickly. GitLab `resource_group` forces jobs such as deployments to run one at a time and offers process modes like `oldest_first` and `newest_first`. Buildkite exposes branch-level dedupe controls and agent heartbeat health. The pattern is centralized scheduling authority plus explicit concurrency scope.\n\nCollaborative systems separate **durable ownership** from **ephemeral presence**. Yjs Awareness is for \"who is online?\" and cursor/presence state; peers can be marked offline after missed updates. That is good for \"someone is here\" UI but too ephemeral to be durable task ownership. Project-management tools similarly signal \"someone is working on this\" through explicit assignee and workflow fields rather than crash-detecting session leases.\n\n### Coordination mechanisms compared\n\n| mechanism | latency | crash safety | complexity | requires daemon | offline-capable | example system |\n|---|---|---|---|---|---|---|\n| SQLite lease row with expiry and heartbeat | very low; one local transaction | high, because claims are reclaimable after missed heartbeats | medium | no | yes | BullMQ/Temporal-style lease and renew semantics |\n| Session manifest directory with TTL | very low; local file reads/writes | medium; good for presence, weaker for atomic takeover | low | no | yes | Yjs-style awareness/presence pattern |\n| `flock()` lockfile plus separate metadata file | very low | high for same-host process death; kernel drops lock on close | low to medium | no | yes | Linux advisory file locking |\n| Plain `.lock` file with PID/timestamp | very low | low; stale files remain after early exit/crash | low | no | yes | Git `index.lock` behavior |\n| Event log or heartbeat feed only | low | low for exclusion; good for observability only | low | no | yes | Celery monitoring / worker-heartbeat events |\n| Git worktree / branch heuristic | low | low; worktrees show activity but do not grant or release ownership atomically | low | no | yes | Git worktree plus issue/branch conventions |\n| Central concurrency group / scheduler | low control-plane latency | high | high | yes | usually no | GitHub Actions, GitLab resource groups, Buildkite branch-build cancellation |\n\nThe best fit is a **hybrid**: SQLite for the authoritative claim lease, Order markdown and the session feed for visibility, and worktree/Linear metadata as supporting context shown to humans when a claim is stale or disputed.\n\n### Recommended minimal protocol\n\nThe smallest trustworthy design is fully offline, uses no central daemon, stays inside a three-second preflight budget, and degrades safely when a session crashes.\n\n1. **Register session existence at startup, unconditionally.** Replace voluntary session-feed posting with a required launcher step that upserts a `session_presence` row keyed by `session_id` and records host, PID, start time, current tab title, Linear issue, and `last_seen_at`.\n\n2. **Preflight every recommendation against live claims, not status text.** `pitch` and `suggest` should read SQLite first: which Orders are `ready`, which are claimed with an unexpired lease, which are stale, and which have matching worktrees or Linear issues.\n\n3. **Make start work an atomic claim transaction.** The only supported path to `in_progress` should be a wrapper such as `work-on ORDER-143`. Inside one SQLite transaction, it attempts to acquire or steal an expired claim. `BEGIN IMMEDIATE` reserves the writer slot before inspecting and updating the claim row.\n\n4. **Mirror after commit, never before commit.** Only after the claim transaction commits should the wrapper update Order status, write `claimed_by` / `claimed_at` / `lease_expires_at` into frontmatter, emit the session-feed event, create or record the worktree, and optionally update Linear.\n\n5. **Refresh heartbeats out-of-band from the model's reasoning loop.** Do not ask the agent to remember to heartbeat. Start a tiny per-session helper that updates both `session_presence.last_seen_at` and the relevant `order_claims.heartbeat_at / lease_expires_at` every 30 seconds.\n\n6. **Release on completion in the same place work completion is recorded.** A `complete-order` command should mark the claim released, record completion timestamps, and write the session-feed completion event. Keep a separate append-only `claim_history` table if audit history matters.\n\n7. **Treat crashes as missed-heartbeat expiry, not immortal locks.** If the session disappears, the helper stops heartbeating. After expiry, the claim becomes reclaimable automatically.\n\n8. **Require a human-visible override path, but only for expired or obviously stale claims.** The override UI should show incumbent session, last heartbeat, worktree path, Linear issue, and whether the claim is stale or expired.\n\nMinimal schema:\n\n```sql\nPRAGMA journal_mode = WAL;\nPRAGMA busy_timeout = 1500;\n\nCREATE TABLE IF NOT EXISTS session_presence (\n  session_id TEXT PRIMARY KEY,\n  host TEXT NOT NULL,\n  pid INTEGER,\n  started_at_ms INTEGER NOT NULL,\n  last_seen_at_ms INTEGER NOT NULL,\n  tab_title TEXT,\n  linear_issue_id TEXT,\n  current_order_id TEXT\n);\n\nCREATE TABLE IF NOT EXISTS order_claims (\n  order_id TEXT PRIMARY KEY,\n  session_id TEXT NOT NULL,\n  claimed_at_ms INTEGER NOT NULL,\n  heartbeat_at_ms INTEGER NOT NULL,\n  lease_expires_at_ms INTEGER NOT NULL,\n  worktree_path TEXT,\n  linear_issue_id TEXT,\n  FOREIGN KEY (session_id) REFERENCES session_presence(session_id)\n);\n\nCREATE INDEX IF NOT EXISTS idx_order_claims_expires\n  ON order_claims (lease_expires_at_ms);\n```\n\nAtomic claim attempt:\n\n```sql\nBEGIN IMMEDIATE;\n\nINSERT INTO order_claims (\n  order_id, session_id, claimed_at_ms, heartbeat_at_ms,\n  lease_expires_at_ms, worktree_path, linear_issue_id\n)\nVALUES (\n  :order_id, :session_id, :now_ms, :now_ms,\n  :lease_expires_at_ms, :worktree_path, :linear_issue_id\n)\nON CONFLICT(order_id) DO UPDATE SET\n  session_id = excluded.session_id,\n  claimed_at_ms = excluded.claimed_at_ms,\n  heartbeat_at_ms = excluded.heartbeat_at_ms,\n  lease_expires_at_ms = excluded.lease_expires_at_ms,\n  worktree_path = excluded.worktree_path,\n  linear_issue_id = excluded.linear_issue_id\nWHERE order_claims.lease_expires_at_ms < :now_ms\nRETURNING order_id, session_id, lease_expires_at_ms;\n\nCOMMIT;\n```\n\nThe lock granularity should be the Order ID, not the repository, not the worktree, and not the entire Linear issue set. The source of truth should be the SQLite lease row, not markdown frontmatter. The mandatory path should be launcher plus `work-on` / `complete-order` wrappers, not model instructions embedded in prompts.\n\n### Anti-patterns and failure modes\n\n- **Removing or acknowledging work before the worker is actually safe.** If you mark an Order `in_progress` in markdown before the SQLite claim commits, you rebuild the same race queue systems try to avoid.\n- **Using infinite or very long locks.** \"Claim until manually cleared\" produces zombie Orders that nobody trusts enough to touch.\n- **Making leases too short for the actual execution environment.** Heartbeats tied to the model's main loop can miss beats during long tool calls, blocked subprocesses, or terminal suspension.\n- **Using presence as ownership.** Presence should inform UI; the lease row should decide who owns work.\n- **Picking the wrong lock granularity.** Too broad cancels unrelated work; too narrow still permits collisions.\n- **Assuming queue order guarantees you do not actually have.** \"Once at a time\" and \"which one goes next\" are different policies.\n- **Treating plain `.lock` artifacts as crash-safe locks.** Stale files become operational toil.\n- **Leaving the registry voluntary.** A system that agents may ignore is not on the critical path and will lose to locally rational behavior.\n\n### Heartbeat and expiry defaults\n\n| parameter | with a tiny per-session heartbeat helper | without a helper |\n|---|---:|---:|\n| heartbeat interval | 30s | 60s |\n| soft-stale threshold | 90s | 180s |\n| hard-expiry / reclaimable | 180s | 300s |\n| startup grace before showing stale | 60s | 90s |\n| retain released claim history | 24h to 7d | 24h to 7d |\n\nWith 10-15 active sessions, a 30-second helper heartbeat is trivial write load for SQLite and gives fast enough \"someone is still on this\" feedback. A soft-stale state after roughly three missed beats lets `pitch` downgrade the Order without automatically stealing it. A hard expiry after roughly six missed beats is long enough to avoid false positives from transient pauses, but short enough that a dead session does not block the team for half an hour. If heartbeats depend on the session's main work loop, use looser leases.\n\nThe rollout tuning rule is: **prefer false negatives over false positives during rollout**. A stale claim that is reclaimable a bit later is annoying. A false live claim that blocks legitimate work trains humans and agents to route around the system.\n\n### Migration path\n\nThe safest retrofit is to introduce the registry in **shadow mode**, then raise it to **soft enforcement**, then to **hard enforcement** once the launcher path is universal.\n\nStart by extending the existing startup path, currently `set-tab-title`, so every new session registers in `session_presence`. In this phase, `pitch` only warns: \"live claim by session X,\" \"stale claim,\" \"legacy in-progress without claim,\" and \"matching worktree exists.\" Nothing blocks yet.\n\nNext, add a mandatory `work-on ORDER_ID` wrapper. That wrapper performs the atomic claim, then mirrors the result into Order status/frontmatter, the session feed, and optional Linear updates. Legacy sessions that directly flip markdown to `in_progress` still exist, but they are classified as **legacy occupancy hints** rather than hard locks.\n\nAfter that, move to soft enforcement. `pitch` should stop recommending Orders with a live unexpired lease. `complete-order` should refuse to complete an Order unless the completing session owns the current claim or a human passes an override. `in_progress` without a live claim should remain visible but suspicious rather than authoritative.\n\nFinally, move to hard enforcement when almost all sessions use the wrapper path. Direct edits that set `in_progress` should be linted or rejected, `pitch` should treat the SQLite lease as canonical, and the event log should become historical telemetry instead of arbitration.\n\nThree compatibility rules make the migration safer:\n\n1. Old sessions may continue to work, but they do not create authoritative claims.\n2. New sessions may see legacy state and warn on it, but only a live lease can block.\n3. Human-visible mirrors remain important, so claim metadata should still be written into Orders and the session feed even after SQLite becomes canonical.\n\n## Appendix B: Query\n\n# Deep Research: Work Registration & Collision Prevention in Multi-Agent/Multi-Session Systems\n\n## Context\n\nI run a solo developer operation where 3-8 Claude Code sessions run simultaneously on the same codebase, each in its own terminal tab (sometimes also Codex and Cursor agents). Each session has its own context window and makes locally rational decisions. Work is tracked via \"Orders\" (structured markdown specs with status fields: draft -> ready -> in_progress -> complete) and Linear issues.\n\nThe problem: sessions frequently recommend or start work that another session already owns. A \"pitch\" skill recommends the highest-leverage next action, but it has no reliable way to know what's in-flight. A session feed (SQLite-backed event log) exists but sessions only post to it voluntarily - there's no mandatory registration or heartbeat. Order status fields exist but aren't updated atomically when a session begins work.\n\n**Architecture constraints:**\n- Sessions are ephemeral processes (no persistent daemon)\n- No shared memory between sessions - coordination must happen through filesystem or SQLite\n- Sessions can read each other's state but can't signal each other\n- Human is the only entity that can interrupt a running session\n- Latency budget for preflight checks: <3 seconds\n- Must work offline (no cloud coordination service)\n- ~220 orders exist, ~10-15 are active at any time\n\n**Current tools:**\n- `session-feed` (SQLite): event log with session.started, status, completed, session.ended events - but posting is manual/voluntary\n- `set-tab-title`: registers session existence + issue ID at startup\n- Order files: have `status` field but no `claimed_by` or `claimed_at`\n- Git worktrees: each active implementation gets its own worktree (observable via `git worktree list`)\n- Pitch/suggest skills: read session feed and order status before recommending, but don't enforce exclusion\n\n## Questions\n\n1. How do distributed task-queue systems (Celery, Temporal, Sidekiq, BullMQ) handle \"claim\" semantics - specifically the pattern where a worker must atomically claim a task before executing it, and other workers must see that claim before picking up the same task? What's the minimal implementation of this for a SQLite-backed single-machine system?\n\n2. What patterns exist in multi-agent AI systems (CrewAI, AutoGen, LangGraph, OpenAI Swarm) for preventing duplicate work across concurrent agents? Do any use a \"work registry\" or \"task board\" that agents check before starting? How do they handle the case where an agent starts work but crashes before completing?\n\n3. In distributed systems literature, what's the lightest-weight protocol for \"advisory locks\" that doesn't require a persistent coordinator process? Specifically interested in file-based or SQLite-based approaches that survive process crashes (no zombie locks). How do systems like SQLite's WAL mode, flock(), or .lock files compare for this use case?\n\n4. What do CI/CD systems (GitHub Actions, GitLab CI, Buildkite) do to prevent duplicate pipeline runs for the same commit/branch? How do they handle the \"claim + heartbeat + expiry\" lifecycle - and what's the minimum heartbeat interval that balances staleness detection against overhead?\n\n5. How do collaborative editing systems (CRDTs, OT) and project management tools (Linear, Jira, Asana) signal \"someone is working on this\" to other users - specifically the UX patterns for showing claimed/in-progress state and the backend mechanisms for detecting abandoned claims (user closed their tab, session crashed)?\n\n6. What are common failure modes when advisory-lock systems are retrofitted onto existing workflows? Specifically: (a) false positives that block legitimate work, (b) stale locks from crashed sessions, (c) lock granularity mistakes (too broad = blocking, too narrow = collisions still happen), (d) adoption failure where agents/users ignore the system because it's not mandatory.\n\n7. Are there lightweight \"session manifest\" patterns where each active worker/agent writes a heartbeat file (e.g., JSON with PID, timestamp, task ID) and other workers read the manifest directory to see what's claimed? How do these compare to SQLite-based approaches for reliability and latency?\n\n## Desired Output\n\n1. **Comparison table** of coordination mechanisms with columns: mechanism, latency, crash safety, complexity, requires daemon, offline-capable, example system\n2. **Recommended minimal protocol** for my constraints (SQLite, no daemon, <3s preflight, crash-safe) - step-by-step lifecycle from \"session wants to start work\" through \"session completes or crashes\"\n3. **Anti-patterns list** - what NOT to do, with real examples of systems that got burned\n4. **Heartbeat/expiry parameters** - recommended intervals and timeout values for a system where sessions last 5-120 minutes\n5. **Migration path** - how to retrofit this onto an existing system where sessions already run without registration, without breaking current workflows or requiring all sessions to update simultaneously","verification_needed":true,"verification_note":"Replace ChatGPT Deep Research session-local citation markers with durable public source links before treating the result as formally cited."},{"slug":"nous-hermes-agent-readme","title":"Hermes Agent README","author":"Nous Research","date":"2026-04-28","year":2026,"source_type":"doc","url":"https://github.com/nousresearch/hermes-agent","excerpt":"The self-improving AI agent built by Nous Research.","summary":"The Hermes Agent README presents an open agent harness with model-provider switching, terminal and messaging interfaces, scheduled automations, isolated subagents, toolsets, persistent memory, session search, and a closed learning loop around skills.","notes":"README inspected on GitHub by Codex on Apr 29, 2026. Date is the capture date for this dynamic README snapshot. This entry was prepared with Codex (OpenAI).","why_it_matters":"Hermes is an example of the harness conversation moving beyond coding alone: a persistent, multi-surface, model-agnostic agent with memory, skills, automations, and self-improvement loops.","tags":["hermes-agent","nous-research","self-improving-agent","skills","memory","messaging-gateway","subagents"],"role":"case-study","harness_types":["input-shaping","grounding-context-loading","execution-harness","repair-harness","monitoring-harness","learning-harness","social-harness","interface-harness"],"validation_position":["before-generation","during-generation","immediately-after-generation","continuous"],"validation_mode":["mechanical","empirical","social"],"domain":"cross-domain","prescription_stance":"strongly-procedural","relation_to_argument":["capability-is-extended","first-mile-input-formation","repairability-matters","observability-matters","diffusion-adoption-bottleneck"],"dimensions":{"observability":{"note":"Terminal UI, command history, streaming tool output, diagnostics, and session search make agent behavior inspectable.","score":4},"repairability":{"note":"The system can improve skills during use, create skills from experience, search past sessions, and persist knowledge.","score":4},"reversibility":{"note":"Retry/undo commands are present, but the README does not foreground a broad rollback model.","score":3},"task_structure":{"note":"The README describes a full harness surface: CLI/TUI, messaging gateway, scheduler, tools, backends, model providers, subagents, and skills.","score":5},"reward_richness":{"note":"Hermes emphasizes learning from experience and skill improvement, but the README does not define a single reward signal.","score":3},"feedback_latency":{"note":"Interactive CLI, messaging interrupts, tool output, scheduled jobs, and session search create frequent feedback opportunities.","score":4},"input_legibility":{"note":"Slash commands, personalities, skills, memory, and cross-session search make user intent and prior context available to the model.","score":4},"offline_evaluability":{"note":"Research tooling and batch trajectory generation suggest evaluability, but this is not the main README argument.","score":3},"institutional_ratification":{"note":"The README is user/harness-oriented rather than focused on organizational approval or governance.","score":2}},"annotation":"The Hermes README is valuable as a productized harness inventory. It does not present a single new model capability. It presents the surrounding system: model-provider switching, a terminal UI, messaging gateways, scheduled automations, persistent memory, skills, subagents, session search, toolsets, terminal backends, and research tooling.\n\nThe distinctive claim is the closed learning loop. Hermes says it can create skills from experience, improve skills during use, nudge itself to persist knowledge, search past conversations, and build a user model across sessions. That is a direct capability-extension claim: the agent becomes more useful not only because the model changes, but because the harness accumulates procedural and contextual memory.\n\n### Extended Frontier Read\n\nHermes makes the \"agent harness\" category concrete across several surfaces:\n\n- **interface harness**: CLI/TUI plus Telegram, Discord, Slack, WhatsApp, Signal, and email gateway;\n- **learning harness**: skill creation, skill improvement, memory nudges, session search;\n- **execution harness**: local, Docker, SSH, Daytona, Singularity, and Modal terminal backends;\n- **social harness**: cross-platform continuity, user modeling, scheduled reports;\n- **subagent harness**: isolated parallel workstreams and RPC-style tool scripts.\n\nThis is not just \"a chatbot with tools.\" It is an attempt to make an agent live where the user lives, remember what matters, and turn repeated work into skills.\n\n### Open Questions\n\n- How much of the self-improvement loop is automatic versus user-confirmed?\n- Which skills improve reliably during use, and which drift?\n- What validation or audit trail exists when memory and user modeling become part of the harness?","verification_needed":true,"verification_note":"README content verified from GitHub snapshot. Date is access/capture date, not a stable publication date."},{"slug":"openai-symphony-codex-orchestration","title":"An open-source spec for Codex orchestration: Symphony","author":"Alex Kotliarskyi, Victor Zhu, and Zach Brock","date":"2026-04-26","year":2026,"source_type":"blog","url":"https://openai.com/index/symphony-codex-orchestration/","excerpt":"The agents were fast, but we had a system bottleneck: human attention.","summary":"OpenAI describes Symphony, a spec and reference implementation that turns issue trackers such as Linear into always-on control planes for coding agents, shifting humans from supervising sessions to managing work.","notes":"Source text supplied by Daniel from OpenAI's April 27, 2026 engineering post. This entry was prepared with Codex (OpenAI).","why_it_matters":"Symphony is an explicit account of the next bottleneck after coding-agent capability: organizing agentic work. It treats orchestration, workflow documentation, issue state, CI, and review as capability infrastructure.","tags":["codex","symphony","orchestration","issue-tracker","linear","agent-management","app-server"],"role":"framework-piece","harness_types":["execution-harness","validation-harness","repair-harness","monitoring-harness","learning-harness","social-harness","interface-harness"],"validation_position":["before-generation","during-generation","immediately-after-generation","before-action","post-deployment","continuous"],"validation_mode":["mechanical","empirical","social","institutional"],"domain":"software","prescription_stance":"strongly-procedural","relation_to_argument":["capability-is-extended","validation-is-constitutive","repairability-matters","observability-matters","institutions-shape-capability","diffusion-adoption-bottleneck"],"dimensions":{"observability":{"note":"Symphony foregrounds logs, status surfaces, review packets, videos, Linear state, and operator visibility.","score":5},"repairability":{"note":"The system rebases, resolves conflicts, retries flaky checks, restarts stalled agents, and feeds failures back into guardrails and skills.","score":5},"reversibility":{"note":"Per-issue workspaces and PR review preserve isolation and throwaway explorations, though rollback policy is implementation-specific.","score":4},"task_structure":{"note":"The issue tracker becomes a state machine/control plane with per-issue workspaces, retries, statuses, and dependencies.","score":5},"reward_richness":{"note":"CI, reviews, issue state transitions, PR landing, videos, and human review all become feedback signals.","score":4},"feedback_latency":{"note":"Agents continuously observe issue state, CI, review feedback, and runtime failures, though some feedback waits on human review.","score":4},"input_legibility":{"note":"Issues, WORKFLOW.md, project state, and review packets turn ambiguous work into agent-readable objectives.","score":4},"offline_evaluability":{"note":"Software tasks have tests, CI, smoke tests, Chrome DevTools checks, and reproducible workspaces.","score":5},"institutional_ratification":{"note":"The issue tracker, review statuses, PM/designer requests, and human review make acceptance institutional rather than merely technical.","score":5}},"annotation":"Symphony is a control-plane argument. OpenAI's team found that interactive coding agents were already capable enough to create a new bottleneck: engineers could only supervise a few sessions before context switching overwhelmed them. Symphony responds by moving the unit of management from \"agent session\" to \"project work.\"\n\nIn the described setup, Linear is not just a queue. It becomes the state machine for agent work. Every eligible issue gets an isolated workspace and a running agent. The orchestrator watches issue states, starts work, restarts stalled agents, handles retries, respects blockers, follows dependency DAGs, and lets agents file follow-up issues when they discover work outside the current scope.\n\nThis is a strong example of **institutional scaffolding as capability**. The agents did not simply get better at coding. The work became more delegable because the surrounding system changed: issues became objectives, WORKFLOW.md captured implicit development norms, CI and QA became part of the run loop, and humans reviewed packets instead of steering terminals.\n\n### Extended Frontier Read\n\nThe key sentence for this library is the attention bottleneck: the agents were fast, but humans were still micromanaging them. Symphony extends capability by changing the coordination layer:\n\n- issue tracker as control plane,\n- per-issue workspaces,\n- agent sessions abstracted behind tickets,\n- CI/rebase/conflict handling in the loop,\n- review packets and videos for human ratification,\n- WORKFLOW.md as versioned organizational knowledge,\n- agent-created follow-up work.\n\nThat turns \"can a model implement this task?\" into \"can the organization make useful agent work cheap to initiate, observe, review, and land?\"\n\n### Tension\n\nThe post is explicit that not every task belongs in Symphony. Some ambiguous work still needs direct interactive Codex sessions and strong human judgment. That caveat is important: orchestration smooths routine implementation and exploration, but it does not erase the frontier. It shifts which work humans spend attention on.","verification_needed":true,"verification_note":"Content came from user capture; URL is the likely OpenAI canonical URL. Confirm exact canonical URL and byline before formal citation."},{"slug":"dhinakaran-agent-harness","title":"What Is an Agent Harness","author":"Aparna Dhinakaran","date":"2026-04-21","year":2026,"source_type":"tweet","url":"https://x.com/aparnadhinak/status/2046980769747533830","excerpt":"LangChain is not a harness. LangGraph is not a harness.","summary":"Defines the modern agent harness as an out-of-the-box architecture that emerged from coding agents: an iteration loop over tools, context management, skill/tool discovery, permissions, hooks, session persistence, sub-agents, and project-context injection.","notes":"Source text supplied by Daniel from X. Date confirmed as Apr 22, 2026. This entry was prepared with Codex (OpenAI); the earlier library entries were prepared with Claude (Anthropic).","why_it_matters":"A strongly procedural counterweight to thin-harness framings. The post argues that harnesses are not generic frameworks for humans to assemble agents, but working closed-loop environments that let models act, observe, repair, persist, and extend themselves.","tags":["agent-harness","coding-agents","harness-architecture","tool-loops","permissions","context-management","skills"],"role":"framework-piece","harness_types":["input-shaping","grounding-context-loading","execution-harness","validation-harness","repair-harness","monitoring-harness","learning-harness","social-harness","interface-harness"],"validation_position":["before-generation","during-generation","immediately-after-generation","before-action","post-deployment","continuous"],"validation_mode":["mechanical","empirical","institutional"],"domain":"cross-domain","prescription_stance":"strongly-procedural","relation_to_argument":["capability-is-extended","validation-is-constitutive","repairability-matters","observability-matters","breakdown-when-harness-absent","diffusion-adoption-bottleneck"],"dimensions":{"observability":{"note":"Hooks, session logs, context compression, and tool results make harness behavior inspectable, though the post is more architectural than telemetry-specific.","score":4},"repairability":{"note":"Repair is central to the definition: the model can observe consequences and continue until the task is actually solved.","score":5},"reversibility":{"note":"Permissions and approval gates reduce destructive risk, but rollback is not foregrounded as a first-class component.","score":3},"task_structure":{"note":"The while loop, tool registry, permission layer, and lifecycle hooks are presented as fixed architecture, not human-assembled graph wiring.","score":5},"reward_richness":{"note":"The source emphasizes act-observe-adjust feedback, but not explicit reward-model training or scalar reward design.","score":3},"feedback_latency":{"note":"Coding-agent feedback is immediate: read, edit, run tests, observe failure, repair, and repeat.","score":5},"input_legibility":{"note":"Project instruction files, context injection, skills, and tool discovery make the task environment legible to the model before and during work.","score":4},"offline_evaluability":{"note":"Coding agents inherit strong offline checks through tests, shell commands, diffs, and build outputs.","score":4},"institutional_ratification":{"note":"Hooks and permission policies are explicitly framed as the enterprise adoption layer.","score":4}},"annotation":"Dhinakaran draws a bright line between *frameworks* and *harnesses*. Frameworks such as LangChain and LangGraph give human developers abstractions to wire together. A harness, in her account, ships as a working agent architecture: outer loop, context manager, tool and skill registry, permission system, lifecycle hooks, session persistence, sub-agent management, and dynamic project-context injection.\n\nThe post is useful because it treats harnesses as an empirical convergence, not a vendor category. Coding agents such as Cursor, Claude Code, Windsurf, and Codex started from the practical problem of changing real repositories, then converged on similar structures: tool loops, compressed context, approval layers, and built-in file/shell/code-navigation primitives. Arize's Alyx is positioned as the same pattern appearing outside pure coding.\n\nFor the Extended Frontier argument, this is direct evidence that capability is produced by the situated assembly. The model alone is a one-shot text generator; the model inside a harness becomes a feedback-seeking system that can act, observe consequences, and adjust. That closed loop is not incidental plumbing. It is what changes the unit of capability from *model output* to *model-in-environment performance*.\n\nThis entry should sit beside:\n\n- [Tan, \"Thin Harness, Fat Skills\"](/library/tan-thin-harness-fat-skills) — disagrees on where durable leverage should live.\n- [Miessler, \"Good and Bad Harness Engineering\"](/library/miessler-good-and-bad-harness-engineering) — adjacent harness-engineering vocabulary.\n- [Anthropic, \"Agent Skills\"](/library/anthropic-agent-skills) — one of the built-in skill-layer mechanisms this post treats as part of harness architecture.\n\n### Components To Reuse\n\nDhinakaran's harness 1.0 component list is a useful checklist for classifying future entries:\n\n- Outer iteration loop.\n- Context management and compression.\n- Skills and tools management.\n- Sub-agent management.\n- Built-in pre-packaged skills.\n- Session persistence and recovery.\n- System prompt assembly and project-context injection.\n- Lifecycle hooks.\n- Permission and safety layer.\n\n### Tension\n\nThe strongest claim is also the pressure point: if a harness is defined as an out-of-the-box working agent architecture, then LangGraph-style frameworks are excluded even when they can be used to build similar loops. That exclusion is analytically useful for the library because it keeps the focus on *deployed capability environments*, not just orchestration abstractions.","verification_needed":true,"verification_note":"Date confirmed by Daniel. Title, author, URL, and content came from user capture; confirm directly in X before formal citation."},{"slug":"feng-junior-to-senior-agency","title":"From Junior to Senior: Allocating Agency and Navigating Professional Growth in Agentic AI-Mediated Software Engineering","author":"Dana Feng, Bhada Yun, April Yi Wang","date":"2026-04-13","year":2026,"source_type":"paper","url":"https://dl.acm.org/doi/10.1145/3772318.3791642","excerpt":"Agency in software engineering is preconfigured at the organizational layer (policies, tooling defaults, CI guardrails) before individual preferences matter.","summary":"Three-phase mixed-methods study with 20 software engineers (10 junior, 10 senior) examining how agency is allocated between humans and agentic AI. Finds that organizational policies and norms preconfigure agency before individual preferences, with seniors maintaining control through delegation and juniors oscillating between over-reliance and resistance.","why_it_matters":"Empirical evidence that agency in AI-mediated software engineering is configured at the organizational layer — policies and norms — before individual tool use begins. Directly supports the Extended Frontier thesis that institutions shape capability and that the social harness (mentorship, code review, norms) is constitutive of effective AI use, not a post-hoc check.","tags":["agency","software-engineering","mentorship","junior-senior","agentic-ai","code-review","tacit-knowledge","cursor","organizational-policy","prompt-review"],"role":"field-observation","harness_types":["social-harness","ratification-harness","interface-harness"],"validation_position":["before-action","post-deployment"],"validation_mode":["social","empirical"],"domain":"software","prescription_stance":"mixed","relation_to_argument":["institutions-shape-capability","first-mile-input-formation","observability-matters","breakdown-when-harness-absent","diffusion-adoption-bottleneck"],"dimensions":{"observability":{"note":"Prompt history review is a major theme. S7: \"accept, accept, accept is a very different thing versus generating all that content and then not actually reading it.\" Seniors read diffs, reject bloat, cross-check with other models."},"repairability":{"note":"What happens when the agent goes wrong. Seniors iterate and refine; juniors spiral. J6: \"it started spiraling ... I just stopped it. The fix was a three-line change.\""},"task_structure":{"note":"Figure 1 frames the entire paper around familiar (well-structured) vs unfamiliar (poorly-structured) tasks and how agency allocation diverges across them."},"input_legibility":{"note":"Seniors shape inputs well — scoping, constraining, arriving with a plan. Juniors struggle because they do not know what to ask. The familiar/unfamiliar split (Figure 1) is essentially about whether the human can form a good first-mile input."},"institutional_ratification":{"note":"The headline finding. Agency is configured before the first prompt is sent: \"company policies, security regimes, and mandates to use AI weekly set default loci of control.\" Three non-negotiables: interruptibility/override, legible provenance, small test-bounded diffs."}},"annotation":"CHI '26 paper. Three-phase qualitative study (ACTA interviews, Cursor debugging task, blind senior review of junior artifacts) with 20 software engineers examining how agency — the distribution of decision authority and accountability — is allocated between humans and agentic AI tools.\n\nThe central finding is that agency in software engineering is **preconfigured at the organizational layer** before individual preferences matter. Company policies, security regimes, tooling defaults, and norms set the boundaries. Within those bounds, seniors and juniors take divergent routes: seniors maintain control through detailed delegation and iterative refinement in familiar contexts, and strategic oversight in unfamiliar ones. Juniors oscillate between over-reliance ('spamming the agent button') and defensive resistance.\n\n### Policies and norms\n\nThe paper surfaces both formal policies and informal norms as shaping agency. On the policy side: approved tool lists, data-sharing restrictions, mandates to use AI. On the norms side: senior code review practices (S2: 'I try and give concise comments on code and try and also point to documentation'), social signals about AI-generated code (S6: 'building things we don't need right now, you can just tell, nowadays ... it's probably AI'), and the emerging norm of 'constant vigilance' when reviewing junior PRs that may be AI-assisted.\n\n### The mentorship reframe\n\nThe senior role transforms from answering questions to asking them — 'Socratic guides and organizational anchors.' Junior growth reframes as 'earning judgment through deliberate restraint' — knowing when not to delegate, when to trust instincts, when to seek human guidance. The traditional pipeline of gradual technical mastery is disrupted: juniors now engage with production systems immediately but with accountability mechanisms like the proposed Prompt & Code Reviews (PCRs).\n\n### Imposter syndrome and ownership\n\nJuniors report that AI undermines ownership: J8: 'It has my name on it, but I have no idea why it works.' J3, after being praised for speed: 'Cursor did the work ... I just tried to find the problem.' J9: felt 'like a fraud' after a hackathon. The paper frames this as authorship without understanding — code emerges from prompts, accountability becomes diffuse.\n\n### Open questions\n\n- How do PCRs scale? The paper acknowledges code review fatigue is already a problem.\n- The study is a snapshot from summer 2025 — how quickly do these dynamics shift as tools improve?\n- Does the junior experience of agency loss parallel deskilling concerns in other domains, or is software engineering structurally different?","verification_needed":false},{"slug":"tan-thin-harness-fat-skills","title":"Thin Harness, Fat Skills","author":"Garry Tan","date":"2026-04-10","year":2026,"source_type":"doc","url":"https://github.com/garrytan/gbrain/blob/master/docs/ethos/THIN_HARNESS_FAT_SKILLS.md","excerpt":"The 2x people and the 100x people are using the same models. The difference is five concepts that fit on an index card.","summary":"Short, practitioner-facing ethos doc arguing that the durable leverage in agent systems comes from model-resident skills (markdown) and deterministic code at the edges, with the harness kept as thin as possible so each model upgrade flows through.","why_it_matters":"A counterweight to harness-heavy framings. Tracks the prediction that as models get better, elaborate scaffolding becomes dead weight. Useful to read alongside Miessler (harness-engineering) and HumanLayer (sub-agents-as-context-control).","tags":["thin-harness","skills","scaffolding-skepticism","agent-design","markdown-skills"],"role":"practitioner-note","harness_types":["execution-harness","interface-harness","learning-harness"],"validation_position":["before-generation"],"validation_mode":["empirical"],"domain":"software","prescription_stance":"anti-prescriptive","relation_to_argument":["capability-is-extended","diffusion-adoption-bottleneck","first-mile-input-formation"],"dimensions":{"observability":{"note":"","score":2},"repairability":{"note":"Thin-harness framing tends to under-specify where repair loops live.","score":2},"task_structure":{"note":"","score":3},"reward_richness":{"note":"Does not foreground reward signal as the key lever.","score":2},"input_legibility":{"note":"Assumes inputs are legible enough that heavy shaping is unnecessary — a domain-specific bet.","score":3},"institutional_ratification":{"note":"","score":1}},"annotation":"A compact practitioner thesis from the gbrain repo: the productivity gap between 2x and 100x agentic-engineering users is not the model, it is the architectural pattern around the model. The prescription is architectural restraint — push fuzzy operations into markdown *skills*, push must-be-perfect operations into *code*, and keep the *harness* thin so every model improvement flows through automatically.\n\nCompanion tweet (same framing, compressed): [@garrytan, \"Thin harness, fat skills\"](https://x.com/garrytan/status/2043566215927328955).\n\nThis is a **sharp disagreement** with framings that treat validation, repair, and context routing as *constitutive* of capability. In Tan's picture, most of that work is either absorbed by the next model or revealed as compensation for a weaker one. In the constitutive picture (Wallach/Jacobs et al.; Salaudeen et al.), those loops are where capability *lives* in practice — no matter how strong the base model.\n\nKeep this entry visible when reading sources that argue the opposite. It marks the pole the library should preserve, not flatten.\n\n### Open questions\n\n- Under what domain conditions is \"thin harness\" actually enough? (Hypothesis: high offline evaluability, low institutional ratification cost.)\n- Does \"fat skills\" degrade gracefully when inputs are illegible or reward signal is thin?\n- What's the smallest counterexample — a task where a fat harness around a weaker model beats a thin harness around a stronger one and *continues* to beat it as models improve?","verification_needed":false},{"slug":"karpathy-llm-knowledge-bases","title":"LLM Knowledge Bases","author":"Andrej Karpathy","date":"2026-04-01","year":2026,"source_type":"tweet","url":"https://x.com/karpathy/status/2039805659525644595","excerpt":"You rarely ever write or edit the wiki manually, it's the domain of the LLM.","summary":"Describes a personal research workflow where raw source documents are compiled by an LLM into a markdown wiki, maintained through index files, health checks, generated outputs, and lightweight tools rather than a heavyweight RAG stack.","notes":"Source text supplied by Daniel from X. This entry was prepared with Codex (OpenAI).","why_it_matters":"This is the Extended Frontier applied to knowledge work: the model's capability comes from a maintained corpus, indexes, summaries, visual outputs, and health checks that make research cumulative instead of ephemeral.","tags":["knowledge-base","markdown-wiki","obsidian","agentic-research","llm-maintained-artifacts","personal-knowledge-management"],"role":"practitioner-note","harness_types":["grounding-context-loading","execution-harness","validation-harness","repair-harness","monitoring-harness","learning-harness","interface-harness"],"validation_position":["before-generation","immediately-after-generation","continuous"],"validation_mode":["empirical","interpretive"],"domain":"research","prescription_stance":"mixed","relation_to_argument":["capability-is-extended","first-mile-input-formation","validation-is-constitutive","repairability-matters","observability-matters","domain-structure-matters"],"dimensions":{"observability":{"note":"The wiki is human-readable markdown and images viewed in Obsidian, so the agent's knowledge substrate stays inspectable.","score":5},"repairability":{"note":"Health checks, missing-data imputation, and filing outputs back into the wiki make the knowledge base incrementally repairable.","score":4},"reversibility":{"note":"Markdown artifacts are versionable, though the post does not foreground git or rollback.","score":3},"task_structure":{"note":"Markdown files, indexes, backlinks, summaries, and Obsidian views give the work a manipulable structure.","score":4},"reward_richness":{"note":"The workflow has useful signals from links, consistency, and answer quality, but not an explicit reward signal.","score":2},"feedback_latency":{"note":"Feedback arrives through Q&A, rendered outputs, and health checks, but not usually as immediate pass/fail tests.","score":3},"input_legibility":{"note":"The raw/ to wiki compilation process is explicitly about making heterogeneous documents legible to future LLM turns.","score":5},"offline_evaluability":{"note":"Some checks can be run offline over the wiki, but factual gaps still require web search or source refresh.","score":3},"institutional_ratification":{"note":"This is a personal research workflow rather than an organizational ratification system.","score":1}},"annotation":"Karpathy describes a knowledge-work harness, not just a note-taking habit. Raw sources go into one directory; an LLM incrementally compiles them into a markdown wiki with summaries, backlinks, concept pages, index files, and derived visualizations. Obsidian becomes the human-facing IDE, while the LLM owns most direct edits to the wiki.\n\nThe important move is that research outputs are not terminal chat answers. They become files: markdown notes, Marp slides, matplotlib images, search indexes, and follow-up articles that can be filed back into the corpus. Each query can make the next query easier because the knowledge base itself accumulates structure.\n\nFor the library, this is a clean example of **capability as artifact maintenance**. Karpathy expected to need \"fancy RAG,\" but at roughly 100 articles and 400K words, LLM-maintained summaries and index files were enough. The boundary condition matters: the system works because the scale is still small enough for source-aware traversal and because the artifacts are legible.\n\n### Extended Frontier Read\n\nThe raw model is not the unit of analysis. The useful system is model plus:\n\n- a raw source archive,\n- a compiled markdown wiki,\n- index and summary files,\n- Obsidian as inspection surface,\n- generated outputs that feed back into the wiki,\n- health checks over consistency and missing data,\n- small custom tools such as a wiki search engine.\n\nThis belongs beside harness entries, but it broadens the frame from coding agents to research agents. The same pattern appears: make the environment legible, let the model act on files, inspect the result, repair the substrate, and let work accumulate.\n\n### Open Questions\n\n- At what corpus size does this stop working without stronger retrieval infrastructure?\n- Which health checks are most predictive of useful future Q&A?\n- Does finetuning on the wiki improve capability, or does it destroy the inspectability and repairability that make the workflow valuable?","verification_needed":true,"verification_note":"Author, URL, timestamp, and content came from user capture. Confirm directly in X before formal citation."},{"slug":"royzen-standard-signal","title":"Standard Signal: AI-native hedge fund announcement","author":"Michael Royzen","date":"2026-02-28","year":2026,"source_type":"tweet","url":"https://x.com/MichaelRoyzen/status/2039801841253564837","excerpt":"Standard Signal is the first hedge fund that researches and executes trades purely with AI. We train models to discover and trade on new fundamental truths about the world before humans can.","summary":"Launch announcement for a YC-backed hedge fund where AI models both generate hypotheses and execute trades. Included here as a domain-claim entry: markets-with-P&L are a paradigmatically favorable domain — clean outcome signal, fast feedback, offline backtestable, institutionally-ratified wrapper (a fund).","why_it_matters":"Finance is often named as a poster domain for AI deployment because outcomes are crisply priced. This entry anchors that claim with a concrete 2026 example and marks the critical asymmetry — high reward richness co-existing with low repairability — that the schema is designed to surface.","tags":["standard-signal","finance","hedge-fund","outcome-signal","domain-favorability"],"role":"domain-claim","harness_types":["validation-harness","ratification-harness","learning-harness","execution-harness"],"validation_position":["post-deployment","continuous"],"validation_mode":["mechanical","empirical","institutional"],"domain":"finance","prescription_stance":"strongly-procedural","relation_to_argument":["reward-structure-matters","domain-structure-matters","institutions-shape-capability","validation-is-constitutive"],"dimensions":{"observability":{"note":"","score":3},"repairability":{"note":"Critical tension: trading P&L tells you that a model is wrong but not *where* or *why*. Verifiable outcome ≠ diagnostic feedback.","score":2},"reversibility":{"note":"Trades execute and settle; losses are not rollbackable.","score":2},"task_structure":{"note":"","score":4},"reward_richness":{"note":"P&L is an unusually clean, cardinal, self-consistent reward. The library's own framing — Royzen does not use the phrase 'verifiable reward.'","score":5},"feedback_latency":{"note":"Faster than science, slower than software. Mark-to-market is continuous; attribution to a specific hypothesis is not.","score":3},"input_legibility":{"note":"","score":3},"offline_evaluability":{"note":"Backtesting is real but regime-shift biased.","score":4},"institutional_ratification":{"note":"A hedge fund is the institution that ratifies 'this worked.' LPs, auditors, and regulators are ratification harness.","score":4}},"annotation":"The announcement tweet is compact, but the conceptual payload is substantial. Standard Signal positions itself as the first hedge fund where *every* trade is researched and executed by AI. That packaging matters — not for the technology, but for the **ratification wrapper** around the technology. A fund is a legal and social form that converts opaque model outputs into legible claims about the world.\n\nWhy this belongs in the library even though Royzen does not use \"harness\" or \"verifiable reward\" vocabulary:\n\n1. It stakes a **domain-favorability** claim: markets are unusually hospitable to AI because the reward signal is priced, real-time, and cardinal.\n2. It stakes an **institutional** claim: a YC-backed fund is institutional ratification in a form academic benchmarks cannot supply.\n3. It exposes the **asymmetry** the library wants to keep visible: high `reward_richness`, low `repairability`. P&L tells you whether you were right; it does not tell you *why*.\n\n### Read alongside\n\n- [Expanding RL with Verifiable Rewards Across Diverse Domains](/library/expanding-rlvr-across-domains) — technical framing of the same bet.\n- [Measurement to Meaning](/library/salaudeen-measurement-to-meaning) — sharpest pushback: even a \"verifiable\" outcome doesn't measure the construct you claim.\n\n### Verification needed\n\n- Exact posting date of the tweet.\n- Whether subsequent Standard Signal writing explicitly uses \"verifiable reward\" language or stays in P&L terms.","verification_needed":true,"verification_note":"URL and excerpt verified via search. Exact posting date is a best guess (Standard Signal is YC Spring 2026 / P26); confirm from the tweet timestamp before citing."},{"slug":"humanlayer-skill-issue","title":"Skill Issue: Harness Engineering for Coding Agents","author":"HumanLayer","date":"2026-02-28","year":2026,"source_type":"blog","url":"https://www.humanlayer.dev/blog/skill-issue-harness-engineering-for-coding-agents","excerpt":"Skills, MCP servers, sub-agents, hooks, and back-pressure mechanisms are tactical solutions HumanLayer has arrived at.","summary":"Case-study framing of harness engineering for coding agents, with specific claims about what does and does not work (notably: role-based sub-agents don't work; sub-agents for context control do).","why_it_matters":"A strong counter-example to thin-harness-in-the-limit. HumanLayer has shipped coding-agent product and reports that sub-agents, hooks, and back-pressure do real work. Sharpens the disagreement with Tan/Miessler and localises it.","tags":["harness-engineering","coding-agents","sub-agents","context-control","back-pressure"],"role":"case-study","harness_types":["execution-harness","repair-harness","monitoring-harness","interface-harness"],"validation_position":["during-generation","immediately-after-generation","post-deployment"],"validation_mode":["empirical","mechanical"],"domain":"software","prescription_stance":"strongly-procedural","relation_to_argument":["capability-is-extended","repairability-matters","observability-matters","breakdown-when-harness-absent"],"dimensions":{"observability":{"note":"","score":4},"repairability":{"note":"Back-pressure mechanisms are repair harness by another name.","score":4},"task_structure":{"note":"Breaking work into discrete delegated tasks is a first-class move here.","score":5},"reward_richness":{"note":"","score":3},"feedback_latency":{"note":"","score":3},"input_legibility":{"note":"","score":4},"offline_evaluability":{"note":"","score":2},"institutional_ratification":{"note":"","score":2}},"annotation":"HumanLayer's post is the library's best current counterweight to the thin-harness pole. The claim is not that more harness is always better — they explicitly report that *role-based* sub-agents (\"frontend engineer,\" \"backend engineer\") don't work. The claim is that specific harness moves — sub-agents as context-control, hooks, back-pressure — carry real load and cannot be absorbed into a better model.\n\nThe piece is useful for the library because it:\n\n- **Distinguishes harness types that work from those that don't**, empirically rather than in principle.\n- **Names specific mechanisms** (sub-agents-for-context, back-pressure) that belong on the `harness_types` taxonomy.\n- **Speaks from shipped product**, which raises its weight on the `practitioner-note` vs `framework-piece` axis.\n\n### Read alongside\n\n- [Tan, \"Thin Harness, Fat Skills\"](/library/tan-thin-harness-fat-skills) — the opposite pole.\n- [Miessler, \"Good and Bad Harness Engineering\"](/library/miessler-good-and-bad-harness-engineering) — the pole this piece is most compatible with.\n- [Anthropic, \"Agent Skills\"](/library/anthropic-agent-skills) — the vendor framing for one of the tactical solutions cited.","verification_needed":true,"verification_note":"URL, title, and publisher verified. Date is best-estimate from search snippet ('published in March 2026'); confirm the exact date and specific author(s) from the post header."},{"slug":"renda-openestimate","title":"OpenEstimate: Evaluating LLMs on Reasoning Under Uncertainty with Real-World Data","author":"Alana Renda, Jillian Ross, Michael Cafarella, Jacob Andreas","date":"2025-10-22","year":2025,"source_type":"paper","url":"https://arxiv.org/abs/2510.15096","excerpt":"LM-elicited priors are often inaccurate and overconfident.","summary":"OpenEstimate is a multi-domain benchmark for testing whether language models can express calibrated Bayesian priors for numerical estimation tasks under uncertainty, using real-world datasets in healthcare, employment, and finance.","notes":"GitHub repository: https://github.com/alanarenda/openestimate. Announcement thread supplied by Daniel from X. arXiv page says v1 submitted Oct 16, 2025 and v2 revised Apr 22, 2026. This entry was prepared with Codex (OpenAI).","why_it_matters":"OpenEstimate targets a capability gap that ordinary right-answer benchmarks miss: knowing how uncertain you should be when neither the model nor the human has an obvious answer. It makes calibration and uncertainty representation first-class evaluation targets.","tags":["uncertainty","calibration","probabilistic-estimation","benchmark","bayesian-priors","openestimate","real-world-data"],"role":"measurement-piece","harness_types":["validation-harness","grounding-context-loading"],"validation_position":["immediately-after-generation"],"validation_mode":["empirical","mechanical"],"domain":"cross-domain","prescription_stance":"strongly-procedural","relation_to_argument":["validation-is-constitutive","reward-structure-matters","domain-structure-matters","breakdown-when-harness-absent"],"dimensions":{"observability":{"note":"Priors, quantiles, calibration error, uncertainty-accuracy correlation, and statistical baselines expose how models represent uncertainty.","score":4},"repairability":{"note":"The benchmark identifies overconfidence and inaccuracy, but does not itself provide rich diagnostic repair paths.","score":2},"reversibility":{"note":"The source is an evaluation harness, not a workflow with rollback or undo semantics.","score":2},"task_structure":{"note":"The benchmark asks models to represent beliefs as distributional priors, making uncertainty part of the output contract.","score":4},"reward_richness":{"note":"Ground-truth distributions from observational data support accuracy and calibration metrics.","score":4},"feedback_latency":{"note":"Feedback is benchmark-level evaluation after elicitation, not an interactive repair loop.","score":2},"input_legibility":{"note":"Questions are constructed from real-world datasets with natural-language variable descriptions and conditioning information.","score":3},"offline_evaluability":{"note":"The benchmark is explicitly offline and reproducible against dataset-derived ground truth.","score":5},"institutional_ratification":{"note":"ICLR venue and open-source benchmark provide research-community ratification, but not deployment governance.","score":3}},"annotation":"OpenEstimate is a measurement entry for the part of the frontier where there is no simple \"right answer\" visible to the user. The task is numerical estimation under uncertainty: given partial information from real-world datasets, models must express beliefs as Bayesian priors, then those priors are evaluated against ground-truth distributions computed from data.\n\nThe benchmark covers domains such as health, employment, and finance using datasets including NHANES, Glassdoor, and PitchBook. It evaluates point accuracy, calibration, uncertainty-accuracy correlation, and the value of LM priors relative to statistical baselines based on samples from the true distribution.\n\nThe headline result is sobering for deployment: across six frontier models, model-elicited priors are often inaccurate and overconfident. The announcement thread adds the sharper interpretation that model priors can be equivalent to fewer than five real data points and that higher model certainty does not reliably mean higher accuracy.\n\n### Extended Frontier Read\n\nOpenEstimate strengthens the measurement shelf because it asks for a richer construct than correctness. The relevant capability is not \"can the model answer?\" but:\n\n- can it represent uncertainty as a usable prior,\n- is that prior calibrated,\n- does the model know when it does not know,\n- does additional reasoning effort or prompting actually improve uncertainty quality?\n\nThat makes it a direct counterweight to benchmarks that reward confident point answers. The extension here is the evaluation harness itself: a structured output contract plus ground-truth distributions plus calibration metrics.\n\n### Boundary\n\nThis is a validation harness, not a repair harness. It can show that models are overconfident, and it can compare elicitation protocols, but it does not by itself teach the model how to repair its uncertainty estimates. That makes it a useful neighbor to RLCR-style work that tries to train models to reason about what they do not know.\n\nSource code: [alanarenda/openestimate](https://github.com/alanarenda/openestimate). Announcement thread: [@alanamarzoev on X](https://x.com/alanamarzoev/status/1981004837102793075).","verification_needed":false,"verification_note":"Date uses the Oct 22, 2025 public announcement supplied by Daniel; title, authors, arXiv URL, and repository were verified against arXiv/GitHub on Apr 29, 2026."},{"slug":"maynard-resurrecting-deceased-darlings","title":"Resurrecting deceased darlings: The Missing Foreword to AI and the Art of Being Human","author":"Andrew Maynard","date":"2025-10-18","year":2025,"source_type":"essay","url":"https://www.futureofbeinghuman.com/p/ai-resurrecting-deceased-darlings?publication_id=1547141&post_id=176516949","excerpt":"This book could not have been written without the learning and insights gained from working closely with one of the most powerful AI models available.","summary":"Maynard publishes the cut foreword to AI and the Art of Being Human, describing months of close collaboration with Claude while emphasizing human agency, manual refinement, AI tells, fictional allegories, and practical tools for staying human with AI.","notes":"Source text supplied by Daniel from Maynard's Substack. This entry was prepared with Codex (OpenAI).","why_it_matters":"A grounded writing case study where AI assistance is neither hidden nor treated as autonomous authorship. Capability comes from months of prompt/resource preparation, human refinement, editorial judgment, and disclosure.","tags":["writing","ai-assisted-book","claude","human-agency","editorial-process","inner-postures","storytelling"],"role":"case-study","harness_types":["input-shaping","validation-harness","repair-harness","learning-harness","social-harness","interface-harness"],"validation_position":["before-generation","immediately-after-generation","before-action","continuous"],"validation_mode":["interpretive","social","empirical"],"domain":"education","prescription_stance":"mixed","relation_to_argument":["capability-is-extended","first-mile-input-formation","repairability-matters","institutions-shape-capability","breakdown-when-harness-absent","diffusion-adoption-bottleneck"],"dimensions":{"observability":{"note":"The foreword makes the collaboration visible, including worries, Claude's failures, and the retained AI tell.","score":4},"repairability":{"note":"The post emphasizes manual refinement, removal of hallucinations, reduction of AI tells, and killing beloved text for reader flow.","score":4},"reversibility":{"note":"The authors cut the foreword from the book, moved some material to the preface, and later published it separately.","score":4},"task_structure":{"note":"The collaboration was organized around chapters, frameworks, stories, tools, and explicit postures.","score":4},"reward_richness":{"note":"The feedback signal is editorial and human, not mechanical or scalar.","score":2},"feedback_latency":{"note":"Passages were iteratively rewritten, but book-scale editorial feedback is slower than code/test loops.","score":3},"input_legibility":{"note":"The authors built a library of resources and deep prompts over months before drafting.","score":4},"offline_evaluability":{"note":"Quality is judged through reading, editing, credibility, and reader engagement rather than offline tests.","score":2},"institutional_ratification":{"note":"Professional advice, publication context, reader reception, and credibility concerns shape what counts as acceptable.","score":3}},"annotation":"Maynard's post is a useful counterexample to simplistic claims about AI-assisted writing. The cut foreword says the book was written in close collaboration with Claude, but also insists the result was not a quick AI-generated artifact. The process took months of discussion, research, prompt and resource development, initial drafting, and extensive human refinement.\n\nThe most important detail for this library is that the authors treat AI collaboration as a **practice**. Claude contributed language, connections, tools, fictional forms, and moments that moved the authors. It also produced hallucinations, AI tells, and repeated failures to capture what they wanted. The final artifact depended on human judgment: rewriting, cutting, shaping reader flow, deciding what to disclose, and even preserving one minor AI tell as a trace of the collaboration.\n\n### Extended Frontier Read\n\nThis is a writing-domain version of the harness argument:\n\n- input preparation through a library of resources and deep prompts;\n- iterative drafting with Claude;\n- human editorial judgment over every chapter;\n- professional advice shaping the final structure;\n- disclosure as social ratification;\n- fictional stories as a designed interface for making abstract AI questions felt.\n\nThe \"extension\" is not a test suite. It is the editorial and social apparatus around the model: judgment, taste, reader empathy, credibility concerns, disclosure, and revision.\n\n### Tension\n\nThe foreword was cut because it slowed reader engagement, even though it contained valuable context. That editorial decision is itself part of the capability story. AI helped produce material the authors valued, but human-facing publication required deciding what not to include. Less output was better output.","verification_needed":true,"verification_note":"Author, title, date, URL, and content came from user capture. Confirm directly from Substack before formal citation."},{"slug":"anthropic-agent-skills","title":"Equipping agents for the real world with Agent Skills","author":"Anthropic","date":"2025-10-15","year":2025,"source_type":"blog","url":"https://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills","excerpt":"Agent Skills are organized folders of instructions, scripts, and resources that agents can discover and load dynamically to perform better at specific tasks.","summary":"Anthropic's engineering announcement of Agent Skills: a markdown-based pattern for extending Claude's capabilities by progressive disclosure. Important as an *institutional* ratification of the thin-harness / fat-skills framing.","why_it_matters":"When the model vendor publishes an engineering post describing the pattern, the pattern becomes a point of reference that downstream tooling, hiring, and documentation can anchor on. The library should treat this as institutional ratification of the thin-harness-adjacent thesis.","tags":["agent-skills","anthropic","progressive-disclosure","institutional-launch","markdown-skills"],"role":"framework-piece","harness_types":["grounding-context-loading","execution-harness","learning-harness","ratification-harness"],"validation_position":["before-generation","during-generation"],"validation_mode":["empirical","institutional"],"domain":"software","prescription_stance":"mixed","relation_to_argument":["capability-is-extended","first-mile-input-formation","institutions-shape-capability","diffusion-adoption-bottleneck"],"dimensions":{"observability":{"note":"","score":4},"repairability":{"note":"","score":3},"task_structure":{"note":"","score":4},"reward_richness":{"note":"","score":3},"input_legibility":{"note":"","score":4},"offline_evaluability":{"note":"","score":3},"institutional_ratification":{"note":"The model vendor officially blessing the markdown-skill pattern is a ratification event, not only a technical one.","score":5}},"annotation":"An important entry for the `institutions-shape-capability` axis. Anthropic's engineering post is not just a feature announcement — it is the moment the markdown-skill pattern gets a canonical, vendor-endorsed framing. That changes what downstream practitioners cite, what conference talks reference, and which designs are considered \"default.\"\n\nThree things to note:\n\n1. **The pattern has been used internally at Anthropic for some time** (search snippet: \"they now have hundreds of them in production\"). The public post is ratification, not invention.\n2. **Progressive disclosure** — scanning only metadata until a skill is relevant — is a specific design move and deserves to be tracked. It is not just \"markdown files\"; it is a loading strategy.\n3. **Agent Skills and MCP are presented as complementary**, not competing. That framing matters for the library's `harness_types` taxonomy.\n\n### Related entries\n\n- [Willison, \"Claude Skills are awesome, maybe a bigger deal than MCP\"](/library/willison-claude-skills-bigger-than-mcp) — practitioner synthesis the same week.\n- [Tan, \"Thin Harness, Fat Skills\"](/library/tan-thin-harness-fat-skills) — the ethos this operationalises.\n- [HumanLayer, \"Skill Issue\"](/library/humanlayer-skill-issue) — working-through of where skills stop and harness-engineering continues.","verification_needed":true,"verification_note":"URL and title verified; date borrowed from Simon Willison's same-day commentary (2025-10-16) — confirm from the Anthropic post header before citing."},{"slug":"willison-claude-skills-bigger-than-mcp","title":"Claude Skills are awesome, maybe a bigger deal than MCP","author":"Simon Willison","date":"2025-10-15","year":2025,"source_type":"blog","url":"https://simonwillison.net/2025/Oct/16/claude-skills/","excerpt":"A skill is a Markdown file telling the model how to do something, optionally accompanied by extra documents and pre-written scripts.","summary":"Practitioner synthesis of Anthropic's Agent Skills feature, arguing the markdown-file pattern is conceptually simpler and more token-efficient than MCP, and that the ease of sharing a single file is the feature.","why_it_matters":"Makes visible the argument that the markdown-skill pattern is a diffusion mechanism, not only a technical one. Pair with Tan (thin harness) and Anthropic's engineering post to triangulate what 'skills' actually refer to.","tags":["agent-skills","markdown-skills","mcp","progressive-disclosure","token-efficiency"],"role":"synthesis-node","harness_types":["grounding-context-loading","execution-harness","learning-harness","social-harness"],"validation_position":["before-generation"],"validation_mode":["empirical"],"domain":"software","prescription_stance":"mixed","relation_to_argument":["capability-is-extended","first-mile-input-formation","diffusion-adoption-bottleneck"],"dimensions":{"observability":{"note":"","score":3},"repairability":{"note":"","score":3},"task_structure":{"note":"","score":4},"reward_richness":{"note":"","score":2},"input_legibility":{"note":"Progressive disclosure — scan metadata, load full skill on demand — is a legibility pattern.","score":4},"offline_evaluability":{"note":"","score":2},"institutional_ratification":{"note":"Distribution is social: skills spread as shareable markdown files, not packaged tools.","score":3}},"annotation":"Willison argues two things at once:\n\n1. **Conceptual simplicity beats MCP.** A skill is a markdown file; the model knows how to read markdown; a CLI tool with `--help` solves most of what an MCP server solves, at a fraction of the token budget.\n2. **Distribution is the feature.** Many skills are a single file. The shareability is the point — skills spread.\n\nRead as a synthesis node that connects:\n\n- [Anthropic's Agent Skills announcement](/library/anthropic-agent-skills) — the institutional launch of the pattern.\n- [Tan, \"Thin Harness, Fat Skills\"](/library/tan-thin-harness-fat-skills) — the practitioner ethos that the markdown-skills pattern operationalises.\n- [HumanLayer, \"Skill Issue\"](/library/humanlayer-skill-issue) — what harness-engineering work remains around skills.","verification_needed":false},{"slug":"miessler-good-and-bad-harness-engineering","title":"Good and Bad Harness Engineering","author":"Daniel Miessler","date":"2025-08-31","year":2025,"source_type":"essay","url":"https://danielmiessler.com/blog/good-and-bad-harness-engineering","excerpt":"In the early days of prompt engineering (2023-2024) it was helpful to tell AI exactly how to do things, but this inversion probably happened somewhere in 2025.","summary":"Argues that good harness engineering focuses on who the user is and what they're trying to accomplish — the 'what' — and lets the model handle the 'how'. Pairs with Miessler's 'Bitter Lesson Engineering' as a design discipline for scaffolding that extends capability rather than compensating for model weakness.","why_it_matters":"Supplies the vocabulary for distinguishing harnesses that *extend* capability from harnesses that merely *compensate* for it. A critical lens for reading practitioner writing.","tags":["harness-engineering","bitter-lesson","design-discipline","agent-design","what-not-how"],"role":"framework-piece","harness_types":["input-shaping","grounding-context-loading","execution-harness","validation-harness","repair-harness","monitoring-harness"],"validation_position":["before-generation","immediately-after-generation","post-deployment"],"validation_mode":["empirical","mechanical"],"domain":"cross-domain","prescription_stance":"mixed","relation_to_argument":["capability-is-extended","repairability-matters","observability-matters","breakdown-when-harness-absent"],"dimensions":{"observability":{"note":"","score":4},"repairability":{"note":"","score":4},"reversibility":{"note":"","score":3},"task_structure":{"note":"","score":4},"reward_richness":{"note":"","score":3},"feedback_latency":{"note":"","score":3},"input_legibility":{"note":"Treats input formation as part of the engineered system, not preprocessing.","score":4},"offline_evaluability":{"note":"","score":3},"institutional_ratification":{"note":"","score":2}},"annotation":"Stakes out the middle ground between \"thin harness, fat skills\" and fully prescriptive agent frameworks. The core move is a *good/bad* distinction inside harness engineering itself: some scaffolding genuinely extends what the system can do (input shaping, repair loops, observability), while other scaffolding is brittle compensation for current model weakness and will not survive the next model.\n\nMiessler's design rule is compressed into one line: **don't confuse the *what* with the *how*.** Tell the model who you are and what outcome you want; let the model figure out the path.\n\nRead together with:\n\n- [Bitter Lesson Engineering](https://danielmiessler.com/blog/bitter-lesson-engineering) — the underlying argument, leaning on Sutton's \"The Bitter Lesson.\"\n- [Tan's \"Thin Harness, Fat Skills\"](/library/tan-thin-harness-fat-skills) — adjacent but less prescriptive-about-good-design.\n\nMiessler is **not** endorsing the thin-harness conclusion that scaffolding is always waste. He is endorsing a *discipline* of harness design. The disagreement with Tan is legible: both agree some scaffolding is waste; they disagree about how much of the harness is waste in the limit of model improvement.\n\n### What the library should extract once the post is fully read\n\n- The explicit taxonomy (if any) of good vs. bad harness work.\n- Concrete examples cited as each type.\n- Whether repairability and observability are treated as *constitutive* of capability or merely as hygiene.","verification_needed":true,"verification_note":"URL and author verified; content summarised via search snippets (site blocks automated fetch). Exact publish date is a best guess — confirm from the post header before citing."},{"slug":"rubrics-as-rewards","title":"Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains","author":"Unknown (OpenReview: 21UFlJrmS2)","date":"2025-08-31","year":2025,"source_type":"paper","url":"https://openreview.net/forum?id=21UFlJrmS2","summary":"Proposes rubrics as a reward source for reinforcement learning in domains where a crisp verifiable outcome does not exist. A deliberate extension of RLVR-style methods past the easy cases.","why_it_matters":"The library's schema intentionally separates reward richness from repairability and input legibility. This entry is a technical illustration: a rubric can score lower than a verified outcome on richness while scoring *higher* on repairability — because the rubric names which dimension failed.","tags":["rubrics-as-rewards","rar","non-verifiable-domains","reward-shaping","rlvr-adjacent"],"role":"measurement-piece","harness_types":["validation-harness","learning-harness"],"validation_position":["immediately-after-generation"],"validation_mode":["interpretive","social","empirical"],"domain":"research","prescription_stance":"mixed","relation_to_argument":["reward-structure-matters","domain-structure-matters","validation-is-constitutive"],"dimensions":{"observability":{"note":"","score":4},"repairability":{"note":"Rubrics *name* failure modes — that is diagnostic by construction, not just verifiable.","score":4},"task_structure":{"note":"","score":3},"reward_richness":{"note":"Rubric scores are richer than nothing, sparser than a verified pass/fail.","score":3},"feedback_latency":{"note":"","score":3},"input_legibility":{"note":"","score":3},"offline_evaluability":{"note":"","score":3},"institutional_ratification":{"note":"","score":3}},"annotation":"An important entry for preserving the library's most-subtle disagreement: *verifiable outcome ≠ diagnostic feedback*. Rubrics as Rewards operationalises that distinction by trading some of the \"cleanness\" of a verified outcome (one bit: passed / failed) for the structured richness of a rubric (multi-dimensional, failure-mode-named, repairable).\n\n### Read alongside\n\n- [Expanding RLVR Across Diverse Domains](/library/expanding-rlvr-across-domains) — the verifiable-outcome pole.\n- [Royzen: Standard Signal](/library/royzen-standard-signal) — the domain where the outcome is unusually verifiable.\n- [Wallach/Jacobs et al](/library/wallach-measurement-challenge) — the measurement critique that applies to both rubrics and verifiable rewards.","verification_needed":true,"verification_note":"OpenReview URL and title verified. Authors and exact date not confirmed; confirm from the forum page before citing."},{"slug":"oecd-ai-ready-workforce","title":"Building an AI-ready public workforce","author":"OECD","date":"2025-06-30","year":2025,"source_type":"doc","url":"https://www.oecd.org/en/publications/building-an-ai-ready-public-workforce_b89244c7-en/full-report.html","summary":"OECD full report on how public-sector workforces are (and are not) prepared to deploy AI. Brought into the library as a governance-piece anchor: the argument is that whether an AI system is capable *in practice* depends on the institutional scaffolding around its use, not only on the model or the harness.","why_it_matters":"Counterweight to the software-centric pole of the library. A large portion of real AI deployment lives inside institutions whose capability depends on workforce preparation, training, accountability, and procurement — none of which is captured by 'harness' in the coding-agent sense.","tags":["workforce","public-sector","oecd","institutional-scaffolding","governance"],"role":"governance-piece","harness_types":["ratification-harness","social-harness","monitoring-harness"],"validation_position":["before-action","post-deployment","continuous"],"validation_mode":["institutional","social"],"domain":"operations","prescription_stance":"strongly-procedural","relation_to_argument":["institutions-shape-capability","diffusion-adoption-bottleneck","breakdown-when-harness-absent"],"dimensions":{"observability":{"note":"","score":3},"repairability":{"note":"","score":3},"task_structure":{"note":"","score":3},"reward_richness":{"note":"Public-sector outcomes rarely collapse to a cardinal reward.","score":1},"feedback_latency":{"note":"Policy-level feedback is slow. Years, not cycles.","score":1},"input_legibility":{"note":"","score":2},"institutional_ratification":{"note":"The report is itself a ratification instrument.","score":5}},"annotation":"A governance entry. Places the question of AI capability-in-practice inside the frame of *public administration*: whether AI makes a public-sector system more capable depends on training, data integration, procurement norms, and public-private partnership structures, not only on the model or its harness.\n\nThe OECD framing forces the library to reckon with a kind of scaffolding that coding-agent practitioners rarely name:\n\n- Workforce **training** as a first-mile input-formation mechanism.\n- Accountability **procedures** as a ratification harness with legal and political standing.\n- Cross-agency **data integration** as a grounding-and-context-loading substrate.\n\n### Why it pairs with the software entries\n\n- [Tan, thin harness / fat skills](/library/tan-thin-harness-fat-skills) — highlights the domain mismatch: thin-harness prescriptions assume a software-practitioner user. Here the \"user\" is a multi-layered public institution.\n- [HumanLayer, \"Skill Issue\"](/library/humanlayer-skill-issue) — both pieces agree that the harness matters; they disagree about which harness.","verification_needed":true,"verification_note":"Publisher and URL verified. Date is a best-estimate; confirm the publication date from the OECD page before citing."},{"slug":"miessler-bitter-lesson-engineering","title":"Bitter Lesson Engineering","author":"Daniel Miessler","date":"2025-05-31","year":2025,"source_type":"essay","url":"https://danielmiessler.com/blog/bitter-lesson-engineering","excerpt":"As AI gets better, Bitter Lesson Engineering becomes increasingly important.","summary":"Leans on Richard Sutton's 'The Bitter Lesson' to argue that prescriptive scaffolding around AI systems is a losing strategy in the limit: you should specify intent precisely and let the best available model figure out the path.","why_it_matters":"Supplies the underlying argument for Miessler's harness-engineering taxonomy. Useful anchor for the anti-prescriptive pole of the library.","tags":["bitter-lesson","anti-prescriptive","sutton","design-discipline"],"role":"framework-piece","harness_types":["input-shaping"],"validation_position":["before-generation"],"validation_mode":["empirical"],"domain":"cross-domain","prescription_stance":"anti-prescriptive","relation_to_argument":["capability-is-extended","diffusion-adoption-bottleneck","first-mile-input-formation"],"dimensions":{"observability":{"note":"","score":2},"repairability":{"note":"Anti-prescriptive stances tend to underweight the value of diagnostic repair loops.","score":2},"task_structure":{"note":"","score":2},"reward_richness":{"note":"","score":2},"feedback_latency":{"note":"","score":2},"input_legibility":{"note":"Being specific about intent *is* input legibility. The whole prescription.","score":4},"institutional_ratification":{"note":"","score":1}},"annotation":"The conceptual base for [Good and Bad Harness Engineering](/library/miessler-good-and-bad-harness-engineering). The argument is a corollary of Sutton's \"Bitter Lesson\": methods that encode human prior knowledge get beaten in the long run by methods that scale general learning. Therefore: encode *what* you want (the construct, the outcome, the user intent) and let the model handle *how*.\n\nIn practice this produces a design stance close to [Tan's thin-harness](/library/tan-thin-harness-fat-skills), but arrived at from a different direction. Tan: \"as models improve, scaffolding gets absorbed.\" Miessler-via-Sutton: \"general methods beat prescriptive ones; prescriptive harness is prescriptive method.\"\n\n### Disagreement preserved\n\nThis entry deliberately scores low on `repairability`, `observability`, and `institutional_ratification`. That is the anti-prescriptive pole: less scaffolding means less to diagnose, less to inspect, and fewer institutional seams. Pair this entry with measurement-focused entries to see the tension.","verification_needed":true,"verification_note":"URL and author verified; exact publish date is a best guess (site blocks automated fetch). Confirm from the post header before citing."},{"slug":"salaudeen-measurement-to-meaning","title":"Measurement to Meaning: A Validity-Centered Framework for AI Evaluation","author":"Olawale Salaudeen, Anka Reuel, Ahmed Ahmed, Suhana Bedi, Zachary Robertson, Sudharsan Sundar, Ben Domingue, Angelina Wang, Sanmi Koyejo","date":"2025-05-14","year":2025,"source_type":"paper","url":"https://arxiv.org/abs/2505.10573","excerpt":"The paper provides a structured approach for reasoning about the types of evaluative claims that can be made given the available evidence.","summary":"Proposes a validity-centered framework for AI evaluation that reasons explicitly about which evaluative claims the evidence actually supports, with detailed vision and language case studies. The operational companion to the Wallach/Jacobs position paper.","why_it_matters":"The applied-framework companion to the Wallach/Jacobs position paper. Where the position paper diagnoses the field, this paper hands you a tool: map the evidence you have to the claim you want to make, and refuse to make claims the evidence can't support.","tags":["measurement","construct-validity","evaluation","validity-framework","claims-vs-evidence"],"role":"measurement-piece","harness_types":["validation-harness","ratification-harness"],"validation_position":["immediately-after-generation","post-deployment"],"validation_mode":["empirical","interpretive","social"],"domain":"research","prescription_stance":"strongly-procedural","relation_to_argument":["validation-is-constitutive","observability-matters","institutions-shape-capability"],"dimensions":{"observability":{"note":"Observability in the measurement-theoretic sense: what does the evaluation actually let you see?","score":5},"repairability":{"note":"","score":3},"task_structure":{"note":"","score":3},"reward_richness":{"note":"","score":2},"input_legibility":{"note":"","score":3},"offline_evaluability":{"note":"","score":4},"institutional_ratification":{"note":"","score":4}},"annotation":"Where the [Wallach/Jacobs position paper](/library/wallach-measurement-challenge) argues that generative-AI evaluation is a social-science measurement challenge, this paper supplies the operational framework. Two case studies (vision and language model evaluations) demonstrate how explicitly reasoning about validity strengthens or weakens the claims an evaluation can support.\n\nThe central move is refusing the shortcut from *benchmark score* to *capability claim*. A model that does well on a math benchmark may be good at that benchmark, not good at math. A model that does well on graduate-exam-style questions may be good at graduate-exam-style questions, not good at reasoning.\n\n### Useful against\n\n- \"Reward richness is the lever\" framings — this paper asks which construct the reward even measures.\n- \"Thin harness, fat skills\" — reminds that the skills you think you are pushing into the model are defined by the evaluations you check them with.\n\n### Useful for\n\n- Anyone who wants to score a library entry on `institutional_ratification` or `observability` with conceptual grounding rather than intuition.","verification_needed":true,"verification_note":"Title, authors, and URL verified; date is best-estimate from the arxiv ID (2505 = May 2025) — confirm exact first-submission date before citing."},{"slug":"expanding-rlvr-across-domains","title":"Expanding RL with Verifiable Rewards Across Diverse Domains","author":"Ma et al.","date":"2025-03-30","year":2025,"source_type":"paper","url":"https://arxiv.org/abs/2503.23829","summary":"Arxiv paper investigating how reinforcement learning with verifiable rewards (RLVR) generalises beyond the easy cases (math, code) to more diverse domains. The technical paper whose conceptual shadow Royzen's domain-claim entry sits in.","why_it_matters":"Grounds the 'verifiable-reward domain' framing in the ML research literature. Useful for readers who want the technical story behind practitioner claims that finance, code, and math are uniquely favourable.","tags":["rlvr","verifiable-rewards","reinforcement-learning","domain-generalisation"],"role":"domain-claim","harness_types":["learning-harness","validation-harness"],"validation_position":["immediately-after-generation","post-deployment"],"validation_mode":["mechanical","empirical"],"domain":"research","prescription_stance":"strongly-procedural","relation_to_argument":["reward-structure-matters","domain-structure-matters","validation-is-constitutive"],"dimensions":{"observability":{"note":"","score":3},"repairability":{"note":"Mark against reward-richness: a verifiable outcome signal can still be silent on the error mechanism.","score":3},"task_structure":{"note":"","score":4},"reward_richness":{"note":"RLVR is the paradigmatic high-reward-richness method; the paper's concern is which domains actually admit it.","score":5},"feedback_latency":{"note":"","score":4},"input_legibility":{"note":"","score":3},"offline_evaluability":{"note":"","score":4},"institutional_ratification":{"note":"","score":2}},"annotation":"Technical complement to the practitioner entries on verifiable-reward domains. The paper asks the question the library should keep asking: *which* diverse domains does RLVR actually generalise to, and what breaks when it doesn't?\n\nRhetorically, this entry is included to prevent the library from collapsing \"verifiable reward\" into a slogan. There is a research program behind it with real empirical findings — both supporting and complicating the practitioner framings.\n\n### Read alongside\n\n- [Royzen: Standard Signal](/library/royzen-standard-signal) — the finance-domain-favourability claim.\n- [Rubrics as Rewards (RaR)](/library/rubrics-as-rewards) — extending the framing past crisp-outcome domains.\n- [Wallach/Jacobs et al](/library/wallach-measurement-challenge) — the measurement-validity pushback.","verification_needed":true,"verification_note":"Title and URL verified. First-author and full author list not confirmed; arxiv date is best-estimate from the arxiv ID (2503 = March 2025). Confirm before citing."},{"slug":"wallach-measurement-challenge","title":"Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge","author":"Hanna Wallach, Meera Desai, A. Feder Cooper, Angelina Wang, Chad Atalla, Solon Barocas, Su Lin Blodgett, Alexandra Chouldechova, Emily Corvi, P. Alex Dow, Jean Garcia-Gathright, Alexandra Olteanu, Nicholas Pangakis, Stefanie Reed, Emily Sheng, Dan Vann, Jennifer Wortman Vaughan, Matthew Vogel, Hannah Washington, Abigail Z. Jacobs","date":"2025-02-01","year":2025,"source_type":"paper","url":"https://arxiv.org/abs/2502.00561","excerpt":"The measurement tasks involved in evaluating generative AI systems lack sufficient scientific rigor, leading to a tangle of sloppy tests and apples-to-oranges comparisons.","summary":"ICML 2025 position paper arguing that generative AI evaluation is fundamentally a social-science measurement problem, and presenting a four-level framework grounded in measurement theory for constructs related to GenAI capabilities, behaviors, and impacts.","why_it_matters":"The foundational pushback against treating any evaluation number as self-evidencing. If the measurement instrument doesn't validly pick out the construct (reasoning, helpfulness, safety, legal competence), a high score is not a capability claim.","tags":["measurement","construct-validity","evaluation","social-science","icml-2025","position-paper"],"role":"measurement-piece","harness_types":["validation-harness","ratification-harness"],"validation_position":["before-generation","post-deployment","continuous"],"validation_mode":["empirical","social","institutional","interpretive"],"domain":"research","prescription_stance":"strongly-procedural","relation_to_argument":["validation-is-constitutive","institutions-shape-capability","observability-matters"],"dimensions":{"observability":{"note":"","score":4},"repairability":{"note":"","score":3},"task_structure":{"note":"","score":3},"reward_richness":{"note":"Deliberate counterweight to reward-richness maximalism: a rich signal for the wrong construct is not evidence of capability.","score":2},"feedback_latency":{"note":"","score":2},"input_legibility":{"note":"","score":3},"offline_evaluability":{"note":"Offline eval is only as good as the construct it ratifies.","score":3},"institutional_ratification":{"note":"The paper's four-level framework makes ratification a first-class object of inquiry.","score":5}},"annotation":"Sets up measurement and construct validity as *prior* to evaluation. A benchmark score is a claim about a construct, and the validity of that claim depends on whether the instrument actually measures the construct. The paper argues that most GenAI evaluation skips this step, producing a tangle of sloppy tests and apples-to-oranges comparisons.\n\nThe authors import a four-level framework from social-science measurement theory and apply it to GenAI. The argument is explicitly *not* that better metrics solve the problem — it is that capability claims depend on validity work that is social, interpretive, and institutional.\n\nPlaced against verifiable-reward framings ([Royzen](/library/royzen-standard-signal); [Expanding RLVR](/library/expanding-rlvr-across-domains)), the tension is direct:\n\n- **Verifiable-reward**: the reward is verifiable when the outcome is checkable.\n- **Measurement-validity**: checkability of an outcome does not imply the outcome measures the construct you care about. The \"verifiable\" in verifiable reward is doing more work than it admits.\n\nBoth can be true at once. A narrow technical task (theorem proved, test suite passed) may have near-trivial validity. A broad capability claim (legal reasoning, medical judgment, general agentic competence) almost never does. The library preserves this disagreement structurally — entries can score high on `reward_richness` while scoring low on `input_legibility` and unknown on validity.\n\n### Related entries\n\n- [Measurement to Meaning (Salaudeen et al. 2025)](/library/salaudeen-measurement-to-meaning) — the validity-centered framework applied.\n- [Royzen: Standard Signal](/library/royzen-standard-signal) — poster case for reward richness.","verification_needed":false}]}