This entry is a new kind of library object: a Deep Research packet. The source is Daniel's query to ChatGPT Deep Research and the resulting ChatGPT Deep Research report. It is preserved because the prompt and result expose the operating theory behind the system being built.

The capability-library lens is straightforward: work collision is not a model-quality failure. It is a harness failure. Multiple sessions can each be locally reasonable and still globally collide if ownership is not legible, claimable, observable, and recoverable. A work registry is therefore not bureaucracy around capability; it is part of the capability surface.

Extended Capability Read

The query asks for a minimal offline protocol because the local constraint matters: no daemon, no cloud coordinator, no shared context window, and a human as the only interrupt channel. That rules out a lot of distributed-systems theater and points toward a small lease model:

atomic claim before action,
visible ownership before recommendation,
heartbeat while active,
expiry after silence,
explicit completion or abandonment,
migration that tolerates old sessions ignoring the protocol at first.

The interesting disagreement is not whether a registry is theoretically cleaner. It is whether the registry becomes a real gate or another voluntary signal beside the existing session feed. The prompt names adoption failure as a first-class risk, which is exactly the kind of social-harness issue this library should keep visible.

Capability Mapping

The result maps the coordination problem onto leases rather than permanent locks. That distinction is the load-bearing concept for this library entry: capability comes from a small lifecycle around the model's work, not from the model remembering more context. The lifecycle is claim, heartbeat, expiry, release, and human-visible override.

Mapped to the library's dimensions:

Execution harness: work-on ORDER_ID becomes the required start-work path. The harness does not merely advise the agent; it changes what actions are allowed.
Monitoring harness: session_presence, order_claims, heartbeat timestamps, stale thresholds, and worktree metadata turn invisible terminal work into observable system state.
Validation harness: preflight checks move from reading advisory status text to checking the authoritative lease table before recommending or beginning work.
Repair harness: expiry and reclaim convert crash recovery from manual cleanup into a recoverable state transition.
Social/interface harness: Order frontmatter, session feed, Linear, and worktrees remain mirrors for humans, while SQLite becomes the exclusion source of truth.
Institutional ratification: the result makes ownership a ratified workflow state: not "the agent thinks it is working," but "the local work system recognizes this claim."

The result also clarifies a shelf-level distinction: multi-agent frameworks and memory systems are not the same thing as a cross-session work registry. CrewAI, AutoGen, LangGraph, and Swarm may provide orchestration, persistence, checkpoints, or handoffs inside their runtimes, but the collision problem here lives between independent shells. That places this entry beside OpenAI, "Symphony", but at a smaller local scale: issue tracker as control plane there, SQLite lease registry here.

The new vocabulary this adds to the library:

Lease, not lock: ownership should expire unless refreshed.
Presence is not ownership: online/session liveness can inform UI, but it should not decide durable task ownership.
Mirrors are not source of truth: markdown, feed events, Linear, and worktrees are useful diagnostic surfaces, but exclusion needs an atomic shared substrate.
Mandatory path beats advisory memory: duplicate-work prevention fails if claim registration remains voluntary.

The main unmapped residue is citation hygiene. The pasted result includes ChatGPT Deep Research session-local citation markers. Those markers are useful provenance inside the originating session, but they are not durable public citations yet.

Appendix A: Deep Research Result

Citation note: the turn... citation markers below came from the Deep Research session and still need replacement with durable public source links.

Bottom line

The pattern that keeps reappearing across task queues, durable workflow engines, CI/CD systems, and collaborative-presence systems is not a permanent lock; it is a lease: a worker atomically claims a unit of work, refreshes liveness while it is active, and another worker can reclaim only after expiry. That is explicit in BullMQ's job locks and renewals, Temporal's task queues plus activity heartbeats and retries, Sidekiq Pro's durable fetch plus orphan recovery, and Celery's late acknowledgements plus visibility-timeout redelivery. On a single machine with SQLite, you can implement the same idea with one order_claims table, one session_presence table, BEGIN IMMEDIATE for atomic claim attempts, and an expiring lease_expires_at column.

For your constraints, the lightest robust design is therefore not "make sessions check the feed more carefully," and not "put a .lock file next to each Order." It is a SQLite-backed work registry that is mandatory on session start, mandatory before work begins, and refreshed by a tiny per-session heartbeat helper. Order frontmatter, the session feed, Linear, and visible worktrees should become mirrors for humans and diagnostics; the SQLite lease row should become the source of truth for exclusion. SQLite's writer serialization and BEGIN IMMEDIATE are exactly what you want for an offline, single-machine, no-daemon claim path.

The other big finding is that the named multi-agent frameworks do not generally solve this exact problem for you. Their docs emphasize workflow orchestration, persisted state, memory, checkpoints, or stateless handoffs within a runtime; they do not expose a first-class cross-terminal "claim board" that independent coding sessions must consult before touching the same task. In other words, your coordination problem is closer to a local task-queue lease problem than to an agent-memory problem.

Claim semantics in task queues and durable runtimes

In Celery, the core tradeoff is visible in the ack policy. By default, Celery acknowledges a task before execution so an already-started task is not executed again; if you enable acks_late, the broker acknowledgement happens after execution instead. Celery also documents Redis/SQS-style visibility_timeout, where an unacked message is redelivered after the timeout, and task_reject_on_worker_lost if you want requeue on abrupt worker loss. This is the classic "claim + ack/retry + expiry" family, but with the usual at-least-once tradeoff: safer recovery means more duplicate-execution risk, which is why Celery tells you to make tasks idempotent.

Sidekiq's docs make the same lesson even more blunt. Its default fetch path uses BRPOP, which removes the job from Redis immediately; if the process crashes mid-job, the job can be lost. Sidekiq Pro's super_fetch switches to LMOVE so the job stays in Redis until completion, then recovers orphaned jobs after process-heartbeat expiry. Sidekiq Enterprise's unique-jobs feature adds a second lesson that matters directly for your design: uniqueness is best effort, the lock always has a TTL, and that TTL is mandatory because otherwise crash-held locks would last forever.

BullMQ is almost a textbook lease implementation. A worker places a lock on the job when processing begins, renews it at lockRenewTime (by default half of lockDuration), and if renewal stops the job is considered stalled and can be restarted. BullMQ explicitly warns that a worker that dies, or a CPU-bound processor that blocks the event loop long enough to miss renewal, can cause a job to be double-processed. That is the exact failure shape you are trying to avoid, and it is also why your scheme needs both heartbeats and idempotent completion paths.

Temporal wraps the same ideas in a stronger durable-execution model. The service persists workflow history, workers poll task queues, activity tasks are picked up by one worker attempt at a time, and the API presents an "effectively once" experience at the workflow level even though multiple activity-task executions may happen underneath due to retry or timeout. Activity heartbeats can carry progress so a retry can resume from the last checkpoint; Temporal's docs recommend heartbeats for long-running activities and show typical heartbeat timeouts on the order of a minute. The important design takeaway is that claim state lives in durable shared storage, not in the worker's private context window.

For a local SQLite-backed system, the minimum useful translation of all of that is: one live row per claimable Order, acquired atomically, renewed periodically, released on completion, and stealable only after expiry. Everything else, feeds, markdown status, worktrees, and issue trackers, should enrich observability, not decide ownership.

What agent frameworks, locks, CI/CD, and collaborative tools provide

Agent frameworks such as CrewAI, AutoGen, LangGraph, and Swarm point in the same direction. CrewAI Flows own state and execution order, and its persistence can save Flow state to a database so execution can resume after a crash. AutoGen focuses on agent lifecycle, communication, and deterministic patterns inside a runtime. LangGraph saves checkpoints for fault-tolerant execution and human-in-the-loop resumes. Swarm is stateless between calls and stores no state between runs. These are useful primitives, but none is a built-in cross-process claim registry for independent terminal sessions on the same repo.

The practical implication is that "multi-agent memory" is not enough to prevent collisions when the agents are separate shells with separate context windows. Duplicate-work prevention needs to live in a shared substrate all sessions read before acting. In LangGraph that substrate could be a store/checkpointer; in CrewAI, persisted Flow state; in Swarm, something entirely external because the framework itself is stateless. In every case, the lock or lease must be mandatory in the start-work path, not an optional memory artifact.

SQLite gives two decisive advantages for this use case: single-writer serialization and transactional state. SQLite transactions are serializable, there can only be one writer at a time, and BEGIN IMMEDIATE starts the write transaction up front so later writes inside that transaction do not fail with a surprise SQLITE_BUSY. WAL mode lets readers continue while a writer appends to the WAL, which is what you want when several sessions frequently read live claims and only occasionally write them. WAL is not the claim protocol; it is the concurrency mode that makes a claim table pleasant to use.

flock() is good if all you need is same-host exclusion on one file. Its lock is released when file descriptors close, so process death has good crash behavior on one host. The downside is that flock() is not a queryable registry. It does not naturally answer "who currently owns this Order, since when, with what issue, and when does the claim expire?" without a second metadata channel.

Plain .lock files have the opposite tradeoff. They are inspectable and easy to create, but they do not auto-release on crash. Git's index.lock is the canonical cautionary example: stale lock files can remain after early exit and require manual cleanup once no process is active. That is tolerable for occasional repository maintenance; it is a poor default for autonomous sessions that need reclaimable, crash-safe work ownership.

CI/CD systems serialize conflict-prone work by attaching it to a centralized concurrency key. GitHub Actions concurrency groups allow at most one running and one pending job, with optional cancellation of the running one. GitHub self-hosted runner jobs are re-queued if the runner does not pick them up quickly. GitLab resource_group forces jobs such as deployments to run one at a time and offers process modes like oldest_first and newest_first. Buildkite exposes branch-level dedupe controls and agent heartbeat health. The pattern is centralized scheduling authority plus explicit concurrency scope.

Collaborative systems separate durable ownership from ephemeral presence. Yjs Awareness is for "who is online?" and cursor/presence state; peers can be marked offline after missed updates. That is good for "someone is here" UI but too ephemeral to be durable task ownership. Project-management tools similarly signal "someone is working on this" through explicit assignee and workflow fields rather than crash-detecting session leases.

Coordination mechanisms compared

mechanism	latency	crash safety	complexity	requires daemon	offline-capable	example system
SQLite lease row with expiry and heartbeat	very low; one local transaction	high, because claims are reclaimable after missed heartbeats	medium	no	yes	BullMQ/Temporal-style lease and renew semantics
Session manifest directory with TTL	very low; local file reads/writes	medium; good for presence, weaker for atomic takeover	low	no	yes	Yjs-style awareness/presence pattern
`flock()` lockfile plus separate metadata file	very low	high for same-host process death; kernel drops lock on close	low to medium	no	yes	Linux advisory file locking
Plain `.lock` file with PID/timestamp	very low	low; stale files remain after early exit/crash	low	no	yes	Git `index.lock` behavior
Event log or heartbeat feed only	low	low for exclusion; good for observability only	low	no	yes	Celery monitoring / worker-heartbeat events
Git worktree / branch heuristic	low	low; worktrees show activity but do not grant or release ownership atomically	low	no	yes	Git worktree plus issue/branch conventions
Central concurrency group / scheduler	low control-plane latency	high	high	yes	usually no	GitHub Actions, GitLab resource groups, Buildkite branch-build cancellation

The best fit is a hybrid: SQLite for the authoritative claim lease, Order markdown and the session feed for visibility, and worktree/Linear metadata as supporting context shown to humans when a claim is stale or disputed.

Recommended minimal protocol

The smallest trustworthy design is fully offline, uses no central daemon, stays inside a three-second preflight budget, and degrades safely when a session crashes.

Register session existence at startup, unconditionally. Replace voluntary session-feed posting with a required launcher step that upserts a session_presence row keyed by session_id and records host, PID, start time, current tab title, Linear issue, and last_seen_at.
Preflight every recommendation against live claims, not status text. pitch and suggest should read SQLite first: which Orders are ready, which are claimed with an unexpired lease, which are stale, and which have matching worktrees or Linear issues.
Make start work an atomic claim transaction. The only supported path to in_progress should be a wrapper such as work-on ORDER-143. Inside one SQLite transaction, it attempts to acquire or steal an expired claim. BEGIN IMMEDIATE reserves the writer slot before inspecting and updating the claim row.
Mirror after commit, never before commit. Only after the claim transaction commits should the wrapper update Order status, write claimed_by / claimed_at / lease_expires_at into frontmatter, emit the session-feed event, create or record the worktree, and optionally update Linear.
Refresh heartbeats out-of-band from the model's reasoning loop. Do not ask the agent to remember to heartbeat. Start a tiny per-session helper that updates both session_presence.last_seen_at and the relevant order_claims.heartbeat_at / lease_expires_at every 30 seconds.
Release on completion in the same place work completion is recorded. A complete-order command should mark the claim released, record completion timestamps, and write the session-feed completion event. Keep a separate append-only claim_history table if audit history matters.
Treat crashes as missed-heartbeat expiry, not immortal locks. If the session disappears, the helper stops heartbeating. After expiry, the claim becomes reclaimable automatically.
Require a human-visible override path, but only for expired or obviously stale claims. The override UI should show incumbent session, last heartbeat, worktree path, Linear issue, and whether the claim is stale or expired.

Minimal schema:

PRAGMA journal_mode = WAL;
PRAGMA busy_timeout = 1500;

CREATE TABLE IF NOT EXISTS session_presence (
  session_id TEXT PRIMARY KEY,
  host TEXT NOT NULL,
  pid INTEGER,
  started_at_ms INTEGER NOT NULL,
  last_seen_at_ms INTEGER NOT NULL,
  tab_title TEXT,
  linear_issue_id TEXT,
  current_order_id TEXT
);

CREATE TABLE IF NOT EXISTS order_claims (
  order_id TEXT PRIMARY KEY,
  session_id TEXT NOT NULL,
  claimed_at_ms INTEGER NOT NULL,
  heartbeat_at_ms INTEGER NOT NULL,
  lease_expires_at_ms INTEGER NOT NULL,
  worktree_path TEXT,
  linear_issue_id TEXT,
  FOREIGN KEY (session_id) REFERENCES session_presence(session_id)
);

CREATE INDEX IF NOT EXISTS idx_order_claims_expires
  ON order_claims (lease_expires_at_ms);

Atomic claim attempt:

BEGIN IMMEDIATE;

INSERT INTO order_claims (
  order_id, session_id, claimed_at_ms, heartbeat_at_ms,
  lease_expires_at_ms, worktree_path, linear_issue_id
)
VALUES (
  :order_id, :session_id, :now_ms, :now_ms,
  :lease_expires_at_ms, :worktree_path, :linear_issue_id
)
ON CONFLICT(order_id) DO UPDATE SET
  session_id = excluded.session_id,
  claimed_at_ms = excluded.claimed_at_ms,
  heartbeat_at_ms = excluded.heartbeat_at_ms,
  lease_expires_at_ms = excluded.lease_expires_at_ms,
  worktree_path = excluded.worktree_path,
  linear_issue_id = excluded.linear_issue_id
WHERE order_claims.lease_expires_at_ms < :now_ms
RETURNING order_id, session_id, lease_expires_at_ms;

COMMIT;

The lock granularity should be the Order ID, not the repository, not the worktree, and not the entire Linear issue set. The source of truth should be the SQLite lease row, not markdown frontmatter. The mandatory path should be launcher plus work-on / complete-order wrappers, not model instructions embedded in prompts.

Anti-patterns and failure modes

Removing or acknowledging work before the worker is actually safe. If you mark an Order in_progress in markdown before the SQLite claim commits, you rebuild the same race queue systems try to avoid.
Using infinite or very long locks. "Claim until manually cleared" produces zombie Orders that nobody trusts enough to touch.
Making leases too short for the actual execution environment. Heartbeats tied to the model's main loop can miss beats during long tool calls, blocked subprocesses, or terminal suspension.
Using presence as ownership. Presence should inform UI; the lease row should decide who owns work.
Picking the wrong lock granularity. Too broad cancels unrelated work; too narrow still permits collisions.
Assuming queue order guarantees you do not actually have. "Once at a time" and "which one goes next" are different policies.
Treating plain .lock artifacts as crash-safe locks. Stale files become operational toil.
Leaving the registry voluntary. A system that agents may ignore is not on the critical path and will lose to locally rational behavior.

Heartbeat and expiry defaults

parameter	with a tiny per-session heartbeat helper	without a helper
heartbeat interval	30s	60s
soft-stale threshold	90s	180s
hard-expiry / reclaimable	180s	300s
startup grace before showing stale	60s	90s
retain released claim history	24h to 7d	24h to 7d

With 10-15 active sessions, a 30-second helper heartbeat is trivial write load for SQLite and gives fast enough "someone is still on this" feedback. A soft-stale state after roughly three missed beats lets pitch downgrade the Order without automatically stealing it. A hard expiry after roughly six missed beats is long enough to avoid false positives from transient pauses, but short enough that a dead session does not block the team for half an hour. If heartbeats depend on the session's main work loop, use looser leases.

The rollout tuning rule is: prefer false negatives over false positives during rollout. A stale claim that is reclaimable a bit later is annoying. A false live claim that blocks legitimate work trains humans and agents to route around the system.

Migration path

The safest retrofit is to introduce the registry in shadow mode, then raise it to soft enforcement, then to hard enforcement once the launcher path is universal.

Start by extending the existing startup path, currently set-tab-title, so every new session registers in session_presence. In this phase, pitch only warns: "live claim by session X," "stale claim," "legacy in-progress without claim," and "matching worktree exists." Nothing blocks yet.

Next, add a mandatory work-on ORDER_ID wrapper. That wrapper performs the atomic claim, then mirrors the result into Order status/frontmatter, the session feed, and optional Linear updates. Legacy sessions that directly flip markdown to in_progress still exist, but they are classified as legacy occupancy hints rather than hard locks.

After that, move to soft enforcement. pitch should stop recommending Orders with a live unexpired lease. complete-order should refuse to complete an Order unless the completing session owns the current claim or a human passes an override. in_progress without a live claim should remain visible but suspicious rather than authoritative.

Finally, move to hard enforcement when almost all sessions use the wrapper path. Direct edits that set in_progress should be linted or rejected, pitch should treat the SQLite lease as canonical, and the event log should become historical telemetry instead of arbitration.

Three compatibility rules make the migration safer:

Old sessions may continue to work, but they do not create authoritative claims.
New sessions may see legacy state and warn on it, but only a live lease can block.
Human-visible mirrors remain important, so claim metadata should still be written into Orders and the session feed even after SQLite becomes canonical.

Appendix B: Query

Deep Research: Work Registration & Collision Prevention in Multi-Agent/Multi-Session Systems

Context

I run a solo developer operation where 3-8 Claude Code sessions run simultaneously on the same codebase, each in its own terminal tab (sometimes also Codex and Cursor agents). Each session has its own context window and makes locally rational decisions. Work is tracked via "Orders" (structured markdown specs with status fields: draft -> ready -> in_progress -> complete) and Linear issues.

The problem: sessions frequently recommend or start work that another session already owns. A "pitch" skill recommends the highest-leverage next action, but it has no reliable way to know what's in-flight. A session feed (SQLite-backed event log) exists but sessions only post to it voluntarily - there's no mandatory registration or heartbeat. Order status fields exist but aren't updated atomically when a session begins work.

Architecture constraints:

Sessions are ephemeral processes (no persistent daemon)
No shared memory between sessions - coordination must happen through filesystem or SQLite
Sessions can read each other's state but can't signal each other
Human is the only entity that can interrupt a running session
Latency budget for preflight checks: <3 seconds
Must work offline (no cloud coordination service)
~220 orders exist, ~10-15 are active at any time

Current tools:

session-feed (SQLite): event log with session.started, status, completed, session.ended events - but posting is manual/voluntary
set-tab-title: registers session existence + issue ID at startup
Order files: have status field but no claimed_by or claimed_at
Git worktrees: each active implementation gets its own worktree (observable via git worktree list)
Pitch/suggest skills: read session feed and order status before recommending, but don't enforce exclusion

Questions

How do distributed task-queue systems (Celery, Temporal, Sidekiq, BullMQ) handle "claim" semantics - specifically the pattern where a worker must atomically claim a task before executing it, and other workers must see that claim before picking up the same task? What's the minimal implementation of this for a SQLite-backed single-machine system?
What patterns exist in multi-agent AI systems (CrewAI, AutoGen, LangGraph, OpenAI Swarm) for preventing duplicate work across concurrent agents? Do any use a "work registry" or "task board" that agents check before starting? How do they handle the case where an agent starts work but crashes before completing?
In distributed systems literature, what's the lightest-weight protocol for "advisory locks" that doesn't require a persistent coordinator process? Specifically interested in file-based or SQLite-based approaches that survive process crashes (no zombie locks). How do systems like SQLite's WAL mode, flock(), or .lock files compare for this use case?
What do CI/CD systems (GitHub Actions, GitLab CI, Buildkite) do to prevent duplicate pipeline runs for the same commit/branch? How do they handle the "claim + heartbeat + expiry" lifecycle - and what's the minimum heartbeat interval that balances staleness detection against overhead?
How do collaborative editing systems (CRDTs, OT) and project management tools (Linear, Jira, Asana) signal "someone is working on this" to other users - specifically the UX patterns for showing claimed/in-progress state and the backend mechanisms for detecting abandoned claims (user closed their tab, session crashed)?
What are common failure modes when advisory-lock systems are retrofitted onto existing workflows? Specifically: (a) false positives that block legitimate work, (b) stale locks from crashed sessions, (c) lock granularity mistakes (too broad = blocking, too narrow = collisions still happen), (d) adoption failure where agents/users ignore the system because it's not mandatory.
Are there lightweight "session manifest" patterns where each active worker/agent writes a heartbeat file (e.g., JSON with PID, timestamp, task ID) and other workers read the manifest directory to see what's claimed? How do these compare to SQLite-based approaches for reliability and latency?

Desired Output

Comparison table of coordination mechanisms with columns: mechanism, latency, crash safety, complexity, requires daemon, offline-capable, example system
Recommended minimal protocol for my constraints (SQLite, no daemon, <3s preflight, crash-safe) - step-by-step lifecycle from "session wants to start work" through "session completes or crashes"
Anti-patterns list - what NOT to do, with real examples of systems that got burned
Heartbeat/expiry parameters - recommended intervals and timeout values for a system where sessions last 5-120 minutes
Migration path - how to retrofit this onto an existing system where sessions already run without registration, without breaking current workflows or requiring all sessions to update simultaneously

Deep Research Query: Work Registration and Collision Prevention

Classification

Extended capability commentary

Why it matters

Annotation