The Extended Frontier: A Prediction
Working Draft
This post is a working draft, developed collaboratively with Claude Opus 4.6 (1M context) in Claude Code (Anthropic) across multiple sessions, using a variety of extensions: persona-based reviews, citation verification, AI detection analysis (Pangram Labs), deep research reports, and dissertation search (qmd). The argument, evidence curation, and editorial direction are Daniel’s; much of the prose was initially generated by Claude and is being iteratively rewritten. A changelog tracks the development. We will continue to write about how we’re exploring this idea—the process is part of the argument.
If extensions predict where the frontier is smooth, the prediction should be testable: within a single domain, subtasks with stronger feedback loops should show smoother frontiers than subtasks with weaker ones. Software engineering already demonstrates this—strong feedback for syntax, weak for security. Map the extensions, predict the frontier shape, and see if you're wrong.
Changelog
- 2026-03-24 — First draft.
The previous posts built a framework: extensions predict where the frontier is smooth, the handoff analytic reveals what else changes, and each extension has specific capacities that define its reach. That's explanation. It organizes evidence well after the fact. But if the framework only explains and never predicts, it's a lens, not a theory. This post tries to cross that line.
The falsifiability problem
Deirdre Mulligan's question cuts right to it: "What is the smallest version of this that is falsifiable? One domain, one set of extensions, one prediction derived from the theory before looking at the outcome data." The Codex review of the argument draft flagged the same gap from a different angle: make the prediction falsifiable.
Both are right. If the framework can't generate a prediction that could be wrong, it's useful but not testable.
So here's an attempt at the smallest falsifiable version.
The within-domain prediction
The tempting version of the prediction is cross-domain: "Software has rich extensions, law has weak ones, therefore the frontier is smoother for code than for legal reasoning." That's broadly true. It's also confounded by a hundred other variables—training data volume, task structure, the nature of language vs. formal logic, professional culture, how tasks get decomposed. You can't control for any of it. Cross-domain comparisons are suggestive but they don't test the extensions mechanism specifically.
The stronger version stays within one domain. Software engineering is bimodal in exactly the way the framework predicts, and the previous post laid out why: extensions have specific capacities, and the extensions that ground functional correctness (compiler, test suite, type system) don't reach security. Fischer et al. found that 15% of Android apps on Google Play contained vulnerable code snippets very likely copied from Stack Overflow. The code compiled. It passed tests. It was insecure.
This is a within-domain prediction, and the evidence already exists. The frontier is smooth for the dimensions where extensions are engaged (compilation, testing) and jagged for the dimensions where they're absent or weak (security, scalability, requirements correctness). Same domain. Same model. Different extension structure. Different performance.
Test design
What would a rigorous test look like? You'd want the same model, same domain, same task complexity—varying only the extension structure. Measure performance on the dimensions the extensions cover versus the dimensions they don't.
The SWE-agent ablation studies (Yang et al., NeurIPS 2024) are the closest existing controlled evidence. They held the model constant and varied the "agent-computer interface"—the commands available, the feedback format, the error handling. Interface design "significantly enhances an agent's ability" to resolve real GitHub issues, without changing weights. Same model, different scaffolding, different outcomes.
That's suggestive, but it's not quite the full test. They varied tool access—one layer of the extension ecology. The framework claims that extensions operate across four layers: training data, tool access, user practice, and work context. A complete test would vary not just whether the agent can run code, but whether the broader practice infrastructure is engaged. Can the agent access test suites? Is there a CI pipeline that catches regressions? Does the workflow include review by a human who knows the codebase?
There's also a design challenge with anything beyond tool access: user practice and work context are hard to experimentally manipulate without also changing the task. But partial tests are still informative. You could compare the same model on tasks where test suites exist vs. tasks where they don't (within the same repository, controlling for difficulty). Or compare performance on codebases with strong typing and linting vs. dynamically typed code with no static checks. The prediction: performance tracks the extension density, not just task difficulty.
What would falsify this
Three findings would be serious problems for the framework.
First: a domain with rich, engaged extensions showing a jagged frontier on the dimensions those extensions cover. If code that compiles, passes all tests, clears static analysis, and survives code review still fails unpredictably on functional correctness—the thing those extensions are supposed to ground—that's a problem. Not "fails on security," which is a dimension the extensions don't cover. Fails on the dimension they do.
Second: a domain with genuinely absent extensions showing smooth, reliable AI performance. If AI-generated strategic advice (no feedback loop, no verification mechanism, no ground truth) turned out to be consistently reliable across contexts—that would suggest extensions aren't the mechanism. Something else would be doing the work.
Third, and more subtle: if extension density didn't predict performance differences within a domain. If security-focused tasks with strong static analysis feedback loops performed no better than security tasks without them—same domain, different extension structure, same performance—that would undermine the mechanism.
Now, there's an honest complication. Aviation has rich extensions—checklists, cross-checks, crew resource management, redundant systems—and pilots still committed errors 55% of the time even when correct cross-check information was available. Does that falsify the framework?
I don't think so, but I want to be careful about why not. The aviation finding shows that extensions are necessary but not sufficient. Having the cross-check information available doesn't help if workload, time pressure, or interface design prevents the pilot from actually using it. Extensions must be engaged, not just available. That's a refinement of the claim, not an escape hatch. It means the prediction has to be more specific: performance tracks whether extensions are engaged under conditions that allow their use. If those conditions are absent—if the pilot is overloaded, if the developer is shipping at 2am without running tests—the extensions exist but aren't doing their work.
That refinement makes the prediction harder to test cleanly but also more honest. "Rich extensions predict smooth frontiers" is too simple. "Rich extensions, engaged under conditions that allow verification, predict smooth frontiers on the dimensions those extensions cover" is clunkier but closer to right. And it's still falsifiable: you'd just need to show that engaged extensions, under conditions that allow their use, still don't predict performance on their covered dimensions.
The prediction as a tool
If the prediction holds, it's not just a scientific claim. It's operationally useful. Before deploying AI in a practice, you can map the extensions: What feedback loops exist? Are they engaged by the proposed workflow? What dimensions do they cover, and what dimensions are left exposed? Where the extensions are rich and engaged, expect the frontier to be smooth. Where they're absent or bypassed, expect jaggedness—and decide whether that jaggedness is benign (visible, cheap to fix), productive (failures teach something), or dangerous (invisible, expensive).
That's a different kind of pre-deployment assessment than most AI evaluations offer. Most evaluations ask "how well does the model perform on this benchmark?" The extensions framework asks "does the deployment preserve, strengthen, or strip the practice infrastructure that makes the work reliable?" The first question tells you about the model. The second tells you about the situated assembly—which is what actually gets deployed.
Mulligan's question was the right one to ask. I'm not sure this version is tight enough yet. But it's the direction: within a domain, the same model, different extension structures, predictably different performance on the dimensions those extensions cover. That's the claim. If it's wrong, it should be possible to show it.
This is part of The Extended Frontier series.