The Extended Frontier: Why AI Capability Is Predictable When You Look at the Work

In September 2023, Ethan Mollick and colleagues at Harvard Business School published a field experiment with 758 BCG consultants showing that AI capability is uneven in ways that don't map to task difficulty—what they called "a jagged technological frontier." Mollick described the metaphor as an invisible fortress wall: you can't tell which side of it you're on until you try. His advice: use AI a lot, learn where the walls are.

jagged technological frontier — Ethan Mollick's metaphor for AI capability: an invisible fortress wall where AI is unpredictably good at some tasks and bad at adjacent ones. From the BCG/Harvard experiment (Dell'Acqua et al., 2023). One Useful Thing

I think Mollick is right about the phenomenon and incomplete about the mechanism. He looked inside the model. I think we also need to look outside it.

The work is already extended

My dissertation, which I finished on 20 December 2022—weeks after ChatGPT's release—studied how data engineers use web search. The core finding: search works for data engineers not because they understand search mechanisms, but because their work practices and artifacts extend search in ways that compensate for its limitations. The knowledge that makes searching work is embedded in the work—error messages become queries, testing catches bad results, code review surfaces what an individual would miss. "The work practices around acceptance testing, feedback, and code review collectivize or distribute the evaluation of the search results."

This isn't a description of add-on validation steps. The extensions are the work. The meetings, the peer review, the testing, the naming conventions—these aren't bolted onto an otherwise isolated task. They're constitutive of the practice. If you don't test what you build, if you don't submit to code review, you're not doing the work.

The same applies to AI. The frontier is smooth where work is extended—permeated with feedback, distributed across people and time, embedded in artifacts and practices. It is jagged where work has been stripped of that extension.

Evidence

The SWE-agent ablation studies (Yang et al., NeurIPS 2024) make this concrete: changing what they call the "agent-computer interface"—the commands available, the feedback format, the error handling—"significantly enhances an agent's ability" to solve real software engineering tasks, without changing the model weights. Same model, different extensions, different performance.

SWE-agent — An autonomous software engineering agent from Princeton. Its ablation studies showed that changing the agent-computer interface — commands, feedback format, error handling — significantly affects performance without changing model weights. swe-agent.com

The MASAI mammography trial (Lång et al., 2023) provides the clinical equivalent: 105,934 women randomized, population-based screening. Transpara, a risk-scoring AI by ScreenPoint Medical, triaged examinations so low-risk scans got single reading and high-risk scans got double reading. The result: higher sensitivity, same specificity, 44.3% reduction in reading workload. Transpara didn't do that alone. The screening practice already had triage, double reading, arbitration, follow-up—the frontier was smooth because the practice was extended before AI showed up.

MASAI mammography trial — A mammography screening AI trial (Lång et al., 2023) with 105,934 women randomized. Transpara, a risk-scoring AI by ScreenPoint Medical, achieved higher sensitivity with 44.3% reduction in radiologist reading workload. PubMed

In March 2026, Simon Willison responded to Clive Thompson's New York Times Magazine piece "Coding After Coders" by noting that programmers can "tether their AIs to reality, because they can demand the agents test the code to see if it runs correctly." His contrast: "programmers have it easy... If you're a lawyer, you're screwed." But that's the isolated-task framing. A lawyer isn't screwed. A lawyer using AI without their normal extensions is screwed—same as a developer shipping AI code without running the compiler. A lawyer who Shepardizes an AI-generated citation takes seconds to verify it. The extension already existed. The 700+ documented cases of AI hallucinations in legal filings (tracked by Damien Charlotin) happened not because law lacks extensions but because people bypassed the ones it has.

Damien Charlotin — Maintains a database documenting 700+ cases of AI hallucinations in legal filings — instances where lawyers submitted AI-generated citations to courts that turned out to be fabricated. damiencharlotin.com

The variable isn't "does this domain have extensions?" It's "does this particular use of AI engage the extensions that already exist?"

Not just verification—the ground

The extensions aren't just validation checkpoints. They're the ground that makes the activity coherent. Think about what a compiler actually does. It's not checking your work—it's defining what "working code" even means. Without it, "runnable" isn't a category you can point to. Code review works the same way: the team's sign-off is what makes something "accepted code," not an abstract quality standard. Or look at law. Shepardizing makes "good law" into something verifiable rather than assumed. Pull out the extensions and the work isn't just unchecked or ungrounded—the work is undone, both undone (fragmented and taken apart) and not done (never actually performed or completed).

Shepardizing — Checking whether a legal citation is still "good law" — whether it has been overruled, distinguished, or otherwise affected by subsequent decisions. Wikipedia

This is why "bolt-on validation"—guardrails, RLHF, human-in-the-loop review—misses at least two things.

It misses the input side. In extended practice, the work helps you formulate the input. Error messages become queries—but only because someone designed error messages to be informative, because the language has a stack trace convention, because the project has logging configured. The profession produces the seeds. More than that: the entire ecology of knowledge production upstream of the query is an extension. Stack Overflow answers exist because a community of practice writes, votes on, corrects, and maintains them. When we say "the model is good at code," we're partly saying the larger system connects the model to that community's knowledge-sharing practices—practices that existed and worked well before 2022. Bolt-on validation only checks the output. The person who doesn't know what to ask gets a validated answer to the wrong question.

It misses scope. The colleague in code review says "have you considered the security implications?" The stakeholder demo reveals requirements nobody wrote down. Extended practice surfaces the unknown-unknowns. Bolt-on validation handles the known-knowns.

The connecting theory

Mulligan and Nissenbaum (2020) argue that when a function transfers from one component to another, functional equivalence does not guarantee ethical equivalence. Things change in the reconfiguration. Their handoff analytic asks: what changes when the function moves?

The extensions framework answers: the extensions change. Or they don't transfer. Or they get stripped in the process. The assumption of smooth functional equivalence—that because a new component can perform a function, the function will perform equivalently in context—is the mistake at the heart of it.

The extensions are what make it work in context. Without them, the frontier is jagged. With them, it's smooth. The jaggedness is not a mystery. It's a design variable.

What's next

If this theory is right, it should be useful—not just as explanation but as a tool for intervention. Where extensions are missing, the theory should tell you specifically what to build or reconnect. Where extensions exist but are being bypassed, it should tell you what's going wrong.

The rest of this series explores where the argument leads: the input problem (verification is the last mile—most people are stuck on the first), repairability (whether mistakes can be recovered once caught), and what else changes beyond the output metrics that most AI evaluations measure.

This post was developed through extended practice—corrections, pushbacks, deep research feeding back, iterative revision across multiple sessions. A single prompt would have produced something plausible and thin. If the argument has substance, it's because the conversation was extended.