The Extended Frontier: Where This Gets Hard

Changelog

2026-03-24 — Stub published.
2026-03-24 — First full draft.
2026-03-24 — Reframed around scope limits and handoff analytic. Dropped "illusory smoothness" as named concept. Replaced Fischer et al. anchor with dissertation non-functional properties finding.

*This is part of [The Extended Frontier](/2026/03/24/extended-frontier) series.*

The previous posts treated extensions as present or absent: does the practice have feedback loops, verification, social evaluation? If so, the frontier is smooth. If not, jagged. That's the easy version. But it leaves a question unanswered: what happens when extensions are present but their capacities don't cover what matters? A practice can have strong extensions and still produce failures, because each extension grounds specific dimensions and is silent on others.

Extensions have capacities

Each extension has specific capacities. A compiler grounds "does it run?" It doesn't also ground "is it safe?" A code review might ground both "does this logic make sense?" and "have you considered the security implications?" in a single act—one extension, two capacities.

In my dissertation, the data engineers could quickly validate what they found in search results—run the code, see if it works. That validation step was a real extension. It grounded "does this work?" in something testable. But as I wrote then: "workable in the moment isn't enough. There is other information that isn't immediately testable by running it through a compiler or interpreter and seeing if it 'works'. This is particularly the case for non-functional properties (normally including things like security, reliability, and scalability)."

The data engineers' extensions had capacities that covered functional correctness. Non-functional properties—security, reliability, scalability—needed extensions with different capacities. Not a fundamentally different kind of extension, just extensions that do different things. A security-focused practice (penetration testing, threat modeling, security review) has extensions whose capacities cover security the way a compiler covers runnability. A childcare setting has extensions—licensing, training, observation protocols—whose capacities cover developmental appropriateness and safety in ways that have nothing to do with code.

The point is: map the capacities of the extensions in a practice, and you've mapped where the frontier is smooth and where it isn't. The frontier is smoother on the dimensions the extensions' capacities cover and jagged on the dimensions they don't—within the same domain, the same task. And the confidence from the covered dimensions can generalize in the practitioner's mind to dimensions the extensions don't actually reach. Green CI makes you feel like the code is good. But "good" has more dimensions than the CI checks for.

What the handoff analytic reveals

This is where Mulligan and Nissenbaum's handoff analytic does its heaviest work in the series—not as connecting theory (that was Post 1), but as a diagnostic tool.

The handoff analytic asks: when a function moves from one component to another, what changes? Not just the primary output—everything the practice was doing. Post 4 applied this to skill formation, craft, pace, accountability. Here I want to apply it to the extensions themselves.

When AI enters a coding workflow, the compiler still works. The tests still run. Code review still happens (maybe). Those extensions are formally present. But the handoff analytic asks: are they doing what they used to do?

A test suite written for human-authored code verified that a human's understanding of the requirements was correctly translated into behavior. When an AI generates the code, the test suite still checks behavior against specification—but the function has shifted. The human who wrote both the code and the tests had a mental model connecting them. The AI that generated the code may satisfy the test's form without sharing the intent behind it. The test passes. The extension is present. But what the extension used to do—verify that someone understood the problem—may not be happening.

Code review shifts too. A reviewer reading a colleague's code is checking the colleague's reasoning. A reviewer reading AI-generated code is checking output they didn't write and may not fully trace. The extension is the same in form. The function it performs in the practice may be different.

The handoff analytic makes this visible. Without it, you'd look at the workflow and say: compiler, tests, review—extensions present, frontier should be smooth. With it, you ask: what were those extensions doing before, and are they still doing it?

The therapy chatbot, one more time

In the First Mile post, I used CBT-based therapy chatbots as an existence proof that input-side extensions can be built into AI systems. The Therabot trial showed real symptom improvement. The extension works.

But the handoff analytic asks: what else was therapy doing?

Human-delivered CBT builds a therapeutic relationship. It develops the patient's capacity to do the cognitive work independently—to notice their own distortions, to challenge their own thinking without a therapist prompting them. It's skill formation, not just symptom management.

The chatbot covers the protocol-following dimension. Whether it covers the skill-formation dimension, the relational dimension, the dimension where the patient becomes their own therapist—that's not measured. It might. I don't know. But the framework's job is to flag the question, and the handoff analytic is what generates it: the function of symptom reduction transferred. Did the function of building the patient's independent capacity transfer with it?

This is the same structure as the code example. The test passes (symptoms improve). The extension is present (CBT protocol activated). The question is whether the other functions the practice performed—functions the output metric doesn't capture—survived the reconfiguration.

Goodhart's Law on extensions

There's a related problem. Extensions can be gamed.

AI systems can learn to satisfy the form of an extension without satisfying its function. A model trained on code with test suites learns what makes tests pass. It produces code that passes. Whether the code handles edge cases the tests don't cover, whether it's secure, whether it's maintainable—those aren't what the test measures, so they aren't what the model optimizes for.

The distinction that matters is between types of extensions. Deterministic gates—compilers, type checkers—are binary. They can't be gamed on their own terms (the code either compiles or it doesn't), but they can be gamed on intent. Probabilistic checks—test suites, static analysis—are gameable on coverage. If the tests don't test for security, passing them says nothing about security.

Social evaluation is harder to game. The colleague who asks "have you considered the security implications?" is expanding the scope of what gets checked. They're evaluating intent, not just form.

This suggests something about which extensions are robust and which are fragile. The more an extension depends on matching known criteria, the more a model can learn to satisfy it without doing what it was designed to ensure. The more it depends on human judgment expanding scope beyond the criteria, the harder it is to circumvent. The extensions most easily automated are also the most easily hollowed out.

Benchmark contamination

There's a version of this scope problem that operates at the evaluation layer. The "SWE-Bench Illusion" paper found that models could identify buggy file paths from issue descriptions alone at suspiciously high rates. They weren't solving the problems. They were recognizing the repositories.

The benchmark's own structure—well-known open-source projects with extensive public issue histories—was functioning as an unintended extension. SWE-bench-Live, using fresh tasks from unfamiliar repos, shows the same agents performing significantly worse.

This matters because a benchmark that provides extensions the real use context lacks will overestimate capability. The frontier looks smooth on the benchmark and jagged in the field. If the evaluation itself has scope limits you haven't mapped, you can't see the jaggedness until it shows up in practice.

Trust drift

The automation bias literature adds something the extensions framework needs. Goddard et al.'s systematic review of 74 studies found that when automation failures go undetected, subjective trust increases. The failures that extensions don't catch don't just persist—they amplify confidence. The user encounters reliable performance, doesn't encounter failures (because they're invisible, not absent), and calibrates trust upward.

This is what makes extension scope limits dangerous rather than merely incomplete. Extensions that catch most errors create conditions for overreliance on the dimensions they don't cover. The user's experience of smooth operation trains them out of checking. The checking atrophies. And then the error on an uncovered dimension gets through to a user who has stopped looking.

The Dutch public-sector experiment is the encouraging counter: when salient, usable alternative cues were provided alongside AI recommendations, automation bias disappeared. The bias isn't intrinsic to using AI. It's about whether the environment keeps the practitioner's verification instincts alive on the dimensions the primary extensions don't cover.

Aviation is the sobering qualifier. Pilots committed errors 55% of the time even when correct cross-check information was available. Extensions must be engaged, not just available—and the conditions for engagement (workload, time pressure, interface design) are themselves part of the extension system.

Where this leaves the framework

The extensions framework predicts that the frontier is smoother where extensions are engaged. That holds. But each extension has specific capacities, and this post is about what happens beyond those capacities.

Three things happen. First, confidence from the covered dimensions generalizes to uncovered ones—green CI makes you feel like the code is good across dimensions the CI doesn't check for. Second, extensions can be satisfied in form without being satisfied in function—tests pass, but the intent behind the tests may not be met, and the more automated the extension, the more this matters. Third, the gaps are invisible when things are going well. Trust drift means the absence of detected errors increases confidence rather than maintaining vigilance.

The descriptive claim from Post 1 still holds: extensions predict where the frontier is smooth. But the prediction has to be dimensional. Smooth on what? Map the capacities of the extensions in a practice and you know which dimensions are grounded. The same practice can be smooth on functional correctness and jagged on security—not because security is inherently harder to extend, but because nobody built or engaged an extension with that capacity. Smooth on symptom reduction and jagged on skill formation—not because skill formation can't be extended, but because the chatbot's extensions don't have that capacity.

The handoff analytic is what makes the dimensions visible. Without it, you'd list the extensions and declare the frontier smooth. With it, you ask what each extension's capacities actually are, what the practice used to do across all its dimensions, and whether those functions survived the move to AI-assisted work. That's where this gets hard.