The Extended Frontier: The First Mile
Working Draft
This post is a working draft, developed collaboratively with Claude Opus 4.6 (1M context) in Claude Code (Anthropic) across multiple sessions, using a variety of extensions: persona-based reviews, citation verification, AI detection analysis (Pangram Labs), deep research reports, and dissertation search (qmd). The argument, evidence curation, and editorial direction are Daniel’s; much of the prose was initially generated by Claude and is being iteratively rewritten. A changelog tracks the development. We will continue to write about how we’re exploring this idea—the process is part of the argument.
Verification is the last mile. Most people are stuck on the first. The AI discourse focuses on output quality—validation, guardrails, verification—but that only works in domains that already have the other extensions. For most of life, the harder problem is upstream: how do you turn 'something feels wrong' into a question an LLM can usefully engage with?
Changelog
- 2026-03-24 — Initial draft published.
The AI conversation right now is almost entirely about outputs. Can you trust what the model produces? How do you verify it? What guardrails keep bad results from reaching the user? The whole discourse is oriented downstream.
This makes sense if you're building software. Software can afford to focus on output quality because the rest of the practice is already there—compilers, test suites, code review, the whole infrastructure. And the input side is rich: error messages give you search terms, stack traces tell you what went wrong, the codebase itself generates the artifacts that become your queries. When people say "AI is great for coding," they're partly describing a domain where the input problem was solved decades ago.
But for most of life, the harder problem is upstream. Not "is this answer right?" but "what is the question?"
The input problem
Think about what it takes to ask a good question about your lawn. "How do I improve my lawn?"—that's what you type into the chat box. But the actual question involves soil composition, climate zone, sun exposure patterns across seasons, drainage, what was planted before, your water situation, whether your neighbors use pesticides that drift, what "improve" even means to you. There's no stack trace. No error message. No compiler telling you what went wrong. The lawn is just there, being a lawn, and you know it could be better but you don't know along which axis.
Companies exist to provide these extensions. Soil testing kits. Satellite imagery services. Diagnostic apps that analyze photos. But they're expensive, slow, and fragmented. Nothing like running code. The gap between "something's wrong with my lawn" and a well-formed question that an LLM could usefully engage with—that gap is where most people live, for most of their problems, most of the time.
Or try temperament. "How do I get less reactive?" You know something's off. You snap at your kids, you stew over small things at work, you recognize the pattern afterward but can't interrupt it in the moment. What do you type into the chat box? The question assumes a level of self-knowledge that is itself the product of practices most people don't have access to. You'd need to know something about emotional regulation frameworks, or attachment theory, or stress physiology, or even just the vocabulary for distinguishing between anger and frustration and resentment. You know something's wrong but you don't have the tools to say what.
Sleep is another one. "How do I get better sleep?" is so massively underconstrained that the LLM will produce a generic list—consistent bedtime, no screens, cool room, limit caffeine. Probably correct advice, and completely useless for the person whose actual problem is an undiagnosed thyroid condition, or a partner who snores, or anxiety that only manifests at 3am, or a medication side effect their doctor didn't mention. The constraints on what the input should be are orders of magnitude larger than a factual or code-based question.
Why verification is a luxury
Here's what I think the output-focused discourse misses. Verification is a luxury of a richly extended practice. You can afford to focus on "does it run?" because the practice has already solved "what should I build?" and "how do I express what went wrong?" and "what does success look like?" Software has been accumulating those upstream extensions for fifty years. The AI-for-coding discourse lives on the output side because the other sides are already handled.
This creates a distortion. We project the software experience onto domains that don't share its extension structure, and then we're puzzled when results are poor. But the model isn't worse at lawn care or emotional regulation because of some architectural limitation. It's worse because the person can't produce the input the model needs, and the practice hasn't built the infrastructure to help them do it.
CBT as an existence proof
The therapy chatbot case is revealing. The Dartmouth/NEJM Therabot trial—210 adults, standalone chatbot, no therapist in the loop—produced a 51% reduction in depression symptoms. On paper, this should be deeply jagged territory—no compiler, no test suite, no peer review, and the person doesn't even know what's wrong with them. How does the input problem get solved?
It gets solved because CBT is a problematizing framework. That's what it does. It helps you figure out what the question is. The chatbot doesn't wait for you to arrive with a well-formed query about your cognitive distortions. It walks you through identifying thought patterns, labeling them, examining the evidence for and against them. The protocol itself is an input-side extension—it transforms "I feel terrible" into "I'm catastrophizing about a work email because I'm interpreting ambiguity as hostility."
But here's the gap the extensions framework reveals: you have to know to ask for CBT in the first place. If you type "I feel terrible" into a general-purpose chat, you might get CBT-adjacent advice, or you might get a list of coping strategies, or you might get referred to a crisis hotline. The protocol is in the training data. Whether it gets activated depends on the interaction pattern—what I called in Post 1 the difference between an extension being contained and being activated. The Therabot trial worked because the chatbot was designed to instantiate the CBT protocol turn by turn. It wasn't a general chatbot that happened to know about CBT. The extension was built into the interaction structure.
This is the input problem in miniature. The person who would benefit most from a CBT-based intervention is often the person least equipped to ask for one—they may not know the framework exists, let alone have the vocabulary to request it. The extensions that would help them formulate the question (therapy itself, psychoeducation, a friend who's been through it) are exactly the things they might not have access to.
The input problem goes deeper than the user
Post 1 described how community knowledge production—Stack Overflow's corrections, voting, canonical answers—constitutes an input-side extension that travels into the training data. But there's a further point worth making here about what this means for the first mile.
The input problem operates at multiple layers simultaneously. At the training-data layer, the model for code inherits not just answers but the correction structure—version-specific caveats, security warnings, deprecation notes, the back-and-forth that sharpened the answer over time. At the user-practice layer, you have a compiler and test suite generating the artifacts that become your queries. At the tool layer, IDE integrations inject context automatically. Each layer is doing input-side work.
For lawn care or temperament or sleep, those layers are thin all the way down. It's not that fewer people care about lawns—it's that the knowledge-production practices that would create dense, corrected, well-structured training data haven't been built. And at the user-practice layer, there's no equivalent of the error message that hands you a search term. And at the tool layer, there's no equivalent of the IDE injecting context into the query. The input problem compounds across layers. That's what makes the first mile so hard in these domains—it's not one missing piece but an absence at every level where input-side work could happen.
What "just ask a better question" misses
The standard advice is prompt engineering. Learn to ask better questions. Be specific. Provide context. Give the model constraints to work with. This is real advice and it works. But it puts the entire burden on the individual user, and it assumes the user has the raw material to be specific with.
The data engineers in my dissertation would say "I googled it and solved the problem" as if search was a solo act. They'd completely elide the extended practice that made it work—the error message that became the query, the Stack Overflow answer that someone else wrote and someone else voted up, the test suite that confirmed the answer worked, the colleague who said "actually, try this instead." Search felt effortless because the extensions were invisible. The work had already been done.
Prompt engineering is the same move. It feels like an individual skill. But the person who writes a good prompt is drawing on invisible infrastructure—domain vocabulary, mental models of what the system needs, prior experience with what works. Those are extensions of practice. Telling someone to "write better prompts" without addressing where that capacity comes from is like telling someone to "search better" without giving them the error messages that make search work. The capacity isn't in the person. It's in the practice the person is situated in.
What would first-mile infrastructure look like?
If the input side is the bottleneck—if AI capability looks jagged largely because people can't formulate the question rather than because the model can't answer it—then the intervention point shifts. Better models and better guardrails are both downstream fixes. What's missing is infrastructure for question formation.
The CBT chatbot works because someone built a conversational structure that transforms vague distress into specific, workable questions. That's not just a therapy design. It's a template for the general problem: how do you help someone who knows something is wrong but can't express the constraints? The whole arc from noticing something is off to arriving at a question worth asking—that's extension work, and it can be built.
This is what I'm trying to work on at Hypandra—tools that support the problematizing step, not just the answering step. I don't think the answer is more features in the chat box. It's more like what CBT does: frameworks that help people figure out what the question is. Not "here's what you should do about your lawn" but "here's how to figure out what's actually going on with your lawn."
The series continues
Post 1 established that the jagged frontier is predictable from the extension structure of the work. This post argues that the hardest extension gap isn't on the output side—it's the first mile, the input side, where the question gets formed.
The next post examines repairability—whether mistakes, once caught, can actually be fixed. That turns out to be a different property of the work than verification, and it's itself an extension that can be designed in or absent.
These are working drafts. The argument is still forming. But the direction seems clear enough to me: the discourse is focused on the last mile when most people haven't gotten through the first one yet.