What it takes for AI coding agents to be truly autonomous

The promise of autonomous AI coding agents is simple. You give them a task. They plan, write code, run tests, fix errors, and ship the result. No babysitting. No reviewing every line. A working system at the end.

That promise is what every serious tool is now selling. Devin. Codex Workspace. Claude Code in headless mode. Cursor's background agents. GitHub Copilot's cloud agent. The marketing says "autonomous." The category implies "no human in the loop."

The reality is different, and a Princeton research group put numbers on the gap earlier this year.

This piece is about what's structurally missing. It draws on recent research on agent reliability, Anthropic's Building Effective AI Agents, and the failure patterns appearing across every team shipping autonomous coding products. It assumes the reader knows what a coding agent is and is trying to make one work in production.

Contents

What "autonomous" actually means
The reliability gap, in numbers
The substrate problem
The autonomy gap is an identity gap
Why agent identity isn't already solved
Where this leaves you
FAQ

What "autonomous" actually means

Princeton's paper Towards a Science of AI Agent Reliability makes a distinction that everyone shipping coding agents should internalize. The authors split agent deployments into two categories.

In augmentation settings, a human reviews, edits, and approves the agent's output before it takes effect. The human serves as a reliability backstop. Coding assistants and copilots fall here. The agent is wrong constantly. The developer catches it. The system gets to call itself trustworthy because the human is quietly doing the work.

In automation settings, the agent's output is the final action with no human buffer. There is no reviewer. The agent's mistakes go straight to production.

The reliability bar between these two modes is fundamentally different. The authors put it bluntly: "An agent that succeeds on 90% of tasks but fails unpredictably on the remaining 10% may be a useful assistant yet an unacceptable autonomous system."

That sentence is the headline finding. Most products labeled "autonomous coding agents" today are highly capable augmentation tools running in environments where the developer has stopped looking. Same model. Same reliability profile. The reviewer just left the room.

The reliability gap, in numbers

Across 14 models and 18 months of capability releases, the research found that reliability gains have barely moved while capability has climbed steadily.

The specific findings worth knowing:

Outcome consistency is low across frontier models. The same agent, given the same task, produces different outcomes on repeated runs. You cannot predict whether re-running a task will succeed.

The "what but not when" pattern is endemic. Agents reliably select similar action types across runs but vary the execution order, producing different failure modes depending on where they are when something goes wrong.

Resource consistency is poor. Identical tasks fluctuate by an order of magnitude across runs in token usage, latency, and cost. Budgeting and SLAs are effectively impossible.

Discrimination, the ability of a model to tell when its own answers are wrong, has actually worsened on the harder benchmark across the most recent frontier models. Calibration improved but discrimination did not. Newer models are more confidently incorrect than the ones they replaced.

The reflex from the field has been to push models harder. Bigger context windows. Better planning. More reasoning. None of those move the dimensions Princeton measures meaningfully. The reason is that the problem isn't a model problem. It's a substrate problem.

The substrate problem

In augmentation mode, the substrate is the developer. The developer holds the credentials. The developer's session has the permissions. The developer is the entity being audited. The agent runs as the developer because the developer is right there, reviewing every output before it ships.

In automation mode, the reviewer is gone but the substrate didn't change. The agent still runs as the developer. Same session token. Same IAM role. Same blast radius. The model's unreliability has nowhere to land except on the developer's own identity.

This is what makes the current generation of autonomous coding agents structurally fragile.

You cannot sandbox what runs as you. You cannot audit actions that are indistinguishable from your own. You cannot revoke an autonomous process that holds your session token. You cannot apply principle of least privilege when the agent inherits all of your privileges.

When an autonomous agent deletes a production database despite explicit instructions not to, the post-mortems usually blame the model or the prompt. The structural failure is upstream. The agent had the credentials to delete production at all. It had them because the team built a copilot's permissions model and deployed it under automation conditions.

The fix isn't a smarter model. It's a substrate underneath the model.

The autonomy gap is an identity gap

An autonomous agent needs four things an augmentation tool doesn't. Its own credentials, scoped deliberately rather than inherited. Its own audit trail, separable from its operator's. Its own addressable presence, reachable by other agents, services, and humans. Its own reputation, that can be revoked, throttled, or trusted incrementally as it demonstrates competence at specific tasks.

These four things together are what we call AI agent identity. It's the substrate that lets autonomy actually work. Without it, even a perfect model is still running as a session token on someone's laptop, and every mistake still attaches to a human who never made the decision.

Two clarifications worth making upfront, because both tend to get pushed back on.

This is a technical attribution claim, not a legal accountability claim. The human or organization deploying the agent is still on the hook. Identity is what makes that accountability actionable: it lets the accountable party know what the agent did, prove it, distinguish it from their own actions, and improve the system afterward. Without per-agent identity, the human is on the hook for autonomous actions they did not personally take and cannot fully reconstruct.

Identity is not the policy engine. Policy still lives in IAM, in approval gates, in deploy systems, in audit pipelines. Identity is the addressable handle those policies bind to. The deeper case for which identity primitive is appropriate for agents is in Email as Identity for AI Agents. For the purposes of this piece, the point is that the identity layer has to exist underneath whatever policy engine you choose to put on top of it.

Why agent identity isn't already solved

A thoughtful reader will assume the problem is already addressed. It isn't, and the gap is specific.

GitHub Apps give an agent a bot identity inside GitHub. AWS service principals give it a scoped role inside AWS. OAuth client credentials cover specific APIs. SOC2-compliant audit logs capture certain action types in certain systems. Each of these is real, useful, and necessary. None of them compose across systems.

There is no portable, programmable, addressable handle that wraps these per-system identities under a single thing you can revoke, throttle, page, or audit across all of them. An agent with a GitHub App identity cannot be reached natively by a Linear bot or paged by a human or audited across services. The pieces do not add up to a coherent identity for the agent itself.

Agent-protocol identity covers a different layer. MCP standardizes how agents talk to tools. A2A standardizes how agents talk to other agents. A2UI standardizes how agents render to users. These are real and converging on production-grade. They are also tool, peer, and presentation layers. They are not a cross-system identity for the agent itself.

The missing piece is a name. An addressable handle the agent owns, that lives across every system it touches, that policy and credentials and audit logs and external reachability can all bind to. The identity primitive that takes an autonomous coding agent from "a process running on someone's laptop" to "an entity that exists on the internet."

Where this leaves you

The data is one signal. The pattern is being repeated across every team shipping autonomous coding products. The next leap in agent reliability is not going to come from waiting for the next model release. It's going to come from the substrate underneath the model.

If you are building an autonomous coding agent, four things to do in the next sprint:

Audit what your agent actually runs as. If the answer is "the developer who deployed it" or "a shared service account with broad permissions," you have a copilot's permissions model under automation conditions. Fix that first.

Separate the agent's audit trail from the operator's. Until you can point at logs and say "this is what the agent did," you cannot run a meaningful post-mortem when something breaks.

Give the agent a distinct, addressable identity it owns. Not just a service account inside one system. A handle that other agents, humans, and external systems can reach without integration work. Permissions and Agent Onboarding cover the practical surfaces for this.

Plan for the agent to ask for help. Even a perfect substrate doesn't make a model less wrong. Build in the channels for the agent to escalate, defer, or request approval when it hits something it cannot handle alone. Building real-time AI agents with webhooks covers the asynchronous side of this.

This is the thesis we built AgentMail on, and the broader thesis matters more than any single product. Autonomous AI agents need first-class identity to operate as autonomous actors on the internet, the same way humans do. The 14 models in the study span 18 months. Across that span, raw capability rose steadily. The four reliability dimensions barely moved. The shape of the failure modes did not change. The unglamorous infrastructure work is what decides whether autonomy actually scales or just looks like it does on a demo.

Start with identity. The rest of the stack follows.

FAQ

What is an autonomous coding agent? An autonomous coding agent is an AI system that plans, writes, tests, and ships code with a defined goal but without direct line-by-line human supervision. The defining feature is the autonomous loop: plan, act, observe, adjust, repeat. Devin, GitHub Copilot's cloud coding agent, Cursor's background agents, Codex Workspace, and Claude Code in headless mode are all examples. The distinction that matters is automation vs augmentation: an autonomous coding agent's output is the final action with no human reviewer in the middle.

How is an autonomous coding agent different from an inline copilot? An inline copilot is reactive. It waits for a developer to write code and suggests what comes next. Every output passes through a human reviewer before it ships. An autonomous coding agent runs longer-horizon tasks without a human reviewing each step. Princeton's reliability research calls this the "augmentation vs automation" distinction. The infrastructure requirements between the two modes are fundamentally different: a copilot can safely run as the developer because the developer is the reliability backstop; an autonomous coding agent cannot.

Why aren't most "autonomous" coding agents truly autonomous yet? Because most of them are augmentation-grade products deployed in automation environments. The model is wrong as often as it always was. The reviewer is gone. The substrate underneath (credentials, audit trail, permissions, identity) was designed for the augmentation case and inherits the developer's session and blast radius. Reliability does not improve by removing the human; it requires building the per-agent substrate that the human used to provide.

What is AI agent identity? AI agent identity is the substrate that lets an autonomous agent operate as a first-class actor on the internet: its own credentials, its own audit trail, its own addressable presence, and its own reputation, all separable from the human or organization that deployed it. Without per-agent identity, agent actions cannot be cleanly sandboxed, audited, scoped, or revoked, because they are indistinguishable from the operator's own actions.

Is AI agent identity the same as accountability? No. Identity is the technical attribution layer. Accountability is the legal and organizational layer that sits on top of it. Humans and organizations remain legally accountable for what their agents do, and they should. Identity is what makes that accountability actionable. It lets the accountable party know what the agent did, prove it, distinguish it from their own actions, and improve the system afterward.

Do GitHub Apps, AWS service accounts, or OAuth tokens already solve agent identity? They solve part of it. GitHub Apps give an agent identity inside GitHub. AWS service principals give it identity inside AWS. OAuth client credentials cover specific APIs. Each per-system identity primitive is real and useful within its system. The gap is composition. There is no cross-system addressable handle that ties them together, that external agents and humans can reach, and that policy, audit, and escalation can all bind to. An agent with twelve separate per-system identities still does not exist as a single addressable entity on the internet.

Why is email a useful identity primitive for AI agents? Email is the most universal addressable identity on the internet. It is portable across every system, programmable through standard APIs, revocable, auditable by default, and reachable by any human or agent without coordinating integration with anyone else's IT. It is not the policy engine; IAM is. But email is the addressable handle that policy, credentials, audit logs, and external reachability can all bind to. The deeper case is in Email as Identity for AI Agents.