Context engineering for coding agents is the practice of controlling which evidence, tools, and memories enter the agent's work. It is not about putting more material into the prompt. It is deciding, before the loop begins, which evidence enters, which noise stays out, and which verification ends the task.

In 2025, Google Cloud reported in the DORA report that AI adoption among software professionals reached 90%, with a median of two hours of daily use (Google Cloud, "How are developers using AI? Inside our 2025 DORA report", 2025). Access is no longer the main problem. Useful memory, cost, trust, and review are.

TL;DR: key takeaways

  • DORA measured 90% AI adoption in software, while Stack Overflow measured 46% distrust in accuracy. The practical path is smaller context, stronger verification, and loops with clear stopping rules.
  • Subagents help when they return synthesis, not raw log dumps.
  • MCP should expose scoped tools and resources, not the whole company by default.

Why did context engineering become the bottleneck?

In 2025, Stack Overflow measured that 84% of respondents use or plan to use AI tools in development, but 46% distrust the accuracy of the results (Stack Overflow, "2025 Developer Survey: AI", 2025). Context engineering became the bottleneck because the agent needs enough evidence to act, without enough noise to drift.

The common mistake is treating the context window like storage. The agent receives the README, long logs, old conversations, a full tree, the whole issue, a raw stack trace, and duplicated rules. That feels safe, but it makes review worse. The person reading the diff cannot see which premise mattered or which test closed the task.

A better flow separates context into layers. Stable context explains how the repository works. Retrieved context brings relevant files and decisions. Operational context defines commands, limits, and permissions. Evidence records what was tested. Each layer needs an owner, a validity window, and a way to expire.

This connects directly to my article on a coding agent harness for reliable pull requests. The harness decides whether the work passes. Context engineering decides whether the agent received material good enough to get there without wasting iterations.

According to Stack Overflow, in 2025 only 3.1% of respondents said they highly trust AI tool accuracy, while 19.6% highly distrust it (Stack Overflow, "2025 Developer Survey: AI", 2025). The point is not to abandon agents. It is to make mistakes visible early, with a trace reviewers can inspect.

What belongs in a context budget?

In 2025, DORA reported that 65% of professionals relied on AI in software development at moderate, high, or very high levels (Google Cloud, "How are developers using AI? Inside our 2025 DORA report", 2025). A context budget turns that reliance into an operational contract: the agent receives objective, constraints, evidence, and stopping criteria.

Abstract diagram showing prioritized context layers before they reach a coding agent.

Start with context that rarely changes. It should fit in a short file such as AGENTS.md, CLAUDE.md, or an equivalent document. Include repository layout, build commands, pull request standards, security rules, and the definition of done. Codex documentation recommends that AGENTS.md cover repo structure, commands, conventions, constraints, and work verification (OpenAI Developers, "Best practices - Codex", 2026).

Next comes retrieved context. This is where lexical search, vector search, dependency graphs, and decision history belong. For large codebases, codebase RAG only works when it returns small excerpts with a reason. A search that returns ten entire files is just another way to pollute the prompt.

Finally, define operational context. It answers which tools may be used, which commands are cheap, which commands need permission, and what output should return to the main conversation. In long loops with Claude Code or Codex, I use one rule: raw logs go to a file; the main agent receives a synthesis with paths, failures, and the next step.

Layer What it contains When to update
Stable rules Conventions, commands, and definition of done. When the team changes the real workflow.
Retrieved evidence Files, decisions, and excerpts tied to the task. For every task or subagent.
Operation Permissions, tools, limits, and commands. When the execution environment changes.
Verification Tests, lint, typecheck, evals, and diff review. On every loop attempt.

A context budget for coding agents limits what enters the main conversation and moves noise into auditable artifacts. That discipline reduces rework because the agent stops arguing with old logs and starts operating on objective, proof, and constraint. It is the difference between "read everything" and "use these signals for this decision."

How do you build a self-correcting agentic loop?

In 2025, OpenAI described Codex as an agent that works on parallel tasks and usually takes 1 to 30 minutes per task (OpenAI, "Introducing Codex", 2025). A self-correcting loop must use that capability as a proof system, not as permission to accept any diff.

Text-free diagram showing context selection, execution, verification, and evidence recording in an agentic loop.

The loop starts with a narrow task. It should state which behavior changes, where the result can be observed, and what must not change. Then the agent creates a short plan, queries the codebase, applies the change, runs verification, and records evidence. If it fails, it should not restart from scratch; it should shrink the hypothesis and try again.

Use evals when the output does not fit a unit test. OpenAI recommends starting difficult problems by defining how success will be measured, combining deterministic checks and rubric-based evaluation when needed (OpenAI Developers, "Iterate on difficult problems", 2026). In software, this becomes a mix of automated tests, static analysis, semantic review, and inspection of the final artifact.

task:
  objective: "fix session expiration failure in the renewal endpoint"
  out_of_scope: "do not change the authentication provider"
context:
  required_files:
    - "src/auth/session.service.ts"
    - "src/auth/session.controller.ts"
  retrieve:
    - "tests mentioning session renewal"
    - "architecture decisions about tokens"
verification:
  commands:
    - "npm run test -- auth"
    - "npm run lint"
  stop: "small diff, passing tests, and residual risk explanation"

In practice, this contract works better when the agent must write its own hypothesis before editing. If the hypothesis does not mention file, behavior, and verification, the task is still vague. That forces a cheap pause before the expensive part: changing code.

In my experience reviewing agent-generated changes, the short hypothesis reduces circular discussion. When the agent writes "I will touch this file, for this behavior, and prove it with this command," the reviewer can separate context failure, implementation failure, and test failure.

For loops that need to cross several sessions, a tool like RemoteCode for stretching Claude Code and Codex in agentic workflows is a resource from the author of this blog for keeping working context more economical and letting agents go further without loading the whole history into the main prompt.

A self-correcting agentic loop is not "autonomy" in the loose sense. It is a process with limited memory, explicit tools, and verifiable output. The agent may try more than once, but each attempt should produce smaller and better evidence than the previous one.

When should you use subagents instead of a larger chat?

In 2025, Stack Overflow found that 69% of agent users agreed agents increased productivity, but only 17% agreed they improved team collaboration (Stack Overflow, "2025 Developer Survey: AI", 2025). Subagents help when they reduce coordination, not when they create a parallel meeting among models.

Use subagents for reading, investigation, and independent critique. One subagent can look for security risks. Another can map broken tests. Another can summarize changes in a legacy module. The main agent does not need every command they ran. It needs findings with file, line, confidence, and recommendation.

Codex subagent documentation warns that dumping exploration notes, logs, and stack traces into the main conversation pollutes and degrades context. The recommendation is to keep the main agent focused on requirements, decisions, and final output while subagents return syntheses (OpenAI Developers, "Subagents", 2026).

This split works well with codebase RAG and knowledge graphs. The retrieval subagent can explain why it chose certain files. The test subagent can explain which suite covers the behavior. The review subagent can point to likely regressions. The main conversation becomes the place for decisions.

Avoid subagents writing the same section of code at the same time. That raises conflict and makes decision ownership blurry. If writing must be parallelized, split along real boundaries: package, service, route, schema, or infrastructure layer. Even then, consolidate with a reviewer agent before opening a pull request.

A useful fan-out has three properties. Each subagent task is small. Each output is a synthesis. The final merge requires proof. If any of those is missing, one cleaner conversation is often better.

How does MCP change context discipline?

In 2025, the Model Context Protocol specification defined three central server surfaces: resources, prompts, and tools (Model Context Protocol, "Specification 2025-11-25", 2025). MCP changes context discipline because it turns access into an interface: the agent does not need everything when it can query the right thing at the right time.

For coding agents, MCP resources can expose files, database schemas, internal documentation, or architecture decisions. MCP tools can run search, SQL queries, queue inspection, feature flag reads, or controlled test execution. MCP prompts can standardize flows such as bug triage, security review, or migration analysis.

The risk is confusing capability with permission. An MCP server that can access every database, every secret, and every repository turns a prompt mistake into an operational mistake. The practical rule is to expose narrow tools with typed arguments, workspace scope, and summarized output. The agent should ask for more context, not receive everything by default.

OWASP lists prompt injection, sensitive information disclosure, supply chain, model denial of service, and excessive agency among 2025 risks for LLM applications (OWASP, "Top 10 for Large Language Model Applications", 2025). In an MCP environment, those risks stop being theoretical because a tool can have real effects.

The best MCP design for development does not imitate an unlimited terminal. It looks more like an internal API: small tools, explicit names, audit logs, and responses that fit a decision. If a tool returns an encyclopedia, it is competing with the context budget.

How do you measure whether context improved?

In 2025, GitHub reported that developers merged an average of 43.2 million pull requests per month and created more than 230 repositories per minute (GitHub, "Octoverse 2025", 2025). At real scale, better context is measured through review, rework, and time to proof, not generated lines.

Choose metrics the team already feels. How many attempts does the agent need before a test passes? How many irrelevant files does it touch? How many code review comments repeat the same failure? How often does the agent ask for context that should already be in AGENTS.md? Those answers show where the context budget leaks.

For TypeScript teams, one simple metric is typecheck failure rate after AI-generated changes. GitHub observed in 2025 that TypeScript reached first place in contributor growth and connected that rise to typed systems that help bring AI-assisted code to production (GitHub, "Octoverse 2025", 2025). Types become part of the context harness.

Use a short post-loop review. Ask the agent what context was missing, what was extra, and which rule should become stable instruction. If the answer is useful, update AGENTS.md or the equivalent document. If the answer is vague, the loop did not produce operational learning.

The goal is not to use fewer tokens at any cost. The goal is to make every token carry a decision. One right file is worth more than five almost-related files. A test that fails clearly is worth more than a long explanation. An honest subagent summary is worth more than a thousand log lines.

Adoption checklist for a real codebase

In 2025, Google Cloud announced that the DORA report combined more than 100 hours of qualitative data with responses from nearly 5,000 technology professionals (Google Cloud, "Announcing the 2025 DORA Report", 2025). The practical takeaway is that AI amplifies the existing system, so adopt context engineering as a platform improvement, not a prompt trick.

Before automating pull requests, write a short and real repository instruction file. It should explain how to run the project, how to test, how to review, and which areas are sensitive. Then create a retrieval map: where architecture decisions, schemas, API contracts, queues, jobs, and runbooks live. Without that map, RAG becomes generic search.

Next, standardize tools. If the agent needs to query a database, expose a scoped read tool. If it needs to test a queue, expose a controlled command. If it needs a security review, provide a checklist and test. Connect this to CI so proof does not depend on the model's goodwill.

Finally, close the cycle in the pull request. The diff should include summary, commands run, evidence, residual risks, and points needing human review. That format connects to the article on types of software tests and to TypeScript service architecture, because good agents still need clear boundaries.

If the codebase has fragile CI, start small. Pick one module, one suite, one task type, and one agent. Measure rework. Only then add subagents, MCP, and automation. Real context engineering is incremental: each rule comes from an observed failure, not from a generic list of best practices.

Frequently asked questions (FAQ)

In 2025, Stack Overflow measured 84% use or intended use of AI among developers, while also recording 46% distrust in accuracy (Stack Overflow, "2025 Developer Survey: AI", 2025). The important questions are therefore not about adopting AI; they are about limiting operational risk.

Is context engineering just prompt engineering under another name?

No. In 2025, DORA measured 90% AI adoption in software development, which shows the problem now sits in the whole workflow, not only in the prompt (Google Cloud, "How are developers using AI? Inside our 2025 DORA report", 2025). Context engineering includes retrieval, tools, memory, permissions, evals, and proof.

When are subagents worth using?

Use subagents when the task can be split into independent investigation. In 2025, Stack Overflow found that 69% of agent users saw productivity gains, but only 17% saw better collaboration (Stack Overflow, "2025 Developer Survey: AI", 2025). That favors analysis subagents, not uncoordinated concurrent editing.

Does MCP make an agent safer?

MCP improves the interface, but it does not guarantee safety by itself. In 2025, the official specification separated resources, prompts, and tools as protocol primitives (Model Context Protocol, "Specification 2025-11-25", 2025). Safety comes from scope, authentication, audit, human review, and narrow tools.

What is the first artifact to create?

The first artifact should be a repository instruction file. In 2026, Codex documentation recommends AGENTS.md with repo layout, commands, conventions, constraints, and definition of done (OpenAI Developers, "Best practices - Codex", 2026). Without it, every session relearns the basics.

Closing

In 2025, DORA observed that teams with higher trust in AI saw greater productivity gains, while warning that trust without foundations can create uncritical dependence (Google Cloud, "How are developers using AI? Inside our 2025 DORA report", 2025). Context engineering for coding agents is an architectural practice: it defines what the agent knows, how it finds what is missing, which tools it may call, and which proof it must deliver.

The next step is simple: choose one repetitive task type, write the context contract, connect verification, and run a short loop. If the agent fails, do not enlarge the prompt first. Find which evidence was missing, which noise was extra, and which gate should have blocked the diff.

Sources consulted

  • Google Cloud, "How are developers using AI? Inside our 2025 DORA report", retrieved 2026-06-30, https://blog.google/innovation-and-ai/technology/developers-tools/dora-report-2025/
  • Google Cloud, "Announcing the 2025 DORA Report: State of AI-Assisted Software Development", retrieved 2026-06-30, https://cloud.google.com/blog/products/ai-machine-learning/announcing-the-2025-dora-report
  • Stack Overflow, "2025 Developer Survey: AI", retrieved 2026-06-30, https://survey.stackoverflow.co/2025/ai
  • GitHub, "Octoverse: A new developer joins GitHub every second as AI leads TypeScript to #1", retrieved 2026-06-30, https://github.blog/news-insights/octoverse/octoverse-a-new-developer-joins-github-every-second-as-ai-leads-typescript-to-1/
  • OpenAI, "Introducing Codex", retrieved 2026-06-30, https://openai.com/index/introducing-codex/
  • OpenAI Developers, "Best practices - Codex", retrieved 2026-06-30, https://developers.openai.com/codex/learn/best-practices
  • OpenAI Developers, "Subagents", retrieved 2026-06-30, https://developers.openai.com/codex/concepts/subagents
  • OpenAI Developers, "Iterate on difficult problems", retrieved 2026-06-30, https://developers.openai.com/codex/use-cases/iterate-on-difficult-problems
  • Model Context Protocol, "Specification 2025-11-25", retrieved 2026-06-30, https://modelcontextprotocol.io/specification/2025-11-25
  • OWASP, "Top 10 for Large Language Model Applications", retrieved 2026-06-30, https://owasp.org/www-project-top-10-for-large-language-model-applications/