PR evals for code agents are checks designed to judge an AI-generated change before the human reviewer spends attention on it. They combine tests, rubrics, diff analysis, execution evidence, and blocking rules. The goal is not to prove the agent is smart. It is to prove this specific PR deserves human review.
In 2026, GitHub said Copilot code review had processed more than 60 million reviews and that more than one in five code reviews on GitHub involved an agent (GitHub, "Agent pull requests are everywhere. Here's how to review them", 2026). That volume changes the discipline: manual review without evals becomes a queue.
Practical summary
- Agent PRs should arrive with proof, not just confident prose.
- Good evals combine deterministic tests, rubrics, and execution evidence.
- Subagents help when they review separate signals and return short syntheses.
- CI should block repeatable risk before calling the reviewer.
Why did PR evals become urgent?
In 2026, GitLab reported that 85% agree AI has shifted the bottleneck from writing code to reviewing and validating it (GitLab, "AI Accountability Report", 2026). PR evals became urgent because agents accelerate diffs, while the trust cost still lands on people.
The mistake is treating every agent PR like a human PR whose author can defend each decision. An agent can write a convincing explanation and still touch the wrong file, ignore a regression, or remove a test without reason. The story is not enough.
The eval asks a different question: which minimum signals make this PR reviewable? For backend, that might mean contract tests, types, lint, reversible migration, and permission analysis. For DevOps, it may mean rollback plan, infrastructure scope, and secret checks. For frontend, it may mean visual tests, accessibility, and empty states.
This extends my article on a code-agent harness for reliable pull requests. The harness defines the flow. The eval defines the measurement. Without both, the team only moves the bottleneck into a larger review queue.
A PR eval for code agents turns review into evidence-based triage. If the PR does not show test, scope, and risk, it is not ready. If it does, the reviewer spends attention on what matters: architecture, product intent, security, and trade-offs automation should not decide.
What should an eval measure before merge?
In 2026, GitLab also said 82% see AI-generated code as a risk of a new form of technical debt (GitLab, "AI Accountability Report", 2026). A PR eval should therefore measure reproduction, scope, regression, and traceability, not just whether the suite is green.

The first signal is reproduction. If the PR fixes a bug, the eval should require a test that failed before or an artifact that shows the original failure. Without that, the agent may have changed only a symptom. For a feature, the equivalent is a behavior contract with a happy path and a negative case.
The second signal is scope. Evals need to compare the diff with the task. An agent that changes authentication to fix layout should be blocked. An agent that changes a lockfile without a related dependency deserves an explanation. An agent that removes a test to pass CI deserves an automatic failure.
The third signal is risk. Changes in authentication, payments, personal data, queues, jobs, migrations, and infrastructure should raise the gate level. You do not need to treat every PR as an incident. You do need to recognize that some directories carry more potential damage.
| PR signal | How to measure | Blocks when |
|---|---|---|
| Reproduction | New test, fixture, or minimal log. | Bug does not appear before the fix. |
| Scope | Changed files against the task. | Diff touches an unexplained boundary. |
| Regression | Tests, typecheck, lint, and contract. | Command fails or was skipped. |
| Risk | Sensitive directories and change type. | Security, data, or infra changes without review. |
| Trace | PR body with commands and evidence. | Reviewer gets only a generic summary. |
Practical experience: when I review agent-generated PRs, the best signal is not the length of the write-up. It is the relationship between hypothesis and proof. If the hypothesis says "fix session expiration", I want to see the session test, touched file, and command output. Everything else is secondary.
How do you turn this into a CI gate?
In 2026, GitLab said only 28% say their software development lifecycle tools are fully integrated with shared data and workflows (GitLab, "AI Accountability Report", 2026). A CI gate for agents must be simple enough to fit the existing flow.

Start with a policy file. It can live in AGENTS.md, CLAUDE.md, or .github/agent-evals.yml. The name matters less than the job: record required commands, sensitive areas, evidence format, and blocking conditions. In June 2026, GitHub made AGENTS.md available to shape Copilot code review feedback (GitHub Changelog, "Copilot code review: AGENTS.md support and UI improvements", 2026).
Then create a small checker. It does not need AI on day one. A Node, Python, or Bash script can validate that the PR body includes commands, sensitive files trigger a label, required tests ran, and no forbidden file changed. The gate starts deterministic.
Next, add a rubric for what tests do not cover. The rubric can score PR clarity, relationship to the requirement, residual risk, and architectural fit. Use AI as a judge only when the criterion is semantic. Even then, treat the result as a signal, not final truth.
pr_eval:
required_commands:
- "npm run lint"
- "npm run test -- --runInBand"
- "npm run typecheck"
sensitive_areas:
- "src/auth/**"
- "infra/**"
- "migrations/**"
requires_human_reviewer:
- "auth"
- "personal data"
- "rollback"
blocks_if:
- "no new test for bug"
- "diff outside scope"
- "required command missing"
For long loops with Claude Code and Codex, context cost also becomes part of the gate. I use RemoteCode for taking Claude Code and Codex further in agentic flows as my own tool when work needs to cross sessions, subagents, and evidence without loading the full history into the main prompt.
Where do subagents help without adding noise?
In 2026, the Codex documentation says subagents help with highly parallel tasks, such as codebase exploration or implementing a multi-step feature plan (OpenAI Developers, "Subagents", 2026). In PR evals, subagents work best as specialized reviewers, not as several authors editing the same diff.
One subagent can read security. Another can inspect tests. Another can check TypeScript type impact. Another can compare the specification with the diff. The main agent does not need every log. It needs a synthesis with finding, file, severity, confidence, and next action.
Claude Code documentation describes subagents as specialized assistants that work in their own context and return only a summary, preserving the main conversation (Claude Code Docs, "Create custom subagents", 2026). That is the central rule for PRs: fan out reading, merge decisions.

Avoid subagents writing in the same module at the same time. That blurs authorship, increases conflict, and makes it harder to explain why one decision won. If the work requires several changes, split by real boundary: service, package, route, schema, or job. The final eval must consolidate.
A good subagent result fits in a few lines:
area: security
status: block
file: src/auth/session.service.ts
reason: renewal accepts expired token without checking revoked_at
proof: test auth/session-renewal.spec.ts fails in the negative case
action: require revocation test before merge
That shape avoids the worst of both worlds: many agents talking and nobody deciding. A good subagent reduces context. A bad subagent becomes an asynchronous meeting between models.
How do you design a loop that improves itself?
In 2026, OpenAI recommends giving Codex an evaluation system with scripts and reviewable artifacts so it can improve a task until the score is good enough (OpenAI Developers, "Iterate on difficult problems", 2026). The loop improves itself when failure becomes structured data, not when the agent blindly tries again.
The cycle starts with a hypothesis. The agent declares which behavior will change, which file it expects to touch, and how it will prove success. Then it applies the patch. CI runs the evals. If it fails, the next attempt receives a compact diagnostic: command, failure, likely file, and preserved constraint.
OpenAI described, in its tax-agent case study, a cycle where production issues become findings, tailored evals, and engineering tasks validated against targeted and regression evals (OpenAI, "Building self-improving tax agents with Codex", 2026). For software, the same pattern turns incidents, bugs, and reviews into reusable test cases.
Do not accept an infinite loop as autonomy. Define an attempt limit and a stop rule. If two attempts fail for the same reason, the agent should open a blocker. If the failure changes, it can iterate. If the eval is wrong, the agent can propose an adjustment, but it should not edit the gate without review.
The most important point: evals also need maintenance. When a human reviewer finds a bug the gate missed, record the pattern. If it is repeatable, turn it into a test, rubric, scope rule, or human checklist. The system improves when each expensive mistake becomes a cheap check.
What is the minimum viable setup this week?
In 2026, GitLab said 91% of organizations are likely to invest in AI code governance tools in the next 12 months (GitLab, "AI Accountability Report", 2026). The minimum viable setup does not need to wait for a new platform: start with a small policy, a checker, and a PR template.
First, write an AGENTS.md section with required commands and delivery format. Include "what changed", "why it changed", "how it was verified", and "residual risk". That already helps Claude Code, Codex, Copilot, and any agent that reads repository instructions.
Second, create a CI job named agent-pr-eval. It runs existing commands and validates the PR body. If the codebase already has end-to-end tests with Playwright, add only the relevant subset. If it has software testing types, choose the smallest combination that proves the change.
Third, mark sensitive areas. Authentication, authorization, billing, personal data, migrations, and infrastructure should not pass with autoapproval. This gate pairs with TypeScript service architecture, because module boundaries make evals more objective.
Fourth, log false negatives. Every time a reviewer catches an error CI missed, ask one question: should this become a test, rubric, scope rule, or human checklist? If it becomes nothing, the organization will keep paying the same review cost.
FAQ about PR evals
In 2026, GitLab said 80% agree their organizations adopted AI tools faster than they developed policies to govern them (GitLab, "AI Accountability Report", 2026). These questions help turn policy into routine.
Does a PR eval replace human code review?
No. In 2026, GitLab reported that 85% see the bottleneck in review and validation, not just code writing (GitLab, "AI Accountability Report", 2026). The eval removes repeatable noise; the human decides architecture, product intent, and risks requiring judgment.
Does every agent PR need AI judging the rubric?
No. In 2026, OpenAI recommends scripts and reviewable artifacts as the base of the evaluation loop (OpenAI Developers, "Iterate on difficult problems", 2026). Start with deterministic checks. Use an AI judge only for semantic criteria, such as coherence between requirement and diff.
Is AGENTS.md required for evals?
No, but it became more useful. In 2026, GitHub announced that Copilot code review uses relevant instructions from repository AGENTS.md (GitHub Changelog, "Copilot code review: AGENTS.md support and UI improvements", 2026). The file becomes a shared contract between agents and reviewers.
Which metric shows the eval improved?
Use review rework. In 2026, GitHub reported more than 60 million reviews processed by Copilot code review (GitHub, "Agent pull requests are everywhere. Here's how to review them", 2026). Locally, measure fewer repeated comments, fewer out-of-scope PRs, and fewer post-merge failures.
Closing
In 2026, OpenAI wrote that as code throughput increased, the bottleneck became human QA capacity (OpenAI, "Harness engineering: leveraging Codex in an agent-first world", 2026). PR evals are the practical response: they do not let AI decide alone, but they force each change to arrive with proof.
Start small. A PR template, three required commands, a sensitive-area list, and one blocking rule already improve review quality. Then add subagents, semantic rubrics, and improvement loops. A good agent is not the one that writes more code. It is the one that delivers a PR the team can verify.
Sources consulted
- GitHub, "Agent pull requests are everywhere. Here's how to review them", retrieved 2026-07-01, https://github.blog/ai-and-ml/generative-ai/agent-pull-requests-are-everywhere-heres-how-to-review-them/
- GitLab, "AI Accountability Report", retrieved 2026-07-01, https://ir.gitlab.com/news/news-details/2026/GitLab-Research-Reveals-Organizations-Are-Generating-AI-Code-Faster-Than-They-Can-Control-It/default.aspx
- GitHub Changelog, "Copilot code review: AGENTS.md support and UI improvements", retrieved 2026-07-01, https://github.blog/changelog/2026-06-18-copilot-code-review-agents-md-support-and-ui-improvements/
- OpenAI Developers, "Subagents", retrieved 2026-07-01, https://developers.openai.com/codex/subagents
- Claude Code Docs, "Create custom subagents", retrieved 2026-07-01, https://code.claude.com/docs/en/sub-agents
- OpenAI Developers, "Iterate on difficult problems", retrieved 2026-07-01, https://developers.openai.com/codex/use-cases/iterate-on-difficult-problems
- OpenAI, "Building self-improving tax agents with Codex", retrieved 2026-07-01, https://openai.com/index/building-self-improving-tax-agents-with-codex/
- OpenAI, "Harness engineering: leveraging Codex in an agent-first world", retrieved 2026-07-01, https://openai.com/index/harness-engineering/