Four PR review modes when most code is AI-generated
How PR review changes when most code is AI-generated. The four review modes engineering teams settle on, when to pick each, and what each one quietly breaks.
- ›Reviewers cannot keep up with PRs that an AI wrote in 20 minutes. The review either becomes shallow or the queue stalls, and most teams pick shallow without admitting it.
- ›When most code is AI-generated, the bugs are not in the lines. They are in the seams between AI-suggested blocks where two reasonable choices fight each other.
- ›Four PR review modes emerge in AI-heavy teams: skim-trust, sample, public-API-only, and deep review of risky paths. Most teams pick one by accident, not by design.
- ›Treating PR review the same way as 2022 is the single biggest source of trust collapse on engineering teams in 2026.
A senior engineer opens her PR queue Monday morning. Fourteen PRs. The first one is 700 lines, a feature she vaguely remembers planning. The author opened it Friday at 4pm. The diff looks clean. The tests pass. She has a 1:1 in twelve minutes.
She approves it.
This is the modal PR review experience on engineering teams in 2026, and it is not how teams describe their process to candidates.
The hidden problem nobody admits
PR review was designed for a world where the author wrote the code, the reviewer read the code, and any disagreement was about taste or correctness in roughly equal measure. Most of the friction was social.
That world ended quietly somewhere between mid-2024 and late-2025. By the time most teams noticed, the code in their PRs was 50 to 70% AI-generated, the author often had not typed two-thirds of the diff, and the reviewer was being asked to give a meaningful sign-off in the same fifteen-minute window they had three years ago.
The math does not work. PR sizes have grown 2 to 4x in our experience. Reviewer cognitive load has grown more than that, because the code is unfamiliar to both the author and the reviewer. Yet time per review has gone down, not up, because nobody can keep up.
The result is a quiet trust collapse that does not show up in any metric until the bug rate spikes in production.
The four review modes that actually emerge
Most teams do not choose a review mode. They drift into one. In our experience working with R&D teams who use Cursor, Claude Code, Copilot, or some combination, four patterns cover almost every PR review process you will find in the wild.
1. Skim-trust
The reviewer reads the PR description, scrolls through the diff once, runs the tests, and approves if everything is green. Total time: 5 to 10 minutes for a 500-line PR.
This works for low-risk paths: docs, internal tooling, well-tested libraries. It fails silently on anything that touches auth, data, payments, or external APIs. The bugs that escape are the ones nobody saw because nobody looked.
We see this pattern most often in teams that adopted AI tools quickly and did not change their review SLA. The form survives, the function does not.
2. Sample review
The reviewer picks one or two files that look risky (the schema change, the auth helper) and deep-reviews those. The rest gets a skim. Total time: 20 to 30 minutes for a 500-line PR.
This is the practical compromise most senior engineers settle into. It catches the high-leverage bugs without forcing a 90-minute review of every PR. Failure mode: when the risky bit is hidden inside a routine-looking file. The agent quietly added a third retry loop in the metrics helper. Nobody flagged the metrics helper. Production sees 3x the API spend next month.
3. Public-API-only review
The reviewer only checks changes to exported interfaces, schemas, public types, and migration files. Implementation details are trusted to AI plus CI plus integration tests. Total time: 10 to 15 minutes for a 500-line PR.
Best for library code, SDK work, anything where the contract matters more than the implementation. Breaks down for product code, where most consequential bugs are in implementation, not interface.
A team we know runs this pattern on their public Python SDK. It works because the SDK has 96% test coverage and a contract test suite that runs against every PR. They also run sample-review on the same PRs. The two stack.
4. Deep review of risky paths
The team uses heuristics, usually a labeller bot, to flag PRs that touch auth, payments, migrations, public APIs, or deployment scripts. Those get a full 60 to 90 minute review. The rest go through skim-trust.
This is the pattern we recommend most often. It scales because it concentrates expensive review where it matters. The setup cost is real: somebody has to build the labelling rules, maintain them, and trust them. Most teams underinvest in this and then complain that "the bot keeps missing things."
A minimal labeller in GitHub Actions:
name: Risk labeller
on:
pull_request:
paths:
- 'src/auth/**'
- 'src/payments/**'
- 'migrations/**'
- 'src/api/public/**'
jobs:
label:
runs-on: ubuntu-latest
steps:
- uses: actions/labeler@v5
with:
configuration-path: .github/risk-labels.yml
That single file, plus a CODEOWNERS rule that requires senior review on risk:high, gets you 80% of the value of an enterprise PR-routing tool.
Comparison: which mode fits your team
There is no universal right answer. Each pattern has a failure mode, and the right pattern depends on what your code does, who writes it, and what failure costs you.
Skim-trust. Best for: small teams shipping internal tools or prototype-stage products. Breaks when: you have customers who will notice the bugs.
Sample review. Best for: most product engineering teams up to roughly 30 engineers, where senior judgement is the bottleneck and you want to use it well. Breaks when: the team is large enough that risk is concentrated in non-obvious places.
Public-API-only. Best for: library teams, SDK teams, infrastructure teams where the contract is the product. Breaks when: applied to product code, where implementation drift quietly destroys reliability.
Deep review of risky paths. Best for: teams above 30 engineers with mixed-risk codebases (some payments, some marketing site). Breaks when: the labelling rules are not maintained, or when "risky" is hard to define in your domain.
How to start tomorrow
Three steps that work for any team.
Audit one week of PRs honestly. Pick last week's merged PRs. For each, ask: how long did review actually take, what did the reviewer actually look at, what did they catch versus what they trusted. Compare to what you would say in a candidate interview. If the gap is large, you have a process problem the team is not naming.
Pick a mode explicitly. Do not drift. Tell the team: "for this codebase, we are sample-reviewing, and the sample is the schema, the auth helpers, and any new API endpoint." Make the rule explicit, write it down in your engineering handbook, refer to it in the next time someone says a review felt thin.
Build the labelling step before you need it. Even a primitive GitHub Action that adds risk:high to PRs touching /migrations, /auth, or /payments will do most of the work of a deep-review-of-risky-paths system. The remaining work is iterating on what counts as risky, based on what actually broke last quarter.
Common mistakes we see repeatedly
Treating "tests pass" as "code is correct." Tests pass for AI-generated code roughly as often as they did for human-written code, partly because the AI also wrote the tests, often in the same conversation. CI passing is necessary, not sufficient. Trust the seams between blocks, not the existence of green checks.
Reviewing for style instead of for risk. When you have 30 minutes and a 700-line PR, spending 15 of them on naming and formatting is a tax on the team's time and a way to feel productive without managing risk. Lint catches style. You should be catching the things lint cannot.
Letting the AI reviewer be the only reviewer. AI code review tools are useful as a first pass, but their failure mode is silent: they miss the architectural seams and the product context that is the actual job of a human reviewer. If you remove the human review, you have outsourced the only step that catches the most consequential bugs.
Sticking with 2022 process language. "All PRs need at least one review before merge" sounds tight. In practice it has degraded to "all PRs need someone to click approve." If the rule is not doing the work it used to, change the rule. Do not pretend it still works.
Where this goes next
Our take, stated honestly: the long-term answer is not a tool that reviews code for you. It is a synthesis layer that joins what an agent wrote, what the human author understands, what is risky in this part of the codebase, and what the team decided last quarter, then surfaces the right level of review for that specific PR. Sometimes that means full human review. Sometimes it means "ship it, and add this PR to next week's incident review if it bites."
That is the shape of the problem TKTIDE sits in, one agent per tool with cross-context synthesis. The honest tradeoff: this category is roughly twelve months old, the playbook is not settled, and there are fewer reference deployments to point at. If you have a production incident every month traced to a thinly-reviewed PR, this is worth exploring. If your current sample-review process is working, the most useful change is naming it explicitly so it does not quietly drift into skim-trust.
PR review did not get harder in 2026. The job of PR review changed, and most processes have not changed with it. The four modes above are the patterns we see teams settle on once they admit the problem out loud. Pick one on purpose.
Frequently asked
How long should PR review take when most code is AI-generated?
Same answer as before: as long as the risk surface deserves. What changed is that risk surface is no longer correlated with diff size. A 50-line PR generated by an agent can carry more risk than a 500-line refactor written by a senior. Time-box by what the PR touches, not by how much it touches.
Should we still ask the author to walk us through the PR?
Yes, more than before. When the author did not write half the code themselves, the walkthrough is the only way to surface what they actually understand. If the author cannot explain a block, that block needs deeper review or a rewrite.
Can AI review AI-generated code?
Partially. AI reviewers catch surface issues like style, obvious bugs, and lint-equivalent problems reliably. They miss the seams between AI-generated blocks, the architectural drift, and anything requiring product context. Use them as a first pass, not a replacement for human judgement.
Are big PRs okay if AI wrote them?
No, the inverse is true. Big PRs were already hard to review carefully. AI-generated big PRs are harder still because the reviewer has even less context for why each block exists. Force splits aggressively. The rule of no PR over 400 lines matters more in 2026, not less.
Does CODEOWNERS still work in AI-heavy teams?
Mostly, but the assumption underneath is breaking. CODEOWNERS routes review to the human who knows the area best. When that human increasingly did not write the code in their area, ownership becomes about judgement of what should be there, not familiarity with what is. The routing still helps, but the bar for what owner-review actually catches has shifted.
TKTIDE connects Jira, GitHub, Monday, and 30+ other R&D tools via one AI agent per tool. Ask a question once, get a synthesized answer across all your systems. No migration, no new dashboard.