Why AI-generated code passes review but breaks your codebase
Here is the uncomfortable thing about AI-generated code: most of the AI-generated code problems that hurt you never fail a check. The code compiles. The tests pass. The linter is green. A good engineer reads the pull request and approves it. And yet, commit by commit, the codebase gets harder to follow, less consistent, more fragile. Nothing broke loudly. It eroded quietly.
If you doubt that "passing code" can be a problem, good. This post is for you. The point isn't that AI writes bad code. It's that every check we trust looks at one change on its own, and the damage lives between files, not inside them.
Is AI code good? It depends what you measure
Ask "is AI code good?" and the honest answer is: yes, locally, almost always. A modern coding agent writes a function that is clean, readable, and correct for the task in front of it. The trouble is that the task in front of it is a narrow slice. Each session starts fresh, with no memory of the dozens of decisions already baked into the code around it.
So the agent makes reasonable choices. But a reasonable choice isn't the only choice. There are five fair ways to handle an error, three to fetch a user, two response shapes. Pick fresh each session and you don't get wrong code. You get a codebase that does the same thing five different ways. That spread is drift, and it compiles and passes review because each piece, on its own, is fine.
Why human review misses it
Reviewers are good at catching bugs in a diff. They are built to miss drift, and it's not a discipline problem. Three things work against them:
- +The diff is the unit of review. A reviewer judges the fifteen lines that changed, not the fifty files that came before. They can answer "is this change sound?" The question that matters for drift is "does this match how we already do it?", and that answer isn't in the diff.
- +Each clash looks fine on its own. When a new route validates input differently from the rest, nothing about it looks wrong. You'd have to hold every other route in your head at once to notice it's the odd one out. Nobody does.
- +Speed hides it. AI lets you merge more, faster. The reviewer sees more diffs each week and has less time to weigh each one against the whole. The thing that makes AI fast is the thing that makes drift invisible.
And the linter won't save you, because a linter checks syntax and style, not behavior. It can force you to use single quotes. It can't know that your team always wraps data access in a repository layer, or that every signed-in route should carry the same middleware. Those are conventions, not rules, and conventions are exactly where AI drifts.
What the failure actually looks like
Drift isn't abstract. It shows up as a few concrete, repeatable patterns:
- +Clashing architecture. Half the code goes through a service layer; the new half hits the database straight from the handler. Both work. Together they mean there's no single place to add a cache, an audit log, or a permission check.
- +Missing auth on some routes. The agent adds a new endpoint and, never having seen the convention, leaves out the middleware every sibling route uses. It compiles, returns data, passes a happy-path test, and quietly exposes something it shouldn't.
- +Duplicate code. A fresh session can't find the existing
formatCurrencyhelper, so it writes a new one that rounds a little differently. Now you have two, and they disagree on the third decimal place. - +Split conventions. Errors are thrown in some files, returned as result objects in others, swallowed and logged in a third. Callers can no longer assume how failure travels.
- +Mismatched response shapes. One endpoint returns
{ data, error }; the next returns the bare object; the one after returns an array. Every caller needs its own unwrapping step, and the type system can't warn you because each shape is valid on its own.
Concretely, two functions written in two sessions, each fine on its own:
// session 1 — early in the project
async function getUser(id) {
const user = await db.users.findById(id);
if (!user) throw new NotFoundError("user");
return user;
}
// session 2 — three weeks later, same repo, fresh context
async function getOrder(id) {
const order = await db.orders.findById(id);
if (!order) return { error: "not found", order: null };
return { error: null, order };
}Neither diff would be rejected in review. But a caller now has to know, function by function, whether failure throws or returns a sentinel value. Do that across a few hundred functions and "how does this codebase handle errors?" stops having an answer.
So how good is AI code, really?
The fair verdict on AI code quality is that it's strong on what we measure and blind on what decays. We gate on per-change correctness, and AI is great at per-change correctness. We don't gate on whether files agree with each other, and that's exactly where fresh-every-time generation pulls a codebase apart.
This is why "the tests pass" means less than it used to. Tests check the behavior you planned for. Drift is the behavior you didn't: the second way of doing a thing, sitting quietly next to the first. You can have full coverage and a messy, inconsistent codebase at the same time, and most teams shipping AI code at speed already do.
How to measure cross-file drift
You can't fix what you can't see, and the diff view will never show you drift, because drift is a property of the whole repo. So measure the whole repo. A local scan reads your code and scores how consistent it is:
npx @vibedrift/cli .
It runs in about two seconds, all on your machine, and your code never leaves it. The output is a Vibe Drift Score from 0 to 100 with a letter grade, plus findings that tell you where the trouble is, not just that it exists. Five detectors do the work:
- +Architecture — are layers and boundaries kept, or does new code route around them?
- +Security — do matching routes apply the same auth and validation, or did some get skipped?
- +Redundancy — is there a near-copy of something that already exists?
- +Conventions — does this file match the common pattern of its peers?
- +Scaffolding — are stubs, dead branches, and half-finished code cluttering the tree?
Each finding names the common pattern, lists the files that break from it, and suggests the fix, so a number turns into a to-do list. For a deeper read, vibedrift . --deep runs a Claude-checked semantic analysis against your deep-scan budget. There's more on how the score works in the Vibe Drift Score deep dive, and a fuller treatment of why this matters in your AI codebase is drifting.
How to prevent it in the loop
Detection tells you what already drifted. The better fix is to stop the agent from drifting at all, which means giving it the context it lacks before it writes. That's what the VibeDrift MCP server does. It's free, open source, local, and needs no login:
claude mcp add vibedrift -- npx -y @vibedrift/cli mcp
Once it's wired in, the agent can check the codebase's own conventions while it works, through five tools: get_dominant_pattern to learn how the repo already handles something, find_similar_function so it reuses instead of reinventing, check_file_drift to see if a file already strays, validate_change to test whether a change would add drift, and get_intent_hints to read the conventions the team wrote in CLAUDE.md or .cursorrules. The agent stops guessing at a convention and starts matching the one you already have.
We wrote up the move from catching drift after the fact to stopping it as the agent works in how to stop Claude Code drift. The short version: the best place to keep code consistent is the moment it's written, not the moment it's reviewed.
The takeaway
AI-generated code passing review doesn't mean everything is fine. It means your checks only look at the diff, while the cost piles up across the repo. The fix isn't to trust AI less or review harder; it's to add the one check your pipeline is missing, whether files agree with each other, and to give your agent the context to stay consistent on its own. Start by scanning what you have:
npx @vibedrift/cli .
It's free and unlimited, runs locally in seconds, and tells you exactly where your AI-generated code stopped agreeing with itself. See pricing for deep scans and team plans.
Frequently asked questions
Review judges a diff in isolation, so it confirms the change is internally sound. It rarely checks whether the change is consistent with the fifty files written before it. AI-generated code tends to be locally reasonable but globally inconsistent: each session picks a slightly different convention, and those small divergences accumulate into drift the diff view can't show you.
A linter checks syntax and style rules: formatting, unused variables, banned constructs. It has no model of your project's behavioral conventions, so it can't tell that one route forgot the auth middleware everyone else uses, or that a new function reimplements an existing helper with different error handling. Those are exactly the AI code quality problems that pass lint cleanly.
Run a local scan with npx @vibedrift/cli . It analyzes the repo across architectural consistency, security posture, redundancy, convention adherence, and scaffolding hygiene, then returns a Vibe Drift Score from 0 to 100 with a grade, the dominant pattern, the files that deviate, and the fix. It runs in about two seconds, your code never leaves your machine, and it's free.
Yes. VibeDrift ships a free, open-source MCP server you add to your agent. Before it writes, the agent can ask the codebase what the dominant pattern is, whether a similar function already exists, and whether a proposed change would introduce drift, so it conforms in the first place rather than being corrected after the fact.
Local scans and the MCP tools are free and open source, forever. The free tier includes 1 deep scan a month; Pro is $15/mo for 12, and you can top up 5 more for $10 on any plan. Credits never expire.