AI Agents Will Break Any Rule You Don't Test
Most teams give their AI agents an AGENTS.md, a CLAUDE.md, or a Cursor rule full of polite architectural guidance. Layers, boundaries, where secrets live, what may import from what. Three weeks later the codebase is spaghetti reaching across every line they wrote down, and they wonder why the agent ignored them. Documents do not enforce architecture. Tests do. The fix is small, language-agnostic, and unforgiving: take your most important architecture rules, write them as a deterministic test, wire it into a pre-commit hook, and let CI run it again. Now the agent literally cannot finish the work if it breaks the architecture.

The Document Does Not Enforce
Read any popular agentic coding setup and you will see the same playbook. A markdown file at the root of the repo. Sections like “Architecture”, “Module Layout”, “Do Not Cross These Boundaries”. A rule that says “agents must not import from clients in this codebase, only from interfaces.”
The agent reads it once. Maybe twice. Then a hundred thousand tokens go by, the context window churns, and the rule slides out of attention. Or the agent technically remembers but optimizes for the local task. Or a different agent on a different branch never had your rules in scope to begin with. By the time you notice, you have cross-layer imports, secrets read from process.env in five places, and a domain layer that calls HTTP clients directly.
This is the bloat-and-mess pattern that scares people off agents. Blaming the model misses what is actually happening — the model did what models do, which is generate plausible code, and nothing in the pipeline was checking the architecture. A rule that lives in prose is a suggestion. A rule that fails the build is a wall.
Make the Rule Executable
The move is simple. Pick the architecture rules you actually care about. Express each one as a script that returns non-zero on violation. Wire that script into your test runner.
Two rules cover most of the damage I see in real codebases:
- Dependency direction. Layer X may not import from layer Y.
- Capability allowlists. Only these specific files may touch this dangerous primitive (
process.env,fs,fetch, raw SQL, the credential broker, whatever).
Both are trivially expressible as a file walk plus a regex. No AST, no fancy linter plugin, no language buy-in. Here is the entire enforcement layer for a Bun project I run, in one test file. The code is unglamorous on purpose:
import { describe, expect, test } from "bun:test";
import { readFileSync, readdirSync, statSync } from "node:fs";
import { join, relative } from "node:path";
const SRC = join(import.meta.dir, "../src");
function collectTsFiles(dir: string): string[] {
const files: string[] = [];
for (const entry of readdirSync(dir)) {
const full = join(dir, entry);
if (statSync(full).isDirectory()) {
files.push(...collectTsFiles(full));
} else if (entry.endsWith(".ts") && !entry.endsWith(".d.ts")) {
files.push(full);
}
}
return files;
}
function importsFrom(source: string, target: string): string[] {
const lines = source.split("\n");
return lines.filter((l) => /^\s*(import|export)\s/.test(l) && l.includes(`/${target}/`));
}
function processEnvUsages(source: string): string[] {
return source
.split("\n")
.filter((l) => !l.trimStart().startsWith("//") && l.includes("process.env"));
}
describe("dependency boundaries", () => {
test("agents/ must not import from clients/", () => {
const agentFiles = collectTsFiles(join(SRC, "agents"));
const violations: string[] = [];
for (const file of agentFiles) {
const rel = relative(SRC, file);
const source = readFileSync(file, "utf-8");
for (const line of importsFrom(source, "clients")) {
violations.push(`${rel}: ${line.trim()}`);
}
}
expect(violations).toEqual([]);
});
});
describe("process.env access", () => {
const ALLOWED = new Set([
"config.ts",
"instrument.ts",
]);
test("only allowlisted files may use process.env", () => {
const allFiles = collectTsFiles(SRC);
const violations: string[] = [];
for (const file of allFiles) {
const rel = relative(SRC, file);
if (ALLOWED.has(rel)) continue;
const source = readFileSync(file, "utf-8");
for (const line of processEnvUsages(source)) {
violations.push(`${rel}: ${line.trim()}`);
}
}
expect(violations).toEqual([]);
});
});
That is it. Two tests. About fifty lines. No frameworks. The violation list is the error message — the agent gets handed every file and line that broke the rule, in plain text it can act on.
The Loop Is the Point

Writing the test is half the move. The other half is wiring it into a place the agent cannot route around.
- Pre-commit hook.
bun test(orpytest, orgo test, ordotnet test) runs before any commit lands. Configure the hook so the architecture tests are not optional, not skippable, not separable from the rest of the suite. - CI on every push. Same tests, same exit code. The agent cannot pretend the rule does not exist by committing past a broken local hook.
- A nice failure message. When the test fails, the output should read like a coach, not a stack trace.
agents/foo.ts: import { Bar } from "../clients/bar"followed byagents/ must not import from clients/. The agent reads that, finds the offending line, and fixes it. Loop closed.
This turns architecture from a property the agent has to remember into a property the system enforces. The agent runs the tests because the loop forces it to. The test fails. The agent reads the violation. The agent fixes the import. The agent commits. Discipline is now a property of the pipeline, not the model.
You stop trusting the agent. You trust the failure.
Primitive Now, Proper Later
Regex on import lines is a hack and I do not pretend otherwise. It will miss creative obfuscations. It does not understand re-exports. It cannot distinguish a comment from code in every dialect. None of that matters for the use case. Agents do not write creative obfuscations. They write the most obvious shape of the code they are asked for, every time. A regex that catches the obvious shape catches the agent.
When the codebase grows up, swap the implementation, keep the rule:
- C# / .NET. Roslyn analyzers, NetArchTest, ArchUnitNET.
- Java / Kotlin. ArchUnit. The original. Still the gold standard.
- TypeScript / JavaScript. ts-morph, eslint-plugin-boundaries, dependency-cruiser, madge for cycle detection.
- Python. import-linter, pyflakes plugins, custom AST walks via
ast. - Go.
go vetplus custom analyzers, depguard. - Rust. cargo-deny, custom clippy lints.
These let you express richer rules: no cyclical dependencies, no public field access across module boundaries, this layer is sealed except through this interface, no calls into the database from anywhere except the repository module. Use them when the project deserves them. Until then, ship the regex test. Ugly enforcement beats elegant aspiration.
Why This Matters More With Agents Than With Humans
Architecture testing is not new. ArchUnit has been around for years, NDepend longer than that, and a small percentage of disciplined teams have always shipped this layer. With humans, it was a nice-to-have. Code review caught most violations. Senior engineers internalized the rules. Onboarding transmitted them.
Code review is dying. Onboarding does not apply to a model that loses its memory at end-of-session. Senior engineers are not in the loop on every line anymore. The mechanisms that used to hold architecture together quietly are gone. The new loop is agent generates → tests run → CI gates → human reviews product, not code. If your architecture is not in the test layer, it is nowhere.
This is the same insight as putting secrets behind a broker instead of in environment variables, or hiring Operators instead of trusting an autonomous agent’s judgment. Mature systems do not ask the model to be careful. They make carelessness expensive.
Set the Initial Architecture, Then Lock the Doors
Two practical takeaways for anyone shipping with agents.
First, set the initial architecture yourself. This is not optional. Agents are excellent at filling in scaffolding and terrible at choosing which scaffolding to use. Lay down the directories, name the layers, write a one-paragraph description of what each layer does and what it may depend on. That is your specification, and it is the cheapest insurance you will ever buy.
Second, lock the doors. Pick the three or four rules that, if violated, would create the most damage. Write a test for each. Run the suite in pre-commit and CI. Make the failure message readable. Now the agent cannot ship a violation, you cannot ship a violation, and you stop having to manually police the structure. The codebase stays clean because the build refuses to be dirty.
A rule that lives in prose is a suggestion. A rule that fails the build is a wall.
You do not need to trust the agent. You need to make it impossible to commit the wrong thing.