OpenAI Just Acquired the Tool We Use to Test Our Agents. Here's Why That Matters.
OpenAI acquired promptfoo on March 9 to secure its enterprise agent platform. We've been running pro...
Amazon held an emergency engineering meeting after AI-assisted code changes triggered multiple outages — including one that took AWS down for 13 hours. Their fix? Require senior sign-off on AI-assisted changes. But adding humans to review AI output is the wrong answer. The companies getting this right are redesigning their entire engineering process around AI agents from the ground up.
On March 10, Amazon summoned its engineers to a mandatory "deep dive" meeting to address a string of outages caused by AI-assisted code changes. The Financial Times reported that the meeting invitation initially referenced "GenAI-assisted changes" and "GenAI tools" before those references were scrubbed from the document ahead of the session.
The outages themselves were not subtle. Amazon's coding agent Kiro was linked to multiple AWS outages, including one in December where it took down AWS for 13 hours after deleting and recreating part of its environment. Another outage was tied to Amazon's coding assistant Q, according to an internal document obtained by Business Insider. Amazon is now reportedly requiring senior engineers to sign off on AI-assisted changes from junior and mid-level developers.
Amazon disputes some of the reporting details. A spokesperson told American Banker that "only one incident was related to AI" and said the new senior sign-off requirement is inaccurate. But the pattern is clear regardless of how Amazon characterizes it publicly: AI-generated code is causing production failures at the largest cloud infrastructure company on the planet. And their reported fix — adding human review gates — reveals a fundamental misunderstanding of the problem.
This isn't an Amazon problem. It's an industry problem. Microsoft CEO Satya Nadella and Google CEO Sundar Pichai have both said AI writes around 30% of their new code. Some top engineers at Anthropic and OpenAI claim they no longer write code at all — they just review AI output. The volume of AI-generated code hitting production systems is accelerating faster than anyone's review infrastructure can keep up.
For small and mid-size businesses, the lesson is direct: if Amazon can't prevent AI code from taking down its own website with thousands of engineers on staff, a five-person team with a coding assistant and no structural safeguards doesn't stand a chance.
The reason AI-generated code causes outages isn't that the code is obviously bad. It's that the code is subtly wrong in ways that are hard to catch.
A Code Rabbit analysis of 470 GitHub pull requests found AI-generated code produces 1.7 times more logic issues than human-written code. Not syntax errors. Not formatting problems. The code compiles, it runs, it passes a cursory review — and then it does the wrong thing in production because the logic is flawed in ways that look correct on the surface.
Google's DORA report tracked a 9% climb in bug rates correlating with a 90% increase in AI adoption, alongside a 91% increase in code review time. The code ships faster but requires more effort to verify. The net effect for many teams is negative productivity — more code, more bugs, more review overhead, slower delivery of reliable software.
This is the vibe coding problem. The term caught on because it captures something real: AI-assisted programming plays faster and looser than traditional software development. Developers generate code at unprecedented speed, accept suggestions they haven't fully analyzed, and push changes that look right but haven't been stress-tested against edge cases.
As Forrester Principal Analyst Brent Ellis put it: "When a human operator acts, they do things with an understanding of the overall environment and knowledge of what they should and should not do. An AI however will use whatever resources it has access to in order to try to achieve the goal it is given."
That's the gap. Human engineers carry contextual knowledge about the system, its dependencies, its failure modes, and the organizational consequences of a bad deploy. AI coding tools carry none of that. They optimize for the immediate task — write this function, fix this bug, refactor this module — without understanding the larger system they're modifying.
Amazon's reported response — requiring senior engineer sign-off on AI-assisted changes — is a reasonable first step. But it's a band-aid on an architectural problem.
The bottleneck isn't that AI code doesn't get reviewed. It's that companies can now generate far more software than they can effectively review. Anthropic released a new AI code-review tool this week specifically to address this — the flood of AI-generated code is overwhelming human review capacity.
The math doesn't work. If AI coding tools generate code 10 times faster than humans, and review still happens at human speed, the review queue becomes the bottleneck. Senior engineers become gatekeepers who either slow everything down or rubber-stamp changes they can't fully evaluate. Both outcomes are bad.
Adding more humans to review AI output is the industrial-era response to a post-industrial problem. It's like responding to the invention of the assembly line by hiring more quality inspectors. The companies that won the manufacturing revolution weren't the ones with the most inspectors — they were the ones that redesigned the entire production process around the new technology.
The same thing is happening in software engineering right now. And the companies getting it right aren't adding review gates. They're redesigning their entire development lifecycle around the reality that AI agents are writing the code.
A handful of companies are pioneering an entirely different approach. Instead of bolting AI onto existing development workflows and hoping humans catch the mistakes, they're building new engineering processes designed from the ground up for a world where AI writes most of the code.
The software industry is moving through a maturity curve that looks roughly like this:
Most companies — Amazon included — are stuck between Levels 1 and 2. They're using AI to generate code but still relying on human review to catch problems. That's the workflow that's breaking down.
The organizations making real progress are pushing toward Levels 4 and 5 — not by trusting AI more, but by building entirely new verification systems that don't depend on a human reading every line of code.
StrongDM pioneered what they call the "software factory" concept: external behavioral scenarios that test whether software does the right thing, stored separately from the code so the AI can't game its own tests.
This is a fundamental shift. Instead of reviewing code for correctness, you define the expected behavior upfront and let automated systems verify that the output matches. The human's job moves from "read this code and decide if it's right" to "define what right looks like and build the verification system."
It's the difference between a human inspector checking every widget on the assembly line versus an automated quality system that tests finished products against specifications. The latter scales. The former breaks.
Anthropic — the company behind Claude — isn't just building AI tools. They're restructuring their own software development lifecycle around the assumption that AI agents will do most of the work. Their engineers increasingly focus on specification, architecture, and verification rather than implementation.
This is the direction the entire industry is heading, whether companies realize it or not. The METR study found that even experienced developers are 19% slower with AI tools while believing they're 24% faster. That gap — the J-curve of AI adoption — exists precisely because most teams are bolting AI onto workflows designed for humans writing code. The productivity gain only materializes when you redesign the workflow around AI's actual strengths and weaknesses.
The deeper pattern across all of these approaches is what's called intent engineering — encoding organizational purpose into the infrastructure that governs AI behavior.
Consider what went wrong at Amazon. Kiro had access to delete and recreate parts of its own environment. The tool was technically doing what it was asked to do. But nobody had encoded the organizational intent — "don't take down production infrastructure" — into the system's decision boundaries in a way the agent could act on.
This is the same failure pattern that cost Zillow $500 million when their iBuying algorithm optimized for price prediction accuracy instead of profitable transactions. The algorithm never "broke." It just optimized for the wrong thing because the actual business intent wasn't encoded into the system.
Intent engineering makes the real goal — not the proxy metric — explicit, structured, and machine-actionable. It's the difference between telling an AI agent "write good code" and building a system where the agent's actions are bounded by defined decision boundaries, escalation triggers, and verification checkpoints that reflect what the organization actually needs.
The Amazon story isn't just a cautionary tale for tech giants. It's a preview of what every business using AI tools will face if they don't adapt their processes.
Adding more human review to your AI-assisted workflows. This doesn't scale, it burns out your best people, and it creates a false sense of security — rubber-stamped reviews are worse than no reviews because they create liability without catching problems.
Redesigning your development and operational processes around three principles:
1. Structural boundaries, not behavioral instructions. Don't tell AI what not to do. Build systems where certain failure modes are architecturally impossible. Scoped permissions, sandboxed environments, defined interfaces. If an AI agent can't access production infrastructure, it can't take it down — regardless of what the code does.
2. Automated verification, not human review. Define expected behavior upfront. Build eval suites that test outcomes, not code. Store tests separately from the code so the AI can't game them. This is StrongDM's insight, and it's the only approach that scales.
3. Intent as infrastructure. Encode your organization's actual goals — not proxy metrics — into the systems that govern AI behavior. Decision boundaries, escalation triggers, authority limits, value hierarchies. When these are explicit and machine-actionable, the AI operates within your organization's intent rather than optimizing for whatever metric happens to be in front of it.
Be honest about where you sit on the Dark Factory spectrum. If you're at Level 1 (AI writes code, humans review everything), you're in the zone where Amazon just got burned. The path forward isn't to Level 1 with more reviewers. It's to Level 4, where humans define specs and verification systems, and the code itself is treated as an implementation detail.
That transition requires organizational change, not just tool adoption. It's why most companies experience the J-curve productivity dip — they're trying to get Level 4 results from a Level 1 process.
We're not writing about these practices from the outside. At Associates AI, we deploy and manage AI agent systems for small and mid-size businesses — and we're actively building these verification and feedback systems into our own engineering process. Here's what that looks like in practice.
No agent writes code until a structured spec exists. Our spec interview process produces a full PRD — acceptance criteria, constraints, known risks — before any implementation begins. The spec becomes the source of truth that every downstream verification step checks against. This is the Level 4 shift: the human's job is defining what to build, not writing or reviewing the code that builds it.
We maintain per-domain failure model files — structured YAML documents that catalog the specific ways our agents fail, with triggers, symptoms, and fixes for each pattern. Not generic "be careful" advice. Specific entries like: "when reviewing PR feedback, the agent checks review-level comments but misses inline diff comments because they're on a different API endpoint." These failure models are mined from real incidents and fed into every spec, every review, and every eval. When an agent writes code, the reviewer checks the diff against the relevant failure models for that domain — not just for correctness, but for known failure patterns.
Before any code gets pushed, a separate AI reviewer (running on a different model) checks the diff against the spec and the failure models. This isn't a human reading code. It's an automated verification gate that loops until the review passes clean or escalates after five iterations. The reviewing agent and the coding agent are separate — the same principle as StrongDM's external behavioral scenarios, applied to the review process itself.
Every agent skill ships with automated evaluations — behavioral test suites stored separately from the skill code. When a PR changes a skill, CI runs the eval suite automatically. Skills that fail evals don't merge. This is the software factory concept in action: define the expected behavior, test the output against it, and let the automated system be the gate. No human needs to read every line of agent code to know whether it works correctly.
Instead of one monolithic agent that does everything, we run specialized agents with defined responsibilities — operations, marketing, sales — each with scoped permissions and inter-agent communication protocols. A marketing agent can't modify infrastructure. A sales agent can't change deployment configs. The architecture enforces separation of concerns structurally, not through behavioral instructions the agent might ignore.
These systems layer on top of each other. A spec feeds into the failure model check, which feeds into the pre-push review, which feeds into the eval pipeline. Each layer catches a different category of problem. The result is that by the time a human engineer reviews a change, the obvious issues — spec drift, known failure patterns, behavioral regressions — have already been caught automatically. Human review time goes toward judgment calls and architectural decisions, not line-by-line code inspection.
We call this frontier operations — operating at the expanding boundary between what AI can handle reliably and what still needs human judgment. It requires constant recalibration as models improve quarterly. The boundary sensing, failure model maintenance, and seam design that make this work aren't one-time setup tasks. They're ongoing operational disciplines.
Most small businesses can't maintain this. They shouldn't have to. That's why managed agent services exist — the same way you hire an MSP for your IT infrastructure instead of building a security operations center in-house.
Amazon is adding review gates. Anthropic is releasing code review tools. StrongDM is building software factories. The entire industry is converging on the same conclusion: AI-generated output needs structural quality systems, not just human vigilance.
But most of these are retrofits — bolting safety onto systems designed for speed. The companies pulling ahead are the ones that started with safety as a structural property and built their workflows around it.
The question for every business isn't whether to use AI — that's already decided. It's whether you'll redesign your processes around AI's actual capabilities and failure modes, or keep bolting AI onto human workflows and wondering why things keep breaking.
Amazon just showed the world what happens when you choose the latter.
Frontier engineering practices are a set of emerging disciplines for building reliable AI systems. They include intent engineering (encoding organizational goals into agent infrastructure), structural safety (architectural boundaries that prevent failure regardless of AI behavior), automated behavioral verification (testing outcomes, not code), and continuous boundary sensing (recalibrating what AI can handle as models improve). Companies like StrongDM and Anthropic are pioneering these approaches.
The Dark Factory spectrum describes the maturity curve of AI-assisted software development, from Level 0 (AI suggests code completions) through Level 5 (spec in, working software out, no human writes or reviews code). Most companies today are between Levels 1-2, where AI writes code but humans still review everything. The organizations seeing real productivity gains are those pushing toward Levels 4-5 by redesigning their verification systems rather than adding more human reviewers.
Because AI generates code faster than humans can review it. If AI tools produce code 10x faster and review still happens at human speed, the review queue becomes an impossible bottleneck. Senior engineers either slow everything down or start rubber-stamping changes they can't fully evaluate. The fix is automated verification systems that test behavioral outcomes — an approach that scales with AI output volume.
Intent engineering encodes organizational purpose into the infrastructure governing AI behavior — decision boundaries, escalation triggers, authority limits, and value hierarchies. It's the difference between telling an AI "write good code" and building a system where the AI's actions are structurally bounded by what the organization actually needs. Without it, AI optimizes for proxy metrics (like Zillow's price prediction) instead of real business goals (like profitable transactions).
We're building toward the upper end of the Dark Factory spectrum. Specs are written before code. Per-domain failure model libraries track specific agent failure patterns from real incidents. A separate AI model reviews every diff against the spec and failure models before code is pushed. Behavioral eval suites run in CI on every skill change — skills that fail evals don't deploy. Multi-agent architectures enforce separation of concerns structurally, not through behavioral instructions. Each of these systems catches problems before they reach the human reviewer — so when an engineer does review a change, they're focused on architecture and judgment, not hunting for bugs the automated systems should have caught. That's the managed service model: you get a team building and running these verification systems without having to develop the operational discipline in-house.
Associates AI deploys managed AI agent systems built on frontier engineering practices — structural safety, automated verification, and intent engineering. If you're ready to move beyond "AI with more reviewers," let's talk.
Written by
Founder, Associates AI
Mike is a self-taught technologist who has spent his career proving that unconventional thinking produces the most powerful solutions. He built Associates AI on the belief that every business — regardless of size — deserves AI that actually works for them: custom-built, fully managed, and getting smarter over time. When he's not building agent systems, he's finding the outside-of-the-box answer to problems that have existed for generations.
More from the blog
OpenAI acquired promptfoo on March 9 to secure its enterprise agent platform. We've been running pro...
Salesforce's Agentforce IT Service attracted 180+ organizations in four months and promises 24/7 aut...
A new survey of 2,121 small business owners found complexity — not cost — is what blocks back-office...
Want to go deeper?
Book a free discovery call. We'll show you exactly what an AI agent can handle for your business.
Book a Discovery Call