Amazon's AI Code Caused Major Outages. Adding More Reviews Won't Fix It.
Amazon held an emergency engineering meeting after AI-assisted code changes triggered multiple outag...
A Harvard Business Review survey and an ECI report landed in the same week with the same finding: SMBs are deploying AI fast and seeing slow results. The missing piece isn't the technology. It's the operational discipline that starts the day after launch.
Two reports landed within days of each other this month, and they tell the same story from different angles.
On March 9, Harvard Business Review Analytic Services published a survey sponsored by TriNet of 230 SMB leaders. The headline: 76% expect to increase AI use in the next 12 months. The buried finding: only 19% feel their organization is highly prepared to recruit or develop the AI skills needed to make it work. And 56% expect difficulty even determining which AI skills they actually need.
Three days later, ECI Software Solutions released its AI Readiness Report based on a survey of 550+ SMB leaders across the U.S., Canada, and Australia. More than 70% hold a positive view of AI. Nearly 40% say they have not yet seen measurable results from their AI initiatives. The top barriers: lack of in-house expertise, data readiness, and "clarity on where to begin."
Read those numbers together. Three out of four SMBs are bullish on AI and planning to deploy more of it. Four out of ten can't point to a single measurable result from what they've already deployed. The gap between enthusiasm and outcomes is not closing. It's widening.
The standard explanation is that these businesses picked the wrong tools or started with the wrong use case. That's sometimes true, but it's not the interesting diagnosis. The more useful one: most businesses treat AI deployment as a project with a finish line. Install the agent, configure the integrations, write the prompts, flip it on, move to the next initiative. The work is "done."
It's never done. And that misunderstanding is where the 40% failure rate lives.
Setting up an AI agent takes hours. Getting it into production — with proper credentials, integrations, and a working prompt — takes days, maybe a week or two for something complex. Most platforms have made this genuinely straightforward.
The hard part starts on day two.
The model your agent runs on will update. Maybe the provider ships a new version with better reasoning but subtly different behavior on edge cases. Maybe they deprecate the version you tested against. The agent that worked perfectly last month now handles customer refund requests differently — not wrong exactly, but not the way your team verified and approved.
Your business processes change. A new product launches, a pricing structure shifts, a compliance requirement tightens. The agent doesn't know. It's still operating against last quarter's context, making decisions based on information that's no longer accurate.
The failure modes evolve. Every model generation fails differently. The previous version might have been overly cautious, declining to act when it should have. The new version might be overconfident, taking actions it shouldn't without flagging uncertainty. If your verification processes were calibrated for the old failure pattern, they'll miss the new one entirely.
None of this shows up in the deployment phase. It shows up in weeks three through fifty-two, when the agent is running in production and nobody is watching it with the same intensity they had during launch.
The TriNet survey found that 70% of SMB leaders value human capabilities like "creativity, intuition, and discernment" alongside AI tools. That's the right instinct. But it's vague enough to be useless without specifics. Here are three concrete operational skills that separate the 60% seeing results from the 40% that aren't.
Every AI agent has a reliability boundary — tasks it handles well and tasks where it starts producing subtle errors. That boundary is not static. It moves with every model update, every change to your business context, and every shift in the types of requests the agent encounters.
Most businesses calibrate this once, during deployment. They test the agent against a set of scenarios, confirm it works, and move on. Six months later, the model has been updated twice, the business has added three new products, and nobody has re-tested whether the agent still handles those scenarios correctly.
The operational discipline is recalibrating that boundary regularly. In practice, this means running evaluation suites after every model update — not just checking "does it still work?" but checking "does it work the same way, on the same edge cases, with the same judgment calls?" Tools like promptfoo exist specifically for this. They let you define expected behaviors as test cases and run them automatically whenever your agent's foundation changes.
A quarterly boundary review — sitting down, looking at what the agent handled well and what it fumbled, and adjusting its scope accordingly — is worth more than any feature you'll add to the agent in that same quarter.
This is different from knowing where the boundary sits. The boundary tells you what the agent should and shouldn't do. The failure model tells you how it breaks when it operates within its boundary.
Current-generation language models fail in specific, textured ways that change with each model version. They produce analysis that sounds authoritative but rests on a misunderstood premise. They write code that compiles and passes basic tests but breaks on edge cases the model didn't anticipate. They generate customer-facing responses that are 98% accurate, with the remaining 2% stated just as confidently as the rest.
The operational discipline here is documenting and updating your failure model by task type. "For customer service inquiries about billing disputes, the agent tends to over-promise resolution timelines." "For code review, the agent catches formatting and style issues reliably but misses logic errors involving concurrent state." "For data analysis, the agent's summaries are accurate but it occasionally invents supporting statistics."
These aren't generic observations. They're specific, current, testable claims about how your particular agent fails on your particular workload. When a new model version ships, you update them. When you expand the agent's scope, you build new ones. This living document becomes the foundation for every verification check your team performs.
Without it, your team defaults to either reviewing everything at the same depth — which doesn't scale — or trusting everything equally, which is how a confidently wrong AI output ends up in a client deliverable.
When an agent produces 50 outputs a day, someone needs to decide which five get deep human review, which twenty get a quick scan, and which twenty-five pass through automated checks only. That allocation decision is itself a skill, and it's the one that determines whether your human team is adding value or just rubber-stamping AI output.
The wrong approach: review everything at the same depth. It's unsustainable, it burns out your team, and it trains people to skim rather than evaluate because the volume is too high for genuine engagement.
The right approach: triage by risk and consequence. Customer-facing outputs get deeper review than internal summaries. Financial calculations get verified against source data. Novel requests — ones the agent hasn't encountered patterns for — get flagged for human judgment before the agent responds.
This triage model needs to be explicit, documented, and updated as the agent's capabilities change. When the agent gets better at a task category, you can lower the review threshold for that category and redirect attention to areas where it's still developing. When a new failure mode emerges, you tighten review on the affected outputs.
The ECI report identified "lack of in-house expertise" as the top barrier to AI results. This is the expertise they're missing. Not expertise in configuring AI tools — that's a one-time learning curve. Expertise in continuously operating AI systems so they stay reliable, safe, and aligned with what the business actually needs.
Abstract principles are useful. Concrete routines are better. Here's what a disciplined agent operations practice looks like week to week.
Every agent capability — every "skill" in OpenClaw terms — should have a corresponding set of evaluation tests. These aren't unit tests in the traditional software sense. They're behavioral tests: given this input and this context, the agent should produce output that meets these criteria.
When a model updates, the pipeline runs automatically. When a soul document changes (the configuration that defines how the agent behaves), the pipeline runs. When the agent's scope expands to cover a new task, new evals get written before the expansion ships. This is the same discipline software engineering learned decades ago — you don't ship code without tests, and you don't ship agent behavior changes without evals.
The difference from traditional testing: agent evaluations are probabilistic, not deterministic. The same input might produce slightly different output each time. The eval framework needs to assess whether the output falls within acceptable bounds, not whether it matches a single expected string. promptfoo's evaluation framework handles this natively, scoring outputs against criteria rather than exact matches.
In OpenClaw, a soul document defines the agent's purpose, boundaries, voice, and decision-making framework. It's the closest thing to intent engineering that exists in production agent deployments today — the organizational values and judgment calls that a human employee absorbs over months, encoded explicitly so an agent can act on them from day one.
But soul documents aren't write-once artifacts. They need versioned updates as the business evolves. A new product launch means the agent needs updated product knowledge and possibly new decision boundaries. A change in compliance requirements means updated guardrails. A shift in company strategy means re-evaluating which trade-offs the agent is authorized to make.
The operational practice: treat soul documents like production code. Version them. Review changes through pull requests. Run evals against the updated configuration before deploying it. Keep them in source control so you can trace when a behavior changed and why.
This is where the "clarity on where to begin" gap from the ECI report becomes concrete. The beginning isn't which AI tool to buy. It's defining — in writing, in specifics, in testable terms — what you want the agent to do, how you want it to make decisions, and where you want it to stop and ask a human.
Anthropic's research on agentic misalignment found that explicit safety instructions ("do not do X") reduced bad behavior from 96% to 37% — still failing more than a third of the time. Telling an agent to be safe doesn't make it safe. Structural controls make it safe.
In production, this means the agent can't modify its own instructions — soul documents are mounted read-only. Credentials are fetched from a secrets manager at boot, not stored in configuration files. The agent runs in a private network with outbound-only access. Every session is logged to an audit trail. Permissions follow least privilege: the agent gets access to what it needs and nothing else.
These aren't deployment decisions. They're operational infrastructure that needs monitoring, updating, and periodic review. When the agent's scope expands, its permissions need revisiting. When a new integration is added, the security model needs updating. When a new class of prompt injection attack emerges, the content guardrails need testing.
The TriNet survey found that 49% of SMBs anticipate difficulty training or upskilling existing employees on AI. Here's why: the training they're imagining — how to use ChatGPT, how to write prompts, how to configure an AI tool — is table-stakes knowledge that covers the deployment phase. The operational skills described above aren't covered in any vendor's onboarding tutorial.
No AI platform vendor has an incentive to tell you that their product requires ongoing operational discipline to deliver results. The sales pitch is deployment ease, not operational complexity. The demo shows the agent working on day one, not the work required to keep it working on day 200.
This is why the adoption gap is widening. Every quarter, more SMBs deploy AI agents. The deployment tools get better, the models get more capable, the barrier to getting started drops further. But the operational skills required after deployment haven't gotten easier. They've gotten harder, because the agents are more capable, handle more edge cases, fail in more subtle ways, and touch more consequential business processes.
The 40% of SMBs reporting no measurable results aren't failing at technology adoption. They're failing at technology operations. And until the industry starts talking honestly about what "operating AI agents" actually requires — the ongoing evaluation, the boundary recalibration, the failure model maintenance, the attention triage, the living documentation — that number isn't going to improve.
Q: How often should we re-evaluate our AI agent's performance after deployment? A: At minimum, run your evaluation pipeline after every model update from your provider and after every change to the agent's configuration or scope. Beyond automated checks, do a manual boundary review quarterly — examine what the agent handled well, what it fumbled, and whether its current scope still matches its actual reliability.
Q: What's the difference between testing an agent at deployment and ongoing evaluation? A: Deployment testing confirms the agent works against a known set of scenarios. Ongoing evaluation tracks whether it continues to work as its foundation changes — new model versions, updated business context, evolved failure modes. The agent that passed every test in January may fail tests in April that didn't exist in January because the business or the model shifted underneath it.
Q: We're a small team. How do we manage agent operations without dedicated staff? A: Automate what you can. Evaluation pipelines should run without human intervention after every model or config change. Build explicit triage rules so your team reviews high-risk outputs deeply and trusts automated checks for low-risk ones. The goal isn't zero human involvement — it's directing human attention where it has the most impact rather than spreading it thin across everything.
Q: Our AI vendor says their platform handles all of this. Should we trust that? A: Some platforms handle pieces well — automated monitoring, basic alerting, usage dashboards. None of them handle the parts that require your business context: defining what "correct" looks like for your specific workflows, deciding which failure modes matter most for your customers, recalibrating agent scope as your business changes. Those are organizational decisions, not platform features.
Q: What's the first thing we should do if we've deployed an agent but aren't seeing results? A: Start with the failure model. Spend a week reviewing the agent's actual outputs — not spot-checking, but systematically sampling across task types. Document where it's producing value and where it's producing plausible-sounding work that requires correction. That document becomes the foundation for every operational improvement that follows.
The companies seeing measurable results from AI aren't the ones who deployed the most advanced models or spent the most on tools. They're the ones who treated deployment as the starting line, not the finish line — and invested in the operational skills and infrastructure to keep their agents reliable, safe, and aligned with what the business actually needs, week after week, quarter after quarter.
Associates AI builds and maintains this operational infrastructure for our clients. Continuous evaluation pipelines, quarterly boundary reviews, failure model documentation, structural security, and the ongoing calibration work that turns an AI deployment into an AI capability. If you're seeing the readiness gap in your own organization, book a call and we'll walk through what closing it looks like for your specific workflows.
Written by
Founder, Associates AI
Mike is a self-taught technologist who has spent his career proving that unconventional thinking produces the most powerful solutions. He built Associates AI on the belief that every business — regardless of size — deserves AI that actually works for them: custom-built, fully managed, and getting smarter over time. When he's not building agent systems, he's finding the outside-of-the-box answer to problems that have existed for generations.
More from the blog
Amazon held an emergency engineering meeting after AI-assisted code changes triggered multiple outag...
OpenAI acquired promptfoo on March 9 to secure its enterprise agent platform. We've been running pro...
Salesforce's Agentforce IT Service attracted 180+ organizations in four months and promises 24/7 aut...
Want to go deeper?
Book a free discovery call. We'll show you exactly what an AI agent can handle for your business.
Book a Discovery Call