Amazon Lost 6.3 Million Orders Because Nobody Reviewed the AI's Code. Here's What That Means for Your Business.
On March 5, Amazon's AI coding agent Kiro pushed unreviewed code to production and caused a six-hour...
Most businesses describing their 'AI agents' are actually running chatbots — and the confusion is costing them real money. Here is how to tell the difference, why it matters, and what it takes to actually deploy one.
Bernard Marr's analysis of enterprise AI deployments found one failure pattern above all others: companies confusing agents with chatbots, then building governance and oversight for the wrong thing. The chatbot gets deployed with agent-level trust. The agent gets constrained with chatbot-level safeguards. Neither works as expected.
The terminology problem is real. Vendors use "AI agent" and "chatbot" interchangeably in marketing materials. Executives read the slide deck and form a mental model. That mental model shapes decisions about budget, staffing, risk tolerance, and architecture. When the mental model is wrong, the decisions are wrong — and fixing them later is expensive.
Here is a precise definition of both, why the distinction matters operationally, and what it actually takes to build something that earns the label "agent."
A chatbot takes input and produces output. That is essentially the full description.
The input is a message from a user. The output is a response. The chatbot does not make decisions that affect anything outside the conversation. It does not take actions in external systems. It does not run code, modify records, send emails, or call APIs on your behalf. It responds.
Rule-based chatbots — the original kind — matched keywords against a decision tree. "Refund" triggered the refund script. "Hours" triggered the hours script. Everything else fell through to a human. These systems were predictable but brittle. Anything outside the decision tree produced nothing useful.
Modern AI-powered chatbots use large language models to understand natural language and generate contextually appropriate responses. They handle a far wider range of inputs. They can maintain context across a multi-turn conversation. They can appear to reason through a problem. But the fundamental structure is the same: input in, output out. The chatbot's world ends at the conversation window.
The Air Canada chatbot situation illustrates where chatbots fail. Air Canada's chatbot told a grieving customer he could apply for a bereavement fare retroactively — an incorrect policy. Air Canada tried to disclaim responsibility; a court ruled the company was liable for what its own bot said. The chatbot did one thing wrong: it generated an incorrect response. That is a chatbot failure mode. But notice what it did not do: it did not process the fare application. It did not modify a reservation. It did not touch any external system. It just said the wrong thing. That is the boundary of what a chatbot can do.
An AI agent can act.
The distinction is not about intelligence — modern chatbots are often quite capable of sophisticated reasoning. The distinction is about scope. An agent's world extends beyond the conversation window into external systems, ongoing processes, and real-world consequences.
An agent can:
This is not a subtle difference. A chatbot tells you "I found three invoices that might be overdue." An agent identifies the overdue invoices, sends the follow-up emails, logs the outreach in your CRM, and flags the ones that need escalation to a human — all without waiting to be asked again.
The operational consequence is enormous. With a chatbot, the risk you're managing is bad information flowing to users. With an agent, you're managing consequential actions in systems where mistakes have real costs.
In practice, there is a spectrum from purely reactive (chatbot) to fully autonomous (agent). Most deployed systems today sit somewhere in the middle.
Reactive chatbot: takes input, generates response, no external actions, no persistence between sessions.
Augmented chatbot: takes input, queries external data sources to generate a more accurate response (e.g., looks up a customer's order history), but still does not act. RAG pipelines and MCP servers often power this tier.
Tool-using assistant: takes input, can call specific pre-approved tools (search the web, look up a knowledge base article, retrieve a document), but only takes actions explicitly approved per request.
Supervised agent: takes a goal, breaks it into steps, executes them autonomously, but surfaces each consequential action for human approval before proceeding.
Autonomous agent: takes a goal, executes it end to end, escalates only when it hits a defined boundary or encounters something genuinely unexpected.
The line between "augmented chatbot" and "supervised agent" is where most organizations struggle. The system feels like it's doing something — it's calling tools, producing richer output, appearing to reason — but it still requires a human to actually do anything consequential. Calling that an "agent" is technically wrong, and it sets up incorrect expectations about what the next version of the system should be able to do.
The confusion is partly vendor-driven and partly perceptual.
On the vendor side, "AI agent" is a premium label in 2026. It implies sophistication, autonomy, and transformative potential. Marketing materials slap it on anything that uses a language model, regardless of whether the system can act on anything. This muddies the market's understanding of what "agentic" actually means operationally.
On the perceptual side, modern language models are very good at producing confident, fluent responses that sound like a reasoning entity making decisions. A chatbot that explains its "reasoning" in a clear, logical way creates a strong impression of an agent — even if it cannot touch anything outside the conversation. The surface experience is convincing.
The operational test cuts through both. Ask one question: can this system change something in an external system without human intervention? If no, it is a chatbot, regardless of how intelligent the responses sound. If yes — even one thing, even narrowly scoped — you are operating in agent territory, with the governance implications that come with it.
This is where the confusion becomes expensive.
Chatbot governance is relatively straightforward. You are managing information quality: what does the system say, is it accurate, does it reflect correct policy, where does it escalate to a human? The failure mode is bad information. The solution space is content review, response auditing, clear escalation paths, and legal disclaimers.
Agent governance is a different discipline entirely. When a system can act, you are managing consequences. The relevant questions shift:
IBM identified a case in early 2026 where an autonomous customer-service agent began approving refunds outside policy guidelines. The agent was doing exactly what it understood its job to be — resolving customer service inquiries quickly. The problem was that "resolving quickly" and "within policy" were not the same thing, and the agent had not been given a clear enough model of the difference. That is not a chatbot problem. That is an agent problem — specifically, a failure of intent engineering: the system was not given a structured, machine-actionable representation of what "correct resolution" actually meant.
Organizations that deploy agents with chatbot-level governance — essentially "just don't say anything harmful" — are taking on consequential risk with tools designed for informational risk. The gap between those two is where real operational failures happen.
These are the practical tests. A system that passes all five is operating as an agent. A system that fails any of them is operating closer to chatbot territory, regardless of what the vendor calls it.
The agent can call external APIs, modify database records, send communications, or interact with external systems in ways that produce persistent change. Not read-only queries — writes. A system that can look up an order but cannot update it is closer to chatbot territory.
Given a goal rather than a request, the agent breaks it into steps, executes them in sequence or parallel, and adapts based on what it finds. A chatbot gets a question, returns an answer. An agent gets an objective, figures out what needs to happen to achieve it, and runs the process.
The agent remembers context across sessions — previous actions, decisions made, state of ongoing work — and uses that memory to make better decisions on subsequent runs. Each chatbot conversation starts fresh; agents carry context forward.
The agent makes decisions without waiting for human approval at each step, within defined boundaries. Those boundaries are explicit: the agent knows what it is authorized to do, what requires escalation, and what is out of scope. Agents that require human approval for every action are supervised agents — still valuable, but different.
When a step fails or produces unexpected output, the agent can diagnose the failure, adapt its approach, and try an alternative path. A chatbot surfaces an error; an agent tries to route around it.
The reason agents are harder to deploy is not technical sophistication — the models are capable. It is the operational and governance work that most organizations underinvest in.
Permission architecture: Agents need access to external systems, which means managing credentials, scopes, and least-privilege access carefully. An agent with broad database write permissions and a misconfigured instruction set is a production incident waiting to happen. The right pattern is narrowly scoped permissions, reviewed regularly, with the agent getting exactly what it needs and nothing more.
Defined escalation thresholds: Before an agent goes into production, you need explicit definitions of what it escalates. Not vague guidance — specific boundaries. "Flag any refund over $200 for human review." "Do not cancel orders; route to the customer success team." These boundaries should be written before the first production run, not after the first incident.
Audit logging: Every action an agent takes needs to be logged — what it did, why it decided to do it, what the outcome was. This is not optional. You cannot govern what you cannot see. CloudWatch, or the equivalent in your stack, should capture a full audit trail of every agent session.
Behavioral testing before production: Automated agent testing tools like promptfoo allow you to write behavioral evaluations that test how an agent responds to edge cases, adversarial inputs, and policy-sensitive scenarios before the system is live. This is the equivalent of unit tests for agent behavior. Most teams skip this step. Most teams also have their first production incident within a month.
Ongoing failure model maintenance: Agents fail in specific, patterned ways that change as the underlying models improve. Current frontier models fail subtly — confident responses on misunderstood premises, correct-sounding analysis of the wrong situation, 98% accurate summaries where the remaining 2% is stated with equal confidence. That failure texture shifts with every model update. The team running a production agent needs an active failure model, not a one-time review.
The decision is not "agent good, chatbot bad." It is matching the tool to the task and the governance maturity of the organization.
Use a chatbot when:
Use a supervised agent when:
Use an autonomous agent when:
Most businesses deploying their first AI systems should start with chatbots or supervised agents. Not because the models are not capable, but because organizational readiness — governance infrastructure, failure model awareness, escalation design — takes time to build. An autonomous agent in an organization without that infrastructure is a risk exposure, not an advantage.
One clarification worth making: what counts as "agent territory" today is shifting.
Eighteen months ago, a system that could call a web search API was considered meaningfully agentic. Today that is table stakes for an AI assistant. The frontier of what requires careful agent governance has moved to longer-horizon autonomous work, multi-agent systems that hand off context to each other, and agents with access to consequential external systems.
This means boundary sensing — maintaining a current calibration of what agents can actually do and where governance gaps exist — is not a one-time exercise. Every significant model release changes what the system can do and how it can fail. The team managing production agents today needs to update their failure models when GPT-5.3 ships, when Claude Opus 4.6 becomes the standard underlying model, when a new tool integration goes live.
The distinction between chatbot and agent is not going to get simpler. As model capabilities expand, more systems will qualify as agents — and the governance gap between what those systems can do and what most organizations have built to manage them will widen if organizations treat the problem as solved once.
Q: If my chatbot uses a language model, does that make it an agent? No. The model is the brain; what matters is whether the system can act. A language model that generates responses without touching external systems is powering a chatbot. The same model with access to APIs, tools, and the ability to take multi-step autonomous action is powering an agent. The model choice does not determine the category.
Q: My vendor calls their product an "AI agent." How do I verify whether it is? Apply the operational test: can the system change something in an external system without human intervention at each step? If the answer is "yes, with human approval each time," you have a supervised agent. If the answer is "yes, autonomously within defined boundaries," you have an autonomous agent. If the answer is "no, it just generates responses," you have an AI-powered chatbot with a premium label.
Q: Is one better than the other? Neither is universally better. Chatbots are appropriate for information-focused use cases where consequences live in the conversation. Agents are appropriate for workflow automation where the goal is taking action in external systems. The right tool depends on the task and your governance readiness — not on which sounds more impressive.
Q: What is the biggest mistake organizations make when deploying agents for the first time? Deploying an autonomous agent before they have supervised the same workflow long enough to understand its failure modes. The correct sequence is: chatbot → supervised agent → autonomous agent, with a meaningful period at each stage to observe how the system behaves and where it makes mistakes. Organizations that skip stages end up running incident response instead.
Q: Do I need an agent if I only want to answer customer questions? No. A well-designed chatbot is the correct tool for information delivery. Adding agent capabilities — tool use, external system access, autonomous action — introduces complexity and governance requirements that a question-answering use case does not justify. Deploy the simpler tool for the simpler task.
Q: How do I know if my organization is ready to run an autonomous agent? You are ready when you can answer these questions with specificity: What can the agent access? What can it change? What triggers escalation to a human? How do you audit what it did? What happens if it is manipulated? If any of those answers is "we haven't defined that yet," you are not ready for autonomous deployment.
Associates AI helps businesses understand exactly where this boundary sits for their specific workflows — and build the governance infrastructure to run agents safely when the time is right. If you are trying to figure out whether what you are building qualifies as an agent and what that means for your operations, book a call.
Written by
Founder, Associates AI
Mike is a self-taught technologist who has spent his career proving that unconventional thinking produces the most powerful solutions. He built Associates AI on the belief that every business — regardless of size — deserves AI that actually works for them: custom-built, fully managed, and getting smarter over time. When he's not building agent systems, he's finding the outside-of-the-box answer to problems that have existed for generations.
More from the blog
On March 5, Amazon's AI coding agent Kiro pushed unreviewed code to production and caused a six-hour...
IBM says 2026 is the year multi-agent systems move into production. Gartner says more than 40% of ag...
Three companies deployed AI agents and got documented, measurable results. What they did — and what...
Want to go deeper?
Book a free discovery call. We'll show you exactly what an AI agent can handle for your business.
Book a Discovery Call