Amazon Lost 6.3 Million Orders Because Nobody Reviewed the AI's Code. Here's What That Means for Your Business.
On March 5, Amazon's AI coding agent Kiro pushed unreviewed code to production and caused a six-hour...
DBS Bank and Visa just tested AI agents making credit card transactions independently. That's not a chatbot writing emails — it's software spending money. Here's what businesses need to figure out before letting agents take consequential actions.
In February 2026, DBS Bank and Visa completed successful tests of AI-driven "agentic commerce" — software agents executing credit card transactions independently. No human clicking "confirm purchase." No human reviewing the cart. An agent identified a need, selected a vendor, and completed payment on its own.
This is a different category of AI deployment than most businesses are running today. The vast majority of agents in production right now generate text: drafting emails, summarizing documents, answering customer questions. Those are valuable tasks, but they share a common trait — if the agent gets it wrong, a human catches it before anything irreversible happens. A bad draft gets edited. A wrong summary gets corrected. The cost of failure is measured in minutes, not dollars.
When an agent spends money, places an order, sends a legal notice, or submits a regulatory filing, the failure model changes completely. The cost of being wrong isn't wasted time. It's wasted money, broken contracts, regulatory exposure, or damaged relationships. And the window for human correction shrinks from "before you hit send" to "after the transaction has already cleared."
Most businesses aren't ready for this. Not because the technology doesn't work — the DBS/Visa test proved it can — but because the operational maturity required to run agents that do things is fundamentally different from the maturity required to run agents that say things.
The first question any business needs to answer before deploying agents with real-world authority is: where exactly is the boundary between what an agent can handle reliably and where humans need to stay involved?
This isn't a philosophical question. It's an engineering one.
Agents are reliable at tasks with clear inputs, structured outputs, and low ambiguity. Looking up a price in a catalog. Formatting an invoice from structured data. Routing a support ticket based on keywords. Comparing three vendor quotes against a predefined rubric. These are bounded problems where "right" and "wrong" are well-defined, and the agent has enough context to make the correct call consistently.
Agents are unreliable at tasks requiring judgment under ambiguity, multi-step reasoning with real-world consequences, or situations where the context window doesn't contain all the information needed to make a good decision. Negotiating a contract. Deciding whether a $50,000 purchase order is actually a good deal given market conditions the agent hasn't been trained on. Determining whether a customer complaint warrants a refund, a discount, or a firm "no."
The DBS/Visa test worked because it was a controlled environment with well-defined parameters. The agent knew what to buy, from whom, at what price. That's closer to a scripted transaction than a judgment call. The operational challenge for every other business is figuring out which of their agent-eligible tasks look like "buy this specific item at this specific price" versus "figure out what we should buy and from whom."
A useful framework: categorize every potential agent action by reversibility and cost of failure. A two-by-two matrix.
This matrix isn't static. As confidence in an agent's reliability grows — backed by data, not gut feeling — actions can migrate from "human approves" to "agent runs with logging." But the migration should be earned, not assumed.
Once the boundary is drawn, the next problem is handoff design. How does work flow between agent and human at the points where authority transfers?
The most common mistake is treating human approval as a rubber stamp. An agent prepares a purchase order, drops it in someone's inbox, and the human clicks "approve" without reviewing it because the approval step adds friction to a process that was supposed to be faster. This is worse than no automation at all — it creates the illusion of oversight without the substance.
Another failure: agents that escalate everything. If an agent flags every action for human review, it hasn't automated anything. It's just added a layer of bureaucracy with an AI label on it.
Effective handoffs have three properties:
Context-rich. The agent doesn't just say "approve this?" It presents the action, the reasoning, the alternatives it considered, and the risk factors. A human should be able to make a decision in 30 seconds, not 30 minutes.
Exception-based. The agent handles the 90% of cases that are routine. Humans see only the exceptions — the edge cases, the high-value decisions, the situations where the agent's confidence is low. This is where the concept of knowing what level your business is actually operating at becomes critical.
Auditable. Every action the agent takes — approved or autonomous — is logged with full context. Not just "agent placed order #4521" but "agent placed order #4521 because inventory for SKU-889 dropped below threshold of 50 units, selected Vendor A over Vendor B based on price ($12.40 vs $14.10) and 3-day delivery window, total cost $1,240."
The technical implementation matters here. Read-only soul documents ensure the agent can't modify its own decision-making rules. Least-privilege permissions mean the agent gets access to only the systems it needs — an agent managing inventory shouldn't have access to payroll. Integration platforms like Composio let agents interact with third-party services through managed API keys rather than direct credentials, so the blast radius of a compromised agent is contained.
Failure models for text-generating agents are well-understood: hallucinations, tone mistakes, factual errors. Annoying but manageable.
Failure models for action-taking agents are different in kind, not just degree.
An agent with purchasing authority can spend money on the wrong thing, spend the right amount with the wrong vendor, spend the wrong amount with the right vendor, or execute a legitimate purchase at the wrong time. Each failure mode requires a different detection mechanism. Spend limits catch overspending but not mis-spending. Vendor allowlists catch wrong-vendor errors but not wrong-timing errors.
When agents chain actions — agent A's output triggers agent B's action, which triggers agent C's payment — a single bad decision can cascade before any human notices. The J-curve of AI adoption is steep enough with individual agents. Cascading agent systems multiply the failure surface.
The most dangerous failure mode isn't the agent that makes an obviously wrong purchase. It's the agent that makes subtly suboptimal decisions consistently — paying 5% more than necessary on every order, choosing the slightly slower vendor every time, categorizing expenses in ways that don't trigger alerts but compound over months. These failures don't set off alarms. They erode margins quietly.
For every consequential action an agent takes, the business needs to answer:
Testing these failure models before deployment — using tools like Promptfoo for eval-driven skill development — is the difference between a controlled rollout and an expensive experiment.
The DBS/Visa test isn't an isolated experiment. It's the leading edge of a shift that will hit most businesses within 12 months.
Expect agent-commerce APIs from major payment processors. Stripe, Square, and Adyen are all building agent-authentication layers. The infrastructure for agents to spend money is being commoditized. The question won't be "can my agent make purchases?" but "should it?"
Procurement agents will be the first mainstream use case outside banking. Automated reordering for consumable supplies, comparison shopping across approved vendors, and invoice reconciliation are all well-bounded enough for current agent capabilities.
Agent-to-agent commerce becomes real. Not just one business's agent buying from a vendor's website, but one business's agent negotiating with another business's agent. This is where the seam design challenge gets genuinely hard — when there's no human on either side of the transaction, the trust architecture has to be embedded in the system design, not the process design.
As agents take on more consequential actions, human roles shift from execution to three specific functions:
The businesses that figure this out early won't just be more efficient. They'll be the ones that can actually deploy agents for consequential work while their competitors are still stuck arguing about whether AI can be trusted to draft an email without supervision.
Only if you've built the operational infrastructure first. That means defined spend limits, vendor allowlists, human approval gates for high-value transactions, comprehensive audit logging, and tested failure models. The technology works — the DBS/Visa tests proved that. The question is whether your processes, permissions, and monitoring are ready.
Reversibility. A bad email draft gets edited before sending. A bad purchase gets charged to your account immediately. Action-taking agents require stricter permission boundaries, real-time monitoring, and pre-defined escalation paths that text-generating agents don't need.
Layer your defenses. Start with least-privilege permissions — the agent can only access what it needs. Add spend limits and approved vendor lists. Implement human approval gates at dollar thresholds. Log every action with full reasoning context. Test failure scenarios with evals before deploying to production. No single safeguard is sufficient; the layers are the strategy.
Agentic commerce is AI agents conducting commercial transactions — purchasing, ordering, payments — without direct human involvement for each transaction. It matters because the infrastructure is being built now by major payment processors and banks. Within a year, the tools to let agents spend money will be widely available. Businesses that understand the operational requirements early will be positioned to adopt safely. Those that don't will either miss the opportunity or adopt recklessly.
Ask three questions: Do you have clear documentation of what the agent should and shouldn't do? Can you monitor every action it takes after the fact? Do you have a tested plan for what happens when it makes a mistake? If the answer to any of these is no, you're not ready for consequential agent actions — but you can start building toward it.
The gap between "AI that talks" and "AI that acts" is the most important operational challenge in business technology right now. The DBS/Visa test showed that the technology side is solved. What's not solved — and what most businesses haven't even started thinking about — is the operational maturity required to deploy it safely.
This isn't something you figure out after the agent has already spent money on the wrong thing. It's something you build before the first transaction, test exhaustively, and refine continuously.
Associates AI helps businesses build the operational frameworks — boundary definitions, seam design, failure models, monitoring, and eval infrastructure — that make consequential AI agent deployment safe and effective. If you're planning to move your agents from generating text to taking real-world actions, book a call to map out what that looks like for your specific use case.
Written by
Founder, Associates AI
Mike is a self-taught technologist who has spent his career proving that unconventional thinking produces the most powerful solutions. He built Associates AI on the belief that every business — regardless of size — deserves AI that actually works for them: custom-built, fully managed, and getting smarter over time. When he's not building agent systems, he's finding the outside-of-the-box answer to problems that have existed for generations.
More from the blog
On March 5, Amazon's AI coding agent Kiro pushed unreviewed code to production and caused a six-hour...
IBM says 2026 is the year multi-agent systems move into production. Gartner says more than 40% of ag...
Three companies deployed AI agents and got documented, measurable results. What they did — and what...
Want to go deeper?
Book a free discovery call. We'll show you exactly what an AI agent can handle for your business.
Book a Discovery Call