AI Strategy

AI Agent Examples: Real Businesses, Real Results

Associates AI · March 25, 2026

Three companies deployed AI agents and got documented, measurable results. What they did — and what went wrong — tells you more about production AI than any vendor demo.

AI Agent Examples: Real Businesses, Real Results

What Real AI Agent Deployments Look Like

Most AI agent content is aspirational. Demos, projections, and vendor case studies where everything works and the numbers are round. Real deployments are messier — and more instructive.

Three companies have documented their AI agent results publicly with enough specificity to be actually useful: Morgan Stanley, Klarna, and StrongDM. They span financial services, fintech, and software engineering. They range from cautionary failure to quiet success to radical transformation. And taken together, they illustrate the difference between deploying AI agents and operating them well.

What follows is what actually happened, what the numbers show, and why those outcomes weren't accidents.

Morgan Stanley: The Knowledge Retrieval Win

Morgan Stanley's wealth management division manages $4 trillion in client assets across 16,000 financial advisors. Each advisor is expected to have current command of the firm's research — market analysis, equity recommendations, sector outlooks, company deep dives. The research library contains over 100,000 documents and grows continuously.

Nobody reads it all. In practice, advisors rely on whatever they remember or whatever they happened to catch in the morning summary. Genuinely relevant research gets missed. Clients don't receive the benefit of analysis that exists but wasn't surfaced.

In early 2023, Morgan Stanley launched the AI @ Morgan Stanley Assistant in partnership with OpenAI. The assistant gives advisors a conversational interface to the full research library. Ask it "what's our current view on European utilities given the energy transition?" and it retrieves, synthesizes, and summarizes the relevant research in seconds. Ask it to prep a client meeting brief and it pulls together everything relevant to that client's holdings and current market context.

By the fall of 2023, the firm reported that 98% of financial advisors use it weekly. That adoption rate, for any enterprise software rollout, is extraordinary.

The reason this deployment worked comes down to seam design. The agent does one thing: retrieve and synthesize research. It doesn't make investment decisions. It doesn't advise clients. It doesn't make recommendations. Those tasks stay entirely with the advisor, because that's where the value actually is — in the human judgment, the client relationship, the read on what a specific person needs to hear and how.

The agent handles the effort work. The advisor handles the judgment work. The seam between them is clean, explicit, and not a source of confusion. The advisor knows exactly what the agent does and what it doesn't. The client still talks to the same advisor. Nothing about the trusted relationship changed.

This is what good boundary sensing looks like: understanding that the agent's value is in high-volume retrieval, not in the relationship work that makes retrieval useful. The seam was placed exactly at the boundary of what the agent could do reliably — and not one step further.

The practical lesson for your business: Before deploying an agent on any knowledge-intensive task, answer this: what's the actual value being delivered to the person on the other end? In Morgan Stanley's case, the value to the client is advisor judgment informed by comprehensive research. The agent enables that. It doesn't replace it. Map the value delivery before you map the task.

Klarna: Exceptional Metrics, Wrong Goal

In February 2024, Klarna announced that its AI customer service agent had handled 2.3 million customer conversations in its first month, across 23 markets in 35 languages. Average resolution time dropped from 11 minutes to 2 minutes. Customer satisfaction scores held. The company's CEO projected $40 million in annual savings and called it a landmark achievement.

By mid-2025, Klarna was rehiring the human agents it had let go. The CEO told Bloomberg that while "cost was a predominant evaluation factor," the result was lower quality. Customers were getting generic answers, robotic tone, and an agent that couldn't handle anything requiring judgment or relationship context.

The agent wasn't broken. It was spectacularly good at the task it was given: resolve customer tickets as fast as possible. That task was measurable, optimizable, and the wrong goal.

Klarna's actual objective — the one that makes a consumer fintech company viable — was building customer relationships that drive long-term value and retention. Those two goals (fast resolution vs. relationship quality) diverge the moment a conversation gets complicated. A customer who's been with the company for three years, calls with a billing dispute during a stressful personal situation, and gets a technically correct but tonally robotic two-minute resolution isn't a satisfied customer. They're a flight risk.

A human agent who's worked there for five years knows things that never appear in a prompt. She knows when to spend an extra three minutes. She knows when to bend a policy without being asked. She knows when efficiency is the right call and when generosity is. She absorbed this by watching how experienced managers handled hard situations over years. The AI agent had none of it — not because it couldn't have, but because nobody encoded it.

This is what happens when you skip intent engineering. The task was defined. The goal wasn't. And in the absence of explicit organizational intent — what trade-offs are we willing to make, what does "good" look like in this specific context, when should we slow down instead of speed up — the agent optimized for what it could measure. It measured resolution time, so it optimized for resolution time.

The Klarna case is now frequently cited as an AI failure. It wasn't. It was a deployment where the technology performed exactly as designed, and the design missed the actual goal by a significant margin.

What a Correct Deployment Would Have Required

Good intent engineering for a customer service deployment starts with an honest answer to: what outcome are we actually trying to achieve when a customer contacts us?

For most service businesses, the answer isn't "close the ticket." It's closer to: "leave this customer feeling that they were heard, that the company cares about getting it right, and that we earned another year of their business." That objective requires explicit trade-off rules baked into the agent's operating parameters: when to escalate vs. resolve, when to prioritize speed vs. quality, what situations automatically route to a human, what a good outcome looks like beyond the ticket status.

The tool for doing this isn't more prompting. It's encoding organizational purpose as machine-actionable parameters — the decision boundaries, escalation triggers, and value hierarchies that tell the agent not just what to do, but what matters and why.

The practical lesson for your business: If you're deploying an agent on anything customer-facing, write down your real objective before you write a single instruction to the agent. "Resolve tickets" is a task. "Leave customers feeling heard and confident in the relationship" is a goal. The gap between those two things is where Klarna's $40 million unraveled.

StrongDM: The Software Factory

StrongDM is a security infrastructure company. In early 2026, CTO Justin McCarthy publicly disclosed that their three-person engineering team targets $1,000 per engineer per day in token spend. No human on the team writes code. No human reviews code. The agents build, test, and ship it.

Three people. The output of what required ten people eighteen months prior.

The way StrongDM solved the quality problem is the most interesting part. The obvious concern with agents writing all the code is that the agent also writes the tests — which means it can write tests that its code is guaranteed to pass, not tests that actually verify the right behavior. StrongDM uses what they call "scenarios": external behavioral tests written before any code is produced, describing what the deployed system should do from the outside. The agents can't see the scenarios. They can't write tests that game them. The only way to pass is to produce software that actually behaves correctly.

This is seam design applied to software engineering. The humans do two things: write the scenarios (what correct behavior looks like) and evaluate whether the deployed output passes them. Those are judgment tasks. The coding, testing, and implementation are effort tasks. The seam is placed at exactly the right point.

The $1,000/day figure is important context. That's not reckless spending — it's cheaper than the ten-person team it replaced. Per-token inference costs have fallen 10-200x annually by most measures. What cost $20 per million tokens in late 2022 runs at roughly $3 today, and continues falling. The teams that are spending $1,000 per engineer per day are getting more production output per dollar than at any point in software engineering history.

StrongDM's model isn't directly applicable to most small businesses. But the underlying pattern is. The organization that wins isn't necessarily the one with the most intelligence budget — it's the one that knows where to place the seams. Where does judgment end and pattern-following begin? At that boundary, agents can run all night at a price that continues to drop.

What "No Human Writes Code" Actually Requires

Reaching StrongDM's operating model didn't happen by switching on an agent and walking away. It required:

Scenario-driven development — external behavioral tests that can't be gamed, written with deep product understanding
Explicit failure models — current, differentiated knowledge of how agents fail on specific task types
Ongoing calibration — the seam location isn't permanent. As models improve, what requires human judgment shifts. StrongDM presumably revisits what scenarios need to exist as the underlying models change.

The psychological difficulty of getting there is real. Most teams top out when they stop reviewing every line of code — not because the technology fails, but because of the discomfort of trusting outputs they haven't verified line by line. The teams that push through build the evaluation discipline that replaces line-by-line review: clear scenarios, external tests, behavioral verification.

The practical lesson for your business: You don't need to match StrongDM's token spend or engineering model. But the principle applies to any repetitive process: identify the pattern work that runs every week on a schedule, define what correct behavior looks like from the outside, and build verification that doesn't rely on reviewing every step. The combination of clear scenarios plus behavioral testing is what separates deployments that compound in value from ones that require constant babysitting.

What These Three Cases Have in Common

Three different industries. Three different outcomes. But one consistent underlying dynamic.

The deployments that worked placed seams at the right point: agent handles retrieval (Morgan Stanley), agents handle implementation verified by external scenarios (StrongDM). The deployment that failed placed the seam wrong — or more precisely, didn't define where the seam should be at all (Klarna).

In every case, the AI model wasn't the variable. The models deployed by Klarna in 2024 were state-of-the-art. Morgan Stanley's assistant ran on GPT-4. What varied was whether the people deploying these agents had done the thinking about:

What does correct behavior actually look like? Not "resolve the ticket" but "earn another year of this customer's business." Not "write code" but "pass these scenarios we wrote in advance."
What trade-offs is the agent allowed to make? Speed vs. quality. Efficiency vs. relationship. No agent makes these trade-offs well unless someone decided in advance how they should be made.
Where does the agent stop and human judgment begin? The Morgan Stanley advisor is still the one calling the client. StrongDM engineers are still the ones writing the scenarios. Klarna's agents were handling situations that required judgment they didn't have.

Getting from a working demo to a production deployment is mostly the work of answering those three questions for your specific context — and then encoding the answers in a way the agent can actually use. That work doesn't happen in the AI configuration. It happens before you touch the configuration.

FAQ

Q: What's the best AI agent example for a small business to learn from? A: Morgan Stanley's knowledge retrieval deployment is the most instructive for most small businesses. It illustrates clean seam design: the agent handles high-volume retrieval; the human provides judgment and relationship. That pattern applies across industries — from a contractor who wants an agent that surfaces the right project history before a customer call, to a clinic where an agent handles intake and the clinician handles care.

Q: Why did Klarna's AI agent fail despite impressive metrics? A: The agent succeeded at the task it was given (resolving tickets quickly) but failed at the underlying goal (building customer relationships that drive retention). This is the intent gap — the distance between the metric you optimize for and the outcome your business actually needs. The lesson isn't that customer service agents don't work; it's that they require explicit encoding of what "good" looks like beyond task completion speed.

Q: How much does it cost to run AI agents like StrongDM? A: StrongDM's team runs approximately $1,000 per engineer per day in token costs — but that's cheaper than the ten-person team it replaced. For most small businesses, the relevant question is narrower: what's the cost to run an agent on your specific high-volume, repetitive work? For a follow-up sequence or scheduling agent, the cost is typically under $50/month for a business with normal volume. The economics have shifted dramatically in the last two years.

Q: Are there AI agent examples from smaller businesses, not just enterprises? A: The most documented cases tend to be large enough to have PR departments. But the patterns apply at any scale. A three-person HVAC operation automating follow-up sequences and review requests is running the same playbook as Shopify's merchant support operation — agents handle pattern work, humans handle judgment. The seam design principles don't change with company size.

Q: What's the difference between an AI agent that works and one that doesn't? A: In every case that's been documented well enough to analyze, the difference is whether the deploying team encoded the actual goal — not just the task. Resolution time is a task metric. Customer retention is the goal. Code that passes tests is a task metric. Software that behaves correctly in production is the goal. Agents optimize for what you measure. Measure the right things.

Q: How do I know what to automate first in my business? A: Start by separating effort work from judgment work. Effort work is high-volume, follows a pattern, and takes real hours without being intellectually demanding: follow-ups, reminders, scheduling confirmations, status updates, data entry. Judgment work requires context that's hard to encode: reading a complicated client situation, deciding whether to bend a policy, handling conflict. Agents handle effort work reliably right now. Mapping which tasks fall where is the most valuable hour you'll spend before your first deployment.

Associates AI helps businesses do the operational work these case studies require — defining actual goals (not just tasks), placing seams correctly, building verification that doesn't require reviewing every output. If you want to understand what that looks like for your specific workflows, book a call.

Written by

Mike Harrison

Founder, Associates AI

Mike is a self-taught technologist who has spent his career proving that unconventional thinking produces the most powerful solutions. He built Associates AI on the belief that every business — regardless of size — deserves AI that actually works for them: custom-built, fully managed, and getting smarter over time. When he's not building agent systems, he's finding the outside-of-the-box answer to problems that have existed for generations.

The Business Operating System: Why Your AI Agents Need an Operating Layer, Not Just a Runtime

Google Cloud Next 2026 unveiled what it calls an 'Agentic Enterprise' platform. It's a better runtim...

Apr 23, 2026 Read ›

AI Strategy

Amazon Lost 6.3 Million Orders Because Nobody Reviewed the AI's Code. Here's What That Means for Your Business.

On March 5, Amazon's AI coding agent Kiro pushed unreviewed code to production and caused a six-hour...

Mar 27, 2026 Read ›

AI Strategy

The Future of AI Agents in 2026: What Production Actually Looks Like

IBM says 2026 is the year multi-agent systems move into production. Gartner says more than 40% of ag...

Mar 26, 2026 Read ›

Want to go deeper?

Browse the Teammates Library See pricing Read case studies

Back to Blog

Ready to put AI to work for your business?

Start the free trial. Hire your first Teammate in minutes and put it to work on what you're reading about.

Start Free Trial