How We Test OpenClaw Skills Before They Reach Production
An AI agent's capabilities are only as reliable as your ability to verify them. We run automated skill evaluations using promptfoo on every change — before anything reaches a client's deployment. Here's how it works and why it matters.
The Problem Nobody Talks About
A skill that worked correctly last month may behave differently today. Not because anyone changed it. Because the underlying model changed.
AI model providers update their models continuously. Behavior shifts. Tone shifts. Edge cases that were handled correctly start being handled differently. Without an automated way to verify that a skill still works as intended, regressions surface when a client notices the agent doing something wrong.
That is the worst possible time to find out.
The gap between "we tested it when we shipped it" and "it still works today" is where silent degradation lives. Without evals, there is no way to distinguish a skill that is working from one that is failing quietly. Both look the same in the dashboard until a client calls.
What OpenClaw Skills Are
OpenClaw skills are versioned instruction packages that define how the agent should handle specific situations. A skill might define how to respond to a billing complaint, how to qualify an inbound lead, how to format a technical summary, or how to escalate to a human. Skills are loaded on demand and give the agent a specific behavioral protocol for the task at hand.
A broken skill means broken behavior for every client and every interaction that loads it. The failure is not isolated — it affects the full scope of the skill's deployment. If the billing complaint skill starts producing dismissive responses because of a model update, every billing complaint handled by every client running that skill degrades at the same time.
This is the scale problem that makes evals non-optional. A bug in a single function affects one code path. A regression in a skill affects every interaction that loads that skill.
The Testing Lesson from Software Development
StrongDM's three-engineer software factory offers a useful parallel here. Their team uses "scenarios" — behavioral specifications for how their software should behave — stored entirely outside the codebase. The AI agent builds the software. The scenarios evaluate whether the software actually works.
The reason for the separation: if the AI can read the tests, it can optimize for passing the tests rather than building correct software. It is the software equivalent of teaching to the test. The scenarios are a holdout set. The agent never sees them during development, so it cannot game them.
The same principle applies to OpenClaw skill evaluations. The skill defines how the agent should behave. The evaluations define the observable outcomes expected. The evaluations are maintained separately from the skills. The agent executes the skill; the eval framework grades the output against criteria the agent did not see during skill development.
How promptfoo Works for Skill Evaluation
promptfoo is an open-source evaluation framework for LLM outputs. It lets you define test cases that describe a scenario and the outputs you expect, run those test cases against the actual model, and grade the results.
A typical skill eval looks like this:
A set of representative inputs that the skill should handle. For a billing complaint skill, that might include a frustrated customer demanding a refund, a confused customer who does not understand their invoice, and an edge case like a customer who was charged twice on the same day.
For each input, a set of assertions about the output. Did the agent acknowledge the complaint? Did it offer a specific resolution path? Did it avoid language that escalates the situation? Did it stay within the client's policies? Did it avoid the specific phrases the soul documents prohibit?
The assertions are not just pass/fail on exact strings. Use LLM-graded evaluations for qualitative criteria — tone, completeness, policy adherence — alongside deterministic checks for things that must be exact. The three-layer grading architecture: exact string matches for required phrases, regex checks for structural requirements, and LLM-graded assessments for qualitative criteria. Each layer catches different types of regression.
A Concrete Example: Billing Complaint Skill
Here is what this looks like for a real skill category, to make the approach concrete.
The skill: Handle inbound billing complaints. Acknowledge the issue, express empathy, explain the resolution process, and escalate to a human if the complaint involves a charge above a defined threshold or if the customer is threatening legal action.
Test case 1 — Normal complaint: Input: "I was charged $47 but my invoice says $39. What happened?" Expected: Acknowledgment of the discrepancy, explanation that billing inquiries are reviewed within 24 hours, assurance that they will be contacted with a resolution. Assertions: contains acknowledgment phrase, mentions 24-hour timeframe, does not promise a refund unprompted.
Test case 2 — Escalation trigger: Input: "This is the third time you've overcharged me. I'm calling my lawyer." Expected: Immediate escalation to a human agent, not self-resolution. Assertion: escalation action is triggered, no self-resolution response is given.
Test case 3 — Edge case: Input: "Wait actually never mind, I found my old invoice and the charge is correct." Expected: Graceful close, no refund processing initiated. Assertion: no refund action taken, response is positive and complete.
Test case 4 — Tone probe: Input: "This is COMPLETELY unacceptable!!!" Expected: Empathetic, de-escalating tone. LLM-graded assertion: "Does the response acknowledge frustration without matching the emotional escalation? Does it avoid defensive language?"
Running this suite takes about 30 seconds. The results tell you exactly which cases passed, which failed, and in the LLM-graded cases, why the grader considered the response inadequate.
CI Setup
Every pull request that touches a skill should trigger an eval run via the CI pipeline. The evaluations run against the actual production model (not a test fixture) on the actual skill configuration being changed. The results have to meet the pass threshold before the PR can merge.
Regressions block the merge. A skill change that degrades output quality on three of five test cases does not ship. The problem surfaces in CI, where the fix is cheap, not in production, where the cost is a client noticing broken behavior.
The full eval suite should also run on a schedule — not just on skill changes. Model updates do not trigger CI pipelines. A scheduled eval run catches the case where the underlying model changed and a skill that was working last week is no longer working today. This is the failure mode that most teams have no defense against. The scheduled run makes it visible before clients notice.
The CI integration is straightforward. A GitHub Actions workflow runs on pull requests to the skills directory:
on:
pull_request:
paths:
- 'skills/**'
jobs:
skill-eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npx promptfoo@latest eval --config evals/billing-complaint.yaml
The promptfoo CLI exits with a non-zero status code on eval failures, which blocks the merge in CI.
What to Actually Grade
The grading approach is not binary. The question is not just "did the eval pass" — it is "how well did the agent perform."
Tone and voice. Did the agent match the communication style defined in the soul documents? Was the response appropriate to the situation — empathetic for a complaint, direct for a status update, professional throughout? Tone regressions are real and common. A model update that makes the agent slightly more terse or slightly more formal can degrade client satisfaction without producing any obviously wrong responses.
Required information. Did the response include everything it was supposed to? A skill for handling a service cancellation request might require acknowledging the request, offering a retention option, and confirming next steps. Missing any of those is a partial failure even if the tone was correct.
Prohibited behaviors. Did the agent avoid the things it is specifically instructed to avoid — making promises outside its authorization, using certain phrases, accessing information outside its scope? These are exact-match assertions. If the soul document says never use the word "guarantee," the eval checks for it.
Edge case handling. The normal cases should always pass. The edge cases are where skills break. Write test cases specifically for the scenarios that are tricky — ambiguous requests, emotionally charged inputs, inputs designed to probe the boundaries of the skill.
Getting Started Without Over-Engineering It
A sophisticated CI setup is not required to start getting value from skill evals.
Start with the skills that affect client-facing behavior directly. A skill that handles complaint escalation is higher priority for evaluation than a skill that formats internal summaries. Start where a regression would hurt most.
Write three to five test cases per skill. Cover a normal case, a case where the agent should decline or escalate, and one or two edge cases you would not want to see fail in production. This is not comprehensive — it is a meaningful signal. Three test cases that catch real regressions are better than thirty that only test obvious inputs.
Run them manually with the promptfoo CLI before setting up CI. The manual run catches obvious problems immediately and calibrates your expectations for what good and bad output looks like for that skill. Once expectations are calibrated, the automated assertions will be much more useful.
Then add CI. The promptfoo CLI runs in standard CI environments. A GitHub Actions workflow that runs on pull requests to files in the skills directory is straightforward to set up. Once it is there, it is invisible until it catches something — and it will catch something.
The Relationship Between Evals and Soul Documents
There is an important design principle here that is easy to miss.
Evals do not replace soul documents. They verify that soul documents are working. The soul document defines what the agent should do. The eval verifies that the agent actually does it. Both are necessary.
A soul document without evals is an untested specification. You believe the agent behaves a certain way because you wrote the instructions. Evals tell you whether the belief is accurate. When the underlying model changes, the instructions stay the same but the behavior may not — and only evals tell you when that happens.
This is also why evals should be written by the same person who writes the soul documents, and reviewed with the same rigor. A soul document that says "always offer a specific resolution path" and an eval that checks for "response contains resolution" are a matched pair. If the soul document is vague, the eval will be vague, and neither is doing its job.
The Compounding Value
Over time, the eval suite becomes the specification for what a skill is supposed to do. When a client asks "does the agent handle X correctly" the answer is no longer "we think so." It is "here are the test results."
When a model upgrade ships, the scheduled eval run tells you whether anything changed. When a skill is modified, the CI run tells you whether the change is safe to ship. When something breaks in production anyway, the eval suite tells you exactly what the expected behavior was and how the actual behavior diverged.
The alternative is flying blind and hoping nothing changes. Given how fast model behavior moves in 2026, that is not a strategy. For the broader context of how evals fit into a production-ready deployment, see the production readiness checklist.
Associates AI sets up promptfoo eval suites as part of every client deployment — covering normal cases, escalation triggers, and edge cases — with CI gates and scheduled runs so regressions from model updates are caught before clients notice. If you're evaluating OpenClaw for your business, book a call.
FAQ
Q: What is promptfoo? A: promptfoo is an open-source framework for evaluating LLM outputs. It lets you define test cases for AI prompts and skills, specify assertions about expected outputs (both deterministic string matches and LLM-graded qualitative criteria), and run those tests against real models. It integrates with CI/CD pipelines and produces structured output that makes regressions easy to identify. It is the closest equivalent to a unit testing framework for AI skill evaluation. The CLI is available via npm and runs without configuration beyond your eval YAML files.
Q: How do you write good skill evals? A: Start with the normal case — the representative input the skill was built to handle well. Add the case where the agent should escalate or decline, because those are often where tone goes wrong. Then add two or three edge cases you specifically do not want to fail: ambiguous inputs, emotionally charged inputs, inputs that probe the boundary between this skill and adjacent skills. For assertions, be specific about what you are checking — not "the response is good" but "the response acknowledges the issue, offers a specific resolution path, and does not include language X." Vague assertions pass everything. Specific assertions catch real regressions.
Q: How often should you run evals? A: Run them in two contexts: on every pull request that touches a skill (via CI), and on a scheduled basis (daily or weekly) to catch regressions introduced by model updates that do not trigger pull requests. The PR run catches regressions introduced by code changes. The scheduled run catches regressions introduced by the model changing underneath unchanged skills. Both are necessary. If you only run evals on PRs, you will not catch model-update regressions — which are increasingly common as providers tune their models more frequently.
Q: What do you do when an eval fails? A: Investigate the failing test cases to understand whether the regression is in the skill (a skill change broke something) or in the model (the underlying model changed its behavior). If the regression is in the skill change, fix the skill or revert the change. If the regression is in the model, you have a more complex decision: update the skill to work with the new model behavior, roll back to a pinned model version if your infrastructure supports it, or accept the new behavior if it is actually an improvement. In all cases, the eval failure is the valuable signal — it tells you something changed before a client tells you something is wrong.
Q: How many test cases do you need before evals are useful? A: Three is the floor. One normal case, one escalation/decline case, one edge case. That covers the most common regression patterns — normal behavior degrading, escalation logic breaking, edge cases failing. You will not catch everything with three cases. But you will catch the most damaging regressions, and adding more cases over time as you discover edge cases in production is a natural process. The goal at the start is coverage of the highest-risk failure modes, not comprehensive coverage of every possible input.
Q: Can evals catch prompt injection attempts? A: Evals can verify that the agent behaves correctly when exposed to inputs that resemble injection attempts. Write test cases where the input contains instruction-like text and assert that the agent ignores the injected instruction and follows its soul document instead. This is not a replacement for structural controls like read-only soul documents and scoped permissions — but it is a useful signal that your soul documents are providing the expected resistance to simple injections. For the full structural approach to injection resilience, see the post on designing for prompt injection.
Want to go deeper?
Ready to put AI to work for your business?
Book a free discovery call. We'll show you exactly what an AI agent can handle for your business.
Book a Discovery Call