AI Agent Testing & Evaluation Guide
A practical framework for testing AI agents from unit tests to production monitoring, with evaluation patterns that scale.
Testing AI agents is different from testing traditional code, but not mysterious. You still need unit tests, integration tests, and regression checks - you just have to handle probabilistic outputs.
This guide covers:
- Unit tests for tools
- Integration tests for workflows
- Evaluation frameworks
- Benchmarking approaches
- A/B testing behavior
- Regression testing
- Production monitoring
1) Unit Testing Agent Tools
Start with the deterministic parts. Every agent tool (search, fetch, file writes) should have a standard test suite.
describe("tool:parseInvoice", () => {
it("extracts totals", async () => {
const result = await parseInvoice(sampleDoc);
expect(result.total).toBe(129.90);
});
});
These tests catch 80% of issues before the LLM is involved.
Testing Tool Error Handling
Don't just test the happy path. Agents need to recover gracefully when tools fail:
describe("tool:parseInvoice - error cases", () => {
it("returns a structured error for malformed PDFs", async () => {
const result = await parseInvoice(corruptedDoc);
expect(result.error).toBe("PARSE_FAILED");
expect(result.total).toBeUndefined();
});
it("handles timeout gracefully", async () => {
jest.useFakeTimers();
const promise = parseInvoice(slowDoc);
jest.advanceTimersByTime(30_000);
await expect(promise).rejects.toThrow("TIMEOUT");
});
});
Schema Validation Tests
If your tools accept structured input, test the schema boundaries:
describe("tool:searchPeople - input validation", () => {
it("rejects empty queries", async () => {
await expect(searchPeople({ query: "" })).rejects.toThrow();
});
it("truncates queries over 500 chars", async () => {
const result = await searchPeople({ query: "a".repeat(600) });
expect(result.queryUsed.length).toBeLessThanOrEqual(500);
});
});
2) Integration Testing Agent Workflows
Integration tests simulate a full run. Use recorded tool responses so tests are repeatable.
const mockTools = {
search: async () => mockSearchResult,
fetch: async () => mockPage
};
const result = await runAgentWorkflow(input, mockTools);
expect(result.status).toBe("success");
Focus on workflow correctness, not exact wording. If your agent system uses multi-agent orchestration, test each stage independently before testing the full pipeline.
Multi-Step Workflow Tests
Real agents chain multiple tools. Test the full sequence:
describe("workflow: research-and-summarize", () => {
it("completes a 3-step research workflow", async () => {
const mockTools = {
search: async () => ({ results: [{ url: "https://example.com", title: "Test" }] }),
fetch: async () => ({ content: "Article about AI agents in production..." }),
summarize: async (text: string) => ({ summary: "AI agents need testing." }),
};
const result = await runAgentWorkflow(
{ goal: "Research AI agent testing" },
mockTools
);
expect(result.status).toBe("success");
expect(result.steps).toHaveLength(3);
expect(result.output.summary).toBeDefined();
});
it("handles mid-workflow tool failures", async () => {
const mockTools = {
search: async () => ({ results: [{ url: "https://example.com" }] }),
fetch: async () => { throw new Error("Network timeout"); },
summarize: async () => ({ summary: "" }),
};
const result = await runAgentWorkflow(
{ goal: "Research AI agent testing" },
mockTools
);
expect(result.status).toBe("partial");
expect(result.failedSteps).toContain("fetch");
});
});
3) Evaluation Frameworks
Use structured evaluations to compare changes. This is where agent testing diverges most from traditional software - you need to measure quality, not just correctness.
Common approaches:
- Rubric scoring (quality, relevance, safety)
- Pairwise comparisons (A vs B outputs)
- Task completion rate
- LLM-as-judge (use a separate model to grade outputs)
Avoid fake benchmarks. Use real samples from your domain.
Building a Rubric-Based Evaluator
Here's a practical rubric scorer you can adapt:
type RubricDimension = {
name: string;
weight: number;
scorer: (output: string, expected: string) => number; // 0-1
};
const rubric: RubricDimension[] = [
{
name: "relevance",
weight: 0.4,
scorer: (output, expected) => {
// Simple keyword overlap - replace with embedding similarity in production
const keywords = expected.toLowerCase().split(" ");
const matches = keywords.filter(k => output.toLowerCase().includes(k));
return matches.length / keywords.length;
},
},
{
name: "completeness",
weight: 0.3,
scorer: (output) => {
return output.length > 200 ? 1 : output.length / 200;
},
},
{
name: "safety",
weight: 0.3,
scorer: (output) => {
const blocklist = ["password", "secret", "token"];
return blocklist.some(w => output.includes(w)) ? 0 : 1;
},
},
];
function evaluateOutput(output: string, expected: string): number {
return rubric.reduce((score, dim) => {
return score + dim.weight * dim.scorer(output, expected);
}, 0);
}
LLM-as-Judge Pattern
For nuanced evaluation, use a second model (like Claude) to grade the output:
async function llmJudge(output: string, criteria: string): Promise<number> {
const response = await anthropic.messages.create({
model: "claude-sonnet-4-20250514",
messages: [{
role: "user",
content: `Rate this output 1-10 on: ${criteria}\n\nOutput: ${output}\n\nScore (number only):`
}],
});
return parseInt(response.content[0].text, 10);
}
4) Benchmarking Approaches
Benchmarks should answer one question: is this change better for users?
Good benchmarks:
- Real user tasks
- Fixed input sets
- Clear pass/fail criteria
Bad benchmarks:
- Overly synthetic prompts
- Single "hero" examples
- Metrics that don't map to user value
For ranking agent quality over time, an ELO-based rating system gives you a more nuanced signal than pass/fail benchmarks alone.
5) A/B Testing Agent Behavior
In production, use A/B tests to compare:
- Prompt variants
- Model versions (Claude Opus vs GPT-5 Codex)
- Tool ordering
Example: 10% of traffic uses a new prompt, measure task success and cost.
A/B Test Implementation
Here's a minimal A/B test router for agent behavior:
function assignVariant(userId: string, experimentId: string): "control" | "treatment" {
const hash = crypto.createHash("md5").update(`${userId}:${experimentId}`).digest("hex");
return parseInt(hash.slice(0, 2), 16) < 26 ? "treatment" : "control"; // ~10%
}
async function runWithExperiment(userId: string, task: AgentTask) {
const variant = assignVariant(userId, "prompt-v2-test");
const prompt = variant === "treatment" ? newPromptV2 : currentPrompt;
const result = await runAgent({ ...task, prompt });
await logExperiment({
userId,
experiment: "prompt-v2-test",
variant,
success: result.status === "success",
cost: result.tokenCost,
latencyMs: result.durationMs,
});
return result;
}
6) Regression Testing
Every prompt or workflow change can break something. Keep a regression suite.
const regressionSet = loadRegressionCases();
for (const testCase of regressionSet) {
const output = await runAgent(testCase.input);
expect(score(output)).toBeGreaterThan(testCase.threshold);
}
Store regression cases in version control. Don't rely on memory.
7) Monitoring in Production
Testing doesn't stop after deployment. Monitor:
- Failure rates
- Latency
- Cost per run
- User feedback signals
- Output quality scores (sampled)
If metrics drift, roll back or adjust quickly.
Production Monitoring Setup
For Convex‑backed agent apps, I store run metrics as structured events:
// convex/metrics.ts
export const logRunMetrics = mutation({
args: {
runId: v.string(),
agentId: v.string(),
status: v.string(),
latencyMs: v.number(),
tokenCount: v.number(),
estimatedCost: v.number(),
qualityScore: v.optional(v.number()),
},
handler: async (ctx, args) => {
await ctx.db.insert("runMetrics", {
...args,
createdAt: Date.now(),
});
},
});
Alerting on Drift
Set thresholds and alert when metrics deviate:
async function checkDrift(agentId: string) {
const recent = await getRecentMetrics(agentId, { hours: 1 });
const baseline = await getBaselineMetrics(agentId);
if (recent.avgLatency > baseline.avgLatency * 1.5) {
await alert(`Agent ${agentId} latency spiked 50%+`);
}
if (recent.failureRate > 0.1) {
await alert(`Agent ${agentId} failure rate above 10%`);
}
}
Practical Testing Pipeline
Unit Tests -> Integration Tests -> Eval Suite -> Regression -> Production Monitoring
Treat evaluation as a pipeline, not a one-off event.
Final Advice
Agent testing is about confidence, not perfection. Build a system where changes are safe to ship. That means deterministic tool tests, realistic evaluation sets, and continuous monitoring.
Don't neglect security testing — a broken auth check in your agent pipeline is worse than a bad prompt. If you're building your first agent, the complete builder's guide covers the full architecture. And when you're ready to take agents from prototype to production, the concept‑to‑production guide covers reliability patterns and monitoring in depth.
The AI Agent Masterclass walks through this entire testing pipeline with real examples, including evaluation patterns I use across production agent systems.
If you do this well, you can move fast and sleep at night.
Related reading
Enjoyed this guide?
Get more actionable AI insights, automation templates, and practical guides delivered to your inbox.
No spam. Unsubscribe anytime.
Ready to ship an AI product?
We build revenue-moving AI tools in focused agentic development cycles. 3 production apps shipped in a single day.