Building Production AI Agents: Lessons from 300+ Commits
Hard-won lessons from building and deploying 14+ AI agents in production — error handling, monitoring, cost management, and the patterns that actually work.
Over the past year I've built and shipped 14 AI agents across intake pipelines, content engines, proposal generators, client management systems, and tax automation workflows. North of 300 commits. Some of those agents are humming along beautifully. Others taught me expensive lessons at 3 AM.
This isn't a tutorial. It's a field report. The patterns that survived contact with real users, real API rate limits, and real invoices from OpenAI and Anthropic.
The Reality Check Nobody Gives You
Most AI agent content online shows you the happy path. Call the model, get a response, done. Production is nothing like that.
In production, the model returns malformed JSON 2% of the time. Your API key rotates and nobody updates the secret. A user sends a 47,000-token document and your carefully tuned prompt blows past the context window. The model hallucinates a function call that doesn't exist. Your monthly bill jumps from $180 to $2,400 because someone left a retry loop running over the weekend.
Every single one of those happened to me. Some of them happened twice.
The gap between "works in development" and "works in production" for AI agents is wider than any other software I've built. Traditional software fails predictably. AI agents fail creatively.
The 5 Things That Will Break First
After deploying 14 agents, I can tell you with confidence what breaks first. Every time.
1. Output Parsing
Your agent returns structured data. Except when it doesn't. Even with function calling and structured outputs, models occasionally return markdown-wrapped JSON, extra whitespace, or completely ignore your schema.
// Don't do this
const result = JSON.parse(response.content);
// Do this
function safeParseOutput<T>(raw: string, schema: z.ZodSchema<T>): T | null {
// Strip markdown code fences if present
const cleaned = raw
.replace(/^```(?:json)?\n?/gm, '')
.replace(/\n?```$/gm, '')
.trim();
try {
const parsed = JSON.parse(cleaned);
return schema.parse(parsed);
} catch (e) {
logger.warn('Output parse failed', {
raw: raw.substring(0, 500),
error: e instanceof Error ? e.message : 'Unknown',
});
return null;
}
}
I use Zod for every single agent output. No exceptions. If the model returns garbage, I want to know exactly which field failed, not just "SyntaxError: Unexpected token."
2. Rate Limits and Timeouts
Every LLM provider has rate limits. They're not consistent, they're not always documented, and they change. My intake pipeline agent got throttled during a product launch because we hit TPM (tokens per minute) limits nobody planned for.
async function callWithBackoff<T>(
fn: () => Promise<T>,
opts: { maxRetries?: number; baseDelayMs?: number } = {}
): Promise<T> {
const { maxRetries = 3, baseDelayMs = 1000 } = opts;
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
return await fn();
} catch (err: any) {
const isRetryable =
err?.status === 429 ||
err?.status === 503 ||
err?.code === 'ECONNRESET';
if (!isRetryable || attempt === maxRetries) throw err;
const delay = baseDelayMs * Math.pow(2, attempt) + Math.random() * 500;
logger.info(`Retry ${attempt + 1}/${maxRetries} after ${Math.round(delay)}ms`);
await new Promise((r) => setTimeout(r, delay));
}
}
throw new Error('Unreachable');
}
Exponential backoff with jitter. Not optional. The jitter matters — without it, all your retries hit the API at the same time and you stay throttled.
3. Context Window Overflows
Users will send you things that don't fit. A 200-page PDF. A database dump. An email thread with 47 replies. Your agent needs to handle this before it hits the model.
function truncateToTokenBudget(
text: string,
maxTokens: number,
encoding: TiktokenEncoding = 'cl100k_base'
): string {
const enc = getEncoding(encoding);
const tokens = enc.encode(text);
if (tokens.length <= maxTokens) return text;
const truncated = enc.decode(tokens.slice(0, maxTokens));
logger.warn(`Truncated input from ${tokens.length} to ${maxTokens} tokens`);
return new TextDecoder().decode(truncated) + '\n\n[Content truncated]';
}
I learned this the hard way when my proposal generator hit a 128k context window with a client brief that was mostly copy-pasted legal boilerplate. The model charged me for the full input and returned nonsense. Now every agent has a token budget enforced before the API call.
4. Costs Spiraling Without Warning
This is the silent killer. You deploy, it works, you forget about it. Three weeks later you check your dashboard and you've burned through $800 more than expected.
The culprit is almost always one of three things: retry loops that succeed on the third try but you forgot to count the cost of attempts one and two, verbose system prompts that get sent with every single message in a conversation, or agents calling other agents in loops.
5. State Management Across Multi-Step Workflows
Single-turn agents are easy. Multi-step agents — where step 3 depends on step 1's output and step 2's side effects — are where things get genuinely hard. Race conditions, partial failures, and "what state are we in?" become real problems.
I'll cover the pattern that works for this below.
Error Handling Patterns That Actually Work
After trying several approaches, I settled on a layered error handling strategy. Every agent has three layers.
Layer 1: Input Validation
Before any LLM call, validate and sanitize inputs. This catches 60% of issues before they cost you money.
interface AgentInput {
validate(): ValidationResult;
sanitize(): this;
estimateTokens(): number;
}
function runAgent(input: AgentInput) {
const sanitized = input.sanitize();
const validation = sanitized.validate();
if (!validation.ok) {
return { success: false, error: validation.errors, stage: 'input' };
}
const estimatedTokens = sanitized.estimateTokens();
if (estimatedTokens > TOKEN_BUDGET) {
return { success: false, error: 'Input exceeds token budget', stage: 'input' };
}
// Proceed to LLM call...
}
Layer 2: LLM Call Wrapper
Every LLM call goes through a single wrapper that handles retries, timeouts, cost tracking, and output parsing. Never call the API directly from business logic.
interface LLMCallResult<T> {
success: boolean;
data?: T;
error?: string;
usage: { inputTokens: number; outputTokens: number; costUsd: number };
latencyMs: number;
attempts: number;
}
async function llmCall<T>(opts: {
model: string;
messages: Message[];
schema: z.ZodSchema<T>;
maxRetries?: number;
timeoutMs?: number;
}): Promise<LLMCallResult<T>> {
const start = Date.now();
let totalUsage = { inputTokens: 0, outputTokens: 0, costUsd: 0 };
let attempts = 0;
try {
const raw = await callWithBackoff(
async () => {
attempts++;
const resp = await provider.chat({
model: opts.model,
messages: opts.messages,
timeout: opts.timeoutMs ?? 30_000,
});
totalUsage.inputTokens += resp.usage.input;
totalUsage.outputTokens += resp.usage.output;
totalUsage.costUsd += calculateCost(opts.model, resp.usage);
return resp;
},
{ maxRetries: opts.maxRetries }
);
const parsed = safeParseOutput(raw.content, opts.schema);
if (!parsed) {
return {
success: false,
error: 'Output schema validation failed',
usage: totalUsage,
latencyMs: Date.now() - start,
attempts,
};
}
return {
success: true,
data: parsed,
usage: totalUsage,
latencyMs: Date.now() - start,
attempts,
};
} catch (err: any) {
return {
success: false,
error: err.message,
usage: totalUsage,
latencyMs: Date.now() - start,
attempts,
};
}
}
Notice the usage tracking on every call, including failed ones. This is critical. Failed calls still cost money.
Layer 3: Graceful Degradation
When the LLM fails, don't just throw an error at the user. Have a fallback.
async function generateProposal(brief: ClientBrief): Promise<Proposal> {
// Try primary model
const result = await llmCall({
model: 'claude-opus-4-6',
messages: buildProposalPrompt(brief),
schema: ProposalSchema,
});
if (result.success) return result.data!;
// Fallback to faster, cheaper model
logger.warn('Primary model failed, falling back', { error: result.error });
const fallback = await llmCall({
model: 'claude-sonnet-4-20250514',
messages: buildProposalPrompt(brief),
schema: ProposalSchema,
});
if (fallback.success) return fallback.data!;
// Final fallback: template-based generation (no LLM)
logger.error('All models failed, using template fallback');
return generateFromTemplate(brief);
}
My content engine uses this exact pattern. Opus for quality, Sonnet as fallback, and a static template as the last resort. In six months of production, the template fallback has fired exactly three times — all during an API outage. Each time, the client got their content on time anyway.
Cost Optimization: What Actually Moves the Needle
I track every dollar across all 14 agents. Here's what made the biggest difference.
Model Routing
Not every task needs your most expensive model. I built a simple router that classifies tasks by complexity and routes accordingly.
const MODEL_TIERS = {
simple: { model: 'claude-haiku-4-20250514', costPer1kInput: 0.0008 },
standard: { model: 'claude-sonnet-4-20250514', costPer1kInput: 0.003 },
complex: { model: 'claude-opus-4-6', costPer1kInput: 0.015 },
} as const;
function selectModel(task: AgentTask): keyof typeof MODEL_TIERS {
if (task.requiresReasoning || task.outputLength > 2000) return 'complex';
if (task.isClassification || task.isExtraction) return 'simple';
return 'standard';
}
This single change cut my monthly costs by 40%. Most tasks in a pipeline are classification, extraction, or formatting — Haiku handles those fine.
Semantic Caching
If you're calling the model with the same (or very similar) input repeatedly, cache it. I use a simple embedding-based cache with a similarity threshold.
interface CacheEntry {
inputHash: string;
embedding: number[];
output: string;
model: string;
createdAt: Date;
ttlMs: number;
}
async function cachedLLMCall<T>(
opts: LLMCallOpts<T> & { cacheKey?: string; cacheTtlMs?: number }
): Promise<LLMCallResult<T>> {
const inputHash = hashMessages(opts.messages);
// Exact match check
const cached = await cache.get(inputHash);
if (cached && !isExpired(cached)) {
metrics.increment('cache.hit');
return { success: true, data: cached.output as T, usage: ZERO_USAGE, latencyMs: 2, attempts: 0 };
}
// Cache miss — call model
metrics.increment('cache.miss');
const result = await llmCall(opts);
if (result.success) {
await cache.set(inputHash, {
output: result.data,
model: opts.model,
ttlMs: opts.cacheTtlMs ?? 3600_000,
});
}
return result;
}
My case study builder generates similar analyses for companies in the same industry. Caching brought repeat-industry costs down by roughly 60%.
Prompt Optimization
Your system prompt is sent with every request. Every token counts.
I went through all 14 agents and ruthlessly trimmed system prompts. Removed examples that weren't pulling their weight. Replaced verbose instructions with concise ones. Moved static context into cached prefixes where supported.
Before: average system prompt was 1,800 tokens. After: 650 tokens. Across thousands of daily calls, that adds up fast.
The Monitoring Stack That Works
You can't manage what you can't measure. Here's what I monitor for every agent in production.
Essential Metrics
interface AgentMetrics {
// Performance
latencyP50Ms: number;
latencyP99Ms: number;
successRate: number;
// Cost
costPerCall: number;
costPerDay: number;
tokenEfficiency: number; // useful output tokens / total tokens
// Quality
parseFailureRate: number;
fallbackRate: number;
retryRate: number;
}
What I Use
- Logging: Structured JSON logs with Pino. Every LLM call logs model, tokens, cost, latency, and success/failure. Ship to your preferred aggregator.
- Metrics: Custom counters pushed to a time-series store. I use a lightweight setup with Prometheus-compatible metrics, but even a PostgreSQL table with timestamps works at small scale.
- Alerting: Simple threshold alerts. If success rate drops below 95% over a 15-minute window, or daily cost exceeds 150% of the 7-day average, I get notified.
- Cost Dashboard: A daily rollup showing cost per agent, cost per call, and trend over 30 days. This is the single most useful thing I built. Catches runaway costs within 24 hours.
// Structured logging for every LLM call
function logLLMCall(result: LLMCallResult<any>, agentName: string) {
logger.info({
agent: agentName,
model: result.model,
success: result.success,
latencyMs: result.latencyMs,
inputTokens: result.usage.inputTokens,
outputTokens: result.usage.outputTokens,
costUsd: result.usage.costUsd,
attempts: result.attempts,
error: result.error ?? undefined,
timestamp: new Date().toISOString(),
});
}
Every call. No exceptions. The day you skip logging "because it's just a simple agent" is the day that agent costs you $400 overnight.
The Alert That Saved Me $1,200
Two months in, my tax pipeline agent started retrying excessively. A upstream data format changed slightly — enough that the model's output failed validation every time, triggering three retries per request. Each retry burned tokens on the full prompt.
My cost alert fired within six hours. Without it, that bug would have run for days before anyone noticed. The fix was a two-line schema update. The alert paid for itself that week.
Multi-Step Agent Orchestration
For agents that chain multiple steps, I use a simple state machine pattern. Each step is a pure function that takes state in and returns new state plus the next action.
type StepResult =
| { status: 'continue'; nextStep: string; state: AgentState }
| { status: 'complete'; output: any }
| { status: 'failed'; error: string; state: AgentState };
interface AgentStep {
name: string;
execute(state: AgentState): Promise<StepResult>;
rollback?(state: AgentState): Promise<void>;
}
async function runPipeline(steps: AgentStep[], initial: AgentState) {
let state = initial;
const completed: AgentStep[] = [];
for (const step of steps) {
const result = await step.execute(state);
if (result.status === 'failed') {
// Rollback completed steps in reverse
for (const done of completed.reverse()) {
await done.rollback?.(state);
}
return { success: false, error: result.error, failedStep: step.name };
}
if (result.status === 'complete') {
return { success: true, output: result.output };
}
state = result.state;
completed.push(step);
}
}
The rollback mechanism has saved me multiple times. When step 4 of a 5-step pipeline fails, you don't want to leave partial data scattered across your system.
What I'd Do Differently
Honest accounting time. If I started over tomorrow, here's what changes.
Start With Observability, Not Features
My first three agents had minimal logging. I was focused on making them work. Getting production visibility retrofitted was painful and I missed early bugs that better logging would have caught in hours.
Build the logging wrapper and cost tracking before you write a single prompt. It's maybe two hours of work and it pays for itself within the first week.
Use Structured Outputs From Day One
I started with free-text outputs and regex parsing. Don't. Use function calling or structured output modes from the start. The migration from "parse this markdown" to "here's a JSON schema" touched every agent and took a full weekend.
Set Cost Budgets Per Agent Per Day
Hard limits, not just alerts. My llmCall wrapper now checks a daily budget before making calls:
async function checkBudget(agentName: string, estimatedCost: number): Promise<boolean> {
const spent = await getDailySpend(agentName);
const budget = AGENT_BUDGETS[agentName] ?? DEFAULT_DAILY_BUDGET;
if (spent + estimatedCost > budget) {
logger.error(`Agent ${agentName} budget exceeded: $${spent.toFixed(2)}/$${budget}`);
await sendAlert(`Budget exceeded for ${agentName}`);
return false;
}
return true;
}
This would have prevented my most expensive mistake — a recursive agent loop that burned $600 in four hours on a Saturday.
Test With Adversarial Inputs Earlier
I tested with clean, well-formatted inputs. Users send garbage. Unicode edge cases, 50MB attachments, empty strings, inputs in languages the agent wasn't designed for. Build an adversarial test suite early. Include empty inputs, maximum-length inputs, malformed data, and multilingual content.
Don't Over-Agent
Not everything needs an AI agent. I built two agents that were better served by a simple rule engine with an if/else tree. The LLM added latency, cost, and non-determinism for tasks that had clear, deterministic logic.
Before building an agent, ask: "Does this task require reasoning, or just routing?" If it's routing, write a function.
The Patterns That Survived
After 300+ commits and 14 agents, these are the patterns I'd bet on:
- Single LLM call wrapper with built-in retry, timeout, cost tracking, and output validation. Every call goes through it.
- Model routing by task complexity. Use the cheapest model that works.
- Structured outputs with Zod validation. Parse failure is a first-class error, not an afterthought.
- Graceful degradation chains. Primary model → fallback model → template/rule-based fallback.
- Per-agent daily cost budgets with hard cutoffs and alerts.
- Structured logging on every call. Model, tokens, cost, latency, success. No exceptions.
- Semantic caching for repeated or similar inputs.
- State machine orchestration for multi-step workflows, with rollback support.
None of these are revolutionary. That's the point. Production AI agents don't need clever architecture. They need boring, reliable infrastructure wrapped around an inherently unpredictable component.
The model is the creative part. Everything around it should be as predictable and observable as possible.
Final Thought
Building AI agents is the easy part. Keeping them running, affordable, and reliable in production — that's the real work. Most of my 300+ commits aren't adding features. They're improving error handling, tightening validation, optimizing prompts, and fixing the monitoring that caught a problem at 2 AM.
The agents that work best aren't the smartest. They're the ones that fail gracefully, cost predictably, and tell me exactly what they're doing at all times.
Build boring infrastructure around interesting AI. That's the whole lesson.
Related reading
Enjoyed this guide?
Get more actionable AI insights, automation templates, and practical guides delivered to your inbox.
No spam. Unsubscribe anytime.
Ready to ship an AI product?
We build revenue-moving AI tools in focused agentic development cycles. 3 production apps shipped in a single day.