What should I do after reading Building Production AI Agents: Lessons from 300+ Commits?

Use the checklist and linked resources to pick one next action, implement it, and measure outcomes before expanding scope.

Amir Brooks

AI AgentsProductionLessons LearnedBest Practices

Building Production AI Agents: Lessons from 300+ Commits

Q: Who is Building Production AI Agents: Lessons from 300+ Commits for?

This guide is for builders and teams evaluating building production ai agents: lessons from 300+ commits in practical, production-focused workflows.

Hard-won lessons from building and deploying 14+ AI agents in production — error handling, monitoring, cost management, and the patterns that actually work.

February 7, 202613 min read

Over the past year I've built and shipped 14 AI agents across intake pipelines, content engines, proposal generators, client management systems, and tax automation workflows. North of 300 commits. Some of those agents are humming along beautifully. Others taught me expensive lessons at 3 AM.

This isn't a tutorial. It's a field report. The patterns that survived contact with real users, real API rate limits, and real invoices from OpenAI and Anthropic.

The Reality Check Nobody Gives You

Most AI agent content online shows you the happy path. Call the model, get a response, done. Production is nothing like that.

In production, the model returns malformed JSON 2% of the time. Your API key rotates and nobody updates the secret. A user sends a 47,000-token document and your carefully tuned prompt blows past the context window. The model hallucinates a function call that doesn't exist. Your monthly bill jumps from $180 to $2,400 because someone left a retry loop running over the weekend.

Every single one of those happened to me. Some of them happened twice.

The gap between "works in development" and "works in production" for AI agents is wider than any other software I've built. Traditional software fails predictably. AI agents fail creatively.

The 5 Things That Will Break First

After deploying 14 agents, I can tell you with confidence what breaks first. Every time.

1. Output Parsing

Your agent returns structured data. Except when it doesn't. Even with function calling and structured outputs, models occasionally return markdown-wrapped JSON, extra whitespace, or completely ignore your schema.

// Don't do this
const result = JSON.parse(response.content);

// Do this
function safeParseOutput<T>(raw: string, schema: z.ZodSchema<T>): T | null {
  // Strip markdown code fences if present
  const cleaned = raw
    .replace(/^```(?:json)?\n?/gm, '')
    .replace(/\n?```$/gm, '')
    .trim();

  try {
    const parsed = JSON.parse(cleaned);
    return schema.parse(parsed);
  } catch (e) {
    logger.warn('Output parse failed', {
      raw: raw.substring(0, 500),
      error: e instanceof Error ? e.message : 'Unknown',
    });
    return null;
  }
}

I use Zod for every single agent output. No exceptions. If the model returns garbage, I want to know exactly which field failed, not just "SyntaxError: Unexpected token."

2. Rate Limits and Timeouts

Every LLM provider has rate limits. They're not consistent, they're not always documented, and they change. My intake pipeline agent got throttled during a product launch because we hit TPM (tokens per minute) limits nobody planned for.

async function callWithBackoff<T>(
  fn: () => Promise<T>,
  opts: { maxRetries?: number; baseDelayMs?: number } = {}
): Promise<T> {
  const { maxRetries = 3, baseDelayMs = 1000 } = opts;

  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (err: any) {
      const isRetryable =
        err?.status === 429 ||
        err?.status === 503 ||
        err?.code === 'ECONNRESET';

      if (!isRetryable || attempt === maxRetries) throw err;

      const delay = baseDelayMs * Math.pow(2, attempt) + Math.random() * 500;
      logger.info(`Retry ${attempt + 1}/${maxRetries} after ${Math.round(delay)}ms`);
      await new Promise((r) => setTimeout(r, delay));
    }
  }
  throw new Error('Unreachable');
}

Exponential backoff with jitter. Not optional. The jitter matters — without it, all your retries hit the API at the same time and you stay throttled.

3. Context Window Overflows

Users will send you things that don't fit. A 200-page PDF. A database dump. An email thread with 47 replies. Your agent needs to handle this before it hits the model.

function truncateToTokenBudget(
  text: string,
  maxTokens: number,
  encoding: TiktokenEncoding = 'cl100k_base'
): string {
  const enc = getEncoding(encoding);
  const tokens = enc.encode(text);

  if (tokens.length <= maxTokens) return text;

  const truncated = enc.decode(tokens.slice(0, maxTokens));
  logger.warn(`Truncated input from ${tokens.length} to ${maxTokens} tokens`);
  return new TextDecoder().decode(truncated) + '\n\n[Content truncated]';
}

I learned this the hard way when my proposal generator hit a 128k context window with a client brief that was mostly copy-pasted legal boilerplate. The model charged me for the full input and returned nonsense. Now every agent has a token budget enforced before the API call.

4. Costs Spiraling Without Warning

This is the silent killer. You deploy, it works, you forget about it. Three weeks later you check your dashboard and you've burned through $800 more than expected.

The culprit is almost always one of three things: retry loops that succeed on the third try but you forgot to count the cost of attempts one and two, verbose system prompts that get sent with every single message in a conversation, or agents calling other agents in loops.

5. State Management Across Multi-Step Workflows

Single-turn agents are easy. Multi-step agents — where step 3 depends on step 1's output and step 2's side effects — are where things get genuinely hard. Race conditions, partial failures, and "what state are we in?" become real problems.

I'll cover the pattern that works for this below.

Error Handling Patterns That Actually Work

After trying several approaches, I settled on a layered error handling strategy. Every agent has three layers.

Layer 1: Input Validation

Before any LLM call, validate and sanitize inputs. This catches 60% of issues before they cost you money.

interface AgentInput {
  validate(): ValidationResult;
  sanitize(): this;
  estimateTokens(): number;
}

function runAgent(input: AgentInput) {
  const sanitized = input.sanitize();
  const validation = sanitized.validate();

  if (!validation.ok) {
    return { success: false, error: validation.errors, stage: 'input' };
  }

  const estimatedTokens = sanitized.estimateTokens();
  if (estimatedTokens > TOKEN_BUDGET) {
    return { success: false, error: 'Input exceeds token budget', stage: 'input' };
  }

  // Proceed to LLM call...
}

Layer 2: LLM Call Wrapper

Every LLM call goes through a single wrapper that handles retries, timeouts, cost tracking, and output parsing. Never call the API directly from business logic.

interface LLMCallResult<T> {
  success: boolean;
  data?: T;
  error?: string;
  usage: { inputTokens: number; outputTokens: number; costUsd: number };
  latencyMs: number;
  attempts: number;
}

async function llmCall<T>(opts: {
  model: string;
  messages: Message[];
  schema: z.ZodSchema<T>;
  maxRetries?: number;
  timeoutMs?: number;
}): Promise<LLMCallResult<T>> {
  const start = Date.now();
  let totalUsage = { inputTokens: 0, outputTokens: 0, costUsd: 0 };
  let attempts = 0;

  try {
    const raw = await callWithBackoff(
      async () => {
        attempts++;
        const resp = await provider.chat({
          model: opts.model,
          messages: opts.messages,
          timeout: opts.timeoutMs ?? 30_000,
        });
        totalUsage.inputTokens += resp.usage.input;
        totalUsage.outputTokens += resp.usage.output;
        totalUsage.costUsd += calculateCost(opts.model, resp.usage);
        return resp;
      },
      { maxRetries: opts.maxRetries }
    );

    const parsed = safeParseOutput(raw.content, opts.schema);
    if (!parsed) {
      return {
        success: false,
        error: 'Output schema validation failed',
        usage: totalUsage,
        latencyMs: Date.now() - start,
        attempts,
      };
    }

    return {
      success: true,
      data: parsed,
      usage: totalUsage,
      latencyMs: Date.now() - start,
      attempts,
    };
  } catch (err: any) {
    return {
      success: false,
      error: err.message,
      usage: totalUsage,
      latencyMs: Date.now() - start,
      attempts,
    };
  }
}

Notice the usage tracking on every call, including failed ones. This is critical. Failed calls still cost money.

Layer 3: Graceful Degradation

When the LLM fails, don't just throw an error at the user. Have a fallback.

async function generateProposal(brief: ClientBrief): Promise<Proposal> {
  // Try primary model
  const result = await llmCall({
    model: 'claude-opus-4-6',
    messages: buildProposalPrompt(brief),
    schema: ProposalSchema,
  });

  if (result.success) return result.data!;

  // Fallback to faster, cheaper model
  logger.warn('Primary model failed, falling back', { error: result.error });
  const fallback = await llmCall({
    model: 'claude-sonnet-4-20250514',
    messages: buildProposalPrompt(brief),
    schema: ProposalSchema,
  });

  if (fallback.success) return fallback.data!;

  // Final fallback: template-based generation (no LLM)
  logger.error('All models failed, using template fallback');
  return generateFromTemplate(brief);
}

My content engine uses this exact pattern. Opus for quality, Sonnet as fallback, and a static template as the last resort. In six months of production, the template fallback has fired exactly three times — all during an API outage. Each time, the client got their content on time anyway.

Cost Optimization: What Actually Moves the Needle

I track every dollar across all 14 agents. Here's what made the biggest difference.

Model Routing

Not every task needs your most expensive model. I built a simple router that classifies tasks by complexity and routes accordingly.

const MODEL_TIERS = {
  simple: { model: 'claude-haiku-4-20250514', costPer1kInput: 0.0008 },
  standard: { model: 'claude-sonnet-4-20250514', costPer1kInput: 0.003 },
  complex: { model: 'claude-opus-4-6', costPer1kInput: 0.015 },
} as const;

function selectModel(task: AgentTask): keyof typeof MODEL_TIERS {
  if (task.requiresReasoning || task.outputLength > 2000) return 'complex';
  if (task.isClassification || task.isExtraction) return 'simple';
  return 'standard';
}

This single change cut my monthly costs by 40%. Most tasks in a pipeline are classification, extraction, or formatting — Haiku handles those fine.

Semantic Caching

If you're calling the model with the same (or very similar) input repeatedly, cache it. I use a simple embedding-based cache with a similarity threshold.

interface CacheEntry {
  inputHash: string;
  embedding: number[];
  output: string;
  model: string;
  createdAt: Date;
  ttlMs: number;
}

async function cachedLLMCall<T>(
  opts: LLMCallOpts<T> & { cacheKey?: string; cacheTtlMs?: number }
): Promise<LLMCallResult<T>> {
  const inputHash = hashMessages(opts.messages);

  // Exact match check
  const cached = await cache.get(inputHash);
  if (cached && !isExpired(cached)) {
    metrics.increment('cache.hit');
    return { success: true, data: cached.output as T, usage: ZERO_USAGE, latencyMs: 2, attempts: 0 };
  }

  // Cache miss — call model
  metrics.increment('cache.miss');
  const result = await llmCall(opts);

  if (result.success) {
    await cache.set(inputHash, {
      output: result.data,
      model: opts.model,
      ttlMs: opts.cacheTtlMs ?? 3600_000,
    });
  }

  return result;
}

My case study builder generates similar analyses for companies in the same industry. Caching brought repeat-industry costs down by roughly 60%.

Prompt Optimization

Your system prompt is sent with every request. Every token counts.

I went through all 14 agents and ruthlessly trimmed system prompts. Removed examples that weren't pulling their weight. Replaced verbose instructions with concise ones. Moved static context into cached prefixes where supported.

Before: average system prompt was 1,800 tokens. After: 650 tokens. Across thousands of daily calls, that adds up fast.

The Monitoring Stack That Works

You can't manage what you can't measure. Here's what I monitor for every agent in production.

Essential Metrics

interface AgentMetrics {
  // Performance
  latencyP50Ms: number;
  latencyP99Ms: number;
  successRate: number;
  
  // Cost
  costPerCall: number;
  costPerDay: number;
  tokenEfficiency: number; // useful output tokens / total tokens
  
  // Quality
  parseFailureRate: number;
  fallbackRate: number;
  retryRate: number;
}

What I Use

Logging: Structured JSON logs with Pino. Every LLM call logs model, tokens, cost, latency, and success/failure. Ship to your preferred aggregator.
Metrics: Custom counters pushed to a time-series store. I use a lightweight setup with Prometheus-compatible metrics, but even a PostgreSQL table with timestamps works at small scale.
Alerting: Simple threshold alerts. If success rate drops below 95% over a 15-minute window, or daily cost exceeds 150% of the 7-day average, I get notified.
Cost Dashboard: A daily rollup showing cost per agent, cost per call, and trend over 30 days. This is the single most useful thing I built. Catches runaway costs within 24 hours.

// Structured logging for every LLM call
function logLLMCall(result: LLMCallResult<any>, agentName: string) {
  logger.info({
    agent: agentName,
    model: result.model,
    success: result.success,
    latencyMs: result.latencyMs,
    inputTokens: result.usage.inputTokens,
    outputTokens: result.usage.outputTokens,
    costUsd: result.usage.costUsd,
    attempts: result.attempts,
    error: result.error ?? undefined,
    timestamp: new Date().toISOString(),
  });
}

Every call. No exceptions. The day you skip logging "because it's just a simple agent" is the day that agent costs you $400 overnight.

The Alert That Saved Me $1,200

Two months in, my tax pipeline agent started retrying excessively. A upstream data format changed slightly — enough that the model's output failed validation every time, triggering three retries per request. Each retry burned tokens on the full prompt.

My cost alert fired within six hours. Without it, that bug would have run for days before anyone noticed. The fix was a two-line schema update. The alert paid for itself that week.

Multi-Step Agent Orchestration

For agents that chain multiple steps, I use a simple state machine pattern. Each step is a pure function that takes state in and returns new state plus the next action.

type StepResult =
  | { status: 'continue'; nextStep: string; state: AgentState }
  | { status: 'complete'; output: any }
  | { status: 'failed'; error: string; state: AgentState };

interface AgentStep {
  name: string;
  execute(state: AgentState): Promise<StepResult>;
  rollback?(state: AgentState): Promise<void>;
}

async function runPipeline(steps: AgentStep[], initial: AgentState) {
  let state = initial;
  const completed: AgentStep[] = [];

  for (const step of steps) {
    const result = await step.execute(state);

    if (result.status === 'failed') {
      // Rollback completed steps in reverse
      for (const done of completed.reverse()) {
        await done.rollback?.(state);
      }
      return { success: false, error: result.error, failedStep: step.name };
    }

    if (result.status === 'complete') {
      return { success: true, output: result.output };
    }

    state = result.state;
    completed.push(step);
  }
}

The rollback mechanism has saved me multiple times. When step 4 of a 5-step pipeline fails, you don't want to leave partial data scattered across your system.

What I'd Do Differently

Honest accounting time. If I started over tomorrow, here's what changes.

Start With Observability, Not Features

My first three agents had minimal logging. I was focused on making them work. Getting production visibility retrofitted was painful and I missed early bugs that better logging would have caught in hours.

Build the logging wrapper and cost tracking before you write a single prompt. It's maybe two hours of work and it pays for itself within the first week.

Use Structured Outputs From Day One

I started with free-text outputs and regex parsing. Don't. Use function calling or structured output modes from the start. The migration from "parse this markdown" to "here's a JSON schema" touched every agent and took a full weekend.

Set Cost Budgets Per Agent Per Day

Hard limits, not just alerts. My llmCall wrapper now checks a daily budget before making calls:

async function checkBudget(agentName: string, estimatedCost: number): Promise<boolean> {
  const spent = await getDailySpend(agentName);
  const budget = AGENT_BUDGETS[agentName] ?? DEFAULT_DAILY_BUDGET;

  if (spent + estimatedCost > budget) {
    logger.error(`Agent ${agentName} budget exceeded: $${spent.toFixed(2)}/$${budget}`);
    await sendAlert(`Budget exceeded for ${agentName}`);
    return false;
  }
  return true;
}

This would have prevented my most expensive mistake — a recursive agent loop that burned $600 in four hours on a Saturday.

Test With Adversarial Inputs Earlier

I tested with clean, well-formatted inputs. Users send garbage. Unicode edge cases, 50MB attachments, empty strings, inputs in languages the agent wasn't designed for. Build an adversarial test suite early. Include empty inputs, maximum-length inputs, malformed data, and multilingual content.

Don't Over-Agent

Not everything needs an AI agent. I built two agents that were better served by a simple rule engine with an if/else tree. The LLM added latency, cost, and non-determinism for tasks that had clear, deterministic logic.

Before building an agent, ask: "Does this task require reasoning, or just routing?" If it's routing, write a function.

The Patterns That Survived

After 300+ commits and 14 agents, these are the patterns I'd bet on:

Single LLM call wrapper with built-in retry, timeout, cost tracking, and output validation. Every call goes through it.
Model routing by task complexity. Use the cheapest model that works.
Structured outputs with Zod validation. Parse failure is a first-class error, not an afterthought.
Graceful degradation chains. Primary model → fallback model → template/rule-based fallback.
Per-agent daily cost budgets with hard cutoffs and alerts.
Structured logging on every call. Model, tokens, cost, latency, success. No exceptions.
Semantic caching for repeated or similar inputs.
State machine orchestration for multi-step workflows, with rollback support.

None of these are revolutionary. That's the point. Production AI agents don't need clever architecture. They need boring, reliable infrastructure wrapped around an inherently unpredictable component.

The model is the creative part. Everything around it should be as predictable and observable as possible.

Final Thought

Building AI agents is the easy part. Keeping them running, affordable, and reliable in production — that's the real work. Most of my 300+ commits aren't adding features. They're improving error handling, tightening validation, optimizing prompts, and fixing the monitoring that caught a problem at 2 AM.

The agents that work best aren't the smartest. They're the ones that fail gracefully, cost predictably, and tell me exactly what they're doing at all times.

Build boring infrastructure around interesting AI. That's the whole lesson.

Ready to ship an AI product?

We build revenue-moving AI tools in focused agentic development cycles. 3 production apps shipped in a single day.

Book a 20-min Fit Call See how agentic development works

Related Guides

AI AgentsFrameworks

How to Choose Between AI Agent Frameworks in 2026

A practical comparison of AI agent frameworks — LangChain, CrewAI, AutoGen, Semantic Kernel, and building from scratch — with decision criteria for builders.

Feb 7, 202611 min read

MCPAI Agents

Getting Started with MCP (Model Context Protocol): A Practical Guide

MCP is changing how AI agents connect to tools and data. Here's a practical guide to understanding, implementing, and building with the Model Context Protocol.

Feb 7, 202610 min read

agentsproduction

Autonomous AI Agents: From Concept to Production

A practical guide to taking AI agents from prototype to production, with reliability, cost control, and monitoring patterns learned from 24/7 operations.

Feb 6, 20266 min read

AI AgentsProductionLessons LearnedBest Practices

Building Production AI Agents: Lessons from 300+ Commits

Hard-won lessons from building and deploying 14+ AI agents in production — error handling, monitoring, cost management, and the patterns that actually work.

February 7, 202613 min read

This isn't a tutorial. It's a field report. The patterns that survived contact with real users, real API rate limits, and real invoices from OpenAI and Anthropic.

The Reality Check Nobody Gives You

Most AI agent content online shows you the happy path. Call the model, get a response, done. Production is nothing like that.

Every single one of those happened to me. Some of them happened twice.

The gap between "works in development" and "works in production" for AI agents is wider than any other software I've built. Traditional software fails predictably. AI agents fail creatively.

The 5 Things That Will Break First

After deploying 14 agents, I can tell you with confidence what breaks first. Every time.

1. Output Parsing

// Don't do this
const result = JSON.parse(response.content);

// Do this
function safeParseOutput<T>(raw: string, schema: z.ZodSchema<T>): T | null {
  // Strip markdown code fences if present
  const cleaned = raw
    .replace(/^```(?:json)?\n?/gm, '')
    .replace(/\n?```$/gm, '')
    .trim();

  try {
    const parsed = JSON.parse(cleaned);
    return schema.parse(parsed);
  } catch (e) {
    logger.warn('Output parse failed', {
      raw: raw.substring(0, 500),
      error: e instanceof Error ? e.message : 'Unknown',
    });
    return null;
  }
}

I use Zod for every single agent output. No exceptions. If the model returns garbage, I want to know exactly which field failed, not just "SyntaxError: Unexpected token."

2. Rate Limits and Timeouts

async function callWithBackoff<T>(
  fn: () => Promise<T>,
  opts: { maxRetries?: number; baseDelayMs?: number } = {}
): Promise<T> {
  const { maxRetries = 3, baseDelayMs = 1000 } = opts;

  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (err: any) {
      const isRetryable =
        err?.status === 429 ||
        err?.status === 503 ||
        err?.code === 'ECONNRESET';

      if (!isRetryable || attempt === maxRetries) throw err;

      const delay = baseDelayMs * Math.pow(2, attempt) + Math.random() * 500;
      logger.info(`Retry ${attempt + 1}/${maxRetries} after ${Math.round(delay)}ms`);
      await new Promise((r) => setTimeout(r, delay));
    }
  }
  throw new Error('Unreachable');
}

Exponential backoff with jitter. Not optional. The jitter matters — without it, all your retries hit the API at the same time and you stay throttled.

3. Context Window Overflows

Users will send you things that don't fit. A 200-page PDF. A database dump. An email thread with 47 replies. Your agent needs to handle this before it hits the model.

function truncateToTokenBudget(
  text: string,
  maxTokens: number,
  encoding: TiktokenEncoding = 'cl100k_base'
): string {
  const enc = getEncoding(encoding);
  const tokens = enc.encode(text);

  if (tokens.length <= maxTokens) return text;

  const truncated = enc.decode(tokens.slice(0, maxTokens));
  logger.warn(`Truncated input from ${tokens.length} to ${maxTokens} tokens`);
  return new TextDecoder().decode(truncated) + '\n\n[Content truncated]';
}

4. Costs Spiraling Without Warning

This is the silent killer. You deploy, it works, you forget about it. Three weeks later you check your dashboard and you've burned through $800 more than expected.

5. State Management Across Multi-Step Workflows

I'll cover the pattern that works for this below.

Error Handling Patterns That Actually Work

After trying several approaches, I settled on a layered error handling strategy. Every agent has three layers.

Layer 1: Input Validation

Before any LLM call, validate and sanitize inputs. This catches 60% of issues before they cost you money.

interface AgentInput {
  validate(): ValidationResult;
  sanitize(): this;
  estimateTokens(): number;
}

function runAgent(input: AgentInput) {
  const sanitized = input.sanitize();
  const validation = sanitized.validate();

  if (!validation.ok) {
    return { success: false, error: validation.errors, stage: 'input' };
  }

  const estimatedTokens = sanitized.estimateTokens();
  if (estimatedTokens > TOKEN_BUDGET) {
    return { success: false, error: 'Input exceeds token budget', stage: 'input' };
  }

  // Proceed to LLM call...
}

Layer 2: LLM Call Wrapper

Every LLM call goes through a single wrapper that handles retries, timeouts, cost tracking, and output parsing. Never call the API directly from business logic.

interface LLMCallResult<T> {
  success: boolean;
  data?: T;
  error?: string;
  usage: { inputTokens: number; outputTokens: number; costUsd: number };
  latencyMs: number;
  attempts: number;
}

async function llmCall<T>(opts: {
  model: string;
  messages: Message[];
  schema: z.ZodSchema<T>;
  maxRetries?: number;
  timeoutMs?: number;
}): Promise<LLMCallResult<T>> {
  const start = Date.now();
  let totalUsage = { inputTokens: 0, outputTokens: 0, costUsd: 0 };
  let attempts = 0;

  try {
    const raw = await callWithBackoff(
      async () => {
        attempts++;
        const resp = await provider.chat({
          model: opts.model,
          messages: opts.messages,
          timeout: opts.timeoutMs ?? 30_000,
        });
        totalUsage.inputTokens += resp.usage.input;
        totalUsage.outputTokens += resp.usage.output;
        totalUsage.costUsd += calculateCost(opts.model, resp.usage);
        return resp;
      },
      { maxRetries: opts.maxRetries }
    );

    const parsed = safeParseOutput(raw.content, opts.schema);
    if (!parsed) {
      return {
        success: false,
        error: 'Output schema validation failed',
        usage: totalUsage,
        latencyMs: Date.now() - start,
        attempts,
      };
    }

    return {
      success: true,
      data: parsed,
      usage: totalUsage,
      latencyMs: Date.now() - start,
      attempts,
    };
  } catch (err: any) {
    return {
      success: false,
      error: err.message,
      usage: totalUsage,
      latencyMs: Date.now() - start,
      attempts,
    };
  }
}

Notice the usage tracking on every call, including failed ones. This is critical. Failed calls still cost money.

Layer 3: Graceful Degradation

When the LLM fails, don't just throw an error at the user. Have a fallback.

async function generateProposal(brief: ClientBrief): Promise<Proposal> {
  // Try primary model
  const result = await llmCall({
    model: 'claude-opus-4-6',
    messages: buildProposalPrompt(brief),
    schema: ProposalSchema,
  });

  if (result.success) return result.data!;

  // Fallback to faster, cheaper model
  logger.warn('Primary model failed, falling back', { error: result.error });
  const fallback = await llmCall({
    model: 'claude-sonnet-4-20250514',
    messages: buildProposalPrompt(brief),
    schema: ProposalSchema,
  });

  if (fallback.success) return fallback.data!;

  // Final fallback: template-based generation (no LLM)
  logger.error('All models failed, using template fallback');
  return generateFromTemplate(brief);
}

Cost Optimization: What Actually Moves the Needle

I track every dollar across all 14 agents. Here's what made the biggest difference.

Model Routing

Not every task needs your most expensive model. I built a simple router that classifies tasks by complexity and routes accordingly.

const MODEL_TIERS = {
  simple: { model: 'claude-haiku-4-20250514', costPer1kInput: 0.0008 },
  standard: { model: 'claude-sonnet-4-20250514', costPer1kInput: 0.003 },
  complex: { model: 'claude-opus-4-6', costPer1kInput: 0.015 },
} as const;

function selectModel(task: AgentTask): keyof typeof MODEL_TIERS {
  if (task.requiresReasoning || task.outputLength > 2000) return 'complex';
  if (task.isClassification || task.isExtraction) return 'simple';
  return 'standard';
}

This single change cut my monthly costs by 40%. Most tasks in a pipeline are classification, extraction, or formatting — Haiku handles those fine.

Semantic Caching

If you're calling the model with the same (or very similar) input repeatedly, cache it. I use a simple embedding-based cache with a similarity threshold.

interface CacheEntry {
  inputHash: string;
  embedding: number[];
  output: string;
  model: string;
  createdAt: Date;
  ttlMs: number;
}

async function cachedLLMCall<T>(
  opts: LLMCallOpts<T> & { cacheKey?: string; cacheTtlMs?: number }
): Promise<LLMCallResult<T>> {
  const inputHash = hashMessages(opts.messages);

  // Exact match check
  const cached = await cache.get(inputHash);
  if (cached && !isExpired(cached)) {
    metrics.increment('cache.hit');
    return { success: true, data: cached.output as T, usage: ZERO_USAGE, latencyMs: 2, attempts: 0 };
  }

  // Cache miss — call model
  metrics.increment('cache.miss');
  const result = await llmCall(opts);

  if (result.success) {
    await cache.set(inputHash, {
      output: result.data,
      model: opts.model,
      ttlMs: opts.cacheTtlMs ?? 3600_000,
    });
  }

  return result;
}

My case study builder generates similar analyses for companies in the same industry. Caching brought repeat-industry costs down by roughly 60%.

Prompt Optimization

Your system prompt is sent with every request. Every token counts.

Before: average system prompt was 1,800 tokens. After: 650 tokens. Across thousands of daily calls, that adds up fast.

The Monitoring Stack That Works

You can't manage what you can't measure. Here's what I monitor for every agent in production.

Essential Metrics

interface AgentMetrics {
  // Performance
  latencyP50Ms: number;
  latencyP99Ms: number;
  successRate: number;
  
  // Cost
  costPerCall: number;
  costPerDay: number;
  tokenEfficiency: number; // useful output tokens / total tokens
  
  // Quality
  parseFailureRate: number;
  fallbackRate: number;
  retryRate: number;
}

What I Use

Logging: Structured JSON logs with Pino. Every LLM call logs model, tokens, cost, latency, and success/failure. Ship to your preferred aggregator.
Metrics: Custom counters pushed to a time-series store. I use a lightweight setup with Prometheus-compatible metrics, but even a PostgreSQL table with timestamps works at small scale.
Alerting: Simple threshold alerts. If success rate drops below 95% over a 15-minute window, or daily cost exceeds 150% of the 7-day average, I get notified.
Cost Dashboard: A daily rollup showing cost per agent, cost per call, and trend over 30 days. This is the single most useful thing I built. Catches runaway costs within 24 hours.

// Structured logging for every LLM call
function logLLMCall(result: LLMCallResult<any>, agentName: string) {
  logger.info({
    agent: agentName,
    model: result.model,
    success: result.success,
    latencyMs: result.latencyMs,
    inputTokens: result.usage.inputTokens,
    outputTokens: result.usage.outputTokens,
    costUsd: result.usage.costUsd,
    attempts: result.attempts,
    error: result.error ?? undefined,
    timestamp: new Date().toISOString(),
  });
}

Every call. No exceptions. The day you skip logging "because it's just a simple agent" is the day that agent costs you $400 overnight.

The Alert That Saved Me $1,200

My cost alert fired within six hours. Without it, that bug would have run for days before anyone noticed. The fix was a two-line schema update. The alert paid for itself that week.

Multi-Step Agent Orchestration

For agents that chain multiple steps, I use a simple state machine pattern. Each step is a pure function that takes state in and returns new state plus the next action.

type StepResult =
  | { status: 'continue'; nextStep: string; state: AgentState }
  | { status: 'complete'; output: any }
  | { status: 'failed'; error: string; state: AgentState };

interface AgentStep {
  name: string;
  execute(state: AgentState): Promise<StepResult>;
  rollback?(state: AgentState): Promise<void>;
}

async function runPipeline(steps: AgentStep[], initial: AgentState) {
  let state = initial;
  const completed: AgentStep[] = [];

  for (const step of steps) {
    const result = await step.execute(state);

    if (result.status === 'failed') {
      // Rollback completed steps in reverse
      for (const done of completed.reverse()) {
        await done.rollback?.(state);
      }
      return { success: false, error: result.error, failedStep: step.name };
    }

    if (result.status === 'complete') {
      return { success: true, output: result.output };
    }

    state = result.state;
    completed.push(step);
  }
}

The rollback mechanism has saved me multiple times. When step 4 of a 5-step pipeline fails, you don't want to leave partial data scattered across your system.

What I'd Do Differently

Honest accounting time. If I started over tomorrow, here's what changes.

Start With Observability, Not Features

Build the logging wrapper and cost tracking before you write a single prompt. It's maybe two hours of work and it pays for itself within the first week.

Use Structured Outputs From Day One

Set Cost Budgets Per Agent Per Day

Hard limits, not just alerts. My llmCall wrapper now checks a daily budget before making calls:

async function checkBudget(agentName: string, estimatedCost: number): Promise<boolean> {
  const spent = await getDailySpend(agentName);
  const budget = AGENT_BUDGETS[agentName] ?? DEFAULT_DAILY_BUDGET;

  if (spent + estimatedCost > budget) {
    logger.error(`Agent ${agentName} budget exceeded: $${spent.toFixed(2)}/$${budget}`);
    await sendAlert(`Budget exceeded for ${agentName}`);
    return false;
  }
  return true;
}

This would have prevented my most expensive mistake — a recursive agent loop that burned $600 in four hours on a Saturday.

Test With Adversarial Inputs Earlier

Don't Over-Agent

Before building an agent, ask: "Does this task require reasoning, or just routing?" If it's routing, write a function.

The Patterns That Survived

After 300+ commits and 14 agents, these are the patterns I'd bet on:

Single LLM call wrapper with built-in retry, timeout, cost tracking, and output validation. Every call goes through it.
Model routing by task complexity. Use the cheapest model that works.
Structured outputs with Zod validation. Parse failure is a first-class error, not an afterthought.
Graceful degradation chains. Primary model → fallback model → template/rule-based fallback.
Per-agent daily cost budgets with hard cutoffs and alerts.
Structured logging on every call. Model, tokens, cost, latency, success. No exceptions.
Semantic caching for repeated or similar inputs.
State machine orchestration for multi-step workflows, with rollback support.

None of these are revolutionary. That's the point. Production AI agents don't need clever architecture. They need boring, reliable infrastructure wrapped around an inherently unpredictable component.

The model is the creative part. Everything around it should be as predictable and observable as possible.

Final Thought

The agents that work best aren't the smartest. They're the ones that fail gracefully, cost predictably, and tell me exactly what they're doing at all times.

Build boring infrastructure around interesting AI. That's the whole lesson.

Ready to ship an AI product?

We build revenue-moving AI tools in focused agentic development cycles. 3 production apps shipped in a single day.

Book a 20-min Fit Call See how agentic development works

Related Guides

AI AgentsFrameworks

How to Choose Between AI Agent Frameworks in 2026

A practical comparison of AI agent frameworks — LangChain, CrewAI, AutoGen, Semantic Kernel, and building from scratch — with decision criteria for builders.

Feb 7, 202611 min read

MCPAI Agents

Getting Started with MCP (Model Context Protocol): A Practical Guide

MCP is changing how AI agents connect to tools and data. Here's a practical guide to understanding, implementing, and building with the Model Context Protocol.

Feb 7, 202610 min read

agentsproduction

Autonomous AI Agents: From Concept to Production

A practical guide to taking AI agents from prototype to production, with reliability, cost control, and monitoring patterns learned from 24/7 operations.

Feb 6, 20266 min read

The Reality Check Nobody Gives You

The 5 Things That Will Break First

1. Output Parsing

2. Rate Limits and Timeouts

3. Context Window Overflows

4. Costs Spiraling Without Warning

5. State Management Across Multi-Step Workflows

Error Handling Patterns That Actually Work

Layer 1: Input Validation

Layer 2: LLM Call Wrapper

Layer 3: Graceful Degradation

Cost Optimization: What Actually Moves the Needle

Model Routing

Semantic Caching

Prompt Optimization

The Monitoring Stack That Works

Essential Metrics

What I Use

The Alert That Saved Me $1,200

Multi-Step Agent Orchestration

What I'd Do Differently

Start With Observability, Not Features

Use Structured Outputs From Day One

Set Cost Budgets Per Agent Per Day

Test With Adversarial Inputs Earlier

Don't Over-Agent

The Patterns That Survived

Final Thought

Related reading

AI Agent Masterclass

AI Agent Testing Guide

Scope Assessment Tool

Enjoyed this guide?

Ready to ship an AI product?

Related Guides

How to Choose Between AI Agent Frameworks in 2026

Getting Started with MCP (Model Context Protocol): A Practical Guide

Autonomous AI Agents: From Concept to Production

The Reality Check Nobody Gives You

The 5 Things That Will Break First

1. Output Parsing

2. Rate Limits and Timeouts

3. Context Window Overflows

4. Costs Spiraling Without Warning

5. State Management Across Multi-Step Workflows

Error Handling Patterns That Actually Work

Layer 1: Input Validation

Layer 2: LLM Call Wrapper

Layer 3: Graceful Degradation

Cost Optimization: What Actually Moves the Needle

Model Routing

Semantic Caching

Prompt Optimization

The Monitoring Stack That Works

Essential Metrics

What I Use

The Alert That Saved Me $1,200

Multi-Step Agent Orchestration

What I'd Do Differently

Start With Observability, Not Features

Use Structured Outputs From Day One

Set Cost Budgets Per Agent Per Day

Test With Adversarial Inputs Earlier

Don't Over-Agent

The Patterns That Survived

Final Thought

Related reading

AI Agent Masterclass

AI Agent Testing Guide

Scope Assessment Tool

Enjoyed this guide?

Ready to ship an AI product?

Related Guides

How to Choose Between AI Agent Frameworks in 2026

Getting Started with MCP (Model Context Protocol): A Practical Guide

Autonomous AI Agents: From Concept to Production