What should I do after reading ELO Rankings for AI Agents: A Practical Implementation Guide?

Use the checklist and linked resources to pick one next action, implement it, and measure outcomes before expanding scope.

Amir Brooks

ai agentselorankingconvextypescript

ELO Rankings for AI Agents: A Practical Implementation Guide

Q: Who is ELO Rankings for AI Agents: A Practical Implementation Guide for?

This guide is for builders and teams evaluating elo rankings for ai agents: a practical implementation guide in practical, production-focused workflows.

How to implement ELO rankings for AI agents in production: algorithm, Convex schema, edge cases, and a PromptDuels example.

February 6, 20264 min read

If you want AI agents to improve over time, you need a feedback loop. One of the simplest and most effective loops is competitive ranking. ELO gives you a clean, explainable way to measure performance across duels, tasks, and head-to-head challenges.

This guide covers:

Why ELO works for agents
The algorithm in plain language
A TypeScript + Convex implementation
Edge cases (new agents, inactive agents, task types)
A real-world "PromptDuels" concept

Why Competitive Ranking Drives Improvement

When agents compete, you get:

Clear incentives (win, climb the leaderboard)
A signal of quality even with noisy tasks
User engagement (people love leaderboards)

ELO is perfect because it's lightweight and doesn't require massive datasets. It updates after every match. If you're pairing ELO with a points economy, rankings become even more meaningful — creators can earn points by climbing the leaderboard.

ELO Algorithm (In Plain Language)

Each agent has a rating. When two agents compete:

Calculate the expected score based on rating difference.
Compare expected to actual result.
Update both ratings.

The formula:

ExpectedA = 1 / (1 + 10^((RatingB - RatingA) / 400))
NewA = RatingA + K * (ScoreA - ExpectedA)

Where ScoreA is 1 for win, 0.5 for draw, 0 for loss. K controls volatility.

Data Model (Convex)

Here's a minimal schema:

// convex/schema.ts
import { defineSchema, defineTable, s } from "convex/schema";

export default defineSchema({
  agents: defineTable({
    name: s.string(),
    rating: s.number(),
    matches: s.number(),
    lastActive: s.number(),
  }),
  matches: defineTable({
    agentA: s.id("agents"),
    agentB: s.id("agents"),
    scoreA: s.number(),
    scoreB: s.number(),
    createdAt: s.number(),
  }),
});

ELO Update Function (TypeScript)

// convex/elo.ts
export function expectedScore(rA: number, rB: number) {
  return 1 / (1 + Math.pow(10, (rB - rA) / 400));
}

export function updateRatings(rA: number, rB: number, scoreA: number, k = 32) {
  const expA = expectedScore(rA, rB);
  const expB = 1 - expA;

  const newA = rA + k * (scoreA - expA);
  const newB = rB + k * ((1 - scoreA) - expB);

  return { newA, newB };
}

Convex Mutation to Record a Match

// convex/matches.ts
import { mutation } from "convex/server";
import { v } from "convex/values";
import { updateRatings } from "./elo";

export const recordMatch = mutation({
  args: {
    agentA: v.id("agents"),
    agentB: v.id("agents"),
    scoreA: v.number(), // 1 win, 0.5 draw, 0 loss
  },
  handler: async (ctx, args) => {
    const a = await ctx.db.get(args.agentA);
    const b = await ctx.db.get(args.agentB);
    if (!a || !b) throw new Error("Agent not found");

    const { newA, newB } = updateRatings(a.rating, b.rating, args.scoreA, 32);

    await ctx.db.patch(args.agentA, {
      rating: newA,
      matches: a.matches + 1,
      lastActive: Date.now(),
    });

    await ctx.db.patch(args.agentB, {
      rating: newB,
      matches: b.matches + 1,
      lastActive: Date.now(),
    });

    await ctx.db.insert("matches", {
      agentA: args.agentA,
      agentB: args.agentB,
      scoreA: args.scoreA,
      scoreB: 1 - args.scoreA,
      createdAt: Date.now(),
    });
  },
});

Handling Edge Cases

New Agents (Cold Start)

New agents should start with a provisional rating (e.g., 1200) and a higher K value to adapt faster.

const k = agent.matches < 10 ? 48 : 32;

Inactive Agents

If agents sit idle, their rating can become stale. Options:

Decay rating slowly
Lower match weighting when inactive

A simple approach: penalize inactivity weekly.

Different Task Types

If your agents compete across categories (coding vs. writing), you need separate rating pools or category multipliers. Otherwise, you rank apples vs. oranges.

Draws / Partial Wins

Use 0.5 for ties. For partial wins, you can model fractional scores (e.g., 0.7).

Data dashboard showing leaderboard rankings

PromptDuels: A Real-World Example

In PromptDuels, two agents solve the same prompt. A user votes for the better output. That vote becomes the score.

Workflow:

Create duel
Run both agents
User votes
Update ELO
Update leaderboard

This creates a flywheel: better agents climb, users get more interesting matchups, and creators tune their agents to win. Built on Convex with a Next.js frontend, the leaderboard updates in real time as votes come in.

Practical Tips

Keep K small once the system stabilizes.
Expose ratings so creators can track improvement.
Store match history for transparency.
Add moderation to prevent vote manipulation.

Final Thoughts

ELO is simple, explainable, and battle-tested. In AI agent ecosystems, it becomes a growth mechanic: creators compete, users engage, and your platform gets smarter.

If you're building a PromptDuels‑style arena, start with ELO. It's the fastest path to a believable ranking system.

ELO pairs naturally with a points economy — rankings drive competition while points drive engagement. And if you're new to building agent systems, the complete builder's guide covers the full architecture from tools to memory to deployment. For the full deep dive into building production agent systems like this, the AI Agent Masterclass walks through implementation step by step.

Ready to ship an AI product?

We build revenue-moving AI tools in focused agentic development cycles. 3 production apps shipped in a single day.

Book a 20-min Fit Call See how agentic development works

Related Guides

AI AgentsFrameworks

How to Choose Between AI Agent Frameworks in 2026

A practical comparison of AI agent frameworks — LangChain, CrewAI, AutoGen, Semantic Kernel, and building from scratch — with decision criteria for builders.

Feb 7, 202611 min read

MCPAI Agents

Getting Started with MCP (Model Context Protocol): A Practical Guide

MCP is changing how AI agents connect to tools and data. Here's a practical guide to understanding, implementing, and building with the Model Context Protocol.

Feb 7, 202610 min read

AI AgentsProduction

Building Production AI Agents: Lessons from 300+ Commits

Hard-won lessons from building and deploying 14+ AI agents in production — error handling, monitoring, cost management, and the patterns that actually work.

Feb 7, 202613 min read

ai agentselorankingconvextypescript

ELO Rankings for AI Agents: A Practical Implementation Guide

How to implement ELO rankings for AI agents in production: algorithm, Convex schema, edge cases, and a PromptDuels example.

February 6, 20264 min read

This guide covers:

Why ELO works for agents
The algorithm in plain language
A TypeScript + Convex implementation
Edge cases (new agents, inactive agents, task types)
A real-world "PromptDuels" concept

Why Competitive Ranking Drives Improvement

When agents compete, you get:

Clear incentives (win, climb the leaderboard)
A signal of quality even with noisy tasks
User engagement (people love leaderboards)

ELO Algorithm (In Plain Language)

Each agent has a rating. When two agents compete:

Calculate the expected score based on rating difference.
Compare expected to actual result.
Update both ratings.

The formula:

ExpectedA = 1 / (1 + 10^((RatingB - RatingA) / 400))
NewA = RatingA + K * (ScoreA - ExpectedA)

Where ScoreA is 1 for win, 0.5 for draw, 0 for loss. K controls volatility.

Data Model (Convex)

Here's a minimal schema:

// convex/schema.ts
import { defineSchema, defineTable, s } from "convex/schema";

export default defineSchema({
  agents: defineTable({
    name: s.string(),
    rating: s.number(),
    matches: s.number(),
    lastActive: s.number(),
  }),
  matches: defineTable({
    agentA: s.id("agents"),
    agentB: s.id("agents"),
    scoreA: s.number(),
    scoreB: s.number(),
    createdAt: s.number(),
  }),
});

ELO Update Function (TypeScript)

// convex/elo.ts
export function expectedScore(rA: number, rB: number) {
  return 1 / (1 + Math.pow(10, (rB - rA) / 400));
}

export function updateRatings(rA: number, rB: number, scoreA: number, k = 32) {
  const expA = expectedScore(rA, rB);
  const expB = 1 - expA;

  const newA = rA + k * (scoreA - expA);
  const newB = rB + k * ((1 - scoreA) - expB);

  return { newA, newB };
}

Convex Mutation to Record a Match

// convex/matches.ts
import { mutation } from "convex/server";
import { v } from "convex/values";
import { updateRatings } from "./elo";

export const recordMatch = mutation({
  args: {
    agentA: v.id("agents"),
    agentB: v.id("agents"),
    scoreA: v.number(), // 1 win, 0.5 draw, 0 loss
  },
  handler: async (ctx, args) => {
    const a = await ctx.db.get(args.agentA);
    const b = await ctx.db.get(args.agentB);
    if (!a || !b) throw new Error("Agent not found");

    const { newA, newB } = updateRatings(a.rating, b.rating, args.scoreA, 32);

    await ctx.db.patch(args.agentA, {
      rating: newA,
      matches: a.matches + 1,
      lastActive: Date.now(),
    });

    await ctx.db.patch(args.agentB, {
      rating: newB,
      matches: b.matches + 1,
      lastActive: Date.now(),
    });

    await ctx.db.insert("matches", {
      agentA: args.agentA,
      agentB: args.agentB,
      scoreA: args.scoreA,
      scoreB: 1 - args.scoreA,
      createdAt: Date.now(),
    });
  },
});

Handling Edge Cases

New Agents (Cold Start)

New agents should start with a provisional rating (e.g., 1200) and a higher K value to adapt faster.

const k = agent.matches < 10 ? 48 : 32;

Inactive Agents

If agents sit idle, their rating can become stale. Options:

Decay rating slowly
Lower match weighting when inactive

A simple approach: penalize inactivity weekly.

Different Task Types

If your agents compete across categories (coding vs. writing), you need separate rating pools or category multipliers. Otherwise, you rank apples vs. oranges.

Draws / Partial Wins

Use 0.5 for ties. For partial wins, you can model fractional scores (e.g., 0.7).

Data dashboard showing leaderboard rankings

PromptDuels: A Real-World Example

In PromptDuels, two agents solve the same prompt. A user votes for the better output. That vote becomes the score.

Workflow:

Create duel
Run both agents
User votes
Update ELO
Update leaderboard

Practical Tips

Keep K small once the system stabilizes.
Expose ratings so creators can track improvement.
Store match history for transparency.
Add moderation to prevent vote manipulation.

Final Thoughts

ELO is simple, explainable, and battle-tested. In AI agent ecosystems, it becomes a growth mechanic: creators compete, users engage, and your platform gets smarter.

If you're building a PromptDuels‑style arena, start with ELO. It's the fastest path to a believable ranking system.

Ready to ship an AI product?

We build revenue-moving AI tools in focused agentic development cycles. 3 production apps shipped in a single day.

Book a 20-min Fit Call See how agentic development works

Related Guides

AI AgentsFrameworks

How to Choose Between AI Agent Frameworks in 2026

A practical comparison of AI agent frameworks — LangChain, CrewAI, AutoGen, Semantic Kernel, and building from scratch — with decision criteria for builders.

Feb 7, 202611 min read

MCPAI Agents

Getting Started with MCP (Model Context Protocol): A Practical Guide

MCP is changing how AI agents connect to tools and data. Here's a practical guide to understanding, implementing, and building with the Model Context Protocol.

Feb 7, 202610 min read

AI AgentsProduction

Building Production AI Agents: Lessons from 300+ Commits

Hard-won lessons from building and deploying 14+ AI agents in production — error handling, monitoring, cost management, and the patterns that actually work.

Feb 7, 202613 min read

Why Competitive Ranking Drives Improvement

ELO Algorithm (In Plain Language)

Data Model (Convex)

ELO Update Function (TypeScript)

Convex Mutation to Record a Match

Handling Edge Cases

New Agents (Cold Start)

Inactive Agents

Different Task Types

Draws / Partial Wins

PromptDuels: A Real-World Example

Practical Tips

Final Thoughts

Related reading

The Builder's Guide to AI Agents

AI Agent Fundamentals Course

Enjoyed this guide?

Ready to ship an AI product?

Related Guides

How to Choose Between AI Agent Frameworks in 2026

Getting Started with MCP (Model Context Protocol): A Practical Guide

Building Production AI Agents: Lessons from 300+ Commits

Why Competitive Ranking Drives Improvement

ELO Algorithm (In Plain Language)

Data Model (Convex)

ELO Update Function (TypeScript)

Convex Mutation to Record a Match

Handling Edge Cases

New Agents (Cold Start)

Inactive Agents

Different Task Types

Draws / Partial Wins

PromptDuels: A Real-World Example

Practical Tips

Final Thoughts

Related reading

The Builder's Guide to AI Agents

AI Agent Fundamentals Course

Enjoyed this guide?

Ready to ship an AI product?

Related Guides

How to Choose Between AI Agent Frameworks in 2026

Getting Started with MCP (Model Context Protocol): A Practical Guide

Building Production AI Agents: Lessons from 300+ Commits