ELO Rankings for AI Agents: A Practical Implementation Guide
How to implement ELO rankings for AI agents in production: algorithm, Convex schema, edge cases, and a PromptDuels example.
If you want AI agents to improve over time, you need a feedback loop. One of the simplest and most effective loops is competitive ranking. ELO gives you a clean, explainable way to measure performance across duels, tasks, and head-to-head challenges.
This guide covers:
- Why ELO works for agents
- The algorithm in plain language
- A TypeScript + Convex implementation
- Edge cases (new agents, inactive agents, task types)
- A real-world "PromptDuels" concept
Why Competitive Ranking Drives Improvement
When agents compete, you get:
- Clear incentives (win, climb the leaderboard)
- A signal of quality even with noisy tasks
- User engagement (people love leaderboards)
ELO is perfect because it's lightweight and doesn't require massive datasets. It updates after every match. If you're pairing ELO with a points economy, rankings become even more meaningful — creators can earn points by climbing the leaderboard.
ELO Algorithm (In Plain Language)
Each agent has a rating. When two agents compete:
- Calculate the expected score based on rating difference.
- Compare expected to actual result.
- Update both ratings.
The formula:
ExpectedA = 1 / (1 + 10^((RatingB - RatingA) / 400))
NewA = RatingA + K * (ScoreA - ExpectedA)
Where ScoreA is 1 for win, 0.5 for draw, 0 for loss. K controls volatility.
Data Model (Convex)
Here's a minimal schema:
// convex/schema.ts
import { defineSchema, defineTable, s } from "convex/schema";
export default defineSchema({
agents: defineTable({
name: s.string(),
rating: s.number(),
matches: s.number(),
lastActive: s.number(),
}),
matches: defineTable({
agentA: s.id("agents"),
agentB: s.id("agents"),
scoreA: s.number(),
scoreB: s.number(),
createdAt: s.number(),
}),
});
ELO Update Function (TypeScript)
// convex/elo.ts
export function expectedScore(rA: number, rB: number) {
return 1 / (1 + Math.pow(10, (rB - rA) / 400));
}
export function updateRatings(rA: number, rB: number, scoreA: number, k = 32) {
const expA = expectedScore(rA, rB);
const expB = 1 - expA;
const newA = rA + k * (scoreA - expA);
const newB = rB + k * ((1 - scoreA) - expB);
return { newA, newB };
}
Convex Mutation to Record a Match
// convex/matches.ts
import { mutation } from "convex/server";
import { v } from "convex/values";
import { updateRatings } from "./elo";
export const recordMatch = mutation({
args: {
agentA: v.id("agents"),
agentB: v.id("agents"),
scoreA: v.number(), // 1 win, 0.5 draw, 0 loss
},
handler: async (ctx, args) => {
const a = await ctx.db.get(args.agentA);
const b = await ctx.db.get(args.agentB);
if (!a || !b) throw new Error("Agent not found");
const { newA, newB } = updateRatings(a.rating, b.rating, args.scoreA, 32);
await ctx.db.patch(args.agentA, {
rating: newA,
matches: a.matches + 1,
lastActive: Date.now(),
});
await ctx.db.patch(args.agentB, {
rating: newB,
matches: b.matches + 1,
lastActive: Date.now(),
});
await ctx.db.insert("matches", {
agentA: args.agentA,
agentB: args.agentB,
scoreA: args.scoreA,
scoreB: 1 - args.scoreA,
createdAt: Date.now(),
});
},
});
Handling Edge Cases
New Agents (Cold Start)
New agents should start with a provisional rating (e.g., 1200) and a higher K value to adapt faster.
const k = agent.matches < 10 ? 48 : 32;
Inactive Agents
If agents sit idle, their rating can become stale. Options:
- Decay rating slowly
- Lower match weighting when inactive
A simple approach: penalize inactivity weekly.
Different Task Types
If your agents compete across categories (coding vs. writing), you need separate rating pools or category multipliers. Otherwise, you rank apples vs. oranges.
Draws / Partial Wins
Use 0.5 for ties. For partial wins, you can model fractional scores (e.g., 0.7).
PromptDuels: A Real-World Example
In PromptDuels, two agents solve the same prompt. A user votes for the better output. That vote becomes the score.
Workflow:
- Create duel
- Run both agents
- User votes
- Update ELO
- Update leaderboard
This creates a flywheel: better agents climb, users get more interesting matchups, and creators tune their agents to win. Built on Convex with a Next.js frontend, the leaderboard updates in real time as votes come in.
Practical Tips
- Keep K small once the system stabilizes.
- Expose ratings so creators can track improvement.
- Store match history for transparency.
- Add moderation to prevent vote manipulation.
Final Thoughts
ELO is simple, explainable, and battle-tested. In AI agent ecosystems, it becomes a growth mechanic: creators compete, users engage, and your platform gets smarter.
If you're building a PromptDuels‑style arena, start with ELO. It's the fastest path to a believable ranking system.
ELO pairs naturally with a points economy — rankings drive competition while points drive engagement. And if you're new to building agent systems, the complete builder's guide covers the full architecture from tools to memory to deployment. For the full deep dive into building production agent systems like this, the AI Agent Masterclass walks through implementation step by step.
Related reading
Enjoyed this guide?
Get more actionable AI insights, automation templates, and practical guides delivered to your inbox.
No spam. Unsubscribe anytime.
Ready to ship an AI product?
We build revenue-moving AI tools in focused agentic development cycles. 3 production apps shipped in a single day.