You Can't Improve What You Can't Measure

A Practical Guide to Measuring AI Performance

Performance metrics measure whether AI outputs meet your business needs. They track accuracy, reliability, and quality, and are compared against industry-standard tests that assess general capabilities.

Industry tests make headlines, but experienced teams know they're not enough. There are rumours that test answers have leaked into training data. Even human-reviewed comparisons have limits - when a system can search the web mid-test, it's an open-book exam. As Greg Brockman puts it, performance metrics are surprisingly often all you need. Industry tests tell you which tools to try. Your own measurements tell you if they actually work.

Why Measuring Performance Is Hard

Most teams know good when they see it but struggle to define it in a way that can be checked automatically. This leads to slow progress and internal disagreements. Often the only metric is the CEO liked it.

Three factors compound this. First, AI responses aren't consistent - the same question produces different results each time. Second, vendor risk matters again after a decade of cheap computing power; services get discontinued, pricing changes, and capabilities shift. Third, AI outputs often have subtle quality differences that matter but resist quantification.

The Three Types of Quality Checks

There are three main ways to measure AI performance, each with different costs and benefits.

1. Automated Rules

Fast and cheap. Use these for objective criteria: Did the response stay under 200 words? Did it include required sections? Did it avoid forbidden topics?

Automated rules work well for:

Format requirements (length, structure, required elements)
Safety checks (blocked content, compliance requirements)
Basic quality gates (coherence, completeness)

Limitations: Can't assess subjective quality. A response can pass every rule and still be unhelpful.

2. AI-Assisted Review

Use one AI to evaluate another. More nuanced than rules, much faster than humans.

Works well for:

Comparing response quality across multiple options
Checking for tone, style, and appropriateness
Identifying potential issues for human review

Limitations: AI reviewers have their own biases and blind spots. They tend to prefer verbose responses. They can miss errors they would make themselves.

3. Human Review

The gold standard for quality, but slow and expensive. Use strategically.

Reserve human review for:

High-stakes outputs where errors are costly
Calibrating automated systems (building training data)
Catching issues AI reviewers miss

Limitations: Humans are inconsistent. Different reviewers apply different standards. Fatigue affects judgment. Scale is limited.

Building Your Performance System

Most teams need all three types working together:

Automated rules as the first gate - cheap, fast, catches obvious problems
AI review for the middle layer - more nuanced, scales well
Human review for the most important outputs and for calibrating the other systems

The goal is pushing as much validation as possible to cheaper, faster methods while maintaining quality. Humans should spend their time on judgment calls, not checking word counts.

What to Measure

Start with what matters most to your business. Common metrics include:

Accuracy: Is the information correct?
Relevance: Does it address the actual question?
Completeness: Are important points covered?
Tone: Is it appropriate for the audience?
Actionability: Can the user do something with this?

Don't try to measure everything. Pick 3-5 metrics that directly connect to business outcomes.

The Strategic Advantage

Teams that build strong performance systems gain compounding advantages:

They can test new models quickly and confidently
They catch regressions before users do
They have data to justify AI investments
They can fine-tune systems with validated examples

The investment in measurement infrastructure pays off every time you need to make a decision about your AI systems.

Getting Started

You don't need a perfect system to start. Begin with:

Define what good looks like for your most common use case
Create 20-30 test examples with known good outputs
Run your current system against these examples
Identify the biggest gaps
Add automated checks for the most common problems

Iterate from there. The goal isn't perfection - it's having enough visibility to improve systematically.

You can't improve what you can't measure. Start measuring.