Evals: how measurement drives AI in companies | Keryc
More than a million companies already use AI to gain efficiency and create value. Why do so many fail to get the results they expect? The answer starts with measuring wisely: evals turn fuzzy goals into concrete, measurable objectives.
¿Qué son las evals y por qué importan?
Think of an eval as the product requirements document, but for AI systems. Instead of saying "improve customer support," an eval forces you to be specific: what inputs arrive, what output do you expect, and which errors are unacceptable.
Why does that change the game? Because without that specificity you don't know whether the AI is failing due to technology, data, or a poorly defined goal. With evals you can reduce serious mistakes, defend against risks, and chart a clear path to better ROI.
Cómo comenzar: equipo pequeño y un set de oro
Start with a small, empowered team that can write the system's purpose in plain terms. Mix technical experts with domain people: if it's for sales, bring salespeople.
Practical steps:
Define the objective in one sentence, for example: Convert qualified incoming emails into scheduled demos while keeping brand tone.
Map the full flow and every decision point.
Create the golden set: concrete examples that represent what experts consider "excellent".
That set will be your authoritative reference and it should live and change over time.
Prototipa, revisa 50 o 100 salidas y haz análisis de errores
Don't try to solve everything at once. Do early prototypes and review real samples: 50 to 100 outputs are usually enough to spot failure patterns.
From that exercise you'll produce an error taxonomy (and their frequencies) that you should track as you improve the system. That list tells you where to invest effort: prompts, data, or model changes.
Mide en condiciones reales y usa rubricas con cuidado
Create a test environment that mimics the real world, not just a prompt playground. Evaluate against your golden set and expose the system to edge cases that, although rare, are costly if they fail.
Rubrics help make judgments concrete, but be careful: don't obsess over superficial metrics. Some qualities are hard to quantify and require expert judgment.
Automatiza con supervisión humana: LLM graders y auditoría
You can scale some evals with an LLM grader that scores outputs like an expert would. The catch? Never trust it blindly. Keep a human in the loop to audit the grader's accuracy and review logs when ambiguous or costly cases appear.
Cierra el ciclo: data flywheel y mejora continua
Log inputs, outputs, and outcomes. Surface those logs periodically and send ambiguous cases for expert review. Add those judgments to the eval and the error taxonomy, and use them to refine prompts, data access, or models.
This way you build a contextual dataset that's hard to replicate: a real competitive advantage.
Riesgos, mantenimiento y experimentación
Evals aren't a static recipe. As models, data, and objectives change, evals must be maintained, expanded, and stress-tested.
For external products, evals don't replace A/B tests; they complement them. A well-designed eval gives you visibility into how changes affect real performance.
Lo que esto significa para los líderes
Working with probabilistic systems requires new measurements and decisions about trade-offs: when you need precision and when you can favor speed. Ultimately, classic management skills —defining objectives, giving direct feedback, and exercising prudent judgment— become AI skills.
If you can't say what "excellent" means for your use case, it's unlikely you'll achieve it. Evals are therefore as much a management practice as a technical one.
In the end, the invitation is clear: don't wait for AI to work magic. Specify what you want, measure it, and improve it iteratively. Start small, involve experts, measure in real conditions, and build the data loop that grows your system with purpose.