Understanding Evaluation

The four objects behind every eval, and the loop they sit in

The four objects

You define what “good” means once, and the platform scores every response against that definition. Four objects carry the whole product. Learn these and the rest is detail:

A template defines what to measure: the criteria, the expected output type, and a pass threshold. Templates are reusable and versioned, and are either built by Future AGI or written by you
A config defines how to measure for one run: which evaluator model, how your data maps to the template’s inputs, and the run settings
A run is one execution: a template plus a config plus one unit of data, from a single row to millions
A score is the outcome: a value (pass or fail, a number, or a category), an optional reason, and the metadata to trace it back

%%{init: {"flowchart": {"curve": "linear"}}}%%
flowchart LR
  T["Template<br/>what to measure"] --> C["Config<br/>pointed at your data"]
  C --> R["Run<br/>one execution"]
  R --> S["Score<br/>the outcome"]

Read it left to right: a template is the definition, a config points it at data, a run executes, and a score comes out. Everything else in Evaluations elaborates on one of these four. And because the same definition runs everywhere your work lives, a score means the same thing on a dataset, a simulation, a live trace, or a pull request.

The quality loop

Evaluation is the middle of a loop, not a standalone tool. You observe what your agent did, evaluate its quality, optimize the prompt or model, and enforce a bar before anything ships. Traces become eval inputs, scores drive optimization, and thresholds become production gates.

How an eval reaches a verdict

Every eval is authored as one of three types:

Agent Evaluator reasons over multiple turns and can use tools to reach a judgment
LLM-as-Judge has an evaluator model read the response and apply the criteria in one pass
Code Eval computes the result in a sandbox, with no model and no API key

The type decides how the verdict is reached and whether an evaluator model is involved.

Questions & Discussion

Understanding Evaluation

The four objects

The quality loop

How an eval reaches a verdict

Keep exploring

Eval types

Eval templates & versions

Evaluator models

Output types & scoring