Eval Templates
What eval templates are, the difference between built-in and custom templates, and how output types work.
About
An eval template is the definition of what to measure. It contains the criteria the judge model will apply to each response and specifies what kind of result to return. You create a template once and reuse it across any dataset, simulation, experiment, or SDK call.
Templates are the reusable unit of evaluation logic. Whether you’re checking for toxicity, verifying that a response stays grounded in a source document, or enforcing a company-specific rule, the logic lives in the template.
Built-in vs custom templates
| Built-in | Custom | |
|---|---|---|
| Who writes the criteria | Future AGI | You |
| How to access | Select from the template list in the UI or pass the name to the SDK | Create via UI or API, then use by name |
| Covers | 70+ categories: quality, safety, factuality, RAG, bias, format, audio, image | Any domain-specific, business, or regulatory rule you define |
| Required inputs | Defined per template (e.g. input, output, context) | You define the required keys in the template config |
See Built-in evals for the full list of available templates.
See Create custom evals for how to write your own.
Output types
Every template returns one of three output types:
| Type | Description | Example |
|---|---|---|
| Pass/Fail | Binary result: 1.0 for pass, 0.0 for fail | Toxicity check: passed or failed |
| Score (percentage) | Numeric value between 0 and 100 | Groundedness: 87 out of 100 |
| Deterministic choices | Categorical result from a defined set of options | Tone classification: formal, informal, neutral |
Every result also includes a reason: a plain-language explanation of why the judge assigned that result. This makes it possible to understand failures without reviewing each response manually.
Required keys and input mapping
Templates declare the input keys they expect. For example, a groundedness template might require output (the model response) and context (the source document). When you run an eval, you map your actual data to these keys.
In the UI: When you add a template to a dataset or simulation, the platform shows a mapping form. You select which column corresponds to each required key.
In the SDK: Pass a dict where the keys match what the template expects:
result = evaluator.evaluate(
template=Groundedness(),
input={
"output": "The Eiffel Tower is in Paris.",
"context": "The Eiffel Tower is a wrought-iron lattice tower in Paris, France.",
},
)
Built-in templates have fixed required keys documented in the template reference. Custom templates let you define any keys using {{variable_name}} placeholders in the rule prompt: the key names you use in the prompt become the required keys you must supply at run time.
Next steps
- Built-in evals: Full list of available templates with required keys and output types.
- Create custom evals: Write your own criteria and rule prompts.
- Eval types: LLM as Judge, Deterministic, Statistical Metric, and LLM as Ranker.
- Judge models: How the model applies a template to produce a result.
- Eval results: What the output of an eval run looks like.