Eval Templates

What eval templates are, the difference between built-in and custom templates, and how output types work.

About

An eval template is the definition of what to measure. It contains the criteria the judge model will apply to each response and specifies what kind of result to return. You create a template once and reuse it across any dataset, simulation, experiment, or SDK call.

Templates are the reusable unit of evaluation logic. Whether you’re checking for toxicity, verifying that a response stays grounded in a source document, or enforcing a company-specific rule, the logic lives in the template.


Built-in vs custom templates

Built-inCustom
Who writes the criteriaFuture AGIYou
How to accessSelect from the template list in the UI or pass the name to the SDKCreate via UI or API, then use by name
Covers70+ categories: quality, safety, factuality, RAG, bias, format, audio, imageAny domain-specific, business, or regulatory rule you define
Required inputsDefined per template (e.g. input, output, context)You define the required keys in the template config

See Built-in evals for the full list of available templates.

See Create custom evals for how to write your own.


Output types

Every template returns one of three output types:

TypeDescriptionExample
Pass/FailBinary result: 1.0 for pass, 0.0 for failToxicity check: passed or failed
Score (percentage)Numeric value between 0 and 100Groundedness: 87 out of 100
Deterministic choicesCategorical result from a defined set of optionsTone classification: formal, informal, neutral

Every result also includes a reason: a plain-language explanation of why the judge assigned that result. This makes it possible to understand failures without reviewing each response manually.


Required keys and input mapping

Templates declare the input keys they expect. For example, a groundedness template might require output (the model response) and context (the source document). When you run an eval, you map your actual data to these keys.

In the UI: When you add a template to a dataset or simulation, the platform shows a mapping form. You select which column corresponds to each required key.

In the SDK: Pass a dict where the keys match what the template expects:

result = evaluator.evaluate(
    template=Groundedness(),
    input={
        "output": "The Eiffel Tower is in Paris.",
        "context": "The Eiffel Tower is a wrought-iron lattice tower in Paris, France.",
    },
)

Built-in templates have fixed required keys documented in the template reference. Custom templates let you define any keys using {{variable_name}} placeholders in the rule prompt: the key names you use in the prompt become the required keys you must supply at run time.


Next steps

  • Built-in evals: Full list of available templates with required keys and output types.
  • Create custom evals: Write your own criteria and rule prompts.
  • Eval types: LLM as Judge, Deterministic, Statistical Metric, and LLM as Ranker.
  • Judge models: How the model applies a template to produce a result.
  • Eval results: What the output of an eval run looks like.
Was this page helpful?

Questions & Discussion