Eval templates & versions

The reusable definition of what an eval measures, and the versions that keep a run stable

What a template holds

A template is where you define what an eval checks. You write it once and reuse it by name across a dataset, a simulation, a live trace, or a check that runs on every pull request, and it is either built by Future AGI (built-in) or written by you (custom). A template holds:

Criteria the evaluator model applies, written as a rule the model can follow
Required inputs: the keys the template needs, like output (the response) and context (the retrieved source the response should stay grounded in, not your whole knowledge base)
Output type: whether the result is pass or fail, a score, or a category, covered in Output types & scoring
Pass threshold: for a score or a category, the line that turns the raw value into a pass or fail
Reason (optional): a plain-language explanation of the verdict, when the eval produces one

Templates and configs

A template is the definition, written once. An eval config is that definition pointed at one place your data lives, and you can create as many as you want: one mapped to a dataset’s columns, one attached to a project’s live traces, one in a CI job.

%%{init: {"flowchart": {"curve": "linear"}}}%%
flowchart LR
  T["Template<br/>the definition: criteria, inputs, output type"] --> C1["Config<br/>mapped to dataset columns"]
  T --> C2["Config<br/>mapped to span attributes"]
  T --> C3["Config<br/>mapped to a CI job's outputs"]
  style T fill:#2f2f2f,stroke:#ffffff,stroke-width:2px

Each config runs at its module’s level, and a config is what the platform loosely calls an “Eval”. When someone says an eval ran, a config ran.

Built-in vs custom templates

	Built-in	Custom
Who writes the criteria	Future AGI	You
How to access	Select from the template list in the UI or pass the name to the SDK	Create via UI or API, then use by name
Covers	156 templates across 14 groups: quality, safety, factuality, RAG, bias, format, audio, image	Any domain-specific, business, or regulatory rule you define
Required inputs	Defined per template (e.g. `input`, `output`, `context`)	You define the required keys in the template config

Built-in evals lists every template; Create a custom eval shows how to write your own.

Required inputs and mapping

A template declares the input keys it needs, and at run time a config maps your real data to those keys. A groundedness template, for example, needs output and context; you point each one at the right column or field.

Custom templates define their own keys with {{variable}} placeholders in the rule prompt. The names you write become the inputs you must supply:

Rate whether {{output}} is fully supported by {{context}}.

Here output and context become the required inputs for that template.

Versions

Every time you change a template, Future AGI saves the previous one as a numbered version, so nothing you already ran gets overwritten. One version is the default: the one a new run uses. Each version is a frozen copy of the exact criteria and threshold, so you can pin a run to a specific version: re-run the eval a month later and it scores the same way, and an eval that gates your pull requests keeps behaving the same even after you edit the template.

Single or composite

A template is single (it runs on its own) or composite (it aggregates several child templates into one score). Reach for a composite when “good” means several checks at once, like faithful and on-tone and complete.

Questions & Discussion

Eval templates & versions

What a template holds

Templates and configs

Built-in vs custom templates

Required inputs and mapping

Versions

Single or composite

Keep exploring

Output types & scoring

Evaluator models

Built-in evals