Understanding Evaluation
How evaluation works in Future AGI: templates, judge models, results, and where evals run.
About
Evaluation in Future AGI is a systematic way to measure whether your AI is producing the right outputs. You define what “right” means once, using an eval template, and the platform scores every response automatically against that definition, returning a result and a reason for each one.
Every eval run has three components working together: a template that defines the criteria, a judge model that applies the criteria to each response, and a result that records the outcome. You supply the data; the platform handles the scoring.
How it works
-
Choose a template: Select a built-in template (e.g. Toxicity, Groundedness, Tone) or create a custom one with your own rule prompt. Templates define what to measure and what output type to expect (pass/fail, score, or a category).
-
Map your data: Tell the eval which columns or input keys contain the text to evaluate (e.g. which column is the model response, which is the reference context).
-
Pick a judge model: Choose a Future AGI model (e.g.
turing_flash) or bring your own via a custom model integration. The judge reads each row and applies the template criteria. -
Run: The platform processes every row in parallel. Each row gets a result (pass/fail or a score) and a reason explaining the judgment.
-
Review: Results appear as new columns in your dataset, or as aggregated summaries and KPIs across runs.
Where evals run
Evals are not limited to datasets. The same templates work across every surface in Future AGI:
| Surface | What you evaluate |
|---|---|
| Dataset | Score every row in a dataset against one or more templates |
| Simulation | Evaluate agent responses in a run test against your criteria |
| Experiments | Compare prompt or model versions using the same eval criteria |
| CI/CD pipeline | Run evals automatically on every code change and track results by version |
| SDK | Call evaluator.evaluate() from any script or application |
Using the same template across surfaces keeps results directly comparable without redefining criteria each time.
Key concepts
- Eval templates: The definition of what to measure. Built-in or custom.
- Judge models: The model that applies the template criteria and produces the result.
- Eval results: The output of a run: result value, reason, and aggregates.
- Eval groups: Named collections of templates you run together as a single unit.
Next steps
- Evaluate via Platform & SDK: Run your first eval.
- Built-in evals: 70+ templates across quality, safety, factuality, RAG, and more.
- Create custom evals: Define your own criteria and rule prompts.
- Eval groups: Bundle multiple evals and run them in one pass.