Overview
Measure and compare quality of prompts and agents across datasets, simulations, and experiments.
What it is
Evaluation is the quality assessment layer across all Future AGI products. It helps measure whether a prompt or agent change has improved output quality, enabling consistent, comparable results across runs and experiments rather than relying on manual review. Eval templates define what to measure — task completion, tone, safety, factual accuracy, or custom metrics — and eval configs specify the model, input mapping, and context in which they run. Evaluations are supported across datasets, simulations (run tests and call executions), experiments, playground runs, and replay sessions, returning scores, pass/fail results, and optional explanations per row or per call, along with aggregated summaries and analytics for regression detection and optimization workflows.
Note
Using the same eval templates and configs across datasets, simulations, and experiments ensures results are comparable without redefining evals for each context.
Purpose
- Measure quality consistently — Apply the same eval definitions (templates and configs) across datasets, simulations, and experiments so results are comparable.
- Compare prompts and agents — Run evals before and after changes to see whether quality improved or regressed.
- Support optimization — Feed eval results into prompt optimisation, Fix My Agent, and experiment analysis so you can iterate on quality.
- Scale evaluation — Run evals over many rows (dataset) or many calls (simulation) and view aggregated summaries, KPIs, and per-row or per-call breakdowns.
Getting started with evaluation
Evaluate via Platform & SDK
Run the first eval from the UI or SDK in minutes.
Built-in evals
70+ templates: quality, safety, factuality, RAG, and more.
Create custom evals
Define your own eval rules and output types.
Eval groups
Bundle multiple evals and run them together.
Future AGI models
Pick the right evaluation model for your task.
Evaluate CI/CD pipeline
Run evals automatically on every pull request.