Overview
Measure and compare quality of prompts and agents across datasets, simulations, and experiments.
About
Evaluation is Future AGI’s quality measurement layer. It gives you a consistent, repeatable way to measure whether your prompts and agents are behaving correctly and whether changes you make improve things or introduce regressions.
There are two building blocks: eval templates define what to measure (task completion, tone, hallucination, safety, factual accuracy, or a custom rule you write yourself), and eval configs define how to measure (the judge model, input mapping, and run settings). Combine them with your data and you get a score, a pass/fail result, and an optional explanation per row or call, plus aggregated summaries, KPIs, and trend data across runs.
Evaluations run across every surface in Future AGI: datasets, simulations, experiments, playground, replay sessions, and CI/CD pipelines. You can also run them programmatically via the SDK. Using the same templates and configs across contexts keeps results directly comparable without redefining your quality criteria each time.
Future AGI ships 70+ built-in templates covering quality, safety, factuality, RAG retrieval, format, bias, audio, and image evaluation. You can also create custom templates and bundle any combination into eval groups to apply multiple evals in a single run.
How Evaluation Connects to Other Features
- Datasets: Run evals across dataset rows and store scores as new columns. Learn more
- Simulation: Score simulated agent conversations for quality, context retention, and escalation. Learn more
- Optimization: Feed eval results into prompt optimization to improve quality automatically. Learn more
- CI/CD: Gate pull requests on eval scores to catch regressions before they ship. Learn more
- Error Feed: Eval-powered scoring for every traced agent execution. Learn more
Getting Started
Evaluate via Platform & SDK
Run the first eval from the UI or SDK in minutes.
Built-in evals
70+ templates: quality, safety, factuality, RAG, and more.
Create custom evals
Define your own eval rules and output types.
Eval groups
Bundle multiple evals and run them together.
Future AGI models
Pick the right evaluation model for your task.
Evaluate CI/CD pipeline
Run evals automatically on every pull request.