Eval Results
What eval results contain, how to read them, and how results are stored and aggregated across runs.
About
Every evaluation run produces a result for each row or call that was scored. A result tells you whether the response passed the criteria, how it scored, and why the judge made that decision. Results are stored alongside your data so you can review them, compare across runs, and track quality over time.
What a result contains
Each individual result has three parts:
| Field | Description |
|---|---|
| Output | The result value: 1.0 (pass), 0.0 (fail), a score between 0 and 100, or a category label depending on the template’s output type |
| Reason | A plain-language explanation from the judge describing why it assigned that result |
| Eval ID | A unique identifier for the eval run, used to retrieve async results |
The reason field is especially useful for diagnosing failures. Instead of reviewing each response manually, you can read the reason to understand exactly what caused a pass or fail judgment.
Output types
| Output type | What it looks like | When to use |
|---|---|---|
| Pass/Fail | 1.0 for pass, 0.0 for fail | Binary checks: toxicity, PII, format validation |
| Score (percentage) | A number between 0 and 100 | Graded quality: groundedness, relevance, completeness |
| Deterministic choices | A category label from a predefined set | Classification: tone, language, intent |
The output type is defined by the eval template. Custom templates let you configure which type to use when you create them.
Where results are stored
In a dataset: Results appear as new columns, one per eval. Each row shows the result value and reason for that row. You can add multiple evals to the same dataset and see all results side by side.
Via SDK: Results are returned directly from evaluator.evaluate(). Access them via result.eval_results[0].output and result.eval_results[0].reason.
Async runs: For long-running or large-batch runs, the SDK returns an eval_id immediately. Use evaluator.get_eval_result(eval_id) to retrieve results when the run completes.
Aggregates and KPIs
When you run evals on a dataset, Future AGI aggregates results across all rows:
- Pass rate: percentage of rows that passed, for pass/fail templates
- Average score: mean score across all rows, for percentage templates
- Distribution: breakdown of results across categories, for deterministic templates
- Trend data: how results change across runs over time
These aggregates appear in the evaluation summary view and are tracked per eval template per dataset run, giving you a versioned history of quality changes.
Next steps
- Evaluate via Platform & SDK: Run an eval and see results.
- Eval templates: How templates define what output type a result uses.
- Judge models: How the judge produces the result and reason.
- CI/CD pipeline: Track results by version across deploys.