Eval Results: Reading and Storing Evaluation Outputs

Understand what evaluation results contain, how to read them, and how results are stored and aggregated across runs in Future AGI.

About

Every evaluation run produces a result for each row or call that was scored. A result tells you whether the response passed the criteria, how it scored, and why the judge made that decision. Results are stored alongside your data so you can review them, compare across runs, and track quality over time.

What a result contains

Each individual result has three parts:

Field	Description
Output	The result value: `1.0` (pass), `0.0` (fail), a score between 0 and 100, or a category label depending on the template’s output type
Reason	A plain-language explanation from the judge describing why it assigned that result
Eval ID	A unique identifier for the eval run, used to retrieve async results

The reason field is especially useful for diagnosing failures. Instead of reviewing each response manually, you can read the reason to understand exactly what caused a pass or fail judgment.

Output types

Output type	What it looks like	When to use
Pass/Fail	`1.0` for pass, `0.0` for fail	Binary checks: toxicity, PII, format validation
Score (percentage)	A number between 0 and 100	Graded quality: groundedness, relevance, completeness
Deterministic choices	A category label from a predefined set	Classification: tone, language, intent

The output type is defined by the eval template. Custom templates let you configure which type to use when you create them.

Where results are stored

In a dataset: Results appear as new columns, one per eval. Each row shows the result value and reason for that row. You can add multiple evals to the same dataset and see all results side by side.

Via SDK: Results are returned directly from evaluator.evaluate(). Access them via result.eval_results[0].output and result.eval_results[0].reason.

Async runs: For long-running or large-batch runs, the SDK returns an eval_id immediately. Use evaluator.get_eval_result(eval_id) to retrieve results when the run completes.

Aggregates and KPIs

When you run evals on a dataset, Future AGI aggregates results across all rows:

Pass rate: percentage of rows that passed, for pass/fail templates
Average score: mean score across all rows, for percentage templates
Distribution: breakdown of results across categories, for deterministic templates
Trend data: how results change across runs over time

These aggregates appear in the evaluation summary view and are tracked per eval template per dataset run, giving you a versioned history of quality changes.

Next steps

Evaluate via Platform & SDK: Run an eval and see results.
Eval templates: How templates define what output type a result uses.
Judge models: How the judge produces the result and reason.
CI/CD pipeline: Track results by version across deploys.

Was this page helpful?

Questions & Discussion