Eval Results

What eval results contain, how to read them, and how results are stored and aggregated across runs.

About

Every evaluation run produces a result for each row or call that was scored. A result tells you whether the response passed the criteria, how it scored, and why the judge made that decision. Results are stored alongside your data so you can review them, compare across runs, and track quality over time.


What a result contains

Each individual result has three parts:

FieldDescription
OutputThe result value: 1.0 (pass), 0.0 (fail), a score between 0 and 100, or a category label depending on the template’s output type
ReasonA plain-language explanation from the judge describing why it assigned that result
Eval IDA unique identifier for the eval run, used to retrieve async results

The reason field is especially useful for diagnosing failures. Instead of reviewing each response manually, you can read the reason to understand exactly what caused a pass or fail judgment.


Output types

Output typeWhat it looks likeWhen to use
Pass/Fail1.0 for pass, 0.0 for failBinary checks: toxicity, PII, format validation
Score (percentage)A number between 0 and 100Graded quality: groundedness, relevance, completeness
Deterministic choicesA category label from a predefined setClassification: tone, language, intent

The output type is defined by the eval template. Custom templates let you configure which type to use when you create them.


Where results are stored

In a dataset: Results appear as new columns, one per eval. Each row shows the result value and reason for that row. You can add multiple evals to the same dataset and see all results side by side.

Via SDK: Results are returned directly from evaluator.evaluate(). Access them via result.eval_results[0].output and result.eval_results[0].reason.

Async runs: For long-running or large-batch runs, the SDK returns an eval_id immediately. Use evaluator.get_eval_result(eval_id) to retrieve results when the run completes.


Aggregates and KPIs

When you run evals on a dataset, Future AGI aggregates results across all rows:

  • Pass rate: percentage of rows that passed, for pass/fail templates
  • Average score: mean score across all rows, for percentage templates
  • Distribution: breakdown of results across categories, for deterministic templates
  • Trend data: how results change across runs over time

These aggregates appear in the evaluation summary view and are tracked per eval template per dataset run, giving you a versioned history of quality changes.


Next steps

Was this page helpful?

Questions & Discussion