Eval Task Aggregations
Aggregate eval-task results as per-eval rollups, per-span pivots, or both.
https://api.futureagi.com/tracer/eval-task/get_usage/ Authentication
Query parameters
The eval task whose runs should be aggregated.
When true, the response includes the eval_aggregation object — one rollup per CustomEvalConfig that ran in the task, keyed by eval name. Defaults to false. At least one of eval_aggregation or span_aggregation must be true.
When true, the response includes the span_aggregation object — one entry per span the task evaluated, keyed by span_id, with the raw value of every eval that touched it. Defaults to false. At least one of eval_aggregation or span_aggregation must be true.
Inclusive lower bound on the span’s created_at — only eval runs whose linked span was created at or after this instant are aggregated. When omitted, no lower bound is applied.
Inclusive upper bound on the span’s created_at — only eval runs whose linked span was created at or before this instant are aggregated. When omitted, no upper bound is applied.
Response
200 OKPer-eval rollup. Present only when eval_aggregation=true. Keys are CustomEvalConfig names; values are one rollup object per eval.
percentage, pass_fail, or deterministic. Drives the shape of aggregated_score. The eval-level rollup. Shape depends on output_type:
• percentage — number (4-dp average across non-error runs, e.g. 0.7421).
• pass_fail — number (pass rate as 0–100 with 2 dp, e.g. 87.5).
• deterministic — object mapping each observed choice to its occurrence percentage 0–100 with 2 dp, e.g. {"positive": 62.5, "neutral": 25.0}. Only choices that actually appeared in the data are included.
null when no aggregatable rows exist (all errors / empty).
Per-span pivot. Present only when span_aggregation=true. Outer keys are span_id (one per span the task evaluated); inner keys are eval names; inner values are one entry per eval that touched the span.
percentage, pass_fail, or deterministic. Drives the shape of value. The raw per-row eval result — no averaging. Shape depends on output_type:
• percentage — number (e.g. 0.82).
• pass_fail — boolean.
• deterministic — array of choice strings (e.g. ["positive"]).
When the same (span, eval) pair has multiple runs (re-runs), the latest by created_at wins.
Note
Soft-deleted eval runs are skipped in both aggregations so the rollups reflect the user’s current view of the data.
Both eval_aggregation and span_aggregation only include span-linked eval runs — session-target eval runs (where there is no underlying span) are excluded from both rollups, regardless of whether a date range is supplied.
start_date and end_date filter on the span’s creation time (observation_span.created_at), not on when the eval ran. The aggregation results therefore reflect only those spans that were created in the supplied window — eval runs against spans created outside the window are dropped from both rollups. When neither parameter is supplied, every span linked to the eval task is included.
Errors
eval_task_id is missing, or no eval task with that ID exists in the caller’s organization.
Invalid or missing API credentials.
Unexpected server error.