Evaluations

Create and run eval tasks on Observe project data: filter spans, choose historic or continuous runs, set sampling and limits, and attach preset or custom evaluations.

What it is

Evaluations in Observe are automated quality checks run on your traced spans—e.g. hallucination, bias, context adherence, toxicity. The feature scores LLM (or other) outputs so you can see pass/fail and numeric results per span in the dashboard, track quality over time, and trigger alerts when scores cross a threshold. Evals can run on existing data (historic) or on new spans as they arrive (continuous), and results are stored on the span and available for filtering, export, and monitors.

Use cases

  • Historic batch — Run evals on a time range of existing spans to score quality (e.g. hallucination, bias, context adherence) after the fact.
  • Continuous monitoring — Run evals automatically on new spans as they arrive so you catch regressions in production.
  • Cost and volume control — Use sampling rate (e.g. 10%) and max spans per run so you don’t evaluate every span and can control cost.
  • Targeted evaluation — Filter by observation type (e.g. LLM only), session, or span attributes so only relevant spans are evaluated.
  • Multiple evals per task — Attach several eval configs to one task so each span gets multiple scores in a single run.

How to

Set filters

Define filters so the task runs only on the spans you care about. Supported filter keys:

Set filters

  • observation_type — Node/span type (e.g. llm, chain, agent). Pass a string or list of types.
  • date_range — Time range: a two-element list [start_date, end_date] (applied to created_at).
  • created_at — Minimum creation time (spans created at or after this value).
  • project_id — Restrict to a specific Observe project.
  • session_id — Restrict to traces in a given session.
  • span_attributes_filters — List of span-attribute conditions (same structure as in the Observe UI filters).

Filters are stored in the task’s filters JSON and applied when the task runs.

Choose run type

Set run_type:

Choose run type

  • Historical — Run on existing spans that match the filters (optionally within a time range). The task processes up to the sampling cap and span limit, then can complete.
  • Continuous — Run on new spans as they arrive. Each run only considers spans created after the last run; the task stays active for ongoing evaluation.

Set sampling rate and span limit

Set sampling rate and span limit

  • sampling_rate — Percentage of matching spans to evaluate (0–100). Example: 50 means 50% of filtered spans are sampled per run. Helps control cost and volume.
  • spans_limit — Maximum number of spans to process per run (default is 1000). For historical runs, the task stops when either the sampled count or this limit is reached.

Select evals to run

Attach one or more CustomEvalConfig IDs to the task (the evals you’ve already created for the project). The task runs each selected eval on every span it processes. For evals that need an input (e.g. Bias Detection), configure the input key to a span attribute path (e.g. llm.output_messages.0.message.content) so the eval reads the right field from each span. See built-in evals for supported evaluations and their inputs.

Run the task

run Create or update the eval task via the API or UI, then run it. You can test the configuration (filters and evals) before saving. Task status values: pending, running, completed, failed, paused, deleted. Results appear on the spans in the Observe dashboard and can be used for alerts.

Note

Eval tasks are processed asynchronously (e.g. by a cron). Status and results update as runs complete. For continuous tasks, new spans are picked up on subsequent runs.

Key concepts

  • Span attributes — Spans store data in key-value form (e.g. llm.output_messages.0.message.content). When an eval needs an input, you point it to one of these attribute paths. See spans and span attributes for the schema.
  • Bias Detection example — Set the eval’s input key to a span attribute that holds the text to check (e.g. llm.output_messages.0.message.content). The eval returns Passed (neutral) or Failed (bias detected).

What you can do next

Was this page helpful?

Questions & Discussion