Annotations

Add human feedback to your AI outputs with annotation labels, queues, and scores across traces, datasets, prototypes, and simulations.

What it is

Annotations are human labels applied to AI outputs — traces, spans, sessions, dataset rows, prototype runs, and simulation executions. They capture subjective judgments (sentiment, quality, helpfulness) and factual assessments (correctness, safety, relevance) that automated evals alone cannot provide.

Human-in-the-loop (HITL) feedback is essential for GenAI systems because:

  • Quality control — Catch hallucinations, off-topic responses, and policy violations before they reach users.
  • Feedback loops — Route human judgments back into prompt tuning, guardrail configuration, and model selection.
  • Fine-tuning data — Build high-quality labeled datasets from production traffic to improve your models.
  • Safety and compliance — Document human review for regulated or high-stakes use cases.

Architecture

Annotations are built on three primitives:

PrimitivePurpose
LabelsReusable annotation templates (categorical, numeric, text, star rating, thumbs up/down) shared across your organization.
QueuesManaged annotation campaigns that assign items to annotators, track progress, and enforce review workflows.
ScoresThe unified data record created each time an annotator (or automation) applies a label to a source.

Labels define what you measure. Queues organize how the work gets done. Scores store every individual annotation.

Supported source types

Annotations can target any of the following entities:

Source TypeDescription
traceAn LLM trace from Observe
observation_spanA specific span within a trace
trace_sessionA conversation session (group of traces)
dataset_rowA row in a dataset
call_executionA simulation call execution
prototype_runA prototype/experiment run

How it works

The typical annotation workflow follows three steps:

  1. Define labels — Create the annotation templates your team will use (e.g. a “Sentiment” categorical label or a “Quality” star rating).
  2. Set up a queue — Build an annotation campaign by choosing labels, adding annotators, and configuring assignment rules.
  3. Annotate and review — Add items (traces, dataset rows, etc.) to the queue. Annotators score each item. Reviewers optionally approve results.

Annotations can also be created inline — directly from any trace, session, or dataset view — without a queue, for ad-hoc feedback.

Key capabilities

  • 5 label types — Categorical, numeric, free-text, star rating, and thumbs up/down to cover any feedback need.
  • Managed queues — Round-robin, load-balanced, or manual assignment strategies with reservation timeouts.
  • Inline annotations — Annotate directly from trace detail, session grid, or dataset views without opening a queue.
  • Multi-annotator support — Require 1-10 annotators per item for inter-annotator agreement.
  • Review workflows — Route completed items through a reviewer before finalizing.
  • Export to dataset — Turn annotated data into training or eval datasets.
  • Python and JS SDK — Create labels, manage queues, and submit scores programmatically.

Common use cases

Use CaseLabel TypeExample
Sentiment classificationCategoricalPositive / Negative / Neutral
Factual accuracyThumbs up/downCorrect vs. hallucinated
Toxicity screeningCategoricalSafe / Borderline / Toxic
Response relevanceNumeric (1-10)How relevant was the answer?
Grammar and styleTextFree-form correction notes
Prompt A vs. B comparisonStar ratingRate each variant 1-5 stars

Get started

Was this page helpful?

Questions & Discussion