Skip to main content

What it is

Annotations are human labels applied to AI outputs — traces, spans, sessions, dataset rows, prototype runs, and simulation executions. They capture subjective judgments (sentiment, quality, helpfulness) and factual assessments (correctness, safety, relevance) that automated evals alone cannot provide. Human-in-the-loop (HITL) feedback is essential for GenAI systems because:
  • Quality control — Catch hallucinations, off-topic responses, and policy violations before they reach users.
  • Feedback loops — Route human judgments back into prompt tuning, guardrail configuration, and model selection.
  • Fine-tuning data — Build high-quality labeled datasets from production traffic to improve your models.
  • Safety and compliance — Document human review for regulated or high-stakes use cases.

Architecture

Annotations are built on three primitives:
PrimitivePurpose
LabelsReusable annotation templates (categorical, numeric, text, star rating, thumbs up/down) shared across your organization.
QueuesManaged annotation campaigns that assign items to annotators, track progress, and enforce review workflows.
ScoresThe unified data record created each time an annotator (or automation) applies a label to a source.
Labels define what you measure. Queues organize how the work gets done. Scores store every individual annotation.

Supported source types

Annotations can target any of the following entities:
Source TypeDescription
traceAn LLM trace from Observe
observation_spanA specific span within a trace
trace_sessionA conversation session (group of traces)
dataset_rowA row in a dataset
call_executionA simulation call execution
prototype_runA prototype/experiment run

How it works

The typical annotation workflow follows three steps:
  1. Define labels — Create the annotation templates your team will use (e.g. a “Sentiment” categorical label or a “Quality” star rating).
  2. Set up a queue — Build an annotation campaign by choosing labels, adding annotators, and configuring assignment rules.
  3. Annotate and review — Add items (traces, dataset rows, etc.) to the queue. Annotators score each item. Reviewers optionally approve results.
Annotations can also be created inline — directly from any trace, session, or dataset view — without a queue, for ad-hoc feedback.

Key capabilities

  • 5 label types — Categorical, numeric, free-text, star rating, and thumbs up/down to cover any feedback need.
  • Managed queues — Round-robin, load-balanced, or manual assignment strategies with reservation timeouts.
  • Inline annotations — Annotate directly from trace detail, session grid, or dataset views without opening a queue.
  • Multi-annotator support — Require 1-10 annotators per item for inter-annotator agreement.
  • Review workflows — Route completed items through a reviewer before finalizing.
  • Export to dataset — Turn annotated data into training or eval datasets.
  • Python and JS SDK — Create labels, manage queues, and submit scores programmatically.

Common use cases

Use CaseLabel TypeExample
Sentiment classificationCategoricalPositive / Negative / Neutral
Factual accuracyThumbs up/downCorrect vs. hallucinated
Toxicity screeningCategoricalSafe / Borderline / Toxic
Response relevanceNumeric (1-10)How relevant was the answer?
Grammar and styleTextFree-form correction notes
Prompt A vs. B comparisonStar ratingRate each variant 1-5 stars

Get started

Quickstart

Create a label, set up a queue, and annotate your first item in 5 minutes.

Annotation Labels

Understand the five label types and when to use each one.

Queues & Workflow

Learn how queues organize work with assignment strategies and review workflows.

Scores

Dive into the unified Score model that powers all annotation data.