Annotations
Add human feedback to your AI outputs with annotation labels, queues, and scores across traces, datasets, prototypes, and simulations.
What it is
Annotations are human labels applied to AI outputs — traces, spans, sessions, dataset rows, prototype runs, and simulation executions. They capture subjective judgments (sentiment, quality, helpfulness) and factual assessments (correctness, safety, relevance) that automated evals alone cannot provide.
Human-in-the-loop (HITL) feedback is essential for GenAI systems because:
- Quality control — Catch hallucinations, off-topic responses, and policy violations before they reach users.
- Feedback loops — Route human judgments back into prompt tuning, guardrail configuration, and model selection.
- Fine-tuning data — Build high-quality labeled datasets from production traffic to improve your models.
- Safety and compliance — Document human review for regulated or high-stakes use cases.
Architecture
Annotations are built on three primitives:
| Primitive | Purpose |
|---|---|
| Labels | Reusable annotation templates (categorical, numeric, text, star rating, thumbs up/down) shared across your organization. |
| Queues | Managed annotation campaigns that assign items to annotators, track progress, and enforce review workflows. |
| Scores | The unified data record created each time an annotator (or automation) applies a label to a source. |
Labels define what you measure. Queues organize how the work gets done. Scores store every individual annotation.
Supported source types
Annotations can target any of the following entities:
| Source Type | Description |
|---|---|
trace | An LLM trace from Observe |
observation_span | A specific span within a trace |
trace_session | A conversation session (group of traces) |
dataset_row | A row in a dataset |
call_execution | A simulation call execution |
prototype_run | A prototype/experiment run |
How it works
The typical annotation workflow follows three steps:
- Define labels — Create the annotation templates your team will use (e.g. a “Sentiment” categorical label or a “Quality” star rating).
- Set up a queue — Build an annotation campaign by choosing labels, adding annotators, and configuring assignment rules.
- Annotate and review — Add items (traces, dataset rows, etc.) to the queue. Annotators score each item. Reviewers optionally approve results.
Annotations can also be created inline — directly from any trace, session, or dataset view — without a queue, for ad-hoc feedback.
Key capabilities
- 5 label types — Categorical, numeric, free-text, star rating, and thumbs up/down to cover any feedback need.
- Managed queues — Round-robin, load-balanced, or manual assignment strategies with reservation timeouts.
- Inline annotations — Annotate directly from trace detail, session grid, or dataset views without opening a queue.
- Multi-annotator support — Require 1-10 annotators per item for inter-annotator agreement.
- Review workflows — Route completed items through a reviewer before finalizing.
- Export to dataset — Turn annotated data into training or eval datasets.
- Python and JS SDK — Create labels, manage queues, and submit scores programmatically.
Common use cases
| Use Case | Label Type | Example |
|---|---|---|
| Sentiment classification | Categorical | Positive / Negative / Neutral |
| Factual accuracy | Thumbs up/down | Correct vs. hallucinated |
| Toxicity screening | Categorical | Safe / Borderline / Toxic |
| Response relevance | Numeric (1-10) | How relevant was the answer? |
| Grammar and style | Text | Free-form correction notes |
| Prompt A vs. B comparison | Star rating | Rate each variant 1-5 stars |
Get started
Quickstart
Create a label, set up a queue, and annotate your first item in 5 minutes.
Annotation Labels
Understand the five label types and when to use each one.
Queues & Workflow
Learn how queues organize work with assignment strategies and review workflows.
Scores
Dive into the unified Score model that powers all annotation data.