What it is
Annotations are human labels applied to AI outputs — traces, spans, sessions, dataset rows, prototype runs, and simulation executions. They capture subjective judgments (sentiment, quality, helpfulness) and factual assessments (correctness, safety, relevance) that automated evals alone cannot provide. Human-in-the-loop (HITL) feedback is essential for GenAI systems because:- Quality control — Catch hallucinations, off-topic responses, and policy violations before they reach users.
- Feedback loops — Route human judgments back into prompt tuning, guardrail configuration, and model selection.
- Fine-tuning data — Build high-quality labeled datasets from production traffic to improve your models.
- Safety and compliance — Document human review for regulated or high-stakes use cases.
Architecture
Annotations are built on three primitives:| Primitive | Purpose |
|---|---|
| Labels | Reusable annotation templates (categorical, numeric, text, star rating, thumbs up/down) shared across your organization. |
| Queues | Managed annotation campaigns that assign items to annotators, track progress, and enforce review workflows. |
| Scores | The unified data record created each time an annotator (or automation) applies a label to a source. |
Supported source types
Annotations can target any of the following entities:| Source Type | Description |
|---|---|
trace | An LLM trace from Observe |
observation_span | A specific span within a trace |
trace_session | A conversation session (group of traces) |
dataset_row | A row in a dataset |
call_execution | A simulation call execution |
prototype_run | A prototype/experiment run |
How it works
The typical annotation workflow follows three steps:- Define labels — Create the annotation templates your team will use (e.g. a “Sentiment” categorical label or a “Quality” star rating).
- Set up a queue — Build an annotation campaign by choosing labels, adding annotators, and configuring assignment rules.
- Annotate and review — Add items (traces, dataset rows, etc.) to the queue. Annotators score each item. Reviewers optionally approve results.
Key capabilities
- 5 label types — Categorical, numeric, free-text, star rating, and thumbs up/down to cover any feedback need.
- Managed queues — Round-robin, load-balanced, or manual assignment strategies with reservation timeouts.
- Inline annotations — Annotate directly from trace detail, session grid, or dataset views without opening a queue.
- Multi-annotator support — Require 1-10 annotators per item for inter-annotator agreement.
- Review workflows — Route completed items through a reviewer before finalizing.
- Export to dataset — Turn annotated data into training or eval datasets.
- Python and JS SDK — Create labels, manage queues, and submit scores programmatically.
Common use cases
| Use Case | Label Type | Example |
|---|---|---|
| Sentiment classification | Categorical | Positive / Negative / Neutral |
| Factual accuracy | Thumbs up/down | Correct vs. hallucinated |
| Toxicity screening | Categorical | Safe / Borderline / Toxic |
| Response relevance | Numeric (1-10) | How relevant was the answer? |
| Grammar and style | Text | Free-form correction notes |
| Prompt A vs. B comparison | Star rating | Rate each variant 1-5 stars |
Get started
Quickstart
Create a label, set up a queue, and annotate your first item in 5 minutes.
Annotation Labels
Understand the five label types and when to use each one.
Queues & Workflow
Learn how queues organize work with assignment strategies and review workflows.
Scores
Dive into the unified Score model that powers all annotation data.