Annotations: Human-in-the-Loop Feedback

Capture human feedback on AI outputs using labels, queues, and scores across traces, spans, sessions, datasets, prototypes, and simulations.

About

Annotations are human labels applied to AI outputs — traces, spans, sessions, dataset rows, prototype runs, and simulation executions. They capture subjective judgments (sentiment, quality, helpfulness) and factual assessments (correctness, safety, relevance) that automated evals alone cannot provide.

Human-in-the-loop (HITL) feedback is essential for GenAI systems because:

Quality control — Catch hallucinations, off-topic responses, and policy violations before they reach users.
Feedback loops — Route human judgments back into prompt tuning, guardrail configuration, and model selection.
Fine-tuning data — Build high-quality labeled datasets from production traffic to improve your models.
Safety and compliance — Document human review for regulated or high-stakes use cases.

Architecture

Annotations are built on three primitives:

Primitive	Purpose
Labels	Reusable annotation templates (categorical, numeric, text, star rating, thumbs up/down) shared across your organization.
Queues	Managed annotation campaigns that assign items to annotators, track progress, and enforce review workflows.
Scores	The unified data record created each time an annotator (or automation) applies a label to a source.

Labels define what you measure. Queues organize how the work gets done. Scores store every individual annotation.

Supported source types

Annotations can target any of the following entities:

Source Type	Description
`trace`	An LLM trace from Observe
`observation_span`	A specific span within a trace
`trace_session`	A conversation session (group of traces)
`dataset_row`	A row in a dataset
`call_execution`	A simulation call execution
`prototype_run`	A prototype/experiment run

How it works

The typical annotation workflow follows three steps:

Define labels — Create the annotation templates your team will use (e.g. a “Sentiment” categorical label or a “Quality” star rating).
Set up a queue — Build an annotation campaign by choosing labels, adding annotators, and configuring assignment rules.
Annotate and review — Add items (traces, dataset rows, etc.) to the queue. Annotators score each item. Reviewers optionally approve results.

Annotations can also be created inline — directly from any trace, session, or dataset view — without a queue, for ad-hoc feedback.

Key capabilities

5 label types — Categorical, numeric, free-text, star rating, and thumbs up/down to cover any feedback need.
Managed queues — Round-robin, load-balanced, or manual assignment strategies with reservation timeouts.
Inline annotations — Annotate directly from trace detail, session grid, or dataset views without opening a queue.
Multi-annotator support — Require 1-10 annotators per item for inter-annotator agreement.
Review workflows — Route completed items through a reviewer before finalizing.
Export to dataset — Turn annotated data into training or eval datasets.
Python and JS SDK — Create labels, manage queues, and submit scores programmatically.

Common use cases

Use Case	Label Type	Example
Sentiment classification	Categorical	Positive / Negative / Neutral
Factual accuracy	Thumbs up/down	Correct vs. hallucinated
Toxicity screening	Categorical	Safe / Borderline / Toxic
Response relevance	Numeric (1-10)	How relevant was the answer?
Grammar and style	Text	Free-form correction notes
Prompt A vs. B comparison	Star rating	Rate each variant 1-5 stars

Questions & Discussion

Annotations: Human-in-the-Loop Feedback

About

Architecture

Supported source types

How it works

Key capabilities

Common use cases

Get started

Quickstart

Annotation Labels

Queues & Workflow

Scores