Configure Evals for Prototype
Define which evaluations run on your prototype outputs using EvalTags, mapping, and optional custom evals.
About
When running multiple versions of your application in Prototype, cost and latency alone don’t tell you which version is better. Configuring evals adds quality scores to every run, so you can compare versions on what actually matters: does the output stay on topic, follow the right tone, avoid unsafe content, and answer accurately. Every generation is scored automatically, and the results appear in the Prototype dashboard alongside cost and latency so you can make a data-driven decision on which version to promote.
When to use
- Pre-production quality checks: Score every run for hallucinations, tone, safety, or accuracy before promoting any version to production.
- Domain-specific criteria: Use different evals depending on what matters for your use case.
- Reproducible scoring: Same eval config across all versions so comparisons stay fair and consistent.
- Multi-version testing: Run the same evals across all versions so rankings in the dashboard stay objective.
How to
Define EvalTags in register()
In your register() call, pass an eval_tags list (Python) or evalTags (TypeScript). Each tag specifies the eval name, span type and kind, mapping from your span attributes to the eval’s required keys, optional custom display name, and the model to use.
eval_tags = [
EvalTag(
eval_name=EvalName.CONTEXT_ADHERENCE,
type=EvalTagType.OBSERVATION_SPAN,
value=EvalSpanKind.LLM,
mapping={"context": "input.value", "output": "output.value"},
custom_eval_name="context_check",
model=ModelChoices.TURING_SMALL
),
EvalTag(
eval_name=EvalName.TOXICITY,
type=EvalTagType.OBSERVATION_SPAN,
value=EvalSpanKind.LLM,
mapping={"input": "input.value"},
custom_eval_name="toxicity_check",
model=ModelChoices.TURING_SMALL
)
]const evalTags = [
new EvalTag({
type: EvalTagType.OBSERVATION_SPAN,
value: EvalSpanKind.LLM,
eval_name: EvalName.CONTEXT_ADHERENCE,
custom_eval_name: "context_check",
mapping: { "context": "input.value", "output": "output.value" },
model: ModelChoices.TURING_SMALL
}),
new EvalTag({
type: EvalTagType.OBSERVATION_SPAN,
value: EvalSpanKind.LLM,
eval_name: EvalName.TOXICITY,
custom_eval_name: "toxicity_check",
mapping: { "input": "input.value" },
model: ModelChoices.TURING_SMALL
})
]; | Field | Description |
|---|---|
eval_name | The evaluation to run. Must be a valid EvalName enum value. |
type | Where to apply the evaluation (e.g. OBSERVATION_SPAN). |
value | Kind of span to evaluate (e.g. LLM). |
mapping | Maps eval required keys to span attribute paths. See below. |
custom_eval_name | Display name for this eval in the dashboard. |
model | Model for Future AGI evals (e.g. TURING_LARGE, TURING_SMALL). |
Understand the mapping attribute
The mapping attribute connects eval requirements with your trace data. How it works:
- Each eval has required keys: Different evals need different inputs (e.g. Context Adherence needs
contextandoutput). - Spans have attributes: Your spans (LLM, retriever, etc.) store data as key-value span attributes.
- Mapping connects them: The mapping object specifies which span attribute to use for each required key.
Example:
mapping={
"context": "input.value",
"output": "output.value"
}- The eval’s
contextkey pulls frominput.value: the raw input sent to the model. - The eval’s
outputkey pulls fromoutput.value: the raw response from the model.
Use custom_eval_name for display (optional)
custom_eval_name sets the display name shown in the Prototype dashboard for this eval. eval_name must always be a valid EvalName enum value: it selects which evaluation logic runs. Use custom_eval_name to give it a meaningful label for your project.
eval_tags = [
EvalTag(
eval_name=EvalName.CONTEXT_ADHERENCE,
type=EvalTagType.OBSERVATION_SPAN,
value=EvalSpanKind.LLM,
mapping={"context": "input.value", "output": "output.value"},
custom_eval_name="my_adherence_check",
model=ModelChoices.TURING_SMALL
),
]const evalTags = [
new EvalTag({
type: EvalTagType.OBSERVATION_SPAN,
value: EvalSpanKind.LLM,
eval_name: EvalName.CONTEXT_ADHERENCE,
custom_eval_name: "my_adherence_check",
mapping: { "context": "input.value", "output": "output.value" },
model: ModelChoices.TURING_SMALL
})
]; Note
For the full list of built-in evals and their required mapping keys, see Built-in evals.
Span Attribute Paths Reference
For OpenAI (and most LLM instrumentors), the standard span attribute paths are:
| Data | Span attribute path |
|---|---|
| System message content | gen_ai.input.messages.0.message.content |
| User message content | gen_ai.input.messages.1.message.content |
| Model response | gen_ai.output.messages.0.message.content |
The index (.0., .1.) corresponds to the position of the message in the messages array passed to the model.
Required Keys by Eval
Each eval has its own required mapping keys. Common ones:
| Eval | Required keys |
|---|---|
| Context Adherence | context, output |
| Toxicity | input |
| Completeness | input, output |
| Detect Hallucination | input, output |
| Prompt Injection | input |
| Tone | input |
For the full list, see Built-in evals.
Next Steps
Set up prototype
Configure environment, register project, and instrument your app.
Choose winner
Rank versions and promote the best to production.
Prototype overview
How Prototype fits in.
Evaluation overview
Running evals and built-in eval details.
Built-in evals
Full list of 70+ built-in evals with mapping keys and output types.