Configure Evals for Prototype

Define which evaluations run on your prototype outputs using EvalTags, mapping, and optional custom evals.

About

When running multiple versions of your application in Prototype, cost and latency alone don’t tell you which version is better. Configuring evals adds quality scores to every run, so you can compare versions on what actually matters: does the output stay on topic, follow the right tone, avoid unsafe content, and answer accurately. Every generation is scored automatically, and the results appear in the Prototype dashboard alongside cost and latency so you can make a data-driven decision on which version to promote.


When to use

  • Pre-production quality checks: Score every run for hallucinations, tone, safety, or accuracy before promoting any version to production.
  • Domain-specific criteria: Use different evals depending on what matters for your use case.
  • Reproducible scoring: Same eval config across all versions so comparisons stay fair and consistent.
  • Multi-version testing: Run the same evals across all versions so rankings in the dashboard stay objective.

How to

Define EvalTags in register()

In your register() call, pass an eval_tags list (Python) or evalTags (TypeScript). Each tag specifies the eval name, span type and kind, mapping from your span attributes to the eval’s required keys, optional custom display name, and the model to use.

eval_tags = [
    EvalTag(
        eval_name=EvalName.CONTEXT_ADHERENCE,
        type=EvalTagType.OBSERVATION_SPAN,
        value=EvalSpanKind.LLM,
        mapping={"context": "input.value", "output": "output.value"},
        custom_eval_name="context_check",
        model=ModelChoices.TURING_SMALL
    ),
    EvalTag(
        eval_name=EvalName.TOXICITY,
        type=EvalTagType.OBSERVATION_SPAN,
        value=EvalSpanKind.LLM,
        mapping={"input": "input.value"},
        custom_eval_name="toxicity_check",
        model=ModelChoices.TURING_SMALL
    )
]
const evalTags = [
  new EvalTag({
    type: EvalTagType.OBSERVATION_SPAN,
    value: EvalSpanKind.LLM,
    eval_name: EvalName.CONTEXT_ADHERENCE,
    custom_eval_name: "context_check",
    mapping: { "context": "input.value", "output": "output.value" },
    model: ModelChoices.TURING_SMALL
  }),
  new EvalTag({
    type: EvalTagType.OBSERVATION_SPAN,
    value: EvalSpanKind.LLM,
    eval_name: EvalName.TOXICITY,
    custom_eval_name: "toxicity_check",
    mapping: { "input": "input.value" },
    model: ModelChoices.TURING_SMALL
  })
];
FieldDescription
eval_nameThe evaluation to run. Must be a valid EvalName enum value.
typeWhere to apply the evaluation (e.g. OBSERVATION_SPAN).
valueKind of span to evaluate (e.g. LLM).
mappingMaps eval required keys to span attribute paths. See below.
custom_eval_nameDisplay name for this eval in the dashboard.
modelModel for Future AGI evals (e.g. TURING_LARGE, TURING_SMALL).

Understand the mapping attribute

The mapping attribute connects eval requirements with your trace data. How it works:

  1. Each eval has required keys: Different evals need different inputs (e.g. Context Adherence needs context and output).
  2. Spans have attributes: Your spans (LLM, retriever, etc.) store data as key-value span attributes.
  3. Mapping connects them: The mapping object specifies which span attribute to use for each required key.

Example:

mapping={
    "context": "input.value",
    "output": "output.value"
}
  • The eval’s context key pulls from input.value: the raw input sent to the model.
  • The eval’s output key pulls from output.value: the raw response from the model.

Use custom_eval_name for display (optional)

custom_eval_name sets the display name shown in the Prototype dashboard for this eval. eval_name must always be a valid EvalName enum value: it selects which evaluation logic runs. Use custom_eval_name to give it a meaningful label for your project.

eval_tags = [
    EvalTag(
        eval_name=EvalName.CONTEXT_ADHERENCE,
        type=EvalTagType.OBSERVATION_SPAN,
        value=EvalSpanKind.LLM,
        mapping={"context": "input.value", "output": "output.value"},
        custom_eval_name="my_adherence_check",
        model=ModelChoices.TURING_SMALL
    ),
]
const evalTags = [
  new EvalTag({
    type: EvalTagType.OBSERVATION_SPAN,
    value: EvalSpanKind.LLM,
    eval_name: EvalName.CONTEXT_ADHERENCE,
    custom_eval_name: "my_adherence_check",
    mapping: { "context": "input.value", "output": "output.value" },
    model: ModelChoices.TURING_SMALL
  })
];

Note

For the full list of built-in evals and their required mapping keys, see Built-in evals.


Span Attribute Paths Reference

For OpenAI (and most LLM instrumentors), the standard span attribute paths are:

DataSpan attribute path
System message contentgen_ai.input.messages.0.message.content
User message contentgen_ai.input.messages.1.message.content
Model responsegen_ai.output.messages.0.message.content

The index (.0., .1.) corresponds to the position of the message in the messages array passed to the model.


Required Keys by Eval

Each eval has its own required mapping keys. Common ones:

EvalRequired keys
Context Adherencecontext, output
Toxicityinput
Completenessinput, output
Detect Hallucinationinput, output
Prompt Injectioninput
Toneinput

For the full list, see Built-in evals.


Next Steps

Was this page helpful?

Questions & Discussion