Configure Evals for Prototype Testing in Future AGI

Define which evaluations run on your prototype outputs using EvalTags, mapping, and optional custom evals in Future AGI Prototype.

About

When running multiple versions of your application in Prototype, cost and latency alone don’t tell you which version is better. Configuring evals adds quality scores to every run, so you can compare versions on what actually matters: does the output stay on topic, follow the right tone, avoid unsafe content, and answer accurately. Every generation is scored automatically, and the results appear in the Prototype dashboard alongside cost and latency so you can make a data-driven decision on which version to promote.

When to use

Pre-production quality checks: Score every run for hallucinations, tone, safety, or accuracy before promoting any version to production.
Domain-specific criteria: Use different evals depending on what matters for your use case.
Reproducible scoring: Same eval config across all versions so comparisons stay fair and consistent.
Multi-version testing: Run the same evals across all versions so rankings in the dashboard stay objective.

How to

Define EvalTags in register()

In your register() call, pass an eval_tags list (Python) or evalTags (TypeScript). Each tag specifies the eval name, span type and kind, mapping from your span attributes to the eval’s required keys, optional custom display name, and the model to use.

eval_tags = [
    EvalTag(
        eval_name=EvalName.CONTEXT_ADHERENCE,
        type=EvalTagType.OBSERVATION_SPAN,
        value=EvalSpanKind.LLM,
        mapping={"context": "input.value", "output": "output.value"},
        custom_eval_name="context_check",
        model=ModelChoices.TURING_SMALL
    ),
    EvalTag(
        eval_name=EvalName.TOXICITY,
        type=EvalTagType.OBSERVATION_SPAN,
        value=EvalSpanKind.LLM,
        mapping={"input": "input.value"},
        custom_eval_name="toxicity_check",
        model=ModelChoices.TURING_SMALL
    )
]

const evalTags = [
  new EvalTag({
    type: EvalTagType.OBSERVATION_SPAN,
    value: EvalSpanKind.LLM,
    eval_name: EvalName.CONTEXT_ADHERENCE,
    custom_eval_name: "context_check",
    mapping: { "context": "input.value", "output": "output.value" },
    model: ModelChoices.TURING_SMALL
  }),
  new EvalTag({
    type: EvalTagType.OBSERVATION_SPAN,
    value: EvalSpanKind.LLM,
    eval_name: EvalName.TOXICITY,
    custom_eval_name: "toxicity_check",
    mapping: { "input": "input.value" },
    model: ModelChoices.TURING_SMALL
  })
];

Field	Description
`eval_name`	The evaluation to run. Must be a valid `EvalName` enum value.
`type`	Where to apply the evaluation (e.g. `OBSERVATION_SPAN`).
`value`	Kind of span to evaluate (e.g. `LLM`).
`mapping`	Maps eval required keys to span attribute paths. See below.
`custom_eval_name`	Display name for this eval in the dashboard.
`model`	Model for Future AGI evals (e.g. `TURING_LARGE`, `TURING_SMALL`).

Understand the mapping attribute

The mapping attribute connects eval requirements with your trace data. How it works:

Each eval has required keys: Different evals need different inputs (e.g. Context Adherence needs context and output).
Spans have attributes: Your spans (LLM, retriever, etc.) store data as key-value span attributes.
Mapping connects them: The mapping object specifies which span attribute to use for each required key.

Example:

mapping={
    "context": "input.value",
    "output": "output.value"
}

The eval’s context key pulls from input.value: the raw input sent to the model.
The eval’s output key pulls from output.value: the raw response from the model.

Use custom_eval_name for display (optional)

custom_eval_name sets the display name shown in the Prototype dashboard for this eval. eval_name must always be a valid EvalName enum value: it selects which evaluation logic runs. Use custom_eval_name to give it a meaningful label for your project.

eval_tags = [
    EvalTag(
        eval_name=EvalName.CONTEXT_ADHERENCE,
        type=EvalTagType.OBSERVATION_SPAN,
        value=EvalSpanKind.LLM,
        mapping={"context": "input.value", "output": "output.value"},
        custom_eval_name="my_adherence_check",
        model=ModelChoices.TURING_SMALL
    ),
]

const evalTags = [
  new EvalTag({
    type: EvalTagType.OBSERVATION_SPAN,
    value: EvalSpanKind.LLM,
    eval_name: EvalName.CONTEXT_ADHERENCE,
    custom_eval_name: "my_adherence_check",
    mapping: { "context": "input.value", "output": "output.value" },
    model: ModelChoices.TURING_SMALL
  })
];

Note

For the full list of built-in evals and their required mapping keys, see Built-in evals.

Span Attribute Paths Reference

For OpenAI (and most LLM instrumentors), the standard span attribute paths are:

Data	Span attribute path
System message content	`gen_ai.input.messages.0.message.content`
User message content	`gen_ai.input.messages.1.message.content`
Model response	`gen_ai.output.messages.0.message.content`

The index (.0., .1.) corresponds to the position of the message in the messages array passed to the model.

Required Keys by Eval

Each eval has its own required mapping keys. Common ones:

Eval	Required keys
Context Adherence	`context`, `output`
Toxicity	`input`
Completeness	`input`, `output`
Detect Hallucination	`input`, `output`
Prompt Injection	`input`
Tone	`input`

For the full list, see Built-in evals.

Questions & Discussion

Configure Evals for Prototype Testing in Future AGI

About

When to use

How to

Define EvalTags in register()

Understand the mapping attribute

Use custom_eval_name for display (optional)

Span Attribute Paths Reference

Required Keys by Eval

Next Steps

Set up prototype

Choose winner

Prototype overview

Evaluation overview

Built-in evals