Configure Evals for Prototype

Define which evaluations run on your prototype outputs using EvalTags, mapping, and optional custom evals.

What it is

Configuring evals for prototype means defining a list of EvalTag objects that specify which evaluations run against your model outputs when you use Prototype. Each EvalTag ties an eval (e.g. context adherence, tone, toxicity) to a span kind and type, maps your trace data (e.g. LLM input/output) to the eval’s required inputs via the mapping attribute, and optionally sets a model for Future AGI evals. Once configured in register(), prototype runs are scored automatically so you can compare versions by eval results in the dashboard.


Use cases

  • Quality and safety — Run built-in evals (context adherence, toxicity, PII, prompt injection) on every prototype run before promoting.
  • Compare versions — Use eval scores alongside cost and latency to pick the best prompt or model.
  • Custom evals — Attach your own evals by name and mapping so prototype outputs are evaluated against your criteria.
  • Consistent scoring — Same mapping and model for all runs so comparisons are fair across versions.

How to

Define EvalTags in register()

In your register() call, pass an eval_tags list (Python) or evalTags (TypeScript). Each tag specifies the eval name, span type and kind, mapping from your span attributes to the eval’s required keys, optional custom display name, and (for Future AGI evals) the model to use.

eval_tags = [
    EvalTag(
        eval_name=EvalName.CONTEXT_ADHERENCE,
        type=EvalTagType.OBSERVATION_SPAN,
        value=EvalSpanKind.LLM,
        mapping={
            "output": "llm.output_messages.0.message.content",
            "context": "llm.input_messages.1.message.content"
        },
        custom_eval_name="context_check",
        model=ModelChoices.TURING_LARGE
    )
]
const evalTags = [
  EvalTag.create({
    type: EvalTagType.OBSERVATION_SPAN,
    value: EvalSpanKind.LLM,
    eval_name: EvalName.CHUNK_ATTRIBUTION,
    custom_eval_name: "Chunk_Attribution",
    mapping: {
      "context": "raw.input",
      "output": "raw.output"
    },
    model: ModelChoices.TURING_SMALL
  }),
];
FieldDescription
eval_nameThe evaluation to run on the spans (e.g. built-in name or custom string).
typeWhere to apply the evaluation (e.g. OBSERVATION_SPAN).
valueKind of span to evaluate (e.g. LLM).
mappingMaps eval required keys to span attribute paths. See below.
custom_eval_nameOptional display name for this eval in the dashboard.
modelModel for Future AGI evals (e.g. TURING_LARGE, TURING_SMALL).

Understand the mapping attribute

The mapping attribute connects eval requirements with your trace data. How it works:

  1. Each eval has required keys — Different evals need different inputs (e.g. Context Adherence needs context and output).
  2. Spans have attributes — Your spans (LLM, retriever, etc.) store data as key-value span attributes.
  3. Mapping connects them — The mapping object specifies which span attribute to use for each required key.

Example:

mapping={
    "output": "llm.output_messages.0.message.content",
    "context": "llm.input_messages.1.message.content"
}
  • The eval’s output key uses data from the span attribute llm.output_messages.0.message.content.
  • The eval’s context key uses data from the span attribute llm.input_messages.1.message.content.

Use the required keys for your chosen eval (see Built-in evals reference below).

Add custom evals (optional)

For custom_built evals, pass the custom eval name as a string in eval_name. No model is needed; mapping still defines how to get the eval’s required inputs from your spans.

eval_tags = [
    EvalTag(
        eval_name='custom_eval_name_entered',
        value=EvalSpanKind.LLM,
        type=EvalTagType.OBSERVATION_SPAN,
        mapping={'input': 'input.value'},
        custom_eval_name="<custom_eval_name2>",
    ),
]
const evalTags = [
  EvalTag.create({
    type: EvalTagType.OBSERVATION_SPAN,
    value: EvalSpanKind.LLM,
    eval_name: "Custom_eval_name_entered",
    custom_eval_name: "Chunk_Attribution",
    mapping: {
      "context": "raw.input",
      "output": "raw.output"
    }
  }),
];

Built-in evals reference

Use the table below to pick eval_name / EvalName and build your mapping. For full details on each eval, see the Evaluation built-in evals docs.

EvalMappingOutput
Conversation CoherencemessagesScore
Conversation ResolutionmessagesScore
Content ModerationtextScore
Context Adherencecontext, outputScore
Context Relevancecontext, inputScore
Completenessinput, outputScore
PIIinputPassed / Failed
ToxicityinputPassed / Failed
ToneinputTone labels
SexistinputPassed / Failed
Prompt InjectioninputPassed / Failed
Prompt Instruction AdherenceoutputScore 0–1
Data Privacy ComplianceinputPassed / Failed
Is JsontextPassed / Failed
One LinetextPassed / Failed
Contains Valid Linktext
No Valid LinkstextPassed / Failed
Is EmailtextPassed / Failed
Summary Qualityinput, output, contextScore
Factual Accuracyinput, output, contextScore
Translation Accuracyinput, outputScore
Cultural SensitivityinputPassed / Failed
Bias DetectioninputPassed / Failed
LLM Function Callinginput, outputPassed / Failed
Groundednessinput, outputPassed / Failed
Audio Transcriptioninput_audio, input_transcriptionScore
Audio Qualityinput_audioScore
Chunk Attributioninput, output, contextPassed / Failed
Chunk Utilizationinput, output, contextScore
Eval Rankinginput, contextScore
No Racial BiasinputPassed / Failed
No Gender BiasinputPassed / Failed
No Age BiasinputPassed / Failed
No OpenAI ReferenceinputPassed / Failed
No AppologiesinputPassed / Failed
Is PoliteinputPassed / Failed
Is ConciseinputPassed / Failed
Is Helpfulinput, outputPassed / Failed
Is CodeinputPassed / Failed
Fuzzy Matchinput, outputPassed / Failed
Answer Refusalinput, outputPassed / Failed
Detect Hallucinationinput, outputPassed / Failed
No Harmful Therapeutic GuidanceinputPassed / Failed
Clinically Inappropriate ToneinputPassed / Failed
Is Harmful AdviceinputPassed / Failed
Content Safety ViolationinputPassed / Failed
Is Good Summaryinput, outputPassed / Failed
Is Factually Consistentinput, outputPassed / Failed
Is CompliantinputPassed / Failed
Is Informal ToneinputPassed / Failed
Evaluate Function Callinginput, outputPassed / Failed
Task Completioninput, outputPassed / Failed
Caption Hallucinationinput, outputPassed / Failed
Bleu Scorereference, hypothesisScore 0–1
Rouge Scorereference, hypothesisScore 0–1
Text to SQLinput, outputPassed / Failed
Recall Scorereference, hypothesisScore 0–1
Levenshtein Similarityresponse, expected_textSimilarity 0–1
Numeric Similarityresponse, expected_textSimilarity 0–1
Embedding Similarityresponse, expected_textScore 0–1
Semantic List Containsresponse, expected_textScore 0–1
Is AI Generated Imageinput_imageScore

What you can do next

Was this page helpful?

Questions & Discussion