Configure Evals for Prototype
Define which evaluations run on your prototype outputs using EvalTags, mapping, and optional custom evals.
What it is
Configuring evals for prototype means defining a list of EvalTag objects that specify which evaluations run against your model outputs when you use Prototype. Each EvalTag ties an eval (e.g. context adherence, tone, toxicity) to a span kind and type, maps your trace data (e.g. LLM input/output) to the eval’s required inputs via the mapping attribute, and optionally sets a model for Future AGI evals. Once configured in register(), prototype runs are scored automatically so you can compare versions by eval results in the dashboard.
Use cases
- Quality and safety — Run built-in evals (context adherence, toxicity, PII, prompt injection) on every prototype run before promoting.
- Compare versions — Use eval scores alongside cost and latency to pick the best prompt or model.
- Custom evals — Attach your own evals by name and mapping so prototype outputs are evaluated against your criteria.
- Consistent scoring — Same mapping and model for all runs so comparisons are fair across versions.
How to
Define EvalTags in register()
In your register() call, pass an eval_tags list (Python) or evalTags (TypeScript). Each tag specifies the eval name, span type and kind, mapping from your span attributes to the eval’s required keys, optional custom display name, and (for Future AGI evals) the model to use.
eval_tags = [
EvalTag(
eval_name=EvalName.CONTEXT_ADHERENCE,
type=EvalTagType.OBSERVATION_SPAN,
value=EvalSpanKind.LLM,
mapping={
"output": "llm.output_messages.0.message.content",
"context": "llm.input_messages.1.message.content"
},
custom_eval_name="context_check",
model=ModelChoices.TURING_LARGE
)
]const evalTags = [
EvalTag.create({
type: EvalTagType.OBSERVATION_SPAN,
value: EvalSpanKind.LLM,
eval_name: EvalName.CHUNK_ATTRIBUTION,
custom_eval_name: "Chunk_Attribution",
mapping: {
"context": "raw.input",
"output": "raw.output"
},
model: ModelChoices.TURING_SMALL
}),
]; | Field | Description |
|---|---|
eval_name | The evaluation to run on the spans (e.g. built-in name or custom string). |
type | Where to apply the evaluation (e.g. OBSERVATION_SPAN). |
value | Kind of span to evaluate (e.g. LLM). |
mapping | Maps eval required keys to span attribute paths. See below. |
custom_eval_name | Optional display name for this eval in the dashboard. |
model | Model for Future AGI evals (e.g. TURING_LARGE, TURING_SMALL). |
Understand the mapping attribute
The mapping attribute connects eval requirements with your trace data. How it works:
- Each eval has required keys — Different evals need different inputs (e.g. Context Adherence needs
contextandoutput). - Spans have attributes — Your spans (LLM, retriever, etc.) store data as key-value span attributes.
- Mapping connects them — The mapping object specifies which span attribute to use for each required key.
Example:
mapping={
"output": "llm.output_messages.0.message.content",
"context": "llm.input_messages.1.message.content"
}- The eval’s
outputkey uses data from the span attributellm.output_messages.0.message.content. - The eval’s
contextkey uses data from the span attributellm.input_messages.1.message.content.
Use the required keys for your chosen eval (see Built-in evals reference below).
Add custom evals (optional)
For custom_built evals, pass the custom eval name as a string in eval_name. No model is needed; mapping still defines how to get the eval’s required inputs from your spans.
eval_tags = [
EvalTag(
eval_name='custom_eval_name_entered',
value=EvalSpanKind.LLM,
type=EvalTagType.OBSERVATION_SPAN,
mapping={'input': 'input.value'},
custom_eval_name="<custom_eval_name2>",
),
]const evalTags = [
EvalTag.create({
type: EvalTagType.OBSERVATION_SPAN,
value: EvalSpanKind.LLM,
eval_name: "Custom_eval_name_entered",
custom_eval_name: "Chunk_Attribution",
mapping: {
"context": "raw.input",
"output": "raw.output"
}
}),
]; Built-in evals reference
Use the table below to pick eval_name / EvalName and build your mapping. For full details on each eval, see the Evaluation built-in evals docs.
| Eval | Mapping | Output |
|---|---|---|
| Conversation Coherence | messages | Score |
| Conversation Resolution | messages | Score |
| Content Moderation | text | Score |
| Context Adherence | context, output | Score |
| Context Relevance | context, input | Score |
| Completeness | input, output | Score |
| PII | input | Passed / Failed |
| Toxicity | input | Passed / Failed |
| Tone | input | Tone labels |
| Sexist | input | Passed / Failed |
| Prompt Injection | input | Passed / Failed |
| Prompt Instruction Adherence | output | Score 0–1 |
| Data Privacy Compliance | input | Passed / Failed |
| Is Json | text | Passed / Failed |
| One Line | text | Passed / Failed |
| Contains Valid Link | text | — |
| No Valid Links | text | Passed / Failed |
| Is Email | text | Passed / Failed |
| Summary Quality | input, output, context | Score |
| Factual Accuracy | input, output, context | Score |
| Translation Accuracy | input, output | Score |
| Cultural Sensitivity | input | Passed / Failed |
| Bias Detection | input | Passed / Failed |
| LLM Function Calling | input, output | Passed / Failed |
| Groundedness | input, output | Passed / Failed |
| Audio Transcription | input_audio, input_transcription | Score |
| Audio Quality | input_audio | Score |
| Chunk Attribution | input, output, context | Passed / Failed |
| Chunk Utilization | input, output, context | Score |
| Eval Ranking | input, context | Score |
| No Racial Bias | input | Passed / Failed |
| No Gender Bias | input | Passed / Failed |
| No Age Bias | input | Passed / Failed |
| No OpenAI Reference | input | Passed / Failed |
| No Appologies | input | Passed / Failed |
| Is Polite | input | Passed / Failed |
| Is Concise | input | Passed / Failed |
| Is Helpful | input, output | Passed / Failed |
| Is Code | input | Passed / Failed |
| Fuzzy Match | input, output | Passed / Failed |
| Answer Refusal | input, output | Passed / Failed |
| Detect Hallucination | input, output | Passed / Failed |
| No Harmful Therapeutic Guidance | input | Passed / Failed |
| Clinically Inappropriate Tone | input | Passed / Failed |
| Is Harmful Advice | input | Passed / Failed |
| Content Safety Violation | input | Passed / Failed |
| Is Good Summary | input, output | Passed / Failed |
| Is Factually Consistent | input, output | Passed / Failed |
| Is Compliant | input | Passed / Failed |
| Is Informal Tone | input | Passed / Failed |
| Evaluate Function Calling | input, output | Passed / Failed |
| Task Completion | input, output | Passed / Failed |
| Caption Hallucination | input, output | Passed / Failed |
| Bleu Score | reference, hypothesis | Score 0–1 |
| Rouge Score | reference, hypothesis | Score 0–1 |
| Text to SQL | input, output | Passed / Failed |
| Recall Score | reference, hypothesis | Score 0–1 |
| Levenshtein Similarity | response, expected_text | Similarity 0–1 |
| Numeric Similarity | response, expected_text | Similarity 0–1 |
| Embedding Similarity | response, expected_text | Score 0–1 |
| Semantic List Contains | response, expected_text | Score 0–1 |
| Is AI Generated Image | input_image | Score |