Built-in Evals
All built-in evaluation templates available on the platform.
Built-in evals are pre-configured evaluation templates you can attach to dataset runs, prompt runs, and simulations. Pick the evals you need, add them to your run, and the platform scores results automatically.
| Eval | Description | Required Inputs | Use Cases | Evaluation Method |
|---|---|---|---|---|
| Conversation Coherence | Evaluates if a conversation flows logically and maintains context throughout. | conversation | Conversation, Chat, Audio | LLM as Judge |
| Conversation Resolution | Checks if the conversation reaches a satisfactory conclusion. | conversation | Conversation, Chat, Audio | LLM as Judge |
| Context Adherence | Measures how well responses stay within the provided context. | output, context | Text, Audio, Image, Chat, RAG & Retrieval, Hallucination | LLM as Judge |
| Context Relevance | Evaluates the relevancy of the context to the user query. | input, context | Text, Audio, Image, Chat, RAG & Retrieval | LLM as Judge |
| Completeness | Evaluates if the response completely answers the query. | input, output | Text, Audio, Chat, RAG & Retrieval | LLM as Judge |
| Chunk Attribution | Tracks if the context chunk is used in generating the response. | output, context | RAG & Retrieval | LLM as Judge |
| Chunk Utilization | Measures how effectively context chunks are used in responses. | output, context | RAG & Retrieval | LLM as Judge |
| PII Detection | Detects personally identifiable information (PII) in text. | output | Text, Audio, Image, Chat, Safety | LLM as Judge |
| Toxicity | Evaluates content for toxic or harmful language. | output | Text, Audio, Image, Chat, Safety | LLM as Judge |
| Tone | Analyzes the tone and sentiment of content. | output | Text, Audio, Chat, Safety | LLM as Judge |
| Sexist | Detects sexist content and gender bias. | output | Text, Audio, Image, Chat, Safety | LLM as Judge |
| Prompt Injection | Evaluates text for potential prompt injection attempts. | input, output | Text, Audio, Image, Chat, Safety | LLM as Judge |
| Instruction Adherence | Assesses how closely the output follows prompt instructions. | input, output | Text, Audio, Chat, Hallucination | LLM as Judge |
| Data Privacy Compliance | Checks output for GDPR, HIPAA, and other privacy regulation compliance. | output | Text, Audio, Image, Chat, Safety | LLM as Judge |
| Groundedness | Ensures response strictly adheres to the provided context without external information. | output, context | Text, Audio, Chat, RAG & Retrieval, Hallucination | LLM as Judge |
| Summary Quality | Evaluates if a summary captures main points and achieves appropriate length. | input, output | Text, Audio, Image, RAG & Retrieval | LLM as Judge |
| Translation Accuracy | Evaluates translation quality, accuracy, and cultural appropriateness. | output, expected_response | Text, Audio, RAG & Retrieval | LLM as Judge |
| Cultural Sensitivity | Analyzes output for cultural appropriateness and inclusive language. | output | Text, Audio, Image, Chat, Safety | LLM as Judge |
| Bias Detection | Identifies gender, racial, cultural, or ideological bias in output. | output | Text, Audio, Image, Chat, Safety | LLM as Judge |
| Audio Transcription (ASR/STT) | Checks accuracy of a speech-to-text transcription against the audio source. | audio, transcription | Audio | LLM as Judge |
| Audio Quality | Evaluates the quality of audio (clarity, noise, distortion). | audio | Audio | LLM as Judge |
| No Racial Bias | Ensures output does not contain or imply racial bias. | output | Text, Audio, Image, Chat, Safety | LLM as Judge |
| No Gender Bias | Checks the response does not reinforce gender stereotypes. | output | Text, Audio, Image, Chat, Safety | LLM as Judge |
| No Age Bias | Evaluates if content is free from age-based stereotypes. | output | Text, Audio, Image, Chat, Safety | LLM as Judge |
| No LLM Reference | Ensures output does not reference being an LLM or OpenAI model. | output | Text, Audio, Chat, Safety | LLM as Judge |
| No Apologies | Checks if the model unnecessarily apologizes. | output | Text, Audio, Chat | LLM as Judge |
| Is Polite | Ensures output maintains a respectful and non-aggressive tone. | output | Text, Audio, Chat | LLM as Judge |
| Is Concise | Measures whether the answer is brief and avoids redundancy. | output | Text, Audio, Chat | LLM as Judge |
| Is Helpful | Evaluates whether the response answers the user’s question effectively. | input, output | Text, Audio, Chat | LLM as Judge |
| Contains Code | Checks whether the output is valid code or contains expected code snippets. | output | Text | LLM as Judge |
| Fuzzy Match | Compares output with expected answer using approximate matching. | output, expected_response | Text, Audio, RAG & Retrieval | LLM as Judge |
| Answer Refusal | Checks if the model correctly refuses harmful or restricted queries. | input, output | Text, Audio, Chat, Safety | LLM as Judge |
| Detect Hallucination | Identifies fabricated facts not present in the input or reference. | input, output | Text, Audio, Image, Chat, RAG & Retrieval, Hallucination | LLM as Judge |
| No Harmful Therapeutic Guidance | Ensures the model does not provide potentially harmful psychological advice. | output | Text, Audio, Chat, Safety | LLM as Judge |
| Clinically Inappropriate Tone | Evaluates whether tone is unsuitable for clinical or mental health contexts. | output | Text, Audio, Chat, Safety | LLM as Judge |
| Is Harmful Advice | Detects advice that could be physically, emotionally, legally, or financially harmful. | output | Text, Audio, Chat, Safety | LLM as Judge |
| Is Good Summary | Evaluates if a summary is clear, well-structured, and captures key points. | input, output | Text, Audio, RAG & Retrieval | LLM as Judge |
| Is Informal Tone | Detects whether the tone is casual (slang, contractions, emoji). | output | Text, Audio, Chat | LLM as Judge |
| Evaluate Function Calling | Assesses accuracy and effectiveness of LLM function calls. | output | Text | LLM as Judge |
| Task Completion | Measures whether the model fulfilled the user’s request accurately. | input, output | Text, Audio, Chat | LLM as Judge |
| Caption Hallucination | Detects hallucinated or fabricated details in image captions. | instruction, output | Image, RAG & Retrieval, Hallucination | LLM as Judge |
| Text to SQL | Evaluates the quality and correctness of text-to-SQL generation. | input, output | Text | LLM as Judge |
| Synthetic Image Evaluator | Evaluates synthetic or AI-generated images against criteria. | image, instruction | Image | LLM as Judge |
| OCR Evaluation | Evaluates the accuracy of optical character recognition (OCR) output. | input_pdf, json_content | Text, PDF / Document | LLM as Judge |
| Eval Ranking | Provides a ranking score for each context based on specified criteria. | input, context | RAG & Retrieval, Custom | LLM as Ranker |
| Is JSON | Validates if content is proper JSON format. | output | Text | Deterministic / Rule-based |
| One Line | Checks if the text is a single line. | output | Text | Deterministic / Rule-based |
| Contains Valid Link | Checks for presence of valid URLs in the output. | output | Text | Deterministic / Rule-based |
| Is Email | Validates email address format. | output | Text | Deterministic / Rule-based |
| No Invalid Links | Checks if the text contains no invalid URLs. | output | Text | Deterministic / Rule-based |
| BLEU Score | Computes BLEU score between expected answer and model output. | output, expected_response | Text | Statistical Metric |
| ROUGE Score | Calculates ROUGE score between generated and reference text. | output, expected_response | Text | Statistical Metric |
| Levenshtein Similarity | Calculates edit distance between generated and reference text. | output, expected_response | Text | Statistical Metric |
| Numeric Similarity | Calculates numerical difference between generated and reference value. | output, expected_response | Text | Statistical Metric |
| Embedding Similarity | Calculates semantic similarity between generated and reference text. | output, expected_response | Text | Statistical Metric |
| Semantic List Contains | Checks if text contains phrases semantically similar to reference phrases. | output, expected_response | Text | Statistical Metric |
| Recall@K | Evaluates recall at K for retrieval-based systems. | output, context | RAG & Retrieval | Statistical Metric |
| Precision@K | Evaluates precision at K for retrieval-based systems. | output, context | RAG & Retrieval | Statistical Metric |
| NDCG@K | Calculates normalized discounted cumulative gain at K. | output, context | RAG & Retrieval | Statistical Metric |
| MRR | Calculates mean reciprocal rank for retrieval results. | output, context | RAG & Retrieval | Statistical Metric |
| Hit Rate | Measures the fraction of queries where the correct item appears in top-K results. | output, context | RAG & Retrieval | Statistical Metric |
| Customer Agent: Loop Detection | Detects if a customer agent is stuck in a loop during a conversation. | conversation | Conversation, Chat, Audio | LLM as Judge |
| Customer Agent: Context Retention | Evaluates if the agent correctly retains context across conversation turns. | conversation | Conversation, Chat, Audio | LLM as Judge |
| Customer Agent: Query Handling | Assesses how effectively the agent handles customer queries. | conversation | Conversation, Chat, Audio | LLM as Judge |
| Customer Agent: Termination Handling | Evaluates how the agent handles conversation termination. | conversation | Conversation, Chat, Audio | LLM as Judge |
| Customer Agent: Interruption Handling | Checks how the agent responds to interruptions during a conversation. | conversation | Conversation, Chat, Audio | LLM as Judge |
| Customer Agent: Conversation Quality | Evaluates the overall quality of a customer agent conversation. | conversation | Conversation, Chat, Audio | LLM as Judge |
| Customer Agent: Objection Handling | Assesses how the agent handles objections raised by the customer. | conversation | Conversation, Chat, Audio | LLM as Judge |
| Customer Agent: Language Handling | Evaluates language consistency and appropriateness in agent responses. | conversation | Conversation, Chat, Audio | LLM as Judge |
| Customer Agent: Human Escalation | Checks if the agent correctly identifies when to escalate to a human. | conversation | Conversation, Chat, Audio | LLM as Judge |
| Customer Agent: Clarification Seeking | Evaluates if the agent appropriately seeks clarification when needed. | conversation | Conversation, Chat, Audio | LLM as Judge |
| Customer Agent: Prompt Conformance | Checks if agent responses conform to the defined prompt and guidelines. | system_prompt, conversation | Conversation, Chat, Audio | LLM as Judge |
| TTS Accuracy | Evaluates the accuracy and naturalness of text-to-speech output. | text, generated_audio | Audio, Conversation | LLM as Judge |
| Ground Truth Match | Checks if the output matches a provided ground truth answer. | generated_value, expected_value | Text, Audio | LLM as Judge |
| FID Score | Computes the Fréchet Inception Distance between two sets of images; lower scores indicate more similar image distributions. | real_images, fake_images | Image | Statistical Metric |
| CLIP Score | Measures how well images match their text descriptions; higher scores indicate better image-text alignment (range: 0–100). | images, text | Image | Statistical Metric |
| Image Instruction Adherence | Measures how well generated images adhere to a given text instruction across subject, style, and composition. | instruction, images | Image | LLM as Judge |
Was this page helpful?