Built-in Evals
All built-in evaluation templates available on the platform.
Built-in evals are pre-configured evaluation templates you can attach to dataset runs, prompt runs, and simulations. Pick the evals you need, add them to your run, and the platform scores results automatically.
| Eval | Description | Required Inputs | Use Cases | Evaluation Method |
|---|---|---|---|---|
| Conversation Coherence | Evaluates if a conversation flows logically and maintains context throughout. | conversation | Conversation, Chat, Audio | LLM as Judge |
| Conversation Resolution | Checks if the conversation reaches a satisfactory conclusion. | conversation | Conversation, Chat, Audio | LLM as Judge |
| Content Moderation | Detects harmful, unsafe, or inappropriate content using OpenAI’s moderation system. | output | Text, Audio, Image, Chat, Safety | LLM as Judge |
| Context Adherence | Measures how well responses stay within the provided context. | output, context | Text, Audio, Image, Chat, RAG & Retrieval, Hallucination | LLM as Judge |
| Context Relevance | Evaluates the relevancy of the context to the user query. | input, context | Text, Audio, Image, Chat, RAG & Retrieval | LLM as Judge |
| Completeness | Evaluates if the response completely answers the query. | input, output | Text, Audio, Chat, RAG & Retrieval | LLM as Judge |
| Chunk Attribution | Tracks if the context chunk is used in generating the response. | output, context | RAG & Retrieval | LLM as Judge |
| Chunk Utilization | Measures how effectively context chunks are used in responses. | output, context | RAG & Retrieval | LLM as Judge |
| PII Detection | Detects personally identifiable information (PII) in text. | output | Text, Audio, Image, Chat, Safety | LLM as Judge |
| Toxicity | Evaluates content for toxic or harmful language. | output | Text, Audio, Image, Chat, Safety | LLM as Judge |
| Tone | Analyzes the tone and sentiment of content. | output | Text, Audio, Chat, Safety | LLM as Judge |
| Sexist | Detects sexist content and gender bias. | output | Text, Audio, Image, Chat, Safety | LLM as Judge |
| Prompt Injection | Evaluates text for potential prompt injection attempts. | input, output | Text, Audio, Image, Chat, Safety | LLM as Judge |
| Instruction Adherence | Assesses how closely the output follows prompt instructions. | input, output | Text, Audio, Chat, Hallucination | LLM as Judge |
| Data Privacy Compliance | Checks output for GDPR, HIPAA, and other privacy regulation compliance. | output | Text, Audio, Image, Chat, Safety | LLM as Judge |
| Groundedness | Ensures response strictly adheres to the provided context without external information. | output, context | Text, Audio, Chat, RAG & Retrieval, Hallucination | LLM as Judge |
| Summary Quality | Evaluates if a summary captures main points and achieves appropriate length. | input, output | Text, Audio, Image, RAG & Retrieval | LLM as Judge |
| Prompt Adherence | Evaluates how well the response adheres to the given prompt. | input, output | Text, Audio | LLM as Judge |
| Factual Accuracy | Verifies if the output is factually correct based on input and context. | output, context | Text, Audio, RAG & Retrieval, Hallucination | LLM as Judge |
| Translation Accuracy | Evaluates translation quality, accuracy, and cultural appropriateness. | output, expected_response | Text, Audio, RAG & Retrieval | LLM as Judge |
| Cultural Sensitivity | Analyzes output for cultural appropriateness and inclusive language. | output | Text, Audio, Image, Chat, Safety | LLM as Judge |
| Bias Detection | Identifies gender, racial, cultural, or ideological bias in output. | output | Text, Audio, Image, Chat, Safety | LLM as Judge |
| Audio Transcription (ASR/STT) | Checks accuracy of a speech-to-text transcription against the audio source. | audio, transcription | Audio | LLM as Judge |
| Audio Quality | Evaluates the quality of audio (clarity, noise, distortion). | audio | Audio | LLM as Judge |
| Protect Flash | FutureAGI’s proprietary evaluator to detect harmful content. | output | Text, Safety | LLM as Judge |
| No Racial Bias | Ensures output does not contain or imply racial bias. | output | Text, Audio, Image, Chat, Safety | LLM as Judge |
| No Gender Bias | Checks the response does not reinforce gender stereotypes. | output | Text, Audio, Image, Chat, Safety | LLM as Judge |
| No Age Bias | Evaluates if content is free from age-based stereotypes. | output | Text, Audio, Image, Chat, Safety | LLM as Judge |
| No LLM Reference | Ensures output does not reference being an LLM or OpenAI model. | output | Text, Audio, Chat, Safety | LLM as Judge |
| No Apologies | Checks if the model unnecessarily apologizes. | output | Text, Audio, Chat | LLM as Judge |
| Is Polite | Ensures output maintains a respectful and non-aggressive tone. | output | Text, Audio, Chat | LLM as Judge |
| Is Concise | Measures whether the answer is brief and avoids redundancy. | output | Text, Audio, Chat | LLM as Judge |
| Is Helpful | Evaluates whether the response answers the user’s question effectively. | input, output | Text, Audio, Chat | LLM as Judge |
| Is Code | Checks whether the output is valid code or contains expected code snippets. | output | Text | LLM as Judge |
| Fuzzy Match | Compares output with expected answer using approximate matching. | output, expected_response | Text, Audio, RAG & Retrieval | LLM as Judge |
| Answer Refusal | Checks if the model correctly refuses harmful or restricted queries. | input, output | Text, Audio, Chat, Safety | LLM as Judge |
| Detect Hallucination | Identifies fabricated facts not present in the input or reference. | input, output | Text, Audio, Image, Chat, RAG & Retrieval, Hallucination | LLM as Judge |
| No Harmful Therapeutic Guidance | Ensures the model does not provide potentially harmful psychological advice. | output | Text, Audio, Chat, Safety | LLM as Judge |
| Clinically Inappropriate Tone | Evaluates whether tone is unsuitable for clinical or mental health contexts. | output | Text, Audio, Chat, Safety | LLM as Judge |
| Is Harmful Advice | Detects advice that could be physically, emotionally, legally, or financially harmful. | output | Text, Audio, Chat, Safety | LLM as Judge |
| Content Safety Violation | Broad check for toxicity, hate speech, explicit content, or violence. | output | Text, Audio, Image, Chat, Safety | LLM as Judge |
| Is Good Summary | Evaluates if a summary is clear, well-structured, and captures key points. | input, output | Text, Audio, RAG & Retrieval | LLM as Judge |
| Is Factually Consistent | Checks if output is factually consistent with the source or context. | output, context | Text, Audio, RAG & Retrieval, Hallucination | LLM as Judge |
| Is Compliant | Ensures output adheres to legal, regulatory, or organizational policies. | output | Text, Audio, Image, Chat, Safety | LLM as Judge |
| Is Informal Tone | Detects whether the tone is casual (slang, contractions, emoji). | output | Text, Audio, Chat | LLM as Judge |
| LLM Function Calling | Assesses accuracy and effectiveness of LLM function calls. | output | Text | LLM as Judge |
| Task Completion | Measures whether the model fulfilled the user’s request accurately. | input, output | Text, Audio, Chat | LLM as Judge |
| Caption Hallucination | Detects hallucinated or fabricated details in image captions. | instruction, output | Image, RAG & Retrieval, Hallucination | LLM as Judge |
| Text to SQL | Evaluates the quality and correctness of text-to-SQL generation. | input, output | Text | LLM as Judge |
| Synthetic Image Evaluator | Evaluates synthetic or AI-generated images against criteria. | image, instruction | Image | LLM as Judge |
| OCR Evaluation | Evaluates the accuracy of optical character recognition (OCR) output. | image, output | Text, PDF / Document | LLM as Judge |
| TTS Accuracy | Evaluates the accuracy and naturalness of text-to-speech output. | audio, input | Audio, Conversation | LLM as Judge |
| Ground Truth Match | Checks if the output matches a provided ground truth answer. | output, expected_response | Text, Audio | LLM as Judge |
| Eval Ranking | Provides a ranking score for each context based on specified criteria. | input, context | RAG & Retrieval, Custom | LLM as Ranker |
| Customer Agent: Loop Detection | Detects if a customer agent is stuck in a loop during a conversation. | conversation | Conversation, Chat, Audio | LLM as Judge |
| Customer Agent: Context Retention | Evaluates if the agent correctly retains context across conversation turns. | conversation | Conversation, Chat, Audio | LLM as Judge |
| Customer Agent: Query Handling | Assesses how effectively the agent handles customer queries. | conversation | Conversation, Chat, Audio | LLM as Judge |
| Customer Agent: Termination Handling | Evaluates how the agent handles conversation termination. | conversation | Conversation, Chat, Audio | LLM as Judge |
| Customer Agent: Interruption Handling | Checks how the agent responds to interruptions during a conversation. | conversation | Conversation, Chat, Audio | LLM as Judge |
| Customer Agent: Conversation Quality | Evaluates the overall quality of a customer agent conversation. | conversation | Conversation, Chat, Audio | LLM as Judge |
| Customer Agent: Objection Handling | Assesses how the agent handles objections raised by the customer. | conversation | Conversation, Chat, Audio | LLM as Judge |
| Customer Agent: Language Handling | Evaluates language consistency and appropriateness in agent responses. | conversation | Conversation, Chat, Audio | LLM as Judge |
| Customer Agent: Human Escalation | Checks if the agent correctly identifies when to escalate to a human. | conversation | Conversation, Chat, Audio | LLM as Judge |
| Customer Agent: Clarification Seeking | Evaluates if the agent appropriately seeks clarification when needed. | conversation | Conversation, Chat, Audio | LLM as Judge |
| Customer Agent: Prompt Conformance | Checks if agent responses conform to the defined prompt and guidelines. | conversation | Conversation, Chat, Audio | LLM as Judge |
| Is JSON | Validates if content is proper JSON format. | output | Text | Deterministic / Rule-based |
| Ends With | Checks if output ends with a specified string. | output | Text | Deterministic / Rule-based |
| Equals | Checks if output exactly matches an expected value. | output, expected_response | Text | Deterministic / Rule-based |
| Contains All | Checks if output contains all specified strings. | output | Text | Deterministic / Rule-based |
| Length Less Than | Checks if output length is below a specified limit. | output | Text | Deterministic / Rule-based |
| Contains None | Checks if output contains none of the specified strings. | output | Text | Deterministic / Rule-based |
| Regex | Validates output against a custom regular expression. | output | Text | Deterministic / Rule-based |
| Starts With | Checks if output starts with a specified string. | output | Text | Deterministic / Rule-based |
| Length Between | Checks if output length falls within a specified range. | output | Text | Deterministic / Rule-based |
| One Line | Checks if the text is a single line. | output | Text | Deterministic / Rule-based |
| Contains Valid Link | Checks for presence of valid URLs in the output. | output | Text | Deterministic / Rule-based |
| Is Email | Validates email address format. | output | Text | Deterministic / Rule-based |
| Length Greater Than | Checks if output length exceeds a specified minimum. | output | Text | Deterministic / Rule-based |
| No Invalid Links | Checks if the text contains no invalid URLs. | output | Text | Deterministic / Rule-based |
| Contains | Checks if output contains a specified string. | output | Text | Deterministic / Rule-based |
| Contains Any | Checks if output contains any of the specified strings. | output | Text | Deterministic / Rule-based |
| JSON Schema Validation | Validates output against a specified JSON schema. | output | Text | Deterministic / Rule-based |
| Custom Code | Runs a custom Python function to evaluate the output. | output | Text | Custom Code Function |
| API Call | Calls an external API endpoint to evaluate the output. | output | Text | External API Call |
| Answer Similarity | Measures semantic similarity between generated answer and reference. | output, expected_response | Text | Statistical Metric |
| BLEU Score | Computes BLEU score between expected answer and model output. | output, expected_response | Text | Statistical Metric |
| ROUGE Score | Calculates ROUGE score between generated and reference text. | output, expected_response | Text | Statistical Metric |
| Recall Score | Evaluates recall of relevant information from retrieved context. | output, context | Text, RAG & Retrieval | Statistical Metric |
| Levenshtein Similarity | Calculates edit distance between generated and reference text. | output, expected_response | Text | Statistical Metric |
| Numeric Similarity | Calculates numerical difference between generated and reference value. | output, expected_response | Text | Statistical Metric |
| Embedding Similarity | Calculates semantic similarity between generated and reference text. | output, expected_response | Text | Statistical Metric |
| Semantic List Contains | Checks if text contains phrases semantically similar to reference phrases. | output, expected_response | Text | Statistical Metric |
| Recall@K | Evaluates recall at K for retrieval-based systems. | output, context | RAG & Retrieval | Statistical Metric |
| Precision@K | Evaluates precision at K for retrieval-based systems. | output, context | RAG & Retrieval | Statistical Metric |
| NDCG@K | Calculates normalized discounted cumulative gain at K. | output, context | RAG & Retrieval | Statistical Metric |
| MRR | Calculates mean reciprocal rank for retrieval results. | output, context | RAG & Retrieval | Statistical Metric |
| Hit Rate | Measures the fraction of queries where the correct item appears in top-K results. | output, context | RAG & Retrieval | Statistical Metric |
Was this page helpful?