Built-in Evals

All built-in evaluation templates available on the platform.

Built-in evals are pre-configured evaluation templates you can attach to dataset runs, prompt runs, and simulations. Pick the evals you need, add them to your run, and the platform scores results automatically.


EvalDescriptionRequired InputsUse CasesEvaluation Method
Conversation CoherenceEvaluates if a conversation flows logically and maintains context throughout.conversationConversation, Chat, AudioLLM as Judge
Conversation ResolutionChecks if the conversation reaches a satisfactory conclusion.conversationConversation, Chat, AudioLLM as Judge
Content ModerationDetects harmful, unsafe, or inappropriate content using OpenAI’s moderation system.outputText, Audio, Image, Chat, SafetyLLM as Judge
Context AdherenceMeasures how well responses stay within the provided context.output, contextText, Audio, Image, Chat, RAG & Retrieval, HallucinationLLM as Judge
Context RelevanceEvaluates the relevancy of the context to the user query.input, contextText, Audio, Image, Chat, RAG & RetrievalLLM as Judge
CompletenessEvaluates if the response completely answers the query.input, outputText, Audio, Chat, RAG & RetrievalLLM as Judge
Chunk AttributionTracks if the context chunk is used in generating the response.output, contextRAG & RetrievalLLM as Judge
Chunk UtilizationMeasures how effectively context chunks are used in responses.output, contextRAG & RetrievalLLM as Judge
PII DetectionDetects personally identifiable information (PII) in text.outputText, Audio, Image, Chat, SafetyLLM as Judge
ToxicityEvaluates content for toxic or harmful language.outputText, Audio, Image, Chat, SafetyLLM as Judge
ToneAnalyzes the tone and sentiment of content.outputText, Audio, Chat, SafetyLLM as Judge
SexistDetects sexist content and gender bias.outputText, Audio, Image, Chat, SafetyLLM as Judge
Prompt InjectionEvaluates text for potential prompt injection attempts.input, outputText, Audio, Image, Chat, SafetyLLM as Judge
Instruction AdherenceAssesses how closely the output follows prompt instructions.input, outputText, Audio, Chat, HallucinationLLM as Judge
Data Privacy ComplianceChecks output for GDPR, HIPAA, and other privacy regulation compliance.outputText, Audio, Image, Chat, SafetyLLM as Judge
GroundednessEnsures response strictly adheres to the provided context without external information.output, contextText, Audio, Chat, RAG & Retrieval, HallucinationLLM as Judge
Summary QualityEvaluates if a summary captures main points and achieves appropriate length.input, outputText, Audio, Image, RAG & RetrievalLLM as Judge
Prompt AdherenceEvaluates how well the response adheres to the given prompt.input, outputText, AudioLLM as Judge
Factual AccuracyVerifies if the output is factually correct based on input and context.output, contextText, Audio, RAG & Retrieval, HallucinationLLM as Judge
Translation AccuracyEvaluates translation quality, accuracy, and cultural appropriateness.output, expected_responseText, Audio, RAG & RetrievalLLM as Judge
Cultural SensitivityAnalyzes output for cultural appropriateness and inclusive language.outputText, Audio, Image, Chat, SafetyLLM as Judge
Bias DetectionIdentifies gender, racial, cultural, or ideological bias in output.outputText, Audio, Image, Chat, SafetyLLM as Judge
Audio Transcription (ASR/STT)Checks accuracy of a speech-to-text transcription against the audio source.audio, transcriptionAudioLLM as Judge
Audio QualityEvaluates the quality of audio (clarity, noise, distortion).audioAudioLLM as Judge
Protect FlashFutureAGI’s proprietary evaluator to detect harmful content.outputText, SafetyLLM as Judge
No Racial BiasEnsures output does not contain or imply racial bias.outputText, Audio, Image, Chat, SafetyLLM as Judge
No Gender BiasChecks the response does not reinforce gender stereotypes.outputText, Audio, Image, Chat, SafetyLLM as Judge
No Age BiasEvaluates if content is free from age-based stereotypes.outputText, Audio, Image, Chat, SafetyLLM as Judge
No LLM ReferenceEnsures output does not reference being an LLM or OpenAI model.outputText, Audio, Chat, SafetyLLM as Judge
No ApologiesChecks if the model unnecessarily apologizes.outputText, Audio, ChatLLM as Judge
Is PoliteEnsures output maintains a respectful and non-aggressive tone.outputText, Audio, ChatLLM as Judge
Is ConciseMeasures whether the answer is brief and avoids redundancy.outputText, Audio, ChatLLM as Judge
Is HelpfulEvaluates whether the response answers the user’s question effectively.input, outputText, Audio, ChatLLM as Judge
Is CodeChecks whether the output is valid code or contains expected code snippets.outputTextLLM as Judge
Fuzzy MatchCompares output with expected answer using approximate matching.output, expected_responseText, Audio, RAG & RetrievalLLM as Judge
Answer RefusalChecks if the model correctly refuses harmful or restricted queries.input, outputText, Audio, Chat, SafetyLLM as Judge
Detect HallucinationIdentifies fabricated facts not present in the input or reference.input, outputText, Audio, Image, Chat, RAG & Retrieval, HallucinationLLM as Judge
No Harmful Therapeutic GuidanceEnsures the model does not provide potentially harmful psychological advice.outputText, Audio, Chat, SafetyLLM as Judge
Clinically Inappropriate ToneEvaluates whether tone is unsuitable for clinical or mental health contexts.outputText, Audio, Chat, SafetyLLM as Judge
Is Harmful AdviceDetects advice that could be physically, emotionally, legally, or financially harmful.outputText, Audio, Chat, SafetyLLM as Judge
Content Safety ViolationBroad check for toxicity, hate speech, explicit content, or violence.outputText, Audio, Image, Chat, SafetyLLM as Judge
Is Good SummaryEvaluates if a summary is clear, well-structured, and captures key points.input, outputText, Audio, RAG & RetrievalLLM as Judge
Is Factually ConsistentChecks if output is factually consistent with the source or context.output, contextText, Audio, RAG & Retrieval, HallucinationLLM as Judge
Is CompliantEnsures output adheres to legal, regulatory, or organizational policies.outputText, Audio, Image, Chat, SafetyLLM as Judge
Is Informal ToneDetects whether the tone is casual (slang, contractions, emoji).outputText, Audio, ChatLLM as Judge
LLM Function CallingAssesses accuracy and effectiveness of LLM function calls.outputTextLLM as Judge
Task CompletionMeasures whether the model fulfilled the user’s request accurately.input, outputText, Audio, ChatLLM as Judge
Caption HallucinationDetects hallucinated or fabricated details in image captions.instruction, outputImage, RAG & Retrieval, HallucinationLLM as Judge
Text to SQLEvaluates the quality and correctness of text-to-SQL generation.input, outputTextLLM as Judge
Synthetic Image EvaluatorEvaluates synthetic or AI-generated images against criteria.image, instructionImageLLM as Judge
OCR EvaluationEvaluates the accuracy of optical character recognition (OCR) output.image, outputText, PDF / DocumentLLM as Judge
TTS AccuracyEvaluates the accuracy and naturalness of text-to-speech output.audio, inputAudio, ConversationLLM as Judge
Ground Truth MatchChecks if the output matches a provided ground truth answer.output, expected_responseText, AudioLLM as Judge
Eval RankingProvides a ranking score for each context based on specified criteria.input, contextRAG & Retrieval, CustomLLM as Ranker
Customer Agent: Loop DetectionDetects if a customer agent is stuck in a loop during a conversation.conversationConversation, Chat, AudioLLM as Judge
Customer Agent: Context RetentionEvaluates if the agent correctly retains context across conversation turns.conversationConversation, Chat, AudioLLM as Judge
Customer Agent: Query HandlingAssesses how effectively the agent handles customer queries.conversationConversation, Chat, AudioLLM as Judge
Customer Agent: Termination HandlingEvaluates how the agent handles conversation termination.conversationConversation, Chat, AudioLLM as Judge
Customer Agent: Interruption HandlingChecks how the agent responds to interruptions during a conversation.conversationConversation, Chat, AudioLLM as Judge
Customer Agent: Conversation QualityEvaluates the overall quality of a customer agent conversation.conversationConversation, Chat, AudioLLM as Judge
Customer Agent: Objection HandlingAssesses how the agent handles objections raised by the customer.conversationConversation, Chat, AudioLLM as Judge
Customer Agent: Language HandlingEvaluates language consistency and appropriateness in agent responses.conversationConversation, Chat, AudioLLM as Judge
Customer Agent: Human EscalationChecks if the agent correctly identifies when to escalate to a human.conversationConversation, Chat, AudioLLM as Judge
Customer Agent: Clarification SeekingEvaluates if the agent appropriately seeks clarification when needed.conversationConversation, Chat, AudioLLM as Judge
Customer Agent: Prompt ConformanceChecks if agent responses conform to the defined prompt and guidelines.conversationConversation, Chat, AudioLLM as Judge
Is JSONValidates if content is proper JSON format.outputTextDeterministic / Rule-based
Ends WithChecks if output ends with a specified string.outputTextDeterministic / Rule-based
EqualsChecks if output exactly matches an expected value.output, expected_responseTextDeterministic / Rule-based
Contains AllChecks if output contains all specified strings.outputTextDeterministic / Rule-based
Length Less ThanChecks if output length is below a specified limit.outputTextDeterministic / Rule-based
Contains NoneChecks if output contains none of the specified strings.outputTextDeterministic / Rule-based
RegexValidates output against a custom regular expression.outputTextDeterministic / Rule-based
Starts WithChecks if output starts with a specified string.outputTextDeterministic / Rule-based
Length BetweenChecks if output length falls within a specified range.outputTextDeterministic / Rule-based
One LineChecks if the text is a single line.outputTextDeterministic / Rule-based
Contains Valid LinkChecks for presence of valid URLs in the output.outputTextDeterministic / Rule-based
Is EmailValidates email address format.outputTextDeterministic / Rule-based
Length Greater ThanChecks if output length exceeds a specified minimum.outputTextDeterministic / Rule-based
No Invalid LinksChecks if the text contains no invalid URLs.outputTextDeterministic / Rule-based
ContainsChecks if output contains a specified string.outputTextDeterministic / Rule-based
Contains AnyChecks if output contains any of the specified strings.outputTextDeterministic / Rule-based
JSON Schema ValidationValidates output against a specified JSON schema.outputTextDeterministic / Rule-based
Custom CodeRuns a custom Python function to evaluate the output.outputTextCustom Code Function
API CallCalls an external API endpoint to evaluate the output.outputTextExternal API Call
Answer SimilarityMeasures semantic similarity between generated answer and reference.output, expected_responseTextStatistical Metric
BLEU ScoreComputes BLEU score between expected answer and model output.output, expected_responseTextStatistical Metric
ROUGE ScoreCalculates ROUGE score between generated and reference text.output, expected_responseTextStatistical Metric
Recall ScoreEvaluates recall of relevant information from retrieved context.output, contextText, RAG & RetrievalStatistical Metric
Levenshtein SimilarityCalculates edit distance between generated and reference text.output, expected_responseTextStatistical Metric
Numeric SimilarityCalculates numerical difference between generated and reference value.output, expected_responseTextStatistical Metric
Embedding SimilarityCalculates semantic similarity between generated and reference text.output, expected_responseTextStatistical Metric
Semantic List ContainsChecks if text contains phrases semantically similar to reference phrases.output, expected_responseTextStatistical Metric
Recall@KEvaluates recall at K for retrieval-based systems.output, contextRAG & RetrievalStatistical Metric
Precision@KEvaluates precision at K for retrieval-based systems.output, contextRAG & RetrievalStatistical Metric
NDCG@KCalculates normalized discounted cumulative gain at K.output, contextRAG & RetrievalStatistical Metric
MRRCalculates mean reciprocal rank for retrieval results.output, contextRAG & RetrievalStatistical Metric
Hit RateMeasures the fraction of queries where the correct item appears in top-K results.output, contextRAG & RetrievalStatistical Metric
Was this page helpful?

Questions & Discussion