Future AGI Built-in Eval Templates Reference

Complete reference for all built-in evaluation templates available on the Future AGI platform, with quick access to metrics by name.

Built-in evals are pre-configured evaluation templates you can attach to dataset runs, prompt runs, and simulations. Pick the evals you need, add them to your run, and the platform scores results automatically.

Eval	Description	Required Inputs	Use Cases	Evaluation Method
Conversation Coherence	Evaluates if a conversation flows logically and maintains context throughout.	`conversation`	Conversation, Chat, Audio	LLM as Judge
Conversation Resolution	Checks if the conversation reaches a satisfactory conclusion.	`conversation`	Conversation, Chat, Audio	LLM as Judge
Context Adherence	Measures how well responses stay within the provided context.	`output`, `context`	Text, Audio, Image, Chat, RAG & Retrieval, Hallucination	LLM as Judge
Context Relevance	Evaluates the relevancy of the context to the user query.	`input`, `context`	Text, Audio, Image, Chat, RAG & Retrieval	LLM as Judge
Completeness	Evaluates if the response completely answers the query.	`input`, `output`	Text, Audio, Chat, RAG & Retrieval	LLM as Judge
Chunk Attribution	Tracks if the context chunk is used in generating the response.	`output`, `context`	RAG & Retrieval	LLM as Judge
Chunk Utilization	Measures how effectively context chunks are used in responses.	`output`, `context`	RAG & Retrieval	LLM as Judge
PII Detection	Detects personally identifiable information (PII) in text.	`output`	Text, Audio, Image, Chat, Safety	LLM as Judge
Toxicity	Evaluates content for toxic or harmful language.	`output`	Text, Audio, Image, Chat, Safety	LLM as Judge
Tone	Analyzes the tone and sentiment of content.	`output`	Text, Audio, Chat, Safety	LLM as Judge
Sexist	Detects sexist content and gender bias.	`output`	Text, Audio, Image, Chat, Safety	LLM as Judge
Prompt Injection	Evaluates text for potential prompt injection attempts.	`input`, `output`	Text, Audio, Image, Chat, Safety	LLM as Judge
Instruction Adherence	Assesses how closely the output follows prompt instructions.	`input`, `output`	Text, Audio, Chat, Hallucination	LLM as Judge
Data Privacy Compliance	Checks output for GDPR, HIPAA, and other privacy regulation compliance.	`output`	Text, Audio, Image, Chat, Safety	LLM as Judge
Groundedness	Ensures response strictly adheres to the provided context without external information.	`output`, `context`	Text, Audio, Chat, RAG & Retrieval, Hallucination	LLM as Judge
Summary Quality	Evaluates if a summary captures main points and achieves appropriate length.	`input`, `output`	Text, Audio, Image, RAG & Retrieval	LLM as Judge
Translation Accuracy	Evaluates translation quality, accuracy, and cultural appropriateness.	`output`, `expected_response`	Text, Audio, RAG & Retrieval	LLM as Judge
Cultural Sensitivity	Analyzes output for cultural appropriateness and inclusive language.	`output`	Text, Audio, Image, Chat, Safety	LLM as Judge
Bias Detection	Identifies gender, racial, cultural, or ideological bias in output.	`output`	Text, Audio, Image, Chat, Safety	LLM as Judge
Audio Transcription (ASR/STT)	Checks accuracy of a speech-to-text transcription against the audio source.	`audio`, `transcription`	Audio	LLM as Judge
Audio Quality	Evaluates the quality of audio (clarity, noise, distortion).	`audio`	Audio	LLM as Judge
No Racial Bias	Ensures output does not contain or imply racial bias.	`output`	Text, Audio, Image, Chat, Safety	LLM as Judge
No Gender Bias	Checks the response does not reinforce gender stereotypes.	`output`	Text, Audio, Image, Chat, Safety	LLM as Judge
No Age Bias	Evaluates if content is free from age-based stereotypes.	`output`	Text, Audio, Image, Chat, Safety	LLM as Judge
No LLM Reference	Ensures output does not reference being an LLM or OpenAI model.	`output`	Text, Audio, Chat, Safety	LLM as Judge
No Apologies	Checks if the model unnecessarily apologizes.	`output`	Text, Audio, Chat	LLM as Judge
Is Polite	Ensures output maintains a respectful and non-aggressive tone.	`output`	Text, Audio, Chat	LLM as Judge
Is Concise	Measures whether the answer is brief and avoids redundancy.	`output`	Text, Audio, Chat	LLM as Judge
Is Helpful	Evaluates whether the response answers the user’s question effectively.	`input`, `output`	Text, Audio, Chat	LLM as Judge
Contains Code	Checks whether the output is valid code or contains expected code snippets.	`output`	Text	LLM as Judge
Fuzzy Match	Compares output with expected answer using approximate matching.	`output`, `expected_response`	Text, Audio, RAG & Retrieval	LLM as Judge
Answer Refusal	Checks if the model correctly refuses harmful or restricted queries.	`input`, `output`	Text, Audio, Chat, Safety	LLM as Judge
Detect Hallucination	Identifies fabricated facts not present in the input or reference.	`input`, `output`	Text, Audio, Image, Chat, RAG & Retrieval, Hallucination	LLM as Judge
No Harmful Therapeutic Guidance	Ensures the model does not provide potentially harmful psychological advice.	`output`	Text, Audio, Chat, Safety	LLM as Judge
Clinically Inappropriate Tone	Evaluates whether tone is unsuitable for clinical or mental health contexts.	`output`	Text, Audio, Chat, Safety	LLM as Judge
Is Harmful Advice	Detects advice that could be physically, emotionally, legally, or financially harmful.	`output`	Text, Audio, Chat, Safety	LLM as Judge
Is Good Summary	Evaluates if a summary is clear, well-structured, and captures key points.	`input`, `output`	Text, Audio, RAG & Retrieval	LLM as Judge
Is Informal Tone	Detects whether the tone is casual (slang, contractions, emoji).	`output`	Text, Audio, Chat	LLM as Judge
Evaluate Function Calling	Assesses accuracy and effectiveness of LLM function calls.	`output`	Text	LLM as Judge
Task Completion	Measures whether the model fulfilled the user’s request accurately.	`input`, `output`	Text, Audio, Chat	LLM as Judge
Caption Hallucination	Detects hallucinated or fabricated details in image captions.	`instruction`, `output`	Image, RAG & Retrieval, Hallucination	LLM as Judge
Text to SQL	Evaluates the quality and correctness of text-to-SQL generation.	`input`, `output`	Text	LLM as Judge
Synthetic Image Evaluator	Evaluates synthetic or AI-generated images against criteria.	`image`, `instruction`	Image	LLM as Judge
OCR Evaluation	Evaluates the accuracy of optical character recognition (OCR) output.	`input_pdf`, `json_content`	Text, PDF / Document	LLM as Judge
Eval Ranking	Provides a ranking score for each context based on specified criteria.	`input`, `context`	RAG & Retrieval, Custom	LLM as Ranker
Is JSON	Validates if content is proper JSON format.	`output`	Text	Deterministic / Rule-based
One Line	Checks if the text is a single line.	`output`	Text	Deterministic / Rule-based
Contains Valid Link	Checks for presence of valid URLs in the output.	`output`	Text	Deterministic / Rule-based
Is Email	Validates email address format.	`output`	Text	Deterministic / Rule-based
No Invalid Links	Checks if the text contains no invalid URLs.	`output`	Text	Deterministic / Rule-based
BLEU Score	Computes BLEU score between expected answer and model output.	`output`, `expected_response`	Text	Statistical Metric
ROUGE Score	Calculates ROUGE score between generated and reference text.	`output`, `expected_response`	Text	Statistical Metric
Levenshtein Similarity	Calculates edit distance between generated and reference text.	`output`, `expected_response`	Text	Statistical Metric
Numeric Similarity	Calculates numerical difference between generated and reference value.	`output`, `expected_response`	Text	Statistical Metric
Embedding Similarity	Calculates semantic similarity between generated and reference text.	`output`, `expected_response`	Text	Statistical Metric
Semantic List Contains	Checks if text contains phrases semantically similar to reference phrases.	`output`, `expected_response`	Text	Statistical Metric
Recall@K	Evaluates recall at K for retrieval-based systems.	`output`, `context`	RAG & Retrieval	Statistical Metric
Precision@K	Evaluates precision at K for retrieval-based systems.	`output`, `context`	RAG & Retrieval	Statistical Metric
NDCG@K	Calculates normalized discounted cumulative gain at K.	`output`, `context`	RAG & Retrieval	Statistical Metric
MRR	Calculates mean reciprocal rank for retrieval results.	`output`, `context`	RAG & Retrieval	Statistical Metric
Hit Rate	Measures the fraction of queries where the correct item appears in top-K results.	`output`, `context`	RAG & Retrieval	Statistical Metric
Customer Agent: Loop Detection	Detects if a customer agent is stuck in a loop during a conversation.	`conversation`	Conversation, Chat, Audio	LLM as Judge
Customer Agent: Context Retention	Evaluates if the agent correctly retains context across conversation turns.	`conversation`	Conversation, Chat, Audio	LLM as Judge
Customer Agent: Query Handling	Assesses how effectively the agent handles customer queries.	`conversation`	Conversation, Chat, Audio	LLM as Judge
Customer Agent: Termination Handling	Evaluates how the agent handles conversation termination.	`conversation`	Conversation, Chat, Audio	LLM as Judge
Customer Agent: Interruption Handling	Checks how the agent responds to interruptions during a conversation.	`conversation`	Conversation, Chat, Audio	LLM as Judge
Customer Agent: Conversation Quality	Evaluates the overall quality of a customer agent conversation.	`conversation`	Conversation, Chat, Audio	LLM as Judge
Customer Agent: Objection Handling	Assesses how the agent handles objections raised by the customer.	`conversation`	Conversation, Chat, Audio	LLM as Judge
Customer Agent: Language Handling	Evaluates language consistency and appropriateness in agent responses.	`conversation`	Conversation, Chat, Audio	LLM as Judge
Customer Agent: Human Escalation	Checks if the agent correctly identifies when to escalate to a human.	`conversation`	Conversation, Chat, Audio	LLM as Judge
Customer Agent: Clarification Seeking	Evaluates if the agent appropriately seeks clarification when needed.	`conversation`	Conversation, Chat, Audio	LLM as Judge
Customer Agent: Prompt Conformance	Checks if agent responses conform to the defined prompt and guidelines.	`system_prompt`, `conversation`	Conversation, Chat, Audio	LLM as Judge
TTS Accuracy	Evaluates the accuracy and naturalness of text-to-speech output.	`text`, `generated_audio`	Audio, Conversation	LLM as Judge
Ground Truth Match	Checks if the output matches a provided ground truth answer.	`generated_value`, `expected_value`	Text, Audio	LLM as Judge
FID Score	Computes the Fréchet Inception Distance between two sets of images; lower scores indicate more similar image distributions.	`real_images`, `fake_images`	Image	Statistical Metric
CLIP Score	Measures how well images match their text descriptions; higher scores indicate better image-text alignment (range: 0–100).	`images`, `text`	Image	Statistical Metric
Image Instruction Adherence	Measures how well generated images adhere to a given text instruction across subject, style, and composition.	`instruction`, `images`	Image	LLM as Judge

Was this page helpful?

Questions & Discussion