To configure evaluations for your prototype, define a list of EvalTag objects that specify which evals should be run against your model outputs.
eval_name
: The evaluations to run on the spanstype
: Specifies where to apply the evaluationvalue
: Identifies the kind of span to evaluateconfig
: Contains configuration parameters for the evalmapping
: Contains mapping of the required inputs of the eval Learn more →custom_eval_name
: Custom name to assign the eval tagThe mapping
attribute is a crucial component that connects eval requirements with your data. Here’s how it works:
Each eval has some required keys: Different evaluations require different inputs. For example, the Context Adherence eval requires both context
and output
keys.
Spans contain attributes: Your spans (like LLM spans, retriever spans, etc.) have attributes that store information as key-value pairs also known as span attributes.
Mapping connects them: The mapping object specifies which span attribute should be used for each required key.
For example, in this mapping:
output
key required by the eval will use data from this span attribute llm.output_messages.0.message.content
context
input will use data from this span attribute llm.input_messages.1.message.content
This allows evaluations to be flexible and work with different data while maintaining consistent evaluation logic.
Below are the list of evals Future AGI provides and their corresponding mappings and configuration parameters.
Assesses whether a dialogue maintains logical flow and contextual consistency throughout all exchanges. Learn more →
Mapping:
Output: Returns an output score. A higher score reflects a logically consistent and contextually relevant conversation. A lower score indicates issues like abrupt topic shifts, irrelevant responses, or loss of context.
Checks if a conversation reaches a satisfactory conclusion that addresses the user’s initial query or intent. Learn more →
Mapping:
Output: Returns an output score. A higher score indicates that the conversation was resolved effectively. A lower score points to incomplete, unclear, or unresolved conversations.
Provides rule-based validation for multi-choice scenarios using predefined patterns and formats. Learn more →
Config:
List[string]
- A set of predefined options for multiple-choice questionsbool
- Whether the input is a multiple choice question or notstring
- A rule or set of conditions that the output must meet ( Use {{<span_attribute>}}
to access~ span attributes in the rule prompt )Output: Output is a set of choices provided by the user of the output’s adherence to the deterministic criteria.
Identifies and flags potentially harmful, unsafe, or prohibited content in text outputs. Learn more →
Mapping:
string
- The text to be evaluated for content moderationOutput: Returns a score where higher values indicate safer content, lower values indicate potentially inappropriate content
Checks if a response stays strictly within the bounds of provided context without introducing external information. Learn more →
Mapping:
string
- The context provided to the AI system.string
- The output generated by the AI system.Output: Returns a score where a higher score indicates stronger adherence to the context.
Measures how difficult or confusing a prompt might be for a language model to process. Learn more →
Config:
string
(default: "gpt-4o-mini"
)Mapping:
string
- The input to be evaluated for perplexityOutput: Returns a score where a higher score indicates more difficult or confusing prompts and lower scores indicate higher prompt clarity and interpretability.
Verifies that content is meaningfully related to the provided context and addresses the query appropriately. Learn more →
Config:
bool
(default: False
) - Whether to verify against internet sourcesMapping:
string
- The context provided to the AI system.string
- The input to the AI system.Output: Returns a score where higher values indicate more relevant context.
Analyzes whether an output fully addresses all aspects of the input request or task. Learn more →
Mapping:
string
- The input to the AI system.string
- The output generated by the AI system.Output: Returns a score where the higher values indicate more complete content
Compares the semantic similarity between a response and the original context using vector comparisons. Learn more →
Config:
string
(default: "CosineSimilarity"
) - Contains method to use for comparisonfloat
(default: 0.5
) - The threshold for the similarity scoreMapping:
string
- The context provided to the AI system.string
- The output generated by the AI system.Output:
Detects Personally Identifiable Information in the response. Learn more →
Mapping:
string
- The input to be evaluated for PIIOutput: Returns a ‘Passed’ if the response does not contains any PII, else returns ‘Failed’
Detects presence of personally identifiable information to protect user privacy and ensure compliance. Learn more →
Mapping:
string
- the input to be evaluated for toxicityOutput: Returns either “Passed” or “Failed”, where “Passed” indicates non-toxic content, “Failed” indicates presence of harmful or aggressive language
Evaluates the emotional quality and overall sentiment expressed in the content. Learn more →
Mapping:
string
- The input to be evaluated for toneOutput: Returns tone labels such as “neutral”, “joy”, etc whatever tag that indicates the dominant emotional tone detected in the content
Detects gender-biased or discriminatory language in the text. Learn more →
Mapping:
string
- The input to be evaluated for sexismOutput: Returns either “Passed” or “Failed”, where “Passed” indicates no sexist content detected, “Failed” indicates presence of gender bias or discriminatory language
Identifies attempts to manipulate the model through crafted inputs that try to override instructions. Learn more →
Mapping:
string
- The input to the AI systemOutput: Returns a ‘Passed’ if the input is not a prompt injection, else returns ‘Failed’
Ensures text is coherent and meaningful rather than random or nonsensical strings. Learn more →
Mapping:
string
- The output generated by the AI system.Output: Returns float between 0 and 1. Higher values indicate more coherent and meaningful content.
Verifies content is appropriate for professional environments, avoiding adult or offensive material. Learn more →
Mapping:
string
- The output generated by the AI system.Output: Returns either “Passed” or “Failed”, where “Passed” indicates safe for work text and “Failed” indicates not safe for work text.
Checks if outputs follow specific instructions provided in the prompt. Learn more →
Mapping:
string
- The output generated by the AI system.Output: Returns a score between 0 and 1. A high score reflects strong adherence, where all prompt requirements are met, tasks are fully addressed, specified formats and constraints are followed, and both explicit and implicit instructions are properly handled. Conversely, a low score indicates significant deviations from the prompt instructions.
Ensures content adheres to data protection standards and privacy regulations. Learn more →
Config:
bool
(default: False
) - Whether to verify against internet sourcesMapping:
string
- The input to be evaluatedOutput: Returns a ‘Passed’ if the content adheres to data protection standards and privacy regulations, else returns ‘Failed’
Validates whether text output is properly formatted as valid JSON. Learn more →
Mapping:
string
- The text to be evaluatedOutput: Returns a ‘Passed’ if the text is valid JSON, else returns ‘Failed’
Checks if text matches a specified regular expression pattern for format validation. Learn more →
Config:
string
- The regular expression pattern to check for in the textMapping:
string
- The text to be evaluatedOutput: Returns a ‘Passed’ if the text matches the regular expression pattern, else returns ‘Failed’
Invokes an external API for additional validation or data retrieval. Learn more →
Config:
string
(default: None
, required: True
) - The URL of the API to call.dict
(default: {}
) - The payload to send to the API.dict
(default: {}
) - The headers to send to the API.Mapping:
string
- The response from the API call. This is the data the AI system will process to construct its output.Output: Returns a ‘Passed’ if the API call is successful, providing a valid response that aligns with the requirements., else returns ‘Failed’
Uses a specified model as the “judge” to evaluate the response. Learn more →
Config:
string
(default: "gpt-4o-mini"
) - The model to use for the evaluationstring
(default: None
, required: True
) - The prompt to use for the evaluation ( Use {{<span_attribute>}}
to access span attributes in the eval prompt )string
(default: ""
) - The system prompt to use for the evaluation ( Use {{<span_attribute>}}
to access span attributes in the system prompt )Output: A successful evaluation provides reasoned output based on the agent’s analysis using the configured prompts and chosen model, while a failed evaluation indicates issues with the agent’s execution or response generation.
Checks that the entire response is contained in a single line. Learn more →
Mapping:
string
- The text to be evaluatedOutput: Returns a ‘Passed’ if the text is a single line, else returns ‘Failed’
Checks if the response length is under a specified limit. Learn more →
Config:
int
(default: 200
)Mapping:
string
- The text to be evaluatedOutput: Returns a ‘Passed’ if the text length is less than max_length, else returns ‘Failed’
Ensures the response meets a minimum length. Learn more →
Config:
int
(default: 50
)Mapping:
string
- The text to be evaluatedOutput: Returns a ‘Passed’ if the text length is greater than min_length, else returns ‘Failed’
Validates the response length is within a certain range. Learn more →
Config:
int
(default: 200
)int
(default: 50
)Mapping:
string
- The text to be evaluatedOutput: Returns a ‘Passed’ if the text length is between max_length and min_length, else returns ‘Failed’
Confirms if the response contains at least one valid URL. Learn more →
Mapping:
Checks that no valid links are present in the response. Learn more →
Mapping:
string
- The text to be evaluatedOutput: Returns a ‘Passed’ if the text does not contain any valid hyperlinks, else returns ‘Failed’
Checks if the response is a properly formatted email address. Learn more →
Mapping:
string
- The text to be evaluatedOutput: Returns a ‘Passed’ if the text is a properly formatted email address, else returns ‘Failed’
Validates that the response contains a required keyword. Learn more →
Config:
bool
(default: True
)string
- The keyword to check for in the textMapping:
string
- The text to be evaluatedOutput: Returns a ‘Passed’ if the text contains the keyword, else returns ‘Failed’
Checks that the response contains at least one of the specified keywords. Learn more →
Config:
bool
(default: True
)list
(default: []
) - The list of keywords to check for in the textMapping:
string
- The text to be evaluatedOutput: Returns a ‘Passed’ if the text contains any of the keywords, else returns ‘Failed’
Validates that the response contains all specified keywords. Learn more →
Config:
bool
(default: True
)list
(default: []
) - The list of keywords to check for in the textMapping:
string
- The text to be evaluatedOutput: Returns a ‘Passed’ if the text contains all of the keywords, else returns ‘Failed’
Ensures the response does not contain any of specified keywords. Learn more →
Config:
bool
(default: True
)list
(default: []
) - The list of forbidden keywords to check for in the textMapping:
string
- The text to be evaluatedOutput: Returns a ‘Passed’ if the text does not contain any of the forbidden keywords, else returns ‘Failed’
Checks if the response starts with a specified substring. Learn more →
Config:
string
(default: None
, required: True
) - The substring to check for in the textbool
(default: True
)Mapping:
string
- The text to be evaluatedOutput: Returns a ‘Passed’ if the text starts with the substring, else returns ‘Failed’
Validates whether the response ends with a specified substring. Learn more →
Config:
bool
(default: True
)string
(default: None
, required: True
) - The substring to check for in the textMapping:
string
- The text to be evaluatedOutput: Returns a ‘Passed’ if the text ends with the substring, else returns ‘Failed’
Checks if the response exactly matches the expected content. Learn more →
Config:
bool
(default: True
)Mapping:
string
- The text to be evaluatedstring
- The expected text to be compared againstOutput: Returns a ‘Passed’ if the text exactly matches the expected text, else returns ‘Failed’
Measures similarity between a candidate answer and an expected answer. Learn more →
Config:
string
(default: "CosineSimilarity"
) - Contains method to use for comparisonfloat
(default: 0.5
) - The threshold for the similarity scoreMapping:
string
- The expected answerstring
- The output generated by the AI systemOutput:
Generic evaluation of the final output for compliance with instructions. Learn more →
Config:
bool
(default: False
)string
Mapping:
Checks if the retrieved context is relevant and sufficient. Learn more →
Config:
string
- Description of the criteria for evaluationMapping:
string
- The input to be evaluatedstring
- The output generated by the AI systemstring
- The context provided to the AI systemOutput: Returns a score where higher value indicate better context retrieval quality
Evaluates compliance with instructions for image-based tasks. Learn more →
Config:
string
- The evaluation standard that defines how the alignment is measuredMapping:
string
- The input to be evaluatedstring
- The URL of the image to be evaluatedOutput: Returns a score based on the linkage analysis, which is compared against predefined criteria to determine whether the image meets the expected standards.
Checks if a summary captures the main points accurately and succinctly. Learn more →
Config:
bool
(default: False
) - Whether to verify against internet sourcesMapping:
string
- The input to the AI systemstring
- The output generated by the AI systemstring
- The context provided to the AI systemOutput: Returns a score where a higher score indicates better summary quality.
Determines if the response is factually correct based on the context provided. Learn more →
Config:
string
bool
(default: False
)Mapping:
string
- The input provided to the AI systemstring
- The output generated by the AI systemstring
- The context provided to the AI systemOutput: Returns a score where a higher score indicates greater factual accuracy.
Evaluates the accuracy of translated content. Learn more →
Config:
bool
(default: False
)string
Mapping:
string
- The input to the AI systemstring
- The translated output generated by the AI systemOutput: Returns a score where a higher score indicates greater translation accuracy.
Assesses given text for inclusivity and cultural awareness. Learn more →
Mapping:
string
- The input to be evaluated for cultural sensitivityOutput: Returns either “Passed” or “Failed”, where “Passed” indicates culturally appropriate content, “Failed” indicates potential cultural insensitivity
Detects presence of bias or unfairness in the text. Learn more →
Mapping:
string
- The input to be evaluated for biasOutput: Returns either “Passed” or “Failed”, where “Passed” indicates neutral content, “Failed” indicates presence of bias.
Checks if the output properly uses a function/tool call with correct parameters. Learn more →
Config:
string
- The criteria for the evaluationMapping:
string
- The input to the AI systemstring
- The output generated by the AI systemOutput: Returns a ‘Passed’ if the output properly uses a function/tool call with correct parameters, else returns ‘Failed’
Evaluates if the response is grounded in the provided context. Learn more →
Mapping:
string
- The input to the AI systemstring
- The output generated by the AI systemOutput: Returns a ‘Passed’ if the output is grounded according to the input provided, else returns ‘Failed’
Scores linkage between instruction, input images, and output images. Learn more →
Config:
string
- The criteria for the evaluationMapping:
string
- The input to the AI systemstring
- The output generated by the AI systemlist
- The output generated by the AI systemOutput: Returns a score based on the criteria provided.
Analyzes the transcription accuracy of the given audio and its transcription. Learn more →
Config:
string
- The criteria for the evaluationMapping:
string
- The URL of the audio to be evaluatedstring
- The output generated by the AI systemOutput: Returns a score based on the criteria provided.
Evaluates the audio based on the description of the given audio. Learn more →
Config:
string
- The criteria for the evaluationstring
- The model to be used for the evaluationMapping:
string
- The URL of the audio to be evaluatedOutput: Returns a Passed or Failed based on the criteria provided.
Evaluates the quality of the given audio. Learn more →
Config:
string
- The criteria for the evaluationstring
- The model to be used for the evaluationMapping:
string
- The URL of the audio to be evaluatedOutput: Returns a score based on the criteria provided.
Validates JSON against specified criteria. Learn more →
Config:
list
- The criteria for the evaluationMapping:
string
- The input to the AI systemstring
- The output generated by the AI systemOutput: Returns a Passed or Failed based on the validations provided.
Tracks if the context chunk is used in generating the response. Learn more →
Mapping:
string
- The input to the AI systemstring
- The output generated by the AI systemstring
- The context provided to the AI systemOutput: Returns a Passed or Failed based on the input, output and context.
Measures how effectively context chunks are used in responses. Learn more →
Mapping:
string
- The input to the AI systemstring
- The output generated by the AI systemstring
- The context provided to the AI systemOutput: Returns a score based on the criteria provided.
Provides ranking score for each context based on specified criteria. Learn more →
Config:
string
- The criteria for the evaluationMapping:
string
- The input to the AI systemstring
- The output generated by the AI systemOutput: Returns a score based on the criteria provided.
Executes custom Python code for evaluation. Learn more →
Config:
string
- The criteria for the evaluationOutput: Returns a Passed or Failed based on the code provided.