Evals for Prototype
To configure evaluations for your prototype, define a list of EvalTag objects that specify which evals should be run against your model outputs.
eval_name
: The evaluations to run on the spanstype
: Specifies where to apply the evaluationvalue
: Identifies the kind of span to evaluateconfig
: Contains configuration parameters for the evalmapping
: Contains mapping of the required inputs of the eval Learn more →custom_eval_name
: Custom name to assign the eval tag
Understanding the Mapping Attribute
The mapping
attribute is a crucial component that connects eval requirements with your data. Here’s how it works:
-
Each eval has some required keys: Different evaluations require different inputs. For example, the Context Adherence eval requires both
context
andoutput
keys. -
Spans contain attributes: Your spans (like LLM spans, retriever spans, etc.) have attributes that store information as key-value pairs also known as span attributes.
-
Mapping connects them: The mapping object specifies which span attribute should be used for each required key.
For example, in this mapping:
- The
output
key required by the eval will use data from this span attributellm.output_messages.0.message.content
- The
context
input will use data from this span attributellm.input_messages.1.message.content
This allows evaluations to be flexible and work with different data while maintaining consistent evaluation logic.
Below are the list of evals Future AGI provides and their corresponding mappings and configuration parameters.
1. Conversation Coherence
Assesses whether a dialogue maintains logical flow and contextual consistency throughout all exchanges. Learn more →
Mapping:
- messages : Data that contains the complete conversation, represented as an array of user and assistant messages.
Output: Returns an output score. A higher score reflects a logically consistent and contextually relevant conversation. A lower score indicates issues like abrupt topic shifts, irrelevant responses, or loss of context.
2. Conversation Resolution
Checks if a conversation reaches a satisfactory conclusion that addresses the user’s initial query or intent. Learn more →
Mapping:
- messages : Data that contains the complete conversation, represented as an array of user and assistant messages.
Output: Returns an output score. A higher score indicates that the conversation was resolved effectively. A lower score points to incomplete, unclear, or unresolved conversations.
3. Deterministic Evals
Provides rule-based validation for multi-choice scenarios using predefined patterns and formats. Learn more →
Config:
- choices:
List[string]
- A set of predefined options for multiple-choice questions - multi_choice:
bool
- Whether the input is a multiple choice question or not - rule_prompt:
string
- A rule or set of conditions that the output must meet ( Use{{<span_attribute>}}
to access~ span attributes in the rule prompt )
Output: Output is a set of choices provided by the user of the output’s adherence to the deterministic criteria.
4. Content Moderation
Identifies and flags potentially harmful, unsafe, or prohibited content in text outputs. Learn more →
Mapping:
- text:
string
- The text to be evaluated for content moderation
Output: Returns a score where higher values indicate safer content, lower values indicate potentially inappropriate content
5. Context Adherence
Checks if a response stays strictly within the bounds of provided context without introducing external information. Learn more →
Mapping:
- context:
string
- The context provided to the AI system. - output:
string
- The output generated by the AI system.
Output: Returns a score where a higher score indicates stronger adherence to the context.
6. Prompt Perplexity
Measures how difficult or confusing a prompt might be for a language model to process. Learn more →
Config:
- model:
string
(default:"gpt-4o-mini"
)
Mapping:
- input:
string
- The input to be evaluated for perplexity
Output: Returns a score where a higher score indicates more difficult or confusing prompts and lower scores indicate higher prompt clarity and interpretability.
7. Context Relevance
Verifies that content is meaningfully related to the provided context and addresses the query appropriately. Learn more →
Config:
- check_internet:
bool
(default:False
) - Whether to verify against internet sources
Mapping:
- context:
string
- The context provided to the AI system. - input:
string
- The input to the AI system.
Output: Returns a score where higher values indicate more relevant context.
8. Completeness
Analyzes whether an output fully addresses all aspects of the input request or task. Learn more →
Mapping:
- input:
string
- The input to the AI system. - output:
string
- The output generated by the AI system.
Output: Returns a score where the higher values indicate more complete content
9. Context Similarity
Compares the semantic similarity between a response and the original context using vector comparisons. Learn more →
Config:
- comparator:
string
(default:"CosineSimilarity"
) - Contains method to use for comparison - failure_threshold:
float
(default:0.5
) - The threshold for the similarity score
Mapping:
- context:
string
- The context provided to the AI system. - response:
string
- The output generated by the AI system.
Output:
- Returns a score where values greater than or equal to the failure threshold indicate sufficient context similarity.
10. PII
Detects Personally Identifiable Information in the response. Learn more →
Mapping:
- input:
string
- The input to be evaluated for PII
Output: Returns a ‘Passed’ if the response does not contains any PII, else returns ‘Failed’
11. Toxicity
Detects presence of personally identifiable information to protect user privacy and ensure compliance. Learn more →
Mapping:
- input:
string
- the input to be evaluated for toxicity
Output: Returns either “Passed” or “Failed”, where “Passed” indicates non-toxic content, “Failed” indicates presence of harmful or aggressive language
12. Tone
Evaluates the emotional quality and overall sentiment expressed in the content. Learn more →
Mapping:
- input:
string
- The input to be evaluated for tone
Output: Returns tone labels such as “neutral”, “joy”, etc whatever tag that indicates the dominant emotional tone detected in the content
13. Sexist
Detects gender-biased or discriminatory language in the text. Learn more →
Mapping:
- input:
string
- The input to be evaluated for sexism
Output: Returns either “Passed” or “Failed”, where “Passed” indicates no sexist content detected, “Failed” indicates presence of gender bias or discriminatory language
14. Prompt Injection
Identifies attempts to manipulate the model through crafted inputs that try to override instructions. Learn more →
Mapping:
- input:
string
- The input to the AI system
Output: Returns a ‘Passed’ if the input is not a prompt injection, else returns ‘Failed’
15. Not Gibberish Text
Ensures text is coherent and meaningful rather than random or nonsensical strings. Learn more →
Mapping:
- response:
string
- The output generated by the AI system.
Output: Returns float between 0 and 1. Higher values indicate more coherent and meaningful content.
16. Safe For Work Text
Verifies content is appropriate for professional environments, avoiding adult or offensive material. Learn more →
Mapping:
- response:
string
- The output generated by the AI system.
Output: Returns either “Passed” or “Failed”, where “Passed” indicates safe for work text and “Failed” indicates not safe for work text.
17. Prompt Instruction Adherence
Checks if outputs follow specific instructions provided in the prompt. Learn more →
Mapping:
- output:
string
- The output generated by the AI system.
Output: Returns a score between 0 and 1. A high score reflects strong adherence, where all prompt requirements are met, tasks are fully addressed, specified formats and constraints are followed, and both explicit and implicit instructions are properly handled. Conversely, a low score indicates significant deviations from the prompt instructions.
18. Data Privacy Compliance
Ensures content adheres to data protection standards and privacy regulations. Learn more →
Config:
- check_internet:
bool
(default:False
) - Whether to verify against internet sources
Mapping:
- input:
string
- The input to be evaluated
Output: Returns a ‘Passed’ if the content adheres to data protection standards and privacy regulations, else returns ‘Failed’
19. Is Json
Validates whether text output is properly formatted as valid JSON. Learn more →
Mapping:
- text:
string
- The text to be evaluated
Output: Returns a ‘Passed’ if the text is valid JSON, else returns ‘Failed’
20. Regex
Checks if text matches a specified regular expression pattern for format validation. Learn more →
Config:
- pattern:
string
- The regular expression pattern to check for in the text
Mapping:
- text:
string
- The text to be evaluated
Output: Returns a ‘Passed’ if the text matches the regular expression pattern, else returns ‘Failed’
21. Api Call
Invokes an external API for additional validation or data retrieval. Learn more →
Config:
- url:
string
(default:None
, required:True
) - The URL of the API to call. - payload:
dict
(default:{}
) - The payload to send to the API. - headers:
dict
(default:{}
) - The headers to send to the API.
Mapping:
- response:
string
- The response from the API call. This is the data the AI system will process to construct its output.
Output: Returns a ‘Passed’ if the API call is successful, providing a valid response that aligns with the requirements., else returns ‘Failed’
22. Agent As Judge
Uses a specified model as the “judge” to evaluate the response. Learn more →
Config:
- model:
string
(default:"gpt-4o-mini"
) - The model to use for the evaluation - eval_prompt:
string
(default:None
, required:True
) - The prompt to use for the evaluation ( Use{{<span_attribute>}}
to access span attributes in the eval prompt ) - system_prompt:
string
(default:""
) - The system prompt to use for the evaluation ( Use{{<span_attribute>}}
to access span attributes in the system prompt )
Output: A successful evaluation provides reasoned output based on the agent’s analysis using the configured prompts and chosen model, while a failed evaluation indicates issues with the agent’s execution or response generation.
23. One Line
Checks that the entire response is contained in a single line. Learn more →
Mapping:
- text:
string
- The text to be evaluated
Output: Returns a ‘Passed’ if the text is a single line, else returns ‘Failed’
24. Length Less Than
Checks if the response length is under a specified limit. Learn more →
Config:
- max_length:
int
(default:200
)
Mapping:
- text:
string
- The text to be evaluated
Output: Returns a ‘Passed’ if the text length is less than max_length, else returns ‘Failed’
25. Length Greater Than
Ensures the response meets a minimum length. Learn more →
Config:
- min_length:
int
(default:50
)
Mapping:
- text:
string
- The text to be evaluated
Output: Returns a ‘Passed’ if the text length is greater than min_length, else returns ‘Failed’
26. Length Between
Validates the response length is within a certain range. Learn more →
Config:
- max_length:
int
(default:200
) - min_length:
int
(default:50
)
Mapping:
- text:
string
- The text to be evaluated
Output: Returns a ‘Passed’ if the text length is between max_length and min_length, else returns ‘Failed’
27. Contains Valid Link
Confirms if the response contains at least one valid URL. Learn more →
Mapping:
- text
28. No Valid Links
Checks that no valid links are present in the response. Learn more →
Mapping:
- text:
string
- The text to be evaluated
Output: Returns a ‘Passed’ if the text does not contain any valid hyperlinks, else returns ‘Failed’
29. Is Email
Checks if the response is a properly formatted email address. Learn more →
Mapping:
- text:
string
- The text to be evaluated
Output: Returns a ‘Passed’ if the text is a properly formatted email address, else returns ‘Failed’
30. Contains
Validates that the response contains a required keyword. Learn more →
Config:
- case_sensitive:
bool
(default:True
) - keyword:
string
- The keyword to check for in the text
Mapping:
- text:
string
- The text to be evaluated
Output: Returns a ‘Passed’ if the text contains the keyword, else returns ‘Failed’
31. Contains Any
Checks that the response contains at least one of the specified keywords. Learn more →
Config:
- case_sensitive:
bool
(default:True
) - keywords:
list
(default:[]
) - The list of keywords to check for in the text
Mapping:
- text:
string
- The text to be evaluated
Output: Returns a ‘Passed’ if the text contains any of the keywords, else returns ‘Failed’
32. Contains All
Validates that the response contains all specified keywords. Learn more →
Config:
- case_sensitive:
bool
(default:True
) - keywords:
list
(default:[]
) - The list of keywords to check for in the text
Mapping:
- text:
string
- The text to be evaluated
Output: Returns a ‘Passed’ if the text contains all of the keywords, else returns ‘Failed’
33. Contains None
Ensures the response does not contain any of specified keywords. Learn more →
Config:
- case_sensitive:
bool
(default:True
) - keywords:
list
(default:[]
) - The list of forbidden keywords to check for in the text
Mapping:
- text:
string
- The text to be evaluated
Output: Returns a ‘Passed’ if the text does not contain any of the forbidden keywords, else returns ‘Failed’
34. Starts With
Checks if the response starts with a specified substring. Learn more →
Config:
- substring:
string
(default:None
, required:True
) - The substring to check for in the text - case_sensitive:
bool
(default:True
)
Mapping:
- text:
string
- The text to be evaluated
Output: Returns a ‘Passed’ if the text starts with the substring, else returns ‘Failed’
35. Ends With
Validates whether the response ends with a specified substring. Learn more →
Config:
- case_sensitive:
bool
(default:True
) - substring:
string
(default:None
, required:True
) - The substring to check for in the text
Mapping:
- text:
string
- The text to be evaluated
Output: Returns a ‘Passed’ if the text ends with the substring, else returns ‘Failed’
36. Equals
Checks if the response exactly matches the expected content. Learn more →
Config:
- case_sensitive:
bool
(default:True
)
Mapping:
- text:
string
- The text to be evaluated - expected_text:
string
- The expected text to be compared against
Output: Returns a ‘Passed’ if the text exactly matches the expected text, else returns ‘Failed’
37. Answer Similarity
Measures similarity between a candidate answer and an expected answer. Learn more →
Config:
- comparator:
string
(default:"CosineSimilarity"
) - Contains method to use for comparison - failure_threshold:
float
(default:0.5
) - The threshold for the similarity score
Mapping:
- expected_response:
string
- The expected answer - response:
string
- The output generated by the AI system
Output:
- Returns a score where values greater than or equal to the failure threshold indicate sufficient answer similarity.
38. Eval Output
Generic evaluation of the final output for compliance with instructions. Learn more →
Config:
- check_internet:
bool
(default:False
) - criteria:
string
Mapping:
- context
- response
39. Eval Context Retrieval Quality
Checks if the retrieved context is relevant and sufficient. Learn more →
Config:
- criteria:
string
- Description of the criteria for evaluation
Mapping:
- input:
string
- The input to be evaluated - output:
string
- The output generated by the AI system - context:
string
- The context provided to the AI system
Output: Returns a score where higher value indicate better context retrieval quality
40. Eval Image Instruction
Evaluates compliance with instructions for image-based tasks. Learn more →
Config:
- criteria:
string
- The evaluation standard that defines how the alignment is measured
Mapping:
- input:
string
- The input to be evaluated - image_url:
string
- The URL of the image to be evaluated
Output: Returns a score based on the linkage analysis, which is compared against predefined criteria to determine whether the image meets the expected standards.
41. Summary Quality
Checks if a summary captures the main points accurately and succinctly. Learn more →
Config:
- check_internet:
bool
(default:False
) - Whether to verify against internet sources
Mapping:
- input:
string
- The input to the AI system - output:
string
- The output generated by the AI system - context:
string
- The context provided to the AI system
Output: Returns a score where a higher score indicates better summary quality.
42. Factual Accuracy
Determines if the response is factually correct based on the context provided. Learn more →
Config:
- criteria:
string
- check_internet:
bool
(default:False
)
Mapping:
- input:
string
- The input provided to the AI system - output:
string
- The output generated by the AI system - context:
string
- The context provided to the AI system
Output: Returns a score where a higher score indicates greater factual accuracy.
43. Translation Accuracy
Evaluates the accuracy of translated content. Learn more →
Config:
- check_internet:
bool
(default:False
) - criteria:
string
Mapping:
- input:
string
- The input to the AI system - output:
string
- The translated output generated by the AI system
Output: Returns a score where a higher score indicates greater translation accuracy.
44. Cultural Sensitivity
Assesses given text for inclusivity and cultural awareness. Learn more →
Mapping:
- input:
string
- The input to be evaluated for cultural sensitivity
Output: Returns either “Passed” or “Failed”, where “Passed” indicates culturally appropriate content, “Failed” indicates potential cultural insensitivity
45. Bias Detection
Detects presence of bias or unfairness in the text. Learn more →
Mapping:
- input:
string
- The input to be evaluated for bias
Output: Returns either “Passed” or “Failed”, where “Passed” indicates neutral content, “Failed” indicates presence of bias.
46. LLM Function Calling
Checks if the output properly uses a function/tool call with correct parameters. Learn more →
Config:
- criteria:
string
- The criteria for the evaluation
Mapping:
- input:
string
- The input to the AI system - output:
string
- The output generated by the AI system
Output: Returns a ‘Passed’ if the output properly uses a function/tool call with correct parameters, else returns ‘Failed’
47. Groundedness
Evaluates if the response is grounded in the provided context. Learn more →
Mapping:
- input:
string
- The input to the AI system - output:
string
- The output generated by the AI system
Output: Returns a ‘Passed’ if the output is grounded according to the input provided, else returns ‘Failed’
48. Score Eval
Scores linkage between instruction, input images, and output images. Learn more →
Config:
- criteria:
string
- The criteria for the evaluation
Mapping:
- rule_prompt:
string
- The input to the AI system - criteria:
string
- The output generated by the AI system - input:
list
- The output generated by the AI system
Output: Returns a score based on the criteria provided.
49. Audio Transcription
Analyzes the transcription accuracy of the given audio and its transcription. Learn more →
Config:
- criteria:
string
- The criteria for the evaluation
Mapping:
- input audio:
string
- The URL of the audio to be evaluated - input transcription:
string
- The output generated by the AI system
Output: Returns a score based on the criteria provided.
50. Eval Audio Description
Evaluates the audio based on the description of the given audio. Learn more →
Config:
- criteria:
string
- The criteria for the evaluation - model:
string
- The model to be used for the evaluation
Mapping:
- input audio:
string
- The URL of the audio to be evaluated
Output: Returns a Passed or Failed based on the criteria provided.
51. Audio Quality
Evaluates the quality of the given audio. Learn more →
Config:
- criteria:
string
- The criteria for the evaluation - model:
string
- The model to be used for the evaluation
Mapping:
- input audio:
string
- The URL of the audio to be evaluated
Output: Returns a score based on the criteria provided.
52. JSON Schema Validation
Validates JSON against specified criteria. Learn more →
Config:
- validations:
list
- The criteria for the evaluation
Mapping:
- actual_json:
string
- The input to the AI system - expected_json:
string
- The output generated by the AI system
Output: Returns a Passed or Failed based on the validations provided.
53. Chunk Attribution
Tracks if the context chunk is used in generating the response. Learn more →
Mapping:
- input:
string
- The input to the AI system - output:
string
- The output generated by the AI system - context:
string
- The context provided to the AI system
Output: Returns a Passed or Failed based on the input, output and context.
54. Chunk Utilization
Measures how effectively context chunks are used in responses. Learn more →
Mapping:
- input:
string
- The input to the AI system - output:
string
- The output generated by the AI system - context:
string
- The context provided to the AI system
Output: Returns a score based on the criteria provided.
55. Eval Ranking
Provides ranking score for each context based on specified criteria. Learn more →
Config:
- criteria:
string
- The criteria for the evaluation
Mapping:
- input:
string
- The input to the AI system - context:
string
- The output generated by the AI system
Output: Returns a score based on the criteria provided.
56. Custom Code Evaluation
Executes custom Python code for evaluation. Learn more →
Config:
- code:
string
- The criteria for the evaluation
Output: Returns a Passed or Failed based on the code provided.
Was this page helpful?