How to Configure Evals for Observability

To configure evals for observability, you can use the EvalTag class.

eval_tag = [
        EvalTag(
        eval_name=EvalName.CONTEXT_ADHERENCE,
        type=EvalTagType.OBSERVATION_SPAN,
        value=EvalSpanKind.LLM,
        config={
            "criteria": "Check if response stays within provided context"
        },
        mapping={
            "output": "llm.output_messages.0.message.content",
            "context": "llm.input_messages.1.message.content"
        }
        custom_eval_name="context_check"
    )
]
  • eval_name: The predefined evaluation to use
  • type: Specifies where to apply the evaluation
  • value: Identifies the kind of span to evaluate
  • config: Contains configuration parameters for the eval
  • mapping: Contains mapping of the required inputs of the eval
  • custom_eval_name: Custom name to assign the eval tag

Below are the list of evals and their corresponding required mappings and configuration parameters.


1. Conversation Coherence

Assesses whether a dialogue maintains logical flow and contextual consistency throughout all exchanges. Learn more →

Mapping:

  • messages : Data that contains the complete conversation, represented as an array of user and assistant messages.

Output: Returns an output score. A higher score reflects a logically consistent and contextually relevant conversation. A lower score indicates issues like abrupt topic shifts, irrelevant responses, or loss of context.


2. Conversation Resolution

Checks if a conversation reaches a satisfactory conclusion that addresses the user’s initial query or intent. Learn more →

Mapping:

  • messages : Data that contains the complete conversation, represented as an array of user and assistant messages.

Output: Returns an output score. A higher score indicates that the conversation was resolved effectively. A lower score points to incomplete, unclear, or unresolved conversations.


3. Deterministic Evals

Provides rule-based validation for multi-choice scenarios using predefined patterns and formats. Learn more →

Config:

  • input: string - The input to be evaluated
  • choices: List[string] - A set of predefined options for multiple-choice questions
  • multi_choice: bool - Whether the input is a multiple choice question or not
  • rule_prompt: string - A rule or set of conditions that the output must meet

Output: Output is a set of choices provided by the user of the output’s adherence to the deterministic criteria.


4. Content Moderation

Identifies and flags potentially harmful, unsafe, or prohibited content in text outputs. Learn more →

Mapping:

  • text: string - The text to be evaluated for content moderation

Output: Returns a score where higher values indicate safer content, lower values indicate potentially inappropriate content


5. Context Adherence

Checks if a response stays strictly within the bounds of provided context without introducing external information. Learn more →

Mapping:

  • context: string - The context provided to the AI system.
  • output: string - The output generated by the AI system.

Output: Returns a score where a higher score indicates stronger adherence to the context.


6. Prompt Perplexity

Measures how difficult or confusing a prompt might be for a language model to process. Learn more →

Config:

  • model: string (default: "gpt-4o-mini")

Mapping:

  • input: string - The input to be evaluated for perplexity

Output: Returns a score where a higher score indicates more difficult or confusing prompts and lower scores indicate higher prompt clarity and interpretability.


7. Context Relevance

Verifies that content is meaningfully related to the provided context and addresses the query appropriately. Learn more →

Config:

  • check_internet: bool (default: False) - Whether to verify against internet sources

Mapping:

  • context: string - The context provided to the AI system.
  • input: string - The input to the AI system.

Output: Returns a score where higher values indicate more relevant context.


8. Completeness

Analyzes whether an output fully addresses all aspects of the input request or task. Learn more →

Mapping:

  • input: string - The input to the AI system.
  • output: string - The output generated by the AI system.

Output: Returns a score where the higher values indicate more complete content


9. Context Similarity

Compares the semantic similarity between a response and the original context using vector comparisons. Learn more →

Config:

  • comparator: string (default: "CosineSimilarity") - Contains method to use for comparison
  • failure_threshold: float (default: 0.5) - The threshold for the similarity score

Mapping:

  • context: string - The context provided to the AI system.
  • response: string - The output generated by the AI system.

Output:

  • Returns a score where values greater than or equal to the failure threshold indicate sufficient context similarity.

10. PII

Detects Personally Identifiable Information in the response. Learn more →

Mapping:

  • input: string - The input to be evaluated for PII

Output: Returns a ‘Passed’ if the response does not contains any PII, else returns ‘Failed’


11. Toxicity

Detects presence of personally identifiable information to protect user privacy and ensure compliance. Learn more →

Mapping:

  • input: string - the input to be evaluated for toxicity

Output: Returns either “Passed” or “Failed”, where “Passed” indicates non-toxic content, “Failed” indicates presence of harmful or aggressive language


12. Tone

Evaluates the emotional quality and overall sentiment expressed in the content. Learn more →

Mapping:

  • input: string - The input to be evaluated for tone

Output: Returns tone labels such as “neutral”, “joy”, etc whatever tag that indicates the dominant emotional tone detected in the content


13. Sexist

Detects gender-biased or discriminatory language in the text. Learn more →

Mapping:

  • input: string - The input to be evaluated for sexism

Output: Returns either “Passed” or “Failed”, where “Passed” indicates no sexist content detected, “Failed” indicates presence of gender bias or discriminatory language


14. Prompt Injection

Identifies attempts to manipulate the model through crafted inputs that try to override instructions. Learn more →

Mapping:

  • input: string - The input to the AI system

Output: Returns a ‘Passed’ if the input is not a prompt injection, else returns ‘Failed’


15. Not Gibberish Text

Ensures text is coherent and meaningful rather than random or nonsensical strings. Learn more →

Mapping:

  • response: string - The output generated by the AI system.

Output: Returns float between 0 and 1. Higher values indicate more coherent and meaningful content.


16. Safe For Work Text

Verifies content is appropriate for professional environments, avoiding adult or offensive material. Learn more →

Mapping:

  • response: string - The output generated by the AI system.

Output: Returns either “Passed” or “Failed”, where “Passed” indicates safe for work text and “Failed” indicates not safe for work text.


17. Prompt Instruction Adherence

Checks if outputs follow specific instructions provided in the prompt. Learn more →

Mapping:

  • output: string - The output generated by the AI system.

Output: Returns a score between 0 and 1. A high score reflects strong adherence, where all prompt requirements are met, tasks are fully addressed, specified formats and constraints are followed, and both explicit and implicit instructions are properly handled. Conversely, a low score indicates significant deviations from the prompt instructions.


18. Data Privacy Compliance

Ensures content adheres to data protection standards and privacy regulations. Learn more →

Config:

  • check_internet: bool (default: False) - Whether to verify against internet sources

Mapping:

  • input: string - The input to be evaluated

Output: Returns a ‘Passed’ if the content adheres to data protection standards and privacy regulations, else returns ‘Failed’


19. Is Json

Validates whether text output is properly formatted as valid JSON. Learn more →

Mapping:

  • text: string - The text to be evaluated

Output: Returns a ‘Passed’ if the text is valid JSON, else returns ‘Failed’


20. Regex

Checks if text matches a specified regular expression pattern for format validation. Learn more →

Config:

  • pattern: string - The regular expression pattern to check for in the text

Mapping:

  • text: string - The text to be evaluated

Output: Returns a ‘Passed’ if the text matches the regular expression pattern, else returns ‘Failed’


21. Api Call

Invokes an external API for additional validation or data retrieval. Learn more →

Config:

  • url: string (default: None, required: True) - The URL of the API to call.
  • payload: dict (default: {}) - The payload to send to the API.
  • headers: dict (default: {}) - The headers to send to the API.

Mapping:

  • response: string - The response from the API call. This is the data the AI system will process to construct its output.

Output: Returns a ‘Passed’ if the API call is successful, providing a valid response that aligns with the requirements., else returns ‘Failed’


22. Agent As Judge

Uses a specified model as the “judge” to evaluate the response. Learn more →

Config:

  • model: string (default: "gpt-4o-mini") - The model to use for the evaluation
  • eval_prompt: string (default: None, required: True) - The prompt to use for the evaluation
  • system_prompt: string (default: "") - The system prompt to use for the evaluation

Output: A successful evaluation provides reasoned output based on the agent’s analysis using the configured prompts and chosen model, while a failed evaluation indicates issues with the agent’s execution or response generation.


23. One Line

Checks that the entire response is contained in a single line. Learn more →

Mapping:

  • text: string - The text to be evaluated

Output: Returns a ‘Passed’ if the text is a single line, else returns ‘Failed’


24. Length Less Than

Checks if the response length is under a specified limit. Learn more →

Config:

  • max_length: int (default: 200)

Mapping:

  • text: string - The text to be evaluated

Output: Returns a ‘Passed’ if the text length is less than max_length, else returns ‘Failed’


25. Length Greater Than

Ensures the response meets a minimum length. Learn more →

Config:

  • min_length: int (default: 50)

Mapping:

  • text: string - The text to be evaluated

Output: Returns a ‘Passed’ if the text length is greater than min_length, else returns ‘Failed’


26. Length Between

Validates the response length is within a certain range. Learn more →

Config:

  • max_length: int (default: 200)
  • min_length: int (default: 50)

Mapping:

  • text: string - The text to be evaluated

Output: Returns a ‘Passed’ if the text length is between max_length and min_length, else returns ‘Failed’


Confirms if the response contains at least one valid URL. Learn more →

Mapping:

  • text

Checks that no valid links are present in the response. Learn more →

Mapping:

  • text: string - The text to be evaluated

Output: Returns a ‘Passed’ if the text does not contain any valid hyperlinks, else returns ‘Failed’


29. Is Email

Checks if the response is a properly formatted email address. Learn more →

Mapping:

  • text: string - The text to be evaluated

Output: Returns a ‘Passed’ if the text is a properly formatted email address, else returns ‘Failed’


30. Contains

Validates that the response contains a required keyword. Learn more →

Config:

  • case_sensitive: bool (default: True)
  • keyword: string - The keyword to check for in the text

Mapping:

  • text: string - The text to be evaluated

Output: Returns a ‘Passed’ if the text contains the keyword, else returns ‘Failed’


31. Contains Any

Checks that the response contains at least one of the specified keywords. Learn more →

Config:

  • case_sensitive: bool (default: True)
  • keywords: list (default: []) - The list of keywords to check for in the text

Mapping:

  • text: string - The text to be evaluated

Output: Returns a ‘Passed’ if the text contains any of the keywords, else returns ‘Failed’


32. Contains All

Validates that the response contains all specified keywords. Learn more →

Config:

  • case_sensitive: bool (default: True)
  • keywords: list (default: []) - The list of keywords to check for in the text

Mapping:

  • text: string - The text to be evaluated

Output: Returns a ‘Passed’ if the text contains all of the keywords, else returns ‘Failed’


33. Contains None

Ensures the response does not contain any of specified keywords. Learn more →

Config:

  • case_sensitive: bool (default: True)
  • keywords: list (default: []) - The list of forbidden keywords to check for in the text

Mapping:

  • text: string - The text to be evaluated

Output: Returns a ‘Passed’ if the text does not contain any of the forbidden keywords, else returns ‘Failed’


34. Starts With

Checks if the response starts with a specified substring. Learn more →

Config:

  • substring: string (default: None, required: True) - The substring to check for in the text
  • case_sensitive: bool (default: True)

Mapping:

  • text: string - The text to be evaluated

Output: Returns a ‘Passed’ if the text starts with the substring, else returns ‘Failed’


35. Ends With

Validates whether the response ends with a specified substring. Learn more →

Config:

  • case_sensitive: bool (default: True)
  • substring: string (default: None, required: True) - The substring to check for in the text

Mapping:

  • text: string - The text to be evaluated

Output: Returns a ‘Passed’ if the text ends with the substring, else returns ‘Failed’


36. Equals

Checks if the response exactly matches the expected content. Learn more →

Config:

  • case_sensitive: bool (default: True)

Mapping:

  • text: string - The text to be evaluated
  • expected_text: string - The expected text to be compared against

Output: Returns a ‘Passed’ if the text exactly matches the expected text, else returns ‘Failed’


37. Answer Similarity

Measures similarity between a candidate answer and an expected answer. Learn more →

Config:

  • comparator: string (default: "CosineSimilarity") - Contains method to use for comparison
  • failure_threshold: float (default: 0.5) - The threshold for the similarity score

Mapping:

  • expected_response: string - The expected answer
  • response: string - The output generated by the AI system

Output:

  • Returns a score where values greater than or equal to the failure threshold indicate sufficient answer similarity.

38. Eval Output

Generic evaluation of the final output for compliance with instructions. Learn more →

Config:

  • check_internet: bool (default: False)
  • criteria: string

Mapping:

  • context
  • response

39. Eval Context Retrieval Quality

Checks if the retrieved context is relevant and sufficient. Learn more →

Config:

  • criteria: string - Description of the criteria for evaluation

Mapping:

  • input: string - The input to be evaluated
  • output: string - The output generated by the AI system
  • context: string - The context provided to the AI system

Output: Returns a score where higher value indicate better context retrieval quality


40. Eval Image Instruction

Evaluates compliance with instructions for image-based tasks. Learn more →

Config:

  • criteria: string - The evaluation standard that defines how the alignment is measured

Mapping:

  • input: string - The input to be evaluated
  • image_url: string - The URL of the image to be evaluated

Output: Returns a score based on the linkage analysis, which is compared against predefined criteria to determine whether the image meets the expected standards.


41. Summary Quality

Checks if a summary captures the main points accurately and succinctly. Learn more →

Config:

  • check_internet: bool (default: False) - Whether to verify against internet sources

Mapping:

  • input: string - The input to the AI system
  • output: string - The output generated by the AI system
  • context: string - The context provided to the AI system

Output: Returns a score where a higher score indicates better summary quality.


42. Factual Accuracy

Determines if the response is factually correct based on the context provided. Learn more →

Config:

  • criteria: string
  • check_internet: bool (default: False)

Mapping:

  • input: string - The input provided to the AI system
  • output: string - The output generated by the AI system
  • context: string - The context provided to the AI system

Output: Returns a score where a higher score indicates greater factual accuracy.


43. Translation Accuracy

Evaluates the accuracy of translated content. Learn more →

Config:

  • check_internet: bool (default: False)
  • criteria: string

Mapping:

  • input: string - The input to the AI system
  • output: string - The translated output generated by the AI system

Output: Returns a score where a higher score indicates greater translation accuracy.


44. Cultural Sensitivity

Assesses given text for inclusivity and cultural awareness. Learn more →

Mapping:

  • input: string - The input to be evaluated for cultural sensitivity

Output: Returns either “Passed” or “Failed”, where “Passed” indicates culturally appropriate content, “Failed” indicates potential cultural insensitivity


45. Bias Detection

Detects presence of bias or unfairness in the text. Learn more →

Mapping:

  • input: string - The input to be evaluated for bias

Output: Returns either “Passed” or “Failed”, where “Passed” indicates neutral content, “Failed” indicates presence of bias.


46. LLM Function Calling

Checks if the output properly uses a function/tool call with correct parameters. Learn more →

Config:

  • criteria: string - The criteria for the evaluation

Mapping:

  • input: string - The input to the AI system
  • output: string - The output generated by the AI system

Output: Returns a ‘Passed’ if the output properly uses a function/tool call with correct parameters, else returns ‘Failed’