To configure evaluations for your prototype, define a list of EvalTag objects that specify which evals should be run against your model outputs.
eval_tags = [
    EvalTag(
        eval_name=EvalName.CONTEXT_ADHERENCE,
        type=EvalTagType.OBSERVATION_SPAN,
        value=EvalSpanKind.LLM,
        mapping={
            "output": "llm.output_messages.0.message.content",
            "context": "llm.input_messages.1.message.content"
        },
        custom_eval_name="context_check",
        model=ModelChoices.TURING_LARGE
    )
]
  • eval_name: The evaluations to run on the spans
  • type: Specifies where to apply the evaluation
  • value: Identifies the kind of span to evaluate
  • mapping: Contains mapping of the required inputs of the eval Learn more →
  • custom_eval_name: Custom name to assign the eval tag
  • model: Model name to be assigned especially incase of future-agi evals

Adding Custom Evals

For custom_built evals, the name of custom-eval should be entered as string.
eval_tags = [
    EvalTag(
        eval_name='custom_eval_name_entered',
        value=EvalSpanKind.LLM,
        type=EvalTagType.OBSERVATION_SPAN,
        mapping={
            'input' : 'input.value'
        },
        custom_eval_name="<custom_eval_name2>",
    ),
]

Understanding the Mapping Attribute

The mapping attribute is a crucial component that connects eval requirements with your data. Here’s how it works:
  1. Each eval has some required keys: Different evaluations require different inputs. For example, the Context Adherence eval requires both context and output keys.
  2. Spans contain attributes: Your spans (like LLM spans, retriever spans, etc.) have attributes that store information as key-value pairs also known as span attributes.
  3. Mapping connects them: The mapping object specifies which span attribute should be used for each required key.
For example, in this mapping:
mapping={
    "output": "llm.output_messages.0.message.content",
    "context": "llm.input_messages.1.message.content"
}
  • The output key required by the eval will use data from this span attribute llm.output_messages.0.message.content
  • The context input will use data from this span attribute llm.input_messages.1.message.content
This allows evaluations to be flexible and work with different data while maintaining consistent evaluation logic.
Below are the list of evals Future AGI provides and their corresponding mappings and configuration parameters.

1. Conversation Coherence

Assesses whether a dialogue maintains logical flow and contextual consistency throughout all exchanges. Learn more → Mapping:
  • messages : Data that contains the complete conversation, represented as an array of user and assistant messages.
Output: Returns an output score. A higher score reflects a logically consistent and contextually relevant conversation. A lower score indicates issues like abrupt topic shifts, irrelevant responses, or loss of context.

2. Conversation Resolution

Checks if a conversation reaches a satisfactory conclusion that addresses the user’s initial query or intent. Learn more → Mapping:
  • messages : Data that contains the complete conversation, represented as an array of user and assistant messages.
Output: Returns an output score. A higher score indicates that the conversation was resolved effectively. A lower score points to incomplete, unclear, or unresolved conversations.

3. Content Moderation

Identifies and flags potentially harmful, unsafe, or prohibited content in text outputs. Learn more → Mapping:
  • text: string - The text to be evaluated for content moderation
Output: Returns a score where higher values indicate safer content, lower values indicate potentially inappropriate content

4. Context Adherence

Checks if a response stays strictly within the bounds of provided context without introducing external information. Learn more → Mapping:
  • context: string - The context provided to the AI system.
  • output: string - The output generated by the AI system.
Output: Returns a score where a higher score indicates stronger adherence to the context.

5. Context Relevance

Verifies that content is meaningfully related to the provided context and addresses the query appropriately. Learn more → Mapping:
  • context: string - The context provided to the AI system.
  • input: string - The input to the AI system.
Output: Returns a score where higher values indicate more relevant context.

6. Completeness

Analyzes whether an output fully addresses all aspects of the input request or task. Learn more → Mapping:
  • input: string - The input to the AI system.
  • output: string - The output generated by the AI system.
Output: Returns a score where the higher values indicate more complete content

7. PII

Detects Personally Identifiable Information in the response. Learn more → Mapping:
  • input: string - The input to be evaluated for PII
Output: Returns a ‘Passed’ if the response does not contains any PII, else returns ‘Failed’

8. Toxicity

Detects presence of personally identifiable information to protect user privacy and ensure compliance. Learn more → Mapping:
  • input: string - the input to be evaluated for toxicity
Output: Returns either “Passed” or “Failed”, where “Passed” indicates non-toxic content, “Failed” indicates presence of harmful or aggressive language

9. Tone

Evaluates the emotional quality and overall sentiment expressed in the content. Learn more → Mapping:
  • input: string - The input to be evaluated for tone
Output: Returns tone labels such as “neutral”, “joy”, etc whatever tag that indicates the dominant emotional tone detected in the content

10. Sexist

Detects gender-biased or discriminatory language in the text. Learn more → Mapping:
  • input: string - The input to be evaluated for sexism
Output: Returns either “Passed” or “Failed”, where “Passed” indicates no sexist content detected, “Failed” indicates presence of gender bias or discriminatory language

11. Prompt Injection

Identifies attempts to manipulate the model through crafted inputs that try to override instructions. Learn more → Mapping:
  • input: string - The input to the AI system
Output: Returns a ‘Passed’ if the input is not a prompt injection, else returns ‘Failed’

12. Prompt Instruction Adherence

Checks if outputs follow specific instructions provided in the prompt. Learn more → Mapping:
  • output: string - The output generated by the AI system.
Output: Returns a score between 0 and 1. A high score reflects strong adherence, where all prompt requirements are met, tasks are fully addressed, specified formats and constraints are followed, and both explicit and implicit instructions are properly handled. Conversely, a low score indicates significant deviations from the prompt instructions.

13. Data Privacy Compliance

Ensures content adheres to data protection standards and privacy regulations. Learn more → Mapping:
  • input: string - The input to be evaluated
Output: Returns a ‘Passed’ if the content adheres to data protection standards and privacy regulations, else returns ‘Failed’

14. Is Json

Validates whether text output is properly formatted as valid JSON. Learn more → Mapping:
  • text: string - The text to be evaluated
Output: Returns a ‘Passed’ if the text is valid JSON, else returns ‘Failed’

15. One Line

Checks that the entire response is contained in a single line. Learn more → Mapping:
  • text: string - The text to be evaluated
Output: Returns a ‘Passed’ if the text is a single line, else returns ‘Failed’
Confirms if the response contains at least one valid URL. Learn more → Mapping:
  • text

Checks that no valid links are present in the response. Learn more → Mapping:
  • text: string - The text to be evaluated
Output: Returns a ‘Passed’ if the text does not contain any valid hyperlinks, else returns ‘Failed’

18. Is Email

Checks if the response is a properly formatted email address. Learn more → Mapping:
  • text: string - The text to be evaluated
Output: Returns a ‘Passed’ if the text is a properly formatted email address, else returns ‘Failed’

19. Summary Quality

Checks if a summary captures the main points accurately and succinctly. Learn more → Mapping:
  • input: string - The input to the AI system
  • output: string - The output generated by the AI system
  • context: string - The context provided to the AI system
Output: Returns a score where a higher score indicates better summary quality.

20. Factual Accuracy

Determines if the response is factually correct based on the context provided. Learn more → Mapping:
  • input: string - The input provided to the AI system
  • output: string - The output generated by the AI system
  • context: string - The context provided to the AI system
Output: Returns a score where a higher score indicates greater factual accuracy.

21. Translation Accuracy

Evaluates the accuracy of translated content. Learn more → Mapping:
  • input: string - The input to the AI system
  • output: string - The translated output generated by the AI system
Output: Returns a score where a higher score indicates greater translation accuracy.

22. Cultural Sensitivity

Assesses given text for inclusivity and cultural awareness. Learn more → Mapping:
  • input: string - The input to be evaluated for cultural sensitivity
Output: Returns either “Passed” or “Failed”, where “Passed” indicates culturally appropriate content, “Failed” indicates potential cultural insensitivity

23. Bias Detection

Detects presence of bias or unfairness in the text. Learn more → Mapping:
  • input: string - The input to be evaluated for bias
Output: Returns either “Passed” or “Failed”, where “Passed” indicates neutral content, “Failed” indicates presence of bias.

24. LLM Function Calling

Checks if the output properly uses a function/tool call with correct parameters. Learn more → Mapping:
  • input: string - The input to the AI system
  • output: string - The output generated by the AI system
Output: Returns a ‘Passed’ if the output properly uses a function/tool call with correct parameters, else returns ‘Failed’

25. Groundedness

Evaluates if the response is grounded in the provided context. Learn more → Mapping:
  • input: string - The input to the AI system
  • output: string - The output generated by the AI system
Output: Returns a ‘Passed’ if the output is grounded according to the input provided, else returns ‘Failed’

26. Audio Transcription

Analyzes the transcription accuracy of the given audio and its transcription. Learn more → Mapping:
  • input audio: string - The URL of the audio to be evaluated
  • input transcription: string - The output generated by the AI system
Output: Returns a score based on the criteria provided.

27. Audio Quality

Evaluates the quality of the given audio. Learn more → Mapping:
  • input audio: string - The URL of the audio to be evaluated
Output: Returns a score based on the criteria provided.

28. Chunk Attribution

Tracks if the context chunk is used in generating the response. Learn more → Mapping:
  • input: string - The input to the AI system
  • output: string - The output generated by the AI system
  • context: string - The context provided to the AI system
Output: Returns a Passed or Failed based on the input, output and context.

29. Chunk Utilization

Measures how effectively context chunks are used in responses. Learn more → Mapping:
  • input: string - The input to the AI system
  • output: string - The output generated by the AI system
  • context: string - The context provided to the AI system
Output: Returns a score based on the criteria provided.

30. Eval Ranking

Provides ranking score for each context based on specified criteria. Learn more → Mapping:
  • input: string - The input to the AI system
  • context: string - The output generated by the AI system
Output: Returns a score based on the criteria provided.

31. No Racial Bias

Ensures that the output does not contain or imply racial bias, stereotypes, or preferential treatment. Mapping
  • input: string - The input provided
Output: Returns a Passed or Failed based on the code provided.

32. No Gender Bias

Checks that the response does not reinforce gender stereotypes or exhibit bias based on gender identity. Mapping
  • input: string - The input provided
Output: Returns a Passed or Failed based on the code provided.

33. No Age Bias

Evaluates if the content is free from stereotypes, discrimination, or assumptions based on age. Mapping
  • input: string - The input provided
Output: Returns a Passed or Failed based on the code provided.

34. No OpenAI Reference

Ensures that the model response does not mention being an OpenAI model or reference its training data or providers. Mapping
  • input: string - The input provided
Output: Returns a Passed or Failed based on the code provided.

35. No Appologies

Checks if the model unnecessarily apologizes, e.g., ‘I’m sorry, but…’ Mapping
  • input: string - The input provided
Output: Returns a Passed or Failed based on the code provided.

36. Is Polite

Ensures that the output maintains a respectful, kind, and non-aggressive tone. Mapping
  • input: string - The input provided
Output: Returns a Passed or Failed based on the code provided.

37. Is Concise

Measures whether the answer is brief and to the point, avoiding redundancy. Mapping:
  • input: string - The input to be evaluated for conciseness
Output: Returns ‘Passed’ if the text is appropriately concise, ‘Failed’ if it contains unnecessary verbosity.

38. Is Helpful

Evaluates whether the response answers the user’s question effectively. Mapping:
  • input: string - The user’s question
  • output: string - The response to be evaluated
Output: Returns ‘Passed’ if the response effectively answers the question, ‘Failed’ if it fails to address the query adequately.

39. Is Code

Checks whether the output is valid code or contains expected code snippets. Mapping:
  • input: string - The input to be evaluated for code validity
Output: Returns ‘Passed’ if the output contains valid, well-structured code, ‘Failed’ if it contains invalid code or syntax errors.

40. Fuzzy Match

Compares model output with an expected answer using approximate (not exact) matching. Mapping:
  • input: string - The output to be evaluated
  • output: string - The expected answer to compare against
Output: Returns ‘Passed’ if the output semantically matches the expected answer, ‘Failed’ if there are significant differences in meaning or intent.

41. Answer Refusal

Checks if the model correctly refuses to answer when prompted with harmful, sensitive, or restricted queries. Mapping:
  • input: string - The harmful/sensitive query
  • output: string - The model’s response
Output: Returns ‘Passed’ if the model correctly refuses inappropriate requests, ‘Failed’ if it provides harmful or sensitive information.

42. Detect Hallucination

Identifies if the model fabricated facts or added information that was not present in the input or reference. Mapping:
  • input: string - The original input/reference
  • output: string - The response to be evaluated for hallucinations
Output: Returns ‘Passed’ if the output contains only information present in or logically derived from the input, ‘Failed’ if it contains fabricated content.

43. No Harmful Therapeutic Guidance

Ensures that the model does not provide potentially harmful psychological or therapeutic advice. Mapping:
  • input: string - The input to be evaluated for harmful therapeutic guidance
Output: Returns ‘Passed’ if the output appropriately avoids harmful therapeutic guidance, ‘Failed’ if it contains such content.

44. Clinically Inappropriate Tone

Evaluates whether the model’s tone is unsuitable for clinical or mental health contexts. Mapping:
  • input: string - The input to be evaluated for clinical tone appropriateness
Output: Returns ‘Passed’ if the tone is clinically appropriate, ‘Failed’ if it contains inappropriate elements.

45. Is Harmful Advice

Detects whether the model gives advice that could be physically, emotionally, legally, or financially harmful. Mapping:
  • input: string - The input to be evaluated for harmful advice
Output: Returns ‘Passed’ if the output appropriately avoids harmful advice, ‘Failed’ if it contains such content.

46. Content Safety Violation

A broad check for content that violates safety or usage policies—this includes toxicity, hate speech, explicit content, violence, etc. Mapping:
  • input: string - The input to be evaluated for content moderation
Output: Returns ‘Passed’ if the content adheres to safety guidelines, ‘Failed’ if it contains safety violations.

47. Is Good Summary

Evaluates if a summary is clear, well-structured, and includes the most important points from the source material. Mapping:
  • input: string - The source material
  • output: string - The summary to be evaluated
Output: Returns ‘Passed’ if the summary effectively captures the main points and is well-structured, ‘Failed’ if it lacks clarity or misses important information.

48. Is Factually Consistent

Checks if the generated output is factually consistent with the source/context (e.g., input text or documents). Mapping:
  • input: string - The source/context material
  • output: string - The output to be evaluated for factual consistency
Output: Returns ‘Passed’ if the output is factually consistent with the source, ‘Failed’ if it contains factual inconsistencies.

49. Is Compliant

Ensures that the output adheres to legal, regulatory, or organizational policies (e.g., HIPAA, GDPR, company rules). Mapping:
  • input: string - The input to be evaluated for compliance
Output: Returns ‘Passed’ if the output adheres to all relevant policies, ‘Failed’ if it contains compliance violations.

50. Is Informal Tone

Detects whether the tone is informal or casual (e.g., use of slang, contractions, emoji). Mapping:
  • input: string - The input to be evaluated for tone formality
Output: Returns ‘Passed’ if the tone is informal, ‘Failed’ if it is formal, neutral, or lacks any informal indicators.

51. Evaluate Function Calling

Tests if the model correctly identifies when to trigger a tool/function and includes the right arguments in the function call. Mapping:
  • input: string - The user’s request
  • output: string - The function call to be evaluated
Output: Returns ‘Passed’ if the function calling is correct and appropriate, ‘Failed’ if there are errors in function selection or argument usage.

52. Task Completion

Measures whether the model fulfilled the user’s request accurately and completely. Mapping:
  • input: string - The user’s request
  • output: string - The model’s response to be evaluated
Output: Returns ‘Passed’ if the task is completed successfully and accurately, ‘Failed’ if the response is incomplete or inaccurate.

53. Caption Hallucination

Evaluates whether image captions or descriptions contain factual inaccuracies or hallucinated details that are not present in the instruction. Mapping:
  • input: string - The user’s request
  • output: string - The model’s response to be evaluated
Output: Returns ‘Passed’ if the description accurately reflects the instruction without adding unverified details, ‘Failed’ if it contains hallucinated elements.

54. Bleu Score

Computes a bleu score between the expected gold answer and the model output. Mapping:
  • reference: string - The reference answer
  • hypothesis: string - The model output
Output: Returns a score between 0 and 1. Higher values indicate greater lexical overlap.

55. Rouge Score

Computes a rouge score between the expected gold answer and the model output. Mapping:
  • reference: string - The reference answer
  • hypothesis: string - The model output
Output: Returns a score between 0 and 1. Higher values indicate greater recall-oriented overlap.

56. Text to SQL

Evaluates if the model correctly converts natural language text into valid and accurate SQL queries. Mapping:
  • input: string - The input text to be evaluated
  • output: string - The output to be evaluated
Output: Returns ‘Passed’ if the SQL query is correct and valid, ‘Failed’ if it is incorrect, invalid, or doesn’t match the input requirements.

57. Recall Score

Calculates Recall = (# relevant retrieved) / (# relevant total) Mapping:
  • reference: string - The reference set
  • hypothesis: string - The retrieved set
Output: Returns a recall score between 0 and 1.

58. Levenshtein Similarity

Measures the number of edits (insertions, deletions, or substitutions) to transform generated text to reference text. It is case-insensitive, punctuation-insensitive, and returns a normalized similarity. Mapping:
  • response: string - Model-generated output to be evaluated
  • expected_text: string - Reference string against which the output is compared
Output: Returns a normalized Levenshtein distance between 0 and 1. A score of 1.0 means perfect match.

59. Numeric Similarity

Extracts numeric values from generated text and computes the normalized difference from the reference number. Returns the normalized numeric similarity. Mapping:
  • response: string - Model-generated output to be evaluated
  • expected_text: string - Reference string against which the output is compared
Output: Returns a normalized numeric similarity score between 0 and 1.

60. Embedding Similarity

Measures the cosine semantic similarity between the generated text and the reference text. Mapping:
  • response: string - Model-generated output to be evaluated
  • expected_text: string - Reference string against which the output is compared
Output: Returns a score between 0 and 1 representing semantic similarity. Higher values indicate stronger similarity.

61. Semantic List Contains

Checks if the generated response semantically contains one or more phrases from a reference list. Mapping:
  • response: string - Model-generated output to be evaluated
  • expected_text: string or List[string] - Reference phrases or keywords
Output: Returns a score between 0 and 1. Closer to 1.0 if match criteria are more satisfied, or closer to 0.0 otherwise.

62. Is AI Generated Image

Evaluates if the given image is generated by AI or not. Mapping:
  • input_image: string - The input image to be evaluated
Output: Returns a score indicating the likelihood that the image is AI-generated.