eval_name
: The evaluations to run on the spanstype
: Specifies where to apply the evaluationvalue
: Identifies the kind of span to evaluatemapping
: Contains mapping of the required inputs of the eval Learn more →custom_eval_name
: Custom name to assign the eval tagmodel
: Model name to be assigned especially incase of future-agi evals
Adding Custom Evals
For custom_built evals, the name of custom-eval should be entered as string.Understanding the Mapping Attribute
Themapping
attribute is a crucial component that connects eval requirements with your data. Here’s how it works:
-
Each eval has some required keys: Different evaluations require different inputs. For example, the Context Adherence eval requires both
context
andoutput
keys. - Spans contain attributes: Your spans (like LLM spans, retriever spans, etc.) have attributes that store information as key-value pairs also known as span attributes.
- Mapping connects them: The mapping object specifies which span attribute should be used for each required key.
- The
output
key required by the eval will use data from this span attributellm.output_messages.0.message.content
- The
context
input will use data from this span attributellm.input_messages.1.message.content
Below are the list of evals Future AGI provides and their corresponding mappings and configuration parameters.
1. Conversation Coherence
Assesses whether a dialogue maintains logical flow and contextual consistency throughout all exchanges. Learn more → Mapping:- messages : Data that contains the complete conversation, represented as an array of user and assistant messages.
2. Conversation Resolution
Checks if a conversation reaches a satisfactory conclusion that addresses the user’s initial query or intent. Learn more → Mapping:- messages : Data that contains the complete conversation, represented as an array of user and assistant messages.
3. Content Moderation
Identifies and flags potentially harmful, unsafe, or prohibited content in text outputs. Learn more → Mapping:- text:
string
- The text to be evaluated for content moderation
4. Context Adherence
Checks if a response stays strictly within the bounds of provided context without introducing external information. Learn more → Mapping:- context:
string
- The context provided to the AI system. - output:
string
- The output generated by the AI system.
5. Context Relevance
Verifies that content is meaningfully related to the provided context and addresses the query appropriately. Learn more → Mapping:- context:
string
- The context provided to the AI system. - input:
string
- The input to the AI system.
6. Completeness
Analyzes whether an output fully addresses all aspects of the input request or task. Learn more → Mapping:- input:
string
- The input to the AI system. - output:
string
- The output generated by the AI system.
7. PII
Detects Personally Identifiable Information in the response. Learn more → Mapping:- input:
string
- The input to be evaluated for PII
8. Toxicity
Detects presence of personally identifiable information to protect user privacy and ensure compliance. Learn more → Mapping:- input:
string
- the input to be evaluated for toxicity
9. Tone
Evaluates the emotional quality and overall sentiment expressed in the content. Learn more → Mapping:- input:
string
- The input to be evaluated for tone
10. Sexist
Detects gender-biased or discriminatory language in the text. Learn more → Mapping:- input:
string
- The input to be evaluated for sexism
11. Prompt Injection
Identifies attempts to manipulate the model through crafted inputs that try to override instructions. Learn more → Mapping:- input:
string
- The input to the AI system
12. Prompt Instruction Adherence
Checks if outputs follow specific instructions provided in the prompt. Learn more → Mapping:- output:
string
- The output generated by the AI system.
13. Data Privacy Compliance
Ensures content adheres to data protection standards and privacy regulations. Learn more → Mapping:- input:
string
- The input to be evaluated
14. Is Json
Validates whether text output is properly formatted as valid JSON. Learn more → Mapping:- text:
string
- The text to be evaluated
15. One Line
Checks that the entire response is contained in a single line. Learn more → Mapping:- text:
string
- The text to be evaluated
16. Contains Valid Link
Confirms if the response contains at least one valid URL. Learn more → Mapping:- text
17. No Valid Links
Checks that no valid links are present in the response. Learn more → Mapping:- text:
string
- The text to be evaluated
18. Is Email
Checks if the response is a properly formatted email address. Learn more → Mapping:- text:
string
- The text to be evaluated
19. Summary Quality
Checks if a summary captures the main points accurately and succinctly. Learn more → Mapping:- input:
string
- The input to the AI system - output:
string
- The output generated by the AI system - context:
string
- The context provided to the AI system
20. Factual Accuracy
Determines if the response is factually correct based on the context provided. Learn more → Mapping:- input:
string
- The input provided to the AI system - output:
string
- The output generated by the AI system - context:
string
- The context provided to the AI system
21. Translation Accuracy
Evaluates the accuracy of translated content. Learn more → Mapping:- input:
string
- The input to the AI system - output:
string
- The translated output generated by the AI system
22. Cultural Sensitivity
Assesses given text for inclusivity and cultural awareness. Learn more → Mapping:- input:
string
- The input to be evaluated for cultural sensitivity
23. Bias Detection
Detects presence of bias or unfairness in the text. Learn more → Mapping:- input:
string
- The input to be evaluated for bias
24. LLM Function Calling
Checks if the output properly uses a function/tool call with correct parameters. Learn more → Mapping:- input:
string
- The input to the AI system - output:
string
- The output generated by the AI system
25. Groundedness
Evaluates if the response is grounded in the provided context. Learn more → Mapping:- input:
string
- The input to the AI system - output:
string
- The output generated by the AI system
26. Audio Transcription
Analyzes the transcription accuracy of the given audio and its transcription. Learn more → Mapping:- input audio:
string
- The URL of the audio to be evaluated - input transcription:
string
- The output generated by the AI system
27. Audio Quality
Evaluates the quality of the given audio. Learn more → Mapping:- input audio:
string
- The URL of the audio to be evaluated
28. Chunk Attribution
Tracks if the context chunk is used in generating the response. Learn more → Mapping:- input:
string
- The input to the AI system - output:
string
- The output generated by the AI system - context:
string
- The context provided to the AI system
29. Chunk Utilization
Measures how effectively context chunks are used in responses. Learn more → Mapping:- input:
string
- The input to the AI system - output:
string
- The output generated by the AI system - context:
string
- The context provided to the AI system
30. Eval Ranking
Provides ranking score for each context based on specified criteria. Learn more → Mapping:- input:
string
- The input to the AI system - context:
string
- The output generated by the AI system
31. No Racial Bias
Ensures that the output does not contain or imply racial bias, stereotypes, or preferential treatment. Mapping- input:
string
- The input provided
32. No Gender Bias
Checks that the response does not reinforce gender stereotypes or exhibit bias based on gender identity. Mapping- input:
string
- The input provided
33. No Age Bias
Evaluates if the content is free from stereotypes, discrimination, or assumptions based on age. Mapping- input:
string
- The input provided
34. No OpenAI Reference
Ensures that the model response does not mention being an OpenAI model or reference its training data or providers. Mapping- input:
string
- The input provided
35. No Appologies
Checks if the model unnecessarily apologizes, e.g., ‘I’m sorry, but…’ Mapping- input:
string
- The input provided
36. Is Polite
Ensures that the output maintains a respectful, kind, and non-aggressive tone. Mapping- input:
string
- The input provided
37. Is Concise
Measures whether the answer is brief and to the point, avoiding redundancy. Mapping:- input:
string
- The input to be evaluated for conciseness
38. Is Helpful
Evaluates whether the response answers the user’s question effectively. Mapping:- input:
string
- The user’s question - output:
string
- The response to be evaluated
39. Is Code
Checks whether the output is valid code or contains expected code snippets. Mapping:- input:
string
- The input to be evaluated for code validity
40. Fuzzy Match
Compares model output with an expected answer using approximate (not exact) matching. Mapping:- input:
string
- The output to be evaluated - output:
string
- The expected answer to compare against
41. Answer Refusal
Checks if the model correctly refuses to answer when prompted with harmful, sensitive, or restricted queries. Mapping:- input:
string
- The harmful/sensitive query - output:
string
- The model’s response
42. Detect Hallucination
Identifies if the model fabricated facts or added information that was not present in the input or reference. Mapping:- input:
string
- The original input/reference - output:
string
- The response to be evaluated for hallucinations
43. No Harmful Therapeutic Guidance
Ensures that the model does not provide potentially harmful psychological or therapeutic advice. Mapping:- input:
string
- The input to be evaluated for harmful therapeutic guidance
44. Clinically Inappropriate Tone
Evaluates whether the model’s tone is unsuitable for clinical or mental health contexts. Mapping:- input:
string
- The input to be evaluated for clinical tone appropriateness
45. Is Harmful Advice
Detects whether the model gives advice that could be physically, emotionally, legally, or financially harmful. Mapping:- input:
string
- The input to be evaluated for harmful advice
46. Content Safety Violation
A broad check for content that violates safety or usage policies—this includes toxicity, hate speech, explicit content, violence, etc. Mapping:- input:
string
- The input to be evaluated for content moderation
47. Is Good Summary
Evaluates if a summary is clear, well-structured, and includes the most important points from the source material. Mapping:- input:
string
- The source material - output:
string
- The summary to be evaluated
48. Is Factually Consistent
Checks if the generated output is factually consistent with the source/context (e.g., input text or documents). Mapping:- input:
string
- The source/context material - output:
string
- The output to be evaluated for factual consistency
49. Is Compliant
Ensures that the output adheres to legal, regulatory, or organizational policies (e.g., HIPAA, GDPR, company rules). Mapping:- input:
string
- The input to be evaluated for compliance
50. Is Informal Tone
Detects whether the tone is informal or casual (e.g., use of slang, contractions, emoji). Mapping:- input:
string
- The input to be evaluated for tone formality
51. Evaluate Function Calling
Tests if the model correctly identifies when to trigger a tool/function and includes the right arguments in the function call. Mapping:- input:
string
- The user’s request - output:
string
- The function call to be evaluated
52. Task Completion
Measures whether the model fulfilled the user’s request accurately and completely. Mapping:- input:
string
- The user’s request - output:
string
- The model’s response to be evaluated
53. Caption Hallucination
Evaluates whether image captions or descriptions contain factual inaccuracies or hallucinated details that are not present in the instruction. Mapping:- input:
string
- The user’s request - output:
string
- The model’s response to be evaluated
54. Bleu Score
Computes a bleu score between the expected gold answer and the model output. Mapping:- reference:
string
- The reference answer - hypothesis:
string
- The model output
55. Rouge Score
Computes a rouge score between the expected gold answer and the model output. Mapping:- reference:
string
- The reference answer - hypothesis:
string
- The model output
56. Text to SQL
Evaluates if the model correctly converts natural language text into valid and accurate SQL queries. Mapping:- input:
string
- The input text to be evaluated - output:
string
- The output to be evaluated
57. Recall Score
Calculates Recall = (# relevant retrieved) / (# relevant total) Mapping:- reference:
string
- The reference set - hypothesis:
string
- The retrieved set
58. Levenshtein Similarity
Measures the number of edits (insertions, deletions, or substitutions) to transform generated text to reference text. It is case-insensitive, punctuation-insensitive, and returns a normalized similarity. Mapping:- response:
string
- Model-generated output to be evaluated - expected_text:
string
- Reference string against which the output is compared
59. Numeric Similarity
Extracts numeric values from generated text and computes the normalized difference from the reference number. Returns the normalized numeric similarity. Mapping:- response:
string
- Model-generated output to be evaluated - expected_text:
string
- Reference string against which the output is compared
60. Embedding Similarity
Measures the cosine semantic similarity between the generated text and the reference text. Mapping:- response:
string
- Model-generated output to be evaluated - expected_text:
string
- Reference string against which the output is compared
61. Semantic List Contains
Checks if the generated response semantically contains one or more phrases from a reference list. Mapping:- response:
string
- Model-generated output to be evaluated - expected_text:
string
orList[string]
- Reference phrases or keywords
62. Is AI Generated Image
Evaluates if the given image is generated by AI or not. Mapping:- input_image:
string
- The input image to be evaluated