Evals for Prototype - Future AGI Documentation

To configure evaluations for your prototype, define a list of EvalTag objects that specify which evals should be run against your model outputs.

eval_tags = [
    EvalTag(
        eval_name=EvalName.CONTEXT_ADHERENCE,
        type=EvalTagType.OBSERVATION_SPAN,
        value=EvalSpanKind.LLM,
        mapping={
            "output": "llm.output_messages.0.message.content",
            "context": "llm.input_messages.1.message.content"
        },
        custom_eval_name="context_check",
        model=ModelChoices.TURING_LARGE
    )
]

eval_name: The evaluations to run on the spans
type: Specifies where to apply the evaluation
value: Identifies the kind of span to evaluate
mapping: Contains mapping of the required inputs of the eval Learn more →
custom_eval_name: Custom name to assign the eval tag
model: Model name to be assigned especially incase of future-agi evals

Adding Custom Evals

For custom_built evals, the name of custom-eval should be entered as string.

eval_tags = [
    EvalTag(
        eval_name='custom_eval_name_entered',
        value=EvalSpanKind.LLM,
        type=EvalTagType.OBSERVATION_SPAN,
        mapping={
            'input' : 'input.value'
        },
        custom_eval_name="<custom_eval_name2>",
    ),
]

Understanding the Mapping Attribute

The mapping attribute is a crucial component that connects eval requirements with your data. Here’s how it works:

Each eval has some required keys: Different evaluations require different inputs. For example, the Context Adherence eval requires both context and output keys.
Spans contain attributes: Your spans (like LLM spans, retriever spans, etc.) have attributes that store information as key-value pairs also known as span attributes.
Mapping connects them: The mapping object specifies which span attribute should be used for each required key.

For example, in this mapping:

mapping={
    "output": "llm.output_messages.0.message.content",
    "context": "llm.input_messages.1.message.content"
}

The output key required by the eval will use data from this span attribute llm.output_messages.0.message.content
The context input will use data from this span attribute llm.input_messages.1.message.content

This allows evaluations to be flexible and work with different data while maintaining consistent evaluation logic.

Below are the list of evals Future AGI provides and their corresponding mappings and configuration parameters.

1. Conversation Coherence

Assesses whether a dialogue maintains logical flow and contextual consistency throughout all exchanges. Learn more → Mapping:

messages : Data that contains the complete conversation, represented as an array of user and assistant messages.

Output: Returns an output score. A higher score reflects a logically consistent and contextually relevant conversation. A lower score indicates issues like abrupt topic shifts, irrelevant responses, or loss of context.

2. Conversation Resolution

Checks if a conversation reaches a satisfactory conclusion that addresses the user’s initial query or intent. Learn more → Mapping:

messages : Data that contains the complete conversation, represented as an array of user and assistant messages.

Output: Returns an output score. A higher score indicates that the conversation was resolved effectively. A lower score points to incomplete, unclear, or unresolved conversations.

3. Content Moderation

Identifies and flags potentially harmful, unsafe, or prohibited content in text outputs. Learn more → Mapping:

text: string - The text to be evaluated for content moderation

Output: Returns a score where higher values indicate safer content, lower values indicate potentially inappropriate content

4. Context Adherence

Checks if a response stays strictly within the bounds of provided context without introducing external information. Learn more → Mapping:

context: string - The context provided to the AI system.
output: string - The output generated by the AI system.

Output: Returns a score where a higher score indicates stronger adherence to the context.

5. Context Relevance

Verifies that content is meaningfully related to the provided context and addresses the query appropriately. Learn more → Mapping:

context: string - The context provided to the AI system.
input: string - The input to the AI system.

Output: Returns a score where higher values indicate more relevant context.

6. Completeness

Analyzes whether an output fully addresses all aspects of the input request or task. Learn more → Mapping:

input: string - The input to the AI system.
output: string - The output generated by the AI system.

Output: Returns a score where the higher values indicate more complete content

7. PII

Detects Personally Identifiable Information in the response. Learn more → Mapping:

input: string - The input to be evaluated for PII

Output: Returns a ‘Passed’ if the response does not contains any PII, else returns ‘Failed’

8. Toxicity

Detects presence of personally identifiable information to protect user privacy and ensure compliance. Learn more → Mapping:

input: string - the input to be evaluated for toxicity

Output: Returns either “Passed” or “Failed”, where “Passed” indicates non-toxic content, “Failed” indicates presence of harmful or aggressive language

9. Tone

Evaluates the emotional quality and overall sentiment expressed in the content. Learn more → Mapping:

input: string - The input to be evaluated for tone

Output: Returns tone labels such as “neutral”, “joy”, etc whatever tag that indicates the dominant emotional tone detected in the content

10. Sexist

Detects gender-biased or discriminatory language in the text. Learn more → Mapping:

input: string - The input to be evaluated for sexism

Output: Returns either “Passed” or “Failed”, where “Passed” indicates no sexist content detected, “Failed” indicates presence of gender bias or discriminatory language

11. Prompt Injection

Identifies attempts to manipulate the model through crafted inputs that try to override instructions. Learn more → Mapping:

input: string - The input to the AI system

Output: Returns a ‘Passed’ if the input is not a prompt injection, else returns ‘Failed’

12. Prompt Instruction Adherence

Checks if outputs follow specific instructions provided in the prompt. Learn more → Mapping:

output: string - The output generated by the AI system.

Output: Returns a score between 0 and 1. A high score reflects strong adherence, where all prompt requirements are met, tasks are fully addressed, specified formats and constraints are followed, and both explicit and implicit instructions are properly handled. Conversely, a low score indicates significant deviations from the prompt instructions.

13. Data Privacy Compliance

Ensures content adheres to data protection standards and privacy regulations. Learn more → Mapping:

input: string - The input to be evaluated

Output: Returns a ‘Passed’ if the content adheres to data protection standards and privacy regulations, else returns ‘Failed’

14. Is Json

Validates whether text output is properly formatted as valid JSON. Learn more → Mapping:

text: string - The text to be evaluated

Output: Returns a ‘Passed’ if the text is valid JSON, else returns ‘Failed’

15. One Line

Checks that the entire response is contained in a single line. Learn more → Mapping:

text: string - The text to be evaluated

Output: Returns a ‘Passed’ if the text is a single line, else returns ‘Failed’

16. Contains Valid Link

Confirms if the response contains at least one valid URL. Learn more → Mapping:

text

17. No Valid Links

Checks that no valid links are present in the response. Learn more → Mapping:

text: string - The text to be evaluated

Output: Returns a ‘Passed’ if the text does not contain any valid hyperlinks, else returns ‘Failed’

18. Is Email

Checks if the response is a properly formatted email address. Learn more → Mapping:

text: string - The text to be evaluated

Output: Returns a ‘Passed’ if the text is a properly formatted email address, else returns ‘Failed’

19. Summary Quality

Checks if a summary captures the main points accurately and succinctly. Learn more → Mapping:

input: string - The input to the AI system
output: string - The output generated by the AI system
context: string - The context provided to the AI system

Output: Returns a score where a higher score indicates better summary quality.

20. Factual Accuracy

Determines if the response is factually correct based on the context provided. Learn more → Mapping:

input: string - The input provided to the AI system
output: string - The output generated by the AI system
context: string - The context provided to the AI system

Output: Returns a score where a higher score indicates greater factual accuracy.

21. Translation Accuracy

Evaluates the accuracy of translated content. Learn more → Mapping:

input: string - The input to the AI system
output: string - The translated output generated by the AI system

Output: Returns a score where a higher score indicates greater translation accuracy.

22. Cultural Sensitivity

Assesses given text for inclusivity and cultural awareness. Learn more → Mapping:

input: string - The input to be evaluated for cultural sensitivity

Output: Returns either “Passed” or “Failed”, where “Passed” indicates culturally appropriate content, “Failed” indicates potential cultural insensitivity

23. Bias Detection

Detects presence of bias or unfairness in the text. Learn more → Mapping:

input: string - The input to be evaluated for bias

Output: Returns either “Passed” or “Failed”, where “Passed” indicates neutral content, “Failed” indicates presence of bias.

24. LLM Function Calling

Checks if the output properly uses a function/tool call with correct parameters. Learn more → Mapping:

input: string - The input to the AI system
output: string - The output generated by the AI system

Output: Returns a ‘Passed’ if the output properly uses a function/tool call with correct parameters, else returns ‘Failed’

25. Groundedness

Evaluates if the response is grounded in the provided context. Learn more → Mapping:

input: string - The input to the AI system
output: string - The output generated by the AI system

Output: Returns a ‘Passed’ if the output is grounded according to the input provided, else returns ‘Failed’

26. Audio Transcription

Analyzes the transcription accuracy of the given audio and its transcription. Learn more → Mapping:

input audio: string - The URL of the audio to be evaluated
input transcription: string - The output generated by the AI system

Output: Returns a score based on the criteria provided.

27. Audio Quality

Evaluates the quality of the given audio. Learn more → Mapping:

input audio: string - The URL of the audio to be evaluated

Output: Returns a score based on the criteria provided.

28. Chunk Attribution

Tracks if the context chunk is used in generating the response. Learn more → Mapping:

input: string - The input to the AI system
output: string - The output generated by the AI system
context: string - The context provided to the AI system

Output: Returns a Passed or Failed based on the input, output and context.

29. Chunk Utilization

Measures how effectively context chunks are used in responses. Learn more → Mapping:

input: string - The input to the AI system
output: string - The output generated by the AI system
context: string - The context provided to the AI system

Output: Returns a score based on the criteria provided.

30. Eval Ranking

Provides ranking score for each context based on specified criteria. Learn more → Mapping:

input: string - The input to the AI system
context: string - The output generated by the AI system

Output: Returns a score based on the criteria provided.

31. No Racial Bias

Ensures that the output does not contain or imply racial bias, stereotypes, or preferential treatment. Mapping

input: string - The input provided