Evaluations
Using the Future AGI Python SDK for running evaluations, listing available evaluators, and configuring the Evaluator client.
The Future AGI Python SDK provides an Evaluator
class to programmatically run evaluations on your data and language model outputs. This document details its usage based on the provided SDK snippets.
Evaluator
An evaluator is an abstraction used for running evaluations on your data and model outputs.
Initialization
Initializes the Evaluator
client. API keys and base URL can be provided directly or will be read from environment variables (FI_API_KEY
, FI_SECRET_KEY
, FI_BASE_URL
) if not specified.
Arguments:
fi_api_key
(Optional[str], optional): API key. Defaults to None.fi_secret_key
(Optional[str], optional): Secret key. Defaults to None.fi_base_url
(Optional[str], optional): Base URL. Defaults to None.**kwargs
:timeout
(Optional[int]): Timeout value in seconds. Default:200
.max_queue_bound
(Optional[int]): Maximum queue size. Default:5000
.max_workers
(Optional[int]): Maximum number of workers. Default:8
.
evaluate
Runs a single evaluation or a batch of evaluations independently.
Arguments:
eval_templates
(Union[str, EvalTemplate, List[EvalTemplate]]): A single evaluation template or a list of evaluation templates.inputs
(Union[TestCase, List[TestCase], Dict[str, Any], List[Dict[str, Any]]): A single test case or a list of test cases. Supports variousTestCase
types.timeout
(Optional[int], optional): Timeout value in seconds for the evaluation. Defaults to None (uses the client’s default timeout).model_name
(Optional[str], optional): Model name to use for the evaluation while using Future AGI Built Evals. Defaults to None.
When running Future AGI Built Evals, you have to specify the model name to use for the evaluation, otherwise the SDK will throw an error.
Returns:
BatchRunResult
: An object containing the results of the evaluation(s).
Raises:
ValidationError
: If the inputs do not match the evaluation templates.Exception
: If the API request fails or other errors occur during evaluation.
list_evaluations
Fetches information about all available evaluation templates.
Returns:
List[Dict[str, Any]]
: A list of dictionaries, where each dictionary contains information about an available evaluation template. This typically includes details like the template’sid
,name
,description
, and expected parameters.
eval_templates
The list of templates that can be used to evaluate your data.
Conversation Coherence
Evaluates if a conversation flows logically and maintains context throughout
Conversation Resolution
Checks if the conversation reaches a satisfactory conclusion or resolution. The conversation must have atleast two users
Content Moderation
Uses OpenAI’s content moderation to evaluate text safety
Context Adherence
Measures how well responses stay within the provided context
Context Relevance
Evaluates the relevancy of the context to the query
Completeness
Evaluates if the response completely answers the query
Chunk Attribution
Tracks if the context chunk is used in generating the response.
Chunk Utilization
Measures how effectively context chunks are used in responses
PII
Detects personally identifiable information (PII) in text.
Toxicity
Evaluates content for toxic or harmful language
Tone
Analyzes the tone and sentiment of content
Sexist
Detects sexist content and gender bias
Prompt Injection
Evaluates text for potential prompt injection attempts
Not Gibberish Text
Checks if the text is not gibberish
Safe for Work text
Evaluates if the text is safe for work.
Prompt Instruction Adherence
Assesses how closely the output follows the given prompt instructions, checking for completion of all requested tasks and adherence to specified constraints or formats. Evaluates both explicit and implicit requirements in the prompt.
Data Privacy Compliance
Checks output for compliance with data privacy regulations (GDPR, HIPAA, etc.). Identifies potential privacy violations, sensitive data exposure, and adherence to privacy principles.
Is Json
Validates if content is proper JSON format
One Line
Checks if the text is a single line
Contains Valid Link
Checks for presence of valid URLs
Is Email
Validates email address format
No Valid Links
Checks if the text contains no invalid URLs
Eval Ranking
Provides ranking score for each context based on specified criteria.
Summary Quality
Evaluates if a summary effectively captures the main points, maintains factual accuracy, and achieves appropriate length while preserving the original meaning. Checks for both inclusion of key information and exclusion of unnecessary details.
Factual Accuracy
Verifies if the provided output is factually correct or not.
Translation Accuracy
Evaluates the quality of translation by checking semantic accuracy, cultural appropriateness, and preservation of original meaning. Considers both literal accuracy and natural expression in the target language.
Cultural Sensitivity
Analyzes output for cultural appropriateness, inclusive language, and awareness of cultural nuances. Identifies potential cultural biases or insensitive content.
Bias Detection
Identifies various forms of bias including gender, racial, cultural, or ideological bias in the output. Evaluates for balanced perspective and neutral language use.
Evaluate LLM Function calling
Assesses accuracy and effectiveness of LLM function calls.
Audio Transcription
Analyzes the transcription accuracy of the given audio and its transcription.
Audio Quality
Evaluates the quality of the given audio.
Protect Flash
FutureAGI’s proprietary evaluator to check if the content is harmful
No Racial Bias
Ensures that the output does not contain or imply racial bias, stereotypes, or preferential treatment.
No Gender Bias
Checks that the response does not reinforce gender stereotypes or exhibit bias based on gender identity.
No Age Bias
Evaluates if the content is free from stereotypes, discrimination, or assumptions based on age.
No Openai Reference
Ensures that the model response does not mention being an OpenAI model or reference its training data or providers.
No Apologies
Checks if the model unnecessarily apologizes, e.g., ‘I’m sorry, but…’
Is Polite
Ensures that the output maintains a respectful, kind, and non-aggressive tone.
Is Concise
Measures whether the answer is brief and to the point, avoiding redundancy.
Is Helpful
Evaluates whether the response answers the user’s question effectively.
Is Code
Checks whether the output is valid code or contains expected code snippets.
Fuzzy Match
Compares model output with an expected answer using approximate (not exact) matching.
Answer Refusal
Checks if the model correctly refuses to answer when prompted with harmful, sensitive, or restricted queries.
Detect Hallucination
Identifies if the model fabricated facts or added information that was not present in the input or reference.
No Harmful Therapeutic Guidance
Ensures that the model does not provide potentially harmful psychological or therapeutic advice.
Clinically Inappropriate Tone
Evaluates whether the model’s tone is unsuitable for clinical or mental health contexts.
Is Harmful Advice
Detects whether the model gives advice that could be physically, emotionally, legally, or financially harmful.
Content Safety Violation
A broad check for content that violates safety or usage policies—this includes toxicity, hate speech, explicit content, violence, etc.
Is Good Summary
Evaluates if a summary is clear, well-structured, and includes the most important points from the source material.
Is Factually Consistent
Checks if the generated output is factually consistent with the source/context (e.g., input text or documents).
Is Compliant
Ensures that the output adheres to legal, regulatory, or organizational policies (e.g., HIPAA, GDPR, company rules).
Is Informal Tone
Detects whether the tone is informal or casual (e.g., use of slang, contractions, emoji).
Evaluate Function Calling
Tests if the model correctly identifies when to trigger a tool/function and includes the right arguments in the function call.
Task Completion
Measures whether the model fulfilled the user’s request accurately and completely.
Caption Hallucination
Evaluates whether image captions or descriptions contain factual inaccuracies or hallucinated details that are not present in the instruction. This metric helps ensure that AI-generated image descriptions remain faithful to the instruction content.
Bleu Score
Computes a bleu score between the expected gold answer and the model output.
Aggregated Metric
Combines multiple evaluation metrics into a single normalised score.
ROUGE Score
Calculate ROUGE score between generated text and reference text
Numerical Difference
Calculate numerical difference between generated value and reference value
Levenshtein Distance
Calculate edit distance between generated text and reference text
Embedding Similarity
Calculate semantic similarity between generated text and reference text
Semantic List Contains
Check if text contains phrases semantically similar to reference phrases