Evaluations
Using the Future AGI Python SDK for running evaluations, listing available evaluators, and configuring the Evaluator client.
The Future AGI Python SDK provides an Evaluator
class to programmatically run evaluations on your data and language model outputs. This document details its usage based on the provided SDK snippets.
Evaluator
An evaluator is an abstraction used for running evaluations on your data and model outputs.
Initialization
Initializes the Evaluator
client. API keys and base URL can be provided directly or will be read from environment variables (FI_API_KEY
, FI_SECRET_KEY
, FI_BASE_URL
) if not specified.
Arguments:
fi_api_key
(Optional[str], optional): API key. Defaults to None.fi_secret_key
(Optional[str], optional): Secret key. Defaults to None.fi_base_url
(Optional[str], optional): Base URL. Defaults to None.**kwargs
:timeout
(Optional[int]): Timeout value in seconds. Default:200
.max_queue_bound
(Optional[int]): Maximum queue size. Default:5000
.max_workers
(Optional[int]): Maximum number of workers. Default:8
.
evaluate
Runs a single evaluation or a batch of evaluations independently.
Arguments:
eval_templates
(Union[EvalTemplate, List[EvalTemplate]]): A single evaluation template or a list of evaluation templates.inputs
(Union[TestCase, List[TestCase], LLMTestCase, List[LLMTestCase], MLLMTestCase, List[MLLMTestCase], ConversationalTestCase, List[ConversationalTestCase]]): A single test case or a list of test cases. Supports variousTestCase
types.timeout
(Optional[int], optional): Timeout value in seconds for the evaluation. Defaults to None (uses the client’s default timeout).
Returns:
BatchRunResult
: An object containing the results of the evaluation(s).
Raises:
ValidationError
: If the inputs do not match the evaluation templates.Exception
: If the API request fails or other errors occur during evaluation.
list_evaluations
Fetches information about all available evaluation templates.
Returns:
List[Dict[str, Any]]
: A list of dictionaries, where each dictionary contains information about an available evaluation template. This typically includes details like the template’sid
,name
,description
, and expected parameters.
eval_templates
The list of templates that can be used to evaluate your data.
Conversation Coherence
Evaluates if a conversation flows logically and maintains context throughout.
Conversation Resolution
Checks if the conversation reaches a satisfactory conclusion or resolution. The conversation must have atleast two users.
Deterministic Evals
Evaluates if the output is deterministic or not.
Content Moderation
Uses OpenAI’s content moderation to evaluate text safety.
Context Adherence
Measures how well responses stay within the provided context.
Prompt Perplexity
Measures how well the model understands and processes the input prompt by calculating the output perplexity. This measure is used to evaluate the model’s ability to generate responses that are coherent and consistent with the input prompt. This measures the confidence of the model in its response. Higher value indicates higher confidence.
Context Relevance
Evaluates the relevancy of the context to the query.
Completeness
Evaluates if the response completely answers the query.
Chunk Attribution
Tracks if the context chunk is used in generating the response.
Chunk Utilization
Measures how effectively context chunks are used in responses.
Context Similarity
Compares similarity between provided and expected context.
PII
Detects personally identifiable information (PII) in text.
Toxicity
Evaluates content for toxic or harmful language.
Tone
Analyzes the tone and sentiment of content.
Sexist
Detects sexist content and gender bias.
Prompt Injection
Evaluates text for potential prompt injection attempts.
Not Gibberish text
Checks if the text is not gibberish.
Safe for Work text
Evaluates if the text is safe for work.
Prompt/Instruction Adherence
Assesses how closely the output follows the given prompt instructions, checking for completion of all requested tasks and adherence to specified constraints or formats. Evaluates both explicit and implicit requirements in the prompt.
Data Privacy Compliance
Checks output for compliance with data privacy regulations (GDPR, HIPAA, etc.). Identifies potential privacy violations, sensitive data exposure, and adherence to privacy principles.
Is Json
Validates if content is proper JSON format.
Ends With
Checks if text ends with specific substring.
Equals
Compares if two texts are exactly equal.
Contains All
Verifies text contains all specified keywords.
Length Less Than
Checks if text length is below threshold.
Contains None
Verifies text contains none of specified terms.
Regex
Checks if the text matches a specified regex pattern.
Starts With
Checks if text begins with specific substring.
API Call
Makes an API call and evaluates the response.
Length Between
Checks if the text length is between specified min and max values.
Custom Code Evaluation
Executes custom Python code for evaluation.
Agent as a Judge
Uses AI agents for content evaluation.
Json Scheme Validation
Validates JSON against specified criteria.
One Line
Checks if the text is a single line.
Contains Valid Link
Checks for presence of valid URLs.
Is Email
Validates email address format.
Length Greater than
Checks if the text length is greater than a specified value.
No Valid Links
Checks if the text contains no invalid URLs.
Contains
Checks if the text contains a specific keyword.
Contains Any
Checks if the text contains any of the specified keywords.
Answer Similarity
Evaluates the similarity between the expected and actual responses.
Eval Output
Scores linkage between input and output based on specified criteria.
Eval Context Retrieval Quality
Assesses quality of retrieved context.
Eval Ranking
Provides ranking score for each context based on specified criteria.
Eval Image Instruction (text to image)
Scores image-instruction linkage based on specified criteria.
Score Eval
Scores linkage between instruction, input images, and output images.
Summary Quality
Evaluates if a summary effectively captures the main points, maintains factual accuracy, and achieves appropriate length while preserving the original meaning. Checks for both inclusion of key information and exclusion of unnecessary details.
Factual Accuracy
Verifies if the provided output is factually correct or not.
Translation Accuracy
Evaluates the quality of translation by checking semantic accuracy, cultural appropriateness, and preservation of original meaning. Considers both literal accuracy and natural expression in the target language.
Cultural Sensitivity
Analyzes output for cultural appropriateness, inclusive language, and awareness of cultural nuances. Identifies potential cultural biases or insensitive content.
Bias Detection
Identifies various forms of bias including gender, racial, cultural, or ideological bias in the output. Evaluates for balanced perspective and neutral language use.
Evaluate LLM Function calling
Assesses accuracy and effectiveness of LLM function calls.
Audio Transcription
Analyzes the transcription accuracy of the given audio and its transcription.
Eval Audio Description
Evaluates the audio based on the description of the given audio.
Audio Quality
Evaluates the quality of the given audio.