The Future AGI Python SDK provides an Evaluator
class to programmatically run evaluations on your data and language model outputs. This document details its usage based on the provided SDK snippets.
Evaluator
class Evaluator(APIKeyAuth):
An evaluator is an abstraction used for running evaluations on your data and model outputs.
Initialization
Initializes the Evaluator
client. API keys and base URL can be provided directly or will be read from environment variables (FI_API_KEY
, FI_SECRET_KEY
, FI_BASE_URL
) if not specified.
def __init__(
self,
fi_api_key: Optional[str] = None,
fi_secret_key: Optional[str] = None,
fi_base_url: Optional[str] = None,
**kwargs,
) -> None:
Arguments:
fi_api_key
(Optional[str], optional): API key. Defaults to None.
fi_secret_key
(Optional[str], optional): Secret key. Defaults to None.
fi_base_url
(Optional[str], optional): Base URL. Defaults to None.
**kwargs
:
timeout
(Optional[int]): Timeout value in seconds. Default: 200
.
max_queue_bound
(Optional[int]): Maximum queue size. Default: 5000
.
max_workers
(Optional[int]): Maximum number of workers. Default: 8
.
evaluate
Runs a single evaluation or a batch of evaluations independently.
def evaluate(
self,
eval_templates: Union[str, type[EvalTemplate]],
inputs: Union[
TestCase,
List[TestCase],
Dict[str, Any],
List[Dict[str, Any]],
],
timeout: Optional[int] = None,
model_name: Optional[str] = None
) -> BatchRunResult:
Arguments:
eval_templates
(Union[str, EvalTemplate, List[EvalTemplate]]): A single evaluation template or a list of evaluation templates.
inputs
(Union[TestCase, List[TestCase], Dict[str, Any], List[Dict[str, Any]]): A single test case or a list of test cases. Supports various TestCase
types.
timeout
(Optional[int], optional): Timeout value in seconds for the evaluation. Defaults to None (uses the client’s default timeout).
model_name
(Optional[str], optional): Model name to use for the evaluation while using Future AGI Built Evals. Defaults to None.
When running Future AGI Built Evals, you have to specify the model name to use for the evaluation, otherwise the SDK will throw an error.
Returns:
BatchRunResult
: An object containing the results of the evaluation(s).
Raises:
ValidationError
: If the inputs do not match the evaluation templates.
Exception
: If the API request fails or other errors occur during evaluation.
list_evaluations
Fetches information about all available evaluation templates.
def list_evaluations(self) -> List[Dict[str, Any]]:
Returns:
List[Dict[str, Any]]
: A list of dictionaries, where each dictionary contains information about an available evaluation template. This typically includes details like the template’s id
, name
, description
, and expected parameters.
eval_templates
The list of templates that can be used to evaluate your data.
Conversation Coherence
Evaluates if a conversation flows logically and maintains context throughout
class ConversationCoherence():
Conversation Resolution
Checks if the conversation reaches a satisfactory conclusion or resolution. The conversation must have atleast two users
class ConversationResolution():
Content Moderation
Uses OpenAI’s content moderation to evaluate text safety
class ContentModeration():
Context Adherence
Measures how well responses stay within the provided context
class ContextAdherence():
Context Relevance
Evaluates the relevancy of the context to the query
class ContextRelevance():
Completeness
Evaluates if the response completely answers the query
Chunk Attribution
Tracks if the context chunk is used in generating the response.
class ChunkAttribution():
Chunk Utilization
Measures how effectively context chunks are used in responses
class ChunkUtilization():
PII
Detects personally identifiable information (PII) in text.
Toxicity
Evaluates content for toxic or harmful language
Tone
Analyzes the tone and sentiment of content
Sexist
Detects sexist content and gender bias
Prompt Injection
Evaluates text for potential prompt injection attempts
Not Gibberish Text
Checks if the text is not gibberish
Safe for Work text
Evaluates if the text is safe for work.
Prompt Instruction Adherence
Assesses how closely the output follows the given prompt instructions, checking for completion of all requested tasks and adherence to specified constraints or formats. Evaluates both explicit and implicit requirements in the prompt.
Data Privacy Compliance
Checks output for compliance with data privacy regulations (GDPR, HIPAA, etc.). Identifies potential privacy violations, sensitive data exposure, and adherence to privacy principles.
class DataPrivacyCompliance():
Is Json
Validates if content is proper JSON format
One Line
Checks if the text is a single line
Contains Valid Link
Checks for presence of valid URLs
class ContainsValidLink():
Is Email
Validates email address format
No Valid Links
Checks if the text contains no invalid URLs
Eval Ranking
Provides ranking score for each context based on specified criteria.
Summary Quality
Evaluates if a summary effectively captures the main points, maintains factual accuracy, and achieves appropriate length while preserving the original meaning. Checks for both inclusion of key information and exclusion of unnecessary details.
class SummaryQuality(config={
"check_internet": {"type": "boolean", "default": False}
}):
Factual Accuracy
Verifies if the provided output is factually correct or not.
class FactualAccuracy(config={
"check_internet": {"type": "boolean", "default": False}
}):
Translation Accuracy
Evaluates the quality of translation by checking semantic accuracy, cultural appropriateness, and preservation of original meaning. Considers both literal accuracy and natural expression in the target language.
class TranslationAccuracy():
Cultural Sensitivity
Analyzes output for cultural appropriateness, inclusive language, and awareness of cultural nuances. Identifies potential cultural biases or insensitive content.
class CulturalSensitivity():
Bias Detection
Identifies various forms of bias including gender, racial, cultural, or ideological bias in the output. Evaluates for balanced perspective and neutral language use.
Evaluate LLM Function calling
Assesses accuracy and effectiveness of LLM function calls.
class EvaluateLLMFunctionCalling():
Audio Transcription
Analyzes the transcription accuracy of the given audio and its transcription.
class AudioTranscription():
Audio Quality
Evaluates the quality of the given audio.
Protect Flash
FutureAGI’s proprietary evaluator to check if the content is harmful
No Racial Bias
Ensures that the output does not contain or imply racial bias, stereotypes, or preferential treatment.
No Gender Bias
Checks that the response does not reinforce gender stereotypes or exhibit bias based on gender identity.
No Age Bias
Evaluates if the content is free from stereotypes, discrimination, or assumptions based on age.
No Openai Reference
Ensures that the model response does not mention being an OpenAI model or reference its training data or providers.
class NoOpenaiReference():
No Apologies
Checks if the model unnecessarily apologizes, e.g., ‘I’m sorry, but…’
Is Polite
Ensures that the output maintains a respectful, kind, and non-aggressive tone.
Is Concise
Measures whether the answer is brief and to the point, avoiding redundancy.
Is Helpful
Evaluates whether the response answers the user’s question effectively.
Is Code
Checks whether the output is valid code or contains expected code snippets.
Fuzzy Match
Compares model output with an expected answer using approximate (not exact) matching.
Answer Refusal
Checks if the model correctly refuses to answer when prompted with harmful, sensitive, or restricted queries.
Detect Hallucination
Identifies if the model fabricated facts or added information that was not present in the input or reference.
class DetectHallucination():
No Harmful Therapeutic Guidance
Ensures that the model does not provide potentially harmful psychological or therapeutic advice.
class NoHarmfulTherapeuticGuidance():
Clinically Inappropriate Tone
Evaluates whether the model’s tone is unsuitable for clinical or mental health contexts.
class ClinicallyInappropriateTone():
Is Harmful Advice
Detects whether the model gives advice that could be physically, emotionally, legally, or financially harmful.
Content Safety Violation
A broad check for content that violates safety or usage policies—this includes toxicity, hate speech, explicit content, violence, etc.
class ContentSafetyViolation():
Is Good Summary
Evaluates if a summary is clear, well-structured, and includes the most important points from the source material.
Is Factually Consistent
Checks if the generated output is factually consistent with the source/context (e.g., input text or documents).
class IsFactuallyConsistent():
Is Compliant
Ensures that the output adheres to legal, regulatory, or organizational policies (e.g., HIPAA, GDPR, company rules).
Detects whether the tone is informal or casual (e.g., use of slang, contractions, emoji).
Evaluate Function Calling
Tests if the model correctly identifies when to trigger a tool/function and includes the right arguments in the function call.
class EvaluateFunctionCalling():
Task Completion
Measures whether the model fulfilled the user’s request accurately and completely.
Caption Hallucination
Evaluates whether image captions or descriptions contain factual inaccuracies or hallucinated details that are not present in the instruction. This metric helps ensure that AI-generated image descriptions remain faithful to the instruction content.
class CaptionHallucination():
Bleu Score
Computes a bleu score between the expected gold answer and the model output.
Aggregated Metric
Combines multiple evaluation metrics into a single normalised score.
class AggregatedMetric(config={
"metrics": {"type": "list", "default": []},
"metric_names": {"type": "list", "default": []},
"aggregator": {"type": "option", "default": "average"},
"weights": {"type": "list", "default": None},
}):
ROUGE Score
Calculate ROUGE score between generated text and reference text
class ROUGEScore(config={
"rouge_type": {"type": "option", "default": "rouge1", "options": ["rouge1", "rouge2", "rougeL"]},
"use_stemmer": {"type": "boolean", "default": True}
}):
Numerical Difference
Calculate numerical difference between generated value and reference value
class NumericDiff(config={
"extract_numeric": {"type": "boolean", "default": True},
"normalized_result": {"type": "boolean", "default": True}
}):
Levenshtein Distance
Calculate edit distance between generated text and reference text
class LevenshteinDistance(config={
"case_insensitive": {"type": "boolean", "default": False},
"remove_punctuation": {"type": "boolean", "default": False}
}):
Embedding Similarity
Calculate semantic similarity between generated text and reference text
class EmbeddingSimilarity(config={
"similarity_method": {"type": "option", "default": "cosine", "options": ["cosine", "euclidean", "manhattan"]},
"normalize": {"type": "boolean", "default": True}
}):
Semantic List Contains
Check if text contains phrases semantically similar to reference phrases
class SemanticListContains(config={
"case_insensitive": {"type": "boolean", "default": True},
"remove_punctuation": {"type": "boolean", "default": True},
"match_all": {"type": "boolean", "default": False},
"similarity_threshold": {"type": "float", "default": 0.7}
}):
Example Usage
from fi.evals import Evaluator, Tone
from fi.testcases import TestCase
evaluator = Evaluator()
test_case = TestCase(
input="Write a professional email",
output="Dear Sir/Madam, I hope this email finds you well. I am writing to inquire about...",
context="Maintain formal business communication tone",
)
template = Tone()
response = evaluator.evaluate(eval_templates=[template], inputs=[test_case], model_name="turing_flash")