Evaluations - Future AGI

The Future AGI Python SDK provides an Evaluator class to programmatically run evaluations on your data and language model outputs. This document details its usage based on the provided SDK snippets.

`Evaluator`

class Evaluator(APIKeyAuth):

An evaluator is an abstraction used for running evaluations on your data and model outputs.

Initialization

Initializes the Evaluator client. API keys and base URL can be provided directly or will be read from environment variables (FI_API_KEY, FI_SECRET_KEY, FI_BASE_URL) if not specified.

def __init__(
        self,
        fi_api_key: Optional[str] = None,
        fi_secret_key: Optional[str] = None,
        fi_base_url: Optional[str] = None,
        **kwargs,
    ) -> None:

Arguments:

fi_api_key (Optional[str], optional): API key. Defaults to None.
fi_secret_key (Optional[str], optional): Secret key. Defaults to None.
fi_base_url (Optional[str], optional): Base URL. Defaults to None.
**kwargs:
- timeout (Optional[int]): Timeout value in seconds. Default: 200.
- max_queue_bound (Optional[int]): Maximum queue size. Default: 5000.
- max_workers (Optional[int]): Maximum number of workers. Default: 8.

`evaluate`

Runs a single evaluation or a batch of evaluations independently.

def evaluate(
        self,
        eval_templates: Union[str, type[EvalTemplate]],
        inputs: Union[
            TestCase,
            List[TestCase],
            Dict[str, Any],
            List[Dict[str, Any]],
        ],
        timeout: Optional[int] = None,
        model_name: Optional[str] = None
    ) -> BatchRunResult:

Arguments:

eval_templates (Union[str, EvalTemplate, List[EvalTemplate]]): A single evaluation template or a list of evaluation templates.
inputs (Union[TestCase, List[TestCase], Dict[str, Any], List[Dict[str, Any]]): A single test case or a list of test cases. Supports various TestCase types.
timeout (Optional[int], optional): Timeout value in seconds for the evaluation. Defaults to None (uses the client’s default timeout).
model_name (Optional[str], optional): Model name to use for the evaluation while using Future AGI Built Evals. Defaults to None.

When running Future AGI Built Evals, you have to specify the model name to use for the evaluation, otherwise the SDK will throw an error.

Returns:

BatchRunResult: An object containing the results of the evaluation(s).

Raises:

ValidationError: If the inputs do not match the evaluation templates.
Exception: If the API request fails or other errors occur during evaluation.

`list_evaluations`

Fetches information about all available evaluation templates.

def list_evaluations(self) -> List[Dict[str, Any]]:

Returns:

List[Dict[str, Any]]: A list of dictionaries, where each dictionary contains information about an available evaluation template. This typically includes details like the template’s id, name, description, and expected parameters.

`eval_templates`

The list of templates that can be used to evaluate your data.

Conversation Coherence

Evaluates if a conversation flows logically and maintains context throughout

class ConversationCoherence():

Conversation Resolution

Checks if the conversation reaches a satisfactory conclusion or resolution. The conversation must have atleast two users

class ConversationResolution():

Content Moderation

Uses OpenAI’s content moderation to evaluate text safety

class ContentModeration():

Context Adherence

Measures how well responses stay within the provided context

class ContextAdherence():

Context Relevance

Evaluates the relevancy of the context to the query

class ContextRelevance():

Completeness

Evaluates if the response completely answers the query

class Completeness():

Chunk Attribution

Tracks if the context chunk is used in generating the response.

class ChunkAttribution():

Chunk Utilization

Measures how effectively context chunks are used in responses

class ChunkUtilization():

PII

Detects personally identifiable information (PII) in text.

class PII():

Toxicity

Evaluates content for toxic or harmful language

class Toxicity():

Tone

Analyzes the tone and sentiment of content

class Tone():

Sexist

Detects sexist content and gender bias

class Sexist():

Prompt Injection

Evaluates text for potential prompt injection attempts

class PromptInjection():

Not Gibberish Text

Checks if the text is not gibberish

class NotGibberish():

Safe for Work text

Evaluates if the text is safe for work.

class SafeForWork():

Prompt Instruction Adherence

Assesses how closely the output follows the given prompt instructions, checking for completion of all requested tasks and adherence to specified constraints or formats. Evaluates both explicit and implicit requirements in the prompt.

class PromptAdherence():

Data Privacy Compliance

Checks output for compliance with data privacy regulations (GDPR, HIPAA, etc.). Identifies potential privacy violations, sensitive data exposure, and adherence to privacy principles.

class DataPrivacyCompliance():

Is Json

Validates if content is proper JSON format

class IsJson():

One Line

Checks if the text is a single line

class OneLine():

Contains Valid Link

Checks for presence of valid URLs

class ContainsValidLink():

Is Email

Validates email address format

class IsEmail():

No Valid Links

Checks if the text contains no invalid URLs

class NoValidLinks():

Eval Ranking

Provides ranking score for each context based on specified criteria.

class EvalRanking():

Summary Quality

Evaluates if a summary effectively captures the main points, maintains factual accuracy, and achieves appropriate length while preserving the original meaning. Checks for both inclusion of key information and exclusion of unnecessary details.

class SummaryQuality(config={
    "check_internet": {"type": "boolean", "default": False}
}):

Factual Accuracy

Verifies if the provided output is factually correct or not.

class FactualAccuracy(config={
    "check_internet": {"type": "boolean", "default": False}
}):

Translation Accuracy

Evaluates the quality of translation by checking semantic accuracy, cultural appropriateness, and preservation of original meaning. Considers both literal accuracy and natural expression in the target language.

class TranslationAccuracy():

Cultural Sensitivity

Analyzes output for cultural appropriateness, inclusive language, and awareness of cultural nuances. Identifies potential cultural biases or insensitive content.

class CulturalSensitivity():

Bias Detection

Identifies various forms of bias including gender, racial, cultural, or ideological bias in the output. Evaluates for balanced perspective and neutral language use.

class BiasDetection():

Evaluate LLM Function calling

Assesses accuracy and effectiveness of LLM function calls.

class EvaluateLLMFunctionCalling():

Audio Transcription

Analyzes the transcription accuracy of the given audio and its transcription.

class AudioTranscription():

Audio Quality

Evaluates the quality of the given audio.

class AudioQuality():

Protect Flash

FutureAGI’s proprietary evaluator to check if the content is harmful

class ProtectFlash():

No Racial Bias

Ensures that the output does not contain or imply racial bias, stereotypes, or preferential treatment.

class NoRacialBias():

No Gender Bias

Checks that the response does not reinforce gender stereotypes or exhibit bias based on gender identity.

class NoGenderBias():

No Age Bias

Evaluates if the content is free from stereotypes, discrimination, or assumptions based on age.

class NoAgeBias():

No Openai Reference

Ensures that the model response does not mention being an OpenAI model or reference its training data or providers.

class NoOpenaiReference():

No Apologies

Checks if the model unnecessarily apologizes, e.g., ‘I’m sorry, but…’

class NoApologies():

Is Polite

Ensures that the output maintains a respectful, kind, and non-aggressive tone.

class IsPolite():

Is Concise

Measures whether the answer is brief and to the point, avoiding redundancy.

class IsConcise():

Is Helpful

Evaluates whether the response answers the user’s question effectively.

class IsHelpful():

Is Code

Checks whether the output is valid code or contains expected code snippets.

class IsCode():

Fuzzy Match

Compares model output with an expected answer using approximate (not exact) matching.

class FuzzyMatch():

Answer Refusal

Checks if the model correctly refuses to answer when prompted with harmful, sensitive, or restricted queries.

class AnswerRefusal():

Detect Hallucination

Identifies if the model fabricated facts or added information that was not present in the input or reference.

class DetectHallucination():

No Harmful Therapeutic Guidance

Ensures that the model does not provide potentially harmful psychological or therapeutic advice.

class NoHarmfulTherapeuticGuidance():

Clinically Inappropriate Tone

Evaluates whether the model’s tone is unsuitable for clinical or mental health contexts.

class ClinicallyInappropriateTone():

Is Harmful Advice

Detects whether the model gives advice that could be physically, emotionally, legally, or financially harmful.

class IsHarmfulAdvice():

Content Safety Violation

A broad check for content that violates safety or usage policies—this includes toxicity, hate speech, explicit content, violence, etc.

class ContentSafetyViolation():

Is Good Summary

Evaluates if a summary is clear, well-structured, and includes the most important points from the source material.

class IsGoodSummary():

Is Factually Consistent

Checks if the generated output is factually consistent with the source/context (e.g., input text or documents).

class IsFactuallyConsistent():

Is Compliant

Ensures that the output adheres to legal, regulatory, or organizational policies (e.g., HIPAA, GDPR, company rules).

class IsCompliant():

Is Informal Tone

Detects whether the tone is informal or casual (e.g., use of slang, contractions, emoji).

class IsInformalTone():

Evaluate Function Calling

Tests if the model correctly identifies when to trigger a tool/function and includes the right arguments in the function call.

class EvaluateFunctionCalling():

Task Completion

Measures whether the model fulfilled the user’s request accurately and completely.

class TaskCompletion():

Caption Hallucination

Evaluates whether image captions or descriptions contain factual inaccuracies or hallucinated details that are not present in the instruction. This metric helps ensure that AI-generated image descriptions remain faithful to the instruction content.

class CaptionHallucination():

Bleu Score

Computes a bleu score between the expected gold answer and the model output.

class BleuScore():

Aggregated Metric

Combines multiple evaluation metrics into a single normalised score.

class AggregatedMetric(config={
    "metrics": {"type": "list", "default": []},
    "metric_names": {"type": "list", "default": []},
    "aggregator": {"type": "option", "default": "average"},
    "weights": {"type": "list", "default": None},
}):

ROUGE Score

Calculate ROUGE score between generated text and reference text

class ROUGEScore(config={
    "rouge_type": {"type": "option", "default": "rouge1", "options": ["rouge1", "rouge2", "rougeL"]},
    "use_stemmer": {"type": "boolean", "default": True}
}):

Numerical Difference

Calculate numerical difference between generated value and reference value

class NumericDiff(config={
    "extract_numeric": {"type": "boolean", "default": True},
    "normalized_result": {"type": "boolean", "default": True}
}):

Levenshtein Distance

Calculate edit distance between generated text and reference text

class LevenshteinDistance(config={
    "case_insensitive": {"type": "boolean", "default": False},
    "remove_punctuation": {"type": "boolean", "default": False}
}):

Embedding Similarity

Calculate semantic similarity between generated text and reference text

class EmbeddingSimilarity(config={
    "similarity_method": {"type": "option", "default": "cosine", "options": ["cosine", "euclidean", "manhattan"]},
    "normalize": {"type": "boolean", "default": True}
}):

Semantic List Contains

Check if text contains phrases semantically similar to reference phrases

class SemanticListContains(config={
    "case_insensitive": {"type": "boolean", "default": True},
    "remove_punctuation": {"type": "boolean", "default": True},
    "match_all": {"type": "boolean", "default": False},
    "similarity_threshold": {"type": "float", "default": 0.7}
}):

Example Usage

from fi.evals import Evaluator, Tone
from fi.testcases import TestCase

evaluator = Evaluator()

test_case = TestCase(
    input="Write a professional email",
    output="Dear Sir/Madam, I hope this email finds you well. I am writing to inquire about...",
    context="Maintain formal business communication tone",
)

template = Tone()

response = evaluator.evaluate(eval_templates=[template], inputs=[test_case], model_name="turing_flash")

SDK Reference

​Evaluator

​Initialization

​evaluate

​list_evaluations

​eval_templates

​Conversation Coherence

​Conversation Resolution

​Content Moderation

​Context Adherence

​Context Relevance

​Completeness

​Chunk Attribution

​Chunk Utilization

​PII

​Toxicity

​Tone

​Sexist

​Prompt Injection

​Not Gibberish Text

​Safe for Work text

​Prompt Instruction Adherence

​Data Privacy Compliance

​Is Json

​One Line

​Contains Valid Link

​Is Email

​No Valid Links

​Eval Ranking

​Summary Quality

​Factual Accuracy

​Translation Accuracy

​Cultural Sensitivity

​Bias Detection

​Evaluate LLM Function calling

​Audio Transcription

​Audio Quality

​Protect Flash

​No Racial Bias

​No Gender Bias

​No Age Bias

​No Openai Reference

​No Apologies

​Is Polite

​Is Concise

​Is Helpful

​Is Code

​Fuzzy Match

​Answer Refusal

​Detect Hallucination

​No Harmful Therapeutic Guidance

​Clinically Inappropriate Tone

​Is Harmful Advice

​Content Safety Violation

​Is Good Summary

​Is Factually Consistent

​Is Compliant

​Is Informal Tone

​Evaluate Function Calling

​Task Completion

​Caption Hallucination

​Bleu Score

​Aggregated Metric

​ROUGE Score

​Numerical Difference

​Levenshtein Distance

​Embedding Similarity

​Semantic List Contains

​Example Usage

`Evaluator`

Initialization

`evaluate`

`list_evaluations`

`eval_templates`

Conversation Coherence

Conversation Resolution

Content Moderation

Context Adherence

Context Relevance

Completeness

Chunk Attribution

Chunk Utilization

PII

Toxicity

Tone

Sexist

Prompt Injection

Not Gibberish Text

Safe for Work text

Prompt Instruction Adherence

Data Privacy Compliance

Is Json

One Line

Contains Valid Link

Is Email

No Valid Links

Eval Ranking

Summary Quality

Factual Accuracy

Translation Accuracy

Cultural Sensitivity

Bias Detection

Evaluate LLM Function calling

Audio Transcription

Audio Quality

Protect Flash

No Racial Bias

No Gender Bias

No Age Bias

No Openai Reference

No Apologies

Is Polite

Is Concise

Is Helpful

Is Code

Fuzzy Match

Answer Refusal

Detect Hallucination

No Harmful Therapeutic Guidance

Clinically Inappropriate Tone

Is Harmful Advice

Content Safety Violation

Is Good Summary

Is Factually Consistent

Is Compliant

Is Informal Tone

Evaluate Function Calling

Task Completion

Caption Hallucination

Bleu Score

Aggregated Metric

ROUGE Score

Numerical Difference

Levenshtein Distance

Embedding Similarity

Semantic List Contains

Example Usage