The Future AGI Python SDK provides an Evaluator class to programmatically run evaluations on your data and language model outputs. This document details its usage based on the provided SDK snippets.

Evaluator

class Evaluator(APIKeyAuth):

An evaluator is an abstraction used for running evaluations on your data and model outputs.

Initialization

Initializes the Evaluator client. API keys and base URL can be provided directly or will be read from environment variables (FI_API_KEY, FI_SECRET_KEY, FI_BASE_URL) if not specified.

def __init__(
        self,
        fi_api_key: Optional[str] = None,
        fi_secret_key: Optional[str] = None,
        fi_base_url: Optional[str] = None,
        **kwargs,
    ) -> None:

Arguments:

  • fi_api_key (Optional[str], optional): API key. Defaults to None.
  • fi_secret_key (Optional[str], optional): Secret key. Defaults to None.
  • fi_base_url (Optional[str], optional): Base URL. Defaults to None.
  • **kwargs:
    • timeout (Optional[int]): Timeout value in seconds. Default: 200.
    • max_queue_bound (Optional[int]): Maximum queue size. Default: 5000.
    • max_workers (Optional[int]): Maximum number of workers. Default: 8.

evaluate

Runs a single evaluation or a batch of evaluations independently.

def evaluate(
        self,
        eval_templates: Union[EvalTemplate, List[EvalTemplate]],
        inputs: Union[
            TestCase,
            List[TestCase],
            LLMTestCase,
            List[LLMTestCase],
            MLLMTestCase,
            List[MLLMTestCase],
            ConversationalTestCase,
            List[ConversationalTestCase],
        ],
        timeout: Optional[int] = None,
    ) -> BatchRunResult:

Arguments:

  • eval_templates (Union[EvalTemplate, List[EvalTemplate]]): A single evaluation template or a list of evaluation templates.
  • inputs (Union[TestCase, List[TestCase], LLMTestCase, List[LLMTestCase], MLLMTestCase, List[MLLMTestCase], ConversationalTestCase, List[ConversationalTestCase]]): A single test case or a list of test cases. Supports various TestCase types.
  • timeout (Optional[int], optional): Timeout value in seconds for the evaluation. Defaults to None (uses the client’s default timeout).

Returns:

  • BatchRunResult: An object containing the results of the evaluation(s).

Raises:

  • ValidationError: If the inputs do not match the evaluation templates.
  • Exception: If the API request fails or other errors occur during evaluation.

list_evaluations

Fetches information about all available evaluation templates.

def list_evaluations(self) -> List[Dict[str, Any]]:

Returns:

  • List[Dict[str, Any]]: A list of dictionaries, where each dictionary contains information about an available evaluation template. This typically includes details like the template’s id, name, description, and expected parameters.

eval_templates

The list of templates that can be used to evaluate your data.

Conversation Coherence

Evaluates if a conversation flows logically and maintains context throughout.

class ConversationCoherence():

Conversation Resolution

Checks if the conversation reaches a satisfactory conclusion or resolution. The conversation must have atleast two users.

class ConversationResolution():

Deterministic Evals

Evaluates if the output is deterministic or not.

class DeterministicEvals():
    config={
        "multi_choice": False,
        "choices": ["Yes", "No"],
        "rule_prompt": "Evaluate if {{input_key1}} and {{input_key2}} are deterministic",
        "input": {"input_key1": "example_response", "input_key2": "expected_response"},
    }

Content Moderation

Uses OpenAI’s content moderation to evaluate text safety.

class ContentModeration():

Context Adherence

Measures how well responses stay within the provided context.

class ContextAdherence():

Prompt Perplexity

Measures how well the model understands and processes the input prompt by calculating the output perplexity. This measure is used to evaluate the model’s ability to generate responses that are coherent and consistent with the input prompt. This measures the confidence of the model in its response. Higher value indicates higher confidence.

class PromptPerplexity(config={
    "model": 
    {"type": "option",
    "default": None}}):

Context Relevance

Evaluates the relevancy of the context to the query.

class ContextRelevance(config={
    "check_internet": {"type": "boolean", "default": False}
}):

Completeness

Evaluates if the response completely answers the query.

class Completeness():

Chunk Attribution

Tracks if the context chunk is used in generating the response.

class ChunkAttribution():

Chunk Utilization

Measures how effectively context chunks are used in responses.

class ChunkUtilization():

Context Similarity

Compares similarity between provided and expected context.

class ContextSimilarity(config={
                "comparator": {"type": "option", "default": None},
                "failure_threshold": {"type": "float", "default": None},
            }):

PII

Detects personally identifiable information (PII) in text.

class PII():

Toxicity

Evaluates content for toxic or harmful language.

class Toxicity():

Tone

Analyzes the tone and sentiment of content.

class Tone():

Sexist

Detects sexist content and gender bias.

class Sexist():

Prompt Injection

Evaluates text for potential prompt injection attempts.

class PromptInjection():

Not Gibberish text

Checks if the text is not gibberish.

class NotGibberish():

Safe for Work text

Evaluates if the text is safe for work.

class SafeForWork():

Prompt/Instruction Adherence

Assesses how closely the output follows the given prompt instructions, checking for completion of all requested tasks and adherence to specified constraints or formats. Evaluates both explicit and implicit requirements in the prompt.

class PromptInstructionAdherence():

Data Privacy Compliance

Checks output for compliance with data privacy regulations (GDPR, HIPAA, etc.). Identifies potential privacy violations, sensitive data exposure, and adherence to privacy principles.

class DataPrivacyCompliance(config={
    "check_internet": {"type": "boolean", "default": True},
}):

Is Json

Validates if content is proper JSON format.

class IsJson():

Ends With

Checks if text ends with specific substring.

class EndsWith(config={
    "case_sensitive": {"type": "boolean", "default": True},
    "substring": {"type": "string", "default": None},
}):

Equals

Compares if two texts are exactly equal.

class Equals(config={
    "case_sensitive": {"type": "boolean", "default": True},
}):

Contains All

Verifies text contains all specified keywords.

class ContainsAll(config={
    "case_sensitive": {"type": "boolean", "default": True},
    "keywords": {"type": "list", "default": []},
}):

Length Less Than

Checks if text length is below threshold.

class LengthLessThan(config={
    "max_length": {"type": "integer", "default": None},
}):

Contains None

Verifies text contains none of specified terms.

class ContainsNone(config={
    "case_sensitive": {"type": "boolean", "default": True},
    "keywords": {"type": "list", "default": []},
}):

Regex

Checks if the text matches a specified regex pattern.

class Regex(config={
    "pattern": {"type": "string", "default": None},
}):

Starts With

Checks if text begins with specific substring.

class StartsWith(config={
    "substring": {"type": "string", "default": None},
    "case_sensitive": {"type": "boolean", "default": True},
}):

API Call

Makes an API call and evaluates the response.

class APICall(config={
    "url": {"type": "string", "default": None},
    "payload": {"type": "dict", "default": {}},
    "headers": {"type": "dict", "default": {}},
}):

Length Between

Checks if the text length is between specified min and max values.

class LengthBetween(config={
    "min_length": {"type": "integer", "default": None},
    "max_length": {"type": "integer", "default": None},
}):

Custom Code Evaluation

Executes custom Python code for evaluation.

class CustomCodeEvaluation(config={
    "code": {"type": "code", "default": None},
}):

Agent as a Judge

Uses AI agents for content evaluation.

class AgentAsJudge(config={
    "model": {"type": "option", "default": None},
    "eval_prompt": {"type": "prompt", "default": None},
    "system_prompt": {"type": "prompt", "default": None},
}):

Json Scheme Validation

Validates JSON against specified criteria.

class JsonSchemeValidation(config={
    "validations": {"type": "list", "default": []},
}):

One Line

Checks if the text is a single line.

class OneLine():

Checks for presence of valid URLs.

class ContainsValidLink():

Is Email

Validates email address format.

class IsEmail():

Length Greater than

Checks if the text length is greater than a specified value.

class LengthGreaterThan(config={
    "min_length": {"type": "integer", "default": None},
}):

Checks if the text contains no invalid URLs.

class NoValidLinks():

Contains

Checks if the text contains a specific keyword.

class Contains(config={
    "keyword": {"type": "string", "default": None},
    "case_sensitive": {"type": "boolean", "default": True},
}):

Contains Any

Checks if the text contains any of the specified keywords.

class ContainsAny(config={
    "keywords": {"type": "list", "default": []},
    "case_sensitive": {"type": "boolean", "default": True},
}):

Answer Similarity

Evaluates the similarity between the expected and actual responses.

class AnswerSimilarity(config={
    "comparator": {"type": "option", "default": "CosineSimilarity"},
    "failure_threshold": {"type": "float", "default": 0.5},
}):

Eval Output

Scores linkage between input and output based on specified criteria.

class EvalOutput(config={
    "criteria": {"type": "string", "default": None},
    "check_internet": {"type": "boolean", "default": False},
}):

Eval Context Retrieval Quality

Assesses quality of retrieved context.

class EvalContextRetrievalQuality(config={
    "criteria": {"type": "string", "default": None},
}):

Eval Ranking

Provides ranking score for each context based on specified criteria.

class EvalRanking(config={
    "criteria": {"type": "string", "default": None}
}):

Eval Image Instruction (text to image)

Scores image-instruction linkage based on specified criteria.

class EvalImageInstruction(config={
    "criteria": {"type": "string", "default": None}
}):

Score Eval

Scores linkage between instruction, input images, and output images.

class ScoreEval(config={
    "rule_prompt": {"type": "rule_prompt", "default": ""},
    "input": {"type": "rule_string", "default": []},
}):

Summary Quality

Evaluates if a summary effectively captures the main points, maintains factual accuracy, and achieves appropriate length while preserving the original meaning. Checks for both inclusion of key information and exclusion of unnecessary details.

class SummaryQuality(config={
    "check_internet": {"type": "boolean", "default": False}
}):

Factual Accuracy

Verifies if the provided output is factually correct or not.

class FactualAccuracy(config={
    "check_internet": {"type": "boolean", "default": False}
}):

Translation Accuracy

Evaluates the quality of translation by checking semantic accuracy, cultural appropriateness, and preservation of original meaning. Considers both literal accuracy and natural expression in the target language.

class TranslationAccuracy():

Cultural Sensitivity

Analyzes output for cultural appropriateness, inclusive language, and awareness of cultural nuances. Identifies potential cultural biases or insensitive content.

class CulturalSensitivity():

Bias Detection

Identifies various forms of bias including gender, racial, cultural, or ideological bias in the output. Evaluates for balanced perspective and neutral language use.

class BiasDetection():

Evaluate LLM Function calling

Assesses accuracy and effectiveness of LLM function calls.

class EvaluateLLMFunctionCalling():

Audio Transcription

Analyzes the transcription accuracy of the given audio and its transcription.

class AudioTranscription():

Eval Audio Description

Evaluates the audio based on the description of the given audio.

class EvalAudioDescription(
    config={
        {"criteria": {"type": "string", "default": None}},
    }
):

Audio Quality

Evaluates the quality of the given audio.

class AudioQuality():

Example Usage


from fi.evals import Tone, Evaluator
from fi.testcases import TestCase

evaluator = Evaluator()

def Tone_evaluation(evaluator):
    test_case = TestCase(
        input="Write a professional email",
        output="Dear Sir/Madam, I hope this email finds you well. I am writing to inquire about...",
        context="Maintain formal business communication tone",
    )
    template = Tone(config={"check_internet": False})
    response = evaluator.evaluate(eval_templates=[template], inputs=[test_case])

    assert response is not None
    assert len(response.eval_results) > 0
    assert response.eval_results[0].data[0] in [
        "neutral",
        "joy",
        "love",
        "fear",
        "surprise",
        "sadness",
        "anger",
        "annoyance",
        "confusion",
    ]

Tone_evaluation(evaluator)