Evaluation Using Interface

Input:

  • Configuration Parameters:
    • Input: The content generated by the model/system that needs to be evaluated against the rules.
    • Rule Prompt: A string defining the specific rules, patterns, or criteria the Input must adhere to. You can use double-curly braces like {{column_name}} which will be substituted with actual input data from column column_name during evaluation.
    • Choices: A list of predefined options or categories. If Multi Choice is enabled, the evaluation checks if the Input matches one of these choices based on the Rule Prompt.
    • Multi Choice: A boolean (true/false) indicating whether the evaluation involves selecting from the predefined Choices (true) or simply evaluating the Input against the Rule Prompt (false).

Output:

  • The result is a set of choice(s) provided by the user of the output’s adherence to the deterministic criteria.

Evaluation Using Python SDK

Click here to learn how to setup evaluation using the Python SDK.


Input TypeParameterTypeDescription
Configuration ParametersinputstringThe actual output or content generated by the model/system that needs to be evaluated against the rules.
rule_promptstringA string defining the specific rules, patterns, or criteria the Input must adhere to. You can use double-curly braces like {{column_name}} which will be substituted with actual input data from column column_name during evaluation.
choiceslist[string]A list of predefined options or categories. Used when multi_choice is true.
multi_choiceboolIf true, evaluates if the input matches one of the choices based on the rule_prompt. If false, evaluates input against rule_prompt.

OutputTypeDescription
Resultstring / list[string]Returns the matching choice(s)

from fi.testcases import MLLMTestCase
from fi.evals import Deterministic

class DeterministicTestCase(MLLMTestCase):
    context: str
    question: str

deterministic_eval = Deterministic(config={
    "multi_choice": False,
    "choices": ["Pass", "Fail"],
    "rule_prompt": "context : {{input_key1}}, question : {{input_key2}}. Given the context and question, choose Pass if the question is grammatically correct, well-structured, and free of errors; choose Fail otherwise",
        "input": {
        "input_key1": "context",
        "input_key2": "question",
    }
})

for index, row in dataset.iterrows():
  test_case = DeterministicTestCase(
        context=row["context"],
        question=row["question"]
  )
  result = evaluator.evaluate([deterministic_eval], [test_case])
  option = result.eval_results[0].metrics[0].value
  reason = result.eval_results[0].reason


What To Do When Deterministic Eval Does Not Return Expected Option

  • Rule Refinement:
    • Review and clarify rule prompt definitions
    • Adjust pattern matching criteria
    • Update choice options if too restrictive
  • Input Validation:
    • Check input formatting
    • Verify rule string compatibility
    • Ensure choice options are comprehensive

Comparing Deterministic Eval with Similar Evals

  1. Content Moderation: While Content Moderation focuses on safety and appropriateness, Deterministic Evals verify pattern compliance and rule adherence.
  2. Prompt Perplexity: it measures a model’s understanding and confidence through perplexity calculations, making it useful for assessing comprehension and response certainty. whereas deterministic eval follows a structured classification framework with explicit rules and criteria, ensuring strict adherence to predefined standards