Uses AI agents to conduct structured evaluations of content, leveraging customisable prompts and system instructions for comprehensive assessment. Unlike rule-based compliance checks, this approach enables nuanced analysis by allowing agents to interpret content contextually based on predefined criteria.

Click here to read the eval definition of Agent as a Judge


a. Using Interface

Configuration Parameters

  • Model: Specifies the LLM model used for evaluation.
  • Eval Prompt: The main evaluation prompt that defines how the AI should judge the response.
  • System Prompt: A higher-level instruction that guides the AI’s evaluation behaviour.

b. Using SDK

from fi.evals import EvalClient
from fi.evals import AgentJudge
from fi.testcases import LLMTestCase

evaluator = EvalClient(
    fi_api_key="your_api_key",
    fi_secret_key="your_secret_key",
    fi_base_url="https://api.futureagi.com"
)

test_case = LLMTestCase(
    query="What is the capital of France?",
    response="Paris is the capital of France and is known for the Eiffel Tower.",
    context="Paris has been France's capital since 987 CE.",
    expected_response="Paris is the capital of France."
)

template = AgentJudge(config={
    "model": "gpt-4o-mini",
    "evalPrompt": "Evaluate if the {{response}} accurately answers the {{query}}. Return a score between 0.0 and 1.0.",
    "systemPrompt": "You are an expert agent evaluating responses for accuracy and completeness."
})

response = evaluator.evaluate(eval_templates=[template], inputs=[test_case])

print(f"Evaluation Result: {response.eval_results[0].reason}")
print(f"Score: {response.eval_results[0].metrics[0].value}")