Agent as a Judge

Evaluation Using Interface

Input:

Configuration Parameters:
- model: The model to use for the evaluation.
- Eval Prompt: The prompt to use for the evaluation.
- System Prompt: The system prompt to use for the evaluation.

Output:

Result: The result of the evaluation.

Evaluation Using Python SDK

Click here to learn how to setup evaluation using the Python SDK.

Input:

Configuration Parameters:
- model: string - The model to use for the evaluation.
- evalPrompt: string - The prompt to use for the evaluation.
- systemPrompt: string - The system prompt to use for the evaluation.

Output:

Result: string - The result of the evaluation.

from fi.evals import AgentJudge
from fi.testcases import LLMTestCase

test_case = LLMTestCase(
    query="What is the capital of France?",
    response="Paris is the capital of France and is known for the Eiffel Tower.",
    context="Paris has been France's capital since 987 CE.",
    expected_response="Paris is the capital of France."
)

template = AgentJudge(config={
    "model": "gpt-4o-mini",
    "evalPrompt": "Evaluate if the {{response}} accurately answers the {{query}}. Return a score between 0.0 and 1.0.",
    "systemPrompt": "You are an expert agent evaluating responses for accuracy and completeness."
})

response = evaluator.evaluate(eval_templates=[template], inputs=[test_case], model_name="turing_flash")

print(f"Evaluation Result: {response.eval_results[0].reason}")
print(f"Score: {response.eval_results[0].metrics[0].value}")

What to do when Agent Judge Evaluation Fails

In such case, reviewing the agent configuration is crucial. This includes checking the system prompt to ensure the agent’s role is correctly defined, verifying that the evaluation prompt is clear and comprehensive, and ensuring that the agent has proper access to necessary tools. Additionally, assessing model selection is important—confirm that the chosen model is compatible with the agent’s operations, and consider using an alternative model from the available options if needed.

Introduction

Evaluation

Knowledge Base

Dataset

Prototype

Observe

Tracing

Optimization

Prompt Workbench

Protect

MCP

Admin & Settings

FAQs

Evaluation Using Interface

Evaluation Using Python SDK

What to do when Agent Judge Evaluation Fails

Introduction

Evaluation

Knowledge Base

Dataset

Prototype

Observe

Tracing

Optimization

Prompt Workbench

Protect

MCP

Admin & Settings

FAQs

​Evaluation Using Interface

​Evaluation Using Python SDK

​What to do when Agent Judge Evaluation Fails

Evaluation Using Interface

Evaluation Using Python SDK

What to do when Agent Judge Evaluation Fails