This evaluation provides an automated way to assess AI-generated responses using language models. This method enables structured and scalable evaluations based on predefined criteria such as accuracy, conciseness, and relevance. By leveraging LLMs for evaluation, you can ensure that AI outputs align with expected responses and maintain quality across various use cases.

Click here to read the eval definition of LLM as a Judge


Using Python SDK

Step 1: Setting Up the Evaluation

To evaluate AI responses, the LLM as a Judge evaluation requires:

  • Query: The user’s original question or prompt.
  • Response: The actual output generated by the AI model.
  • Expected Response (Optional): The ideal or reference answer.
  • Context (Optional): Background information to help guide evaluation.

Additionally, you must configure:

  • Model: The LLM used for evaluation (e.g., gpt-4o-mini).
  • Evaluation Criteria: Defines what aspects of the response should be judged (e.g., accuracy, conciseness, relevance).
  • Choices: Possible ratings such as Excellent, Good, Fair, Poor.
  • Multi-choice: Boolean value indicating whether multiple ratings can be selected.

Step 2: Initialise the Evaluation Client

from fi.evals import EvalClient

evaluator = EvalClient(
    fi_api_key="your_api_key",
    fi_secret_key="your_secret_key",
    fi_base_url="https://api.futureagi.com"
)

Step 3: Define the Test Case

from fi.testcases import LLMTestCase

test_case = LLMTestCase(
    query="What is the capital of France?",
    response="Paris is the capital of France and is known for the Eiffel Tower.",
    context="Paris has been France's capital since 987 CE."
)

Step 4: Configure the Evaluation Criteria

from fi.evals.templates import LLMJudge

template = LLMJudge(config={
    "model": "gpt-4o-mini",
    "eval_prompt": "return summary of {{response}}",
    "system_prompt": "you are an expert in summarizing"
})

Step 5: Run the Evaluation

response = evaluator.evaluate(eval_templates=[template], inputs=[test_case])

print(f"Evaluation Result: {response.eval_results[0].reason}")
print(f"Score: {response.eval_results[0].metrics[0].value}")

Returns float rating between 0 and 1.

  • Higher values (closer to 1) indicate better performance.
  • Lower values (closer to 0) suggest a poor-quality response.

Using Interface

Configuration Parameters

  • Model: Specifies the LLM model used for evaluation.
  • Eval Prompt: The main evaluation prompt that defines how the AI should judge the response.
  • System Prompt: A higher-level instruction that guides the AI’s evaluation behaviour.