Evaluate Using LLM as a Judge
This evaluation provides an automated way to assess AI-generated responses using language models. This method enables structured and scalable evaluations based on predefined criteria such as accuracy, conciseness, and relevance. By leveraging LLMs for evaluation, you can ensure that AI outputs align with expected responses and maintain quality across various use cases.
Click here to read the eval definition of LLM as a Judge
Using Python SDK
Step 1: Setting Up the Evaluation
To evaluate AI responses, the LLM as a Judge evaluation requires:
- Query: The user’s original question or prompt.
- Response: The actual output generated by the AI model.
- Expected Response (Optional): The ideal or reference answer.
- Context (Optional): Background information to help guide evaluation.
Additionally, you must configure:
- Model: The LLM used for evaluation (e.g.,
gpt-4o-mini
). - Evaluation Criteria: Defines what aspects of the response should be judged (e.g., accuracy, conciseness, relevance).
- Choices: Possible ratings such as Excellent, Good, Fair, Poor.
- Multi-choice: Boolean value indicating whether multiple ratings can be selected.
Step 2: Initialise the Evaluation Client
Step 3: Define the Test Case
Step 4: Configure the Evaluation Criteria
Step 5: Run the Evaluation
Returns float rating between 0
and 1
.
- Higher values (closer to
1
) indicate better performance. - Lower values (closer to
0
) suggest a poor-quality response.
Using Interface
Configuration Parameters
- Model: Specifies the LLM model used for evaluation.
- Eval Prompt: The main evaluation prompt that defines how the AI should judge the response.
- System Prompt: A higher-level instruction that guides the AI’s evaluation behaviour.