This evaluation provides an automated way to assess AI-generated responses using language models. This method enables structured and scalable evaluations based on predefined criteria such as accuracy, conciseness, and relevance. By leveraging LLMs for evaluation, you can ensure that AI outputs align with expected responses and maintain quality across various use cases.Click here to read the eval definition of LLM as a Judge
Step 1: Setting Up the EvaluationTo evaluate AI responses, the LLM as a Judge evaluation requires:
Query: The user’s original question or prompt.
Response: The actual output generated by the AI model.
Expected Response (Optional): The ideal or reference answer.
Context (Optional): Background information to help guide evaluation.
Additionally, you must configure:
Model: The LLM used for evaluation (e.g., gpt-4o-mini).
Evaluation Criteria: Defines what aspects of the response should be judged (e.g., accuracy, conciseness, relevance).
Choices: Possible ratings such as Excellent, Good, Fair, Poor.
Multi-choice: Boolean value indicating whether multiple ratings can be selected.
Step 2: Initialise the Evaluation Client
Copy
from fi.evals import Evaluatorevaluator = Evaluator( fi_api_key="your_api_key", fi_secret_key="your_secret_key", fi_base_url="https://api.futureagi.com")
Step 3: Define the Test Case
Copy
from fi.testcases import LLMTestCasetest_case = LLMTestCase( query="What is the capital of France?", response="Paris is the capital of France and is known for the Eiffel Tower.", context="Paris has been France's capital since 987 CE.")
Step 4: Configure the Evaluation Criteria
Copy
from fi.evals.templates import LLMJudgetemplate = LLMJudge(config={ "model": "gpt-4o-mini", "eval_prompt": "return summary of {{response}}", "system_prompt": "you are an expert in summarizing"})