LLM as a Judge
Definition
Uses large language model (LLM) to evaluate content based on custom prompts. This evaluation leverages the power of LLMs to make sophisticated judgments about text content, providing flexible and customisable evaluation criteria through carefully crafted prompts.
A successful evaluation provides a reasoned output based on the configured prompts and chosen model, while a failed evaluation indicates issues with the prompt execution or model response.
Calculation
It begins with Configuration Setup, where an appropriate LLM model is selected, and the evaluation prompt is defined to establish the assessment criteria. A system prompt can also be included to provide additional context.
During Prompt Execution, the system prompt sets the overall evaluation framework, while the evaluation prompt specifies the exact criteria to be assessed.
The model processes these prompts and generates a reasoned response. Finally, the output includes a detailed explanation of the evaluation, formatted according to the structure defined in the generic text template, ensuring consistency and interpretability.
What to do when LLM as a Judge Eval Fails
Ensure that the evaluation prompt is clear, well-structured, and aligned with the intended assessment goals. The system prompt should also be checked for appropriateness, ensuring it provides the necessary context for accurate evaluation.
If the prompts are correctly structured but issues persist, the next step is Model Selection—verifying that the chosen model is suitable for the evaluation task. If needed, alternative models from the available list should be considered to improve evaluation accuracy and reliability.
Differentiating LLM as a Judge with Agent as a Judge
LLM as Judge relies solely on language models to assess content, providing a direct and structured evaluation based on predefined criteria. In contrast, Agent as Judge incorporates AI agents, which may leverage additional tools, reasoning capabilities, or external data sources to enhance the evaluation process.
While LLM as Judge follows a more straightforward model-driven approach, Agent as Judge enables more complex assessments through agent-based reasoning and decision-making.