LLM-as-Judge
Define custom grading criteria and run them with any LLM — GPT-4o, Gemini, Claude, Ollama, or any LiteLLM-supported model.
- Write custom grading criteria in plain English, score with any LLM
- Any LiteLLM model string works:
gemini/gemini-2.5-flash,gpt-4o,claude-sonnet-4-20250514,ollama/llama3.2:3b - Auto-generate detailed rubrics from short descriptions with
generate_prompt=True
Use LLM-as-Judge when none of the 76+ local metrics or 100+ cloud templates cover your use case. Write grading criteria in plain English, pick a model, and the SDK sends it to the LLM and parses the score back into an EvalResult.
Note
Requires pip install ai-evaluation and an API key for your chosen model provider (e.g. GOOGLE_API_KEY, OPENAI_API_KEY, ANTHROPIC_API_KEY).
Quick Example
from fi.evals import evaluate
result = evaluate(
prompt="Rate how helpful this response is from 0 to 1. A helpful response directly answers the question with actionable steps.",
output="Here are 3 steps to fix the issue: 1. Check your config file...",
query="How do I fix the login error?",
engine="llm",
model="gemini/gemini-2.5-flash",
)
print(result.score) # 0.9
print(result.reason) # JSON with score and explanation
How It Works
When you pass engine="llm" (or a non-Turing model string), the SDK:
- Takes your
promptas the grading criteria - Substitutes any
{field_name}placeholders with your input values - Sends the criteria + inputs to the LLM
- Parses the response into an
EvalResultwith score, passed, and reason
Writing Criteria
The prompt parameter is your grading rubric. Write it as a clear instruction telling the LLM how to score the output.
Simple criteria
result = evaluate(
prompt="Rate the professionalism of this email from 0 to 1.",
output="Hey dude, we need the report ASAP. Thx.",
engine="llm",
model="gpt-4o",
)
# score → 0.2
Criteria with input references
Use {field_name} placeholders to reference any input field in your criteria.
result = evaluate(
prompt="Does the response answer the question '{query}'? Score 0 if it ignores the question, 1 if it fully answers it.",
output="The capital of France is Paris.",
query="What is the capital of France?",
engine="llm",
model="gemini/gemini-2.5-flash",
)
# score → 1.0
Multi-dimensional criteria
result = evaluate(
prompt="""Score this customer support response from 0 to 1 based on:
- Empathy (does it acknowledge the customer's frustration?)
- Accuracy (is the information correct?)
- Actionability (does it give clear next steps?)
Weight all three equally.""",
output="I understand this is frustrating. The issue is caused by a known bug in v2.3. We've released a fix in v2.4 — please update and let me know if it persists.",
query="Your app keeps crashing and I've lost my data!",
engine="llm",
model="gpt-4o",
)
Auto-Generated Rubrics
Short criteria can be ambiguous. Set generate_prompt=True to have the SDK expand your description into a detailed rubric automatically. The generated rubric is cached for the session.
# Without generate_prompt — the LLM interprets "empathetic" loosely
result = evaluate(
prompt="Check if the response is empathetic",
output="I understand. Let me help fix that.",
engine="llm",
model="gemini/gemini-2.5-flash",
)
# With generate_prompt — expands into a detailed rubric first
result = evaluate(
prompt="Check if the response is empathetic",
output="I understand. Let me help fix that.",
engine="llm",
model="gemini/gemini-2.5-flash",
generate_prompt=True,
)
You can also generate rubrics separately:
from fi.evals.core.prompt_generator import generate_grading_criteria
rubric = generate_grading_criteria(
"Check if the response is empathetic and acknowledges the customer's frustration",
model="gemini/gemini-2.5-flash",
)
print(rubric) # detailed multi-point rubric
Supported Models
Any model string supported by LiteLLM works. Common examples:
| Model | String | API Key Env Var |
|---|---|---|
| Gemini 2.5 Flash | gemini/gemini-2.5-flash | GOOGLE_API_KEY |
| Gemini 2.5 Pro | gemini/gemini-2.5-pro | GOOGLE_API_KEY |
| GPT-4o | gpt-4o | OPENAI_API_KEY |
| GPT-4o Mini | gpt-4o-mini | OPENAI_API_KEY |
| Claude Sonnet 4 | claude-sonnet-4-20250514 | ANTHROPIC_API_KEY |
| Ollama (local) | ollama/llama3.2:3b | None (local) |