LLM-as-Judge

Define custom grading criteria and run them with any LLM — GPT-4o, Gemini, Claude, Ollama, or any LiteLLM-supported model.

📝
TL;DR
  • Write custom grading criteria in plain English, score with any LLM
  • Any LiteLLM model string works: gemini/gemini-2.5-flash, gpt-4o, claude-sonnet-4-20250514, ollama/llama3.2:3b
  • Auto-generate detailed rubrics from short descriptions with generate_prompt=True

Use LLM-as-Judge when none of the 76+ local metrics or 100+ cloud templates cover your use case. Write grading criteria in plain English, pick a model, and the SDK sends it to the LLM and parses the score back into an EvalResult.

Note

Requires pip install ai-evaluation and an API key for your chosen model provider (e.g. GOOGLE_API_KEY, OPENAI_API_KEY, ANTHROPIC_API_KEY).

Quick Example

from fi.evals import evaluate

result = evaluate(
    prompt="Rate how helpful this response is from 0 to 1. A helpful response directly answers the question with actionable steps.",
    output="Here are 3 steps to fix the issue: 1. Check your config file...",
    query="How do I fix the login error?",
    engine="llm",
    model="gemini/gemini-2.5-flash",
)

print(result.score)   # 0.9
print(result.reason)  # JSON with score and explanation

How It Works

When you pass engine="llm" (or a non-Turing model string), the SDK:

  1. Takes your prompt as the grading criteria
  2. Substitutes any {field_name} placeholders with your input values
  3. Sends the criteria + inputs to the LLM
  4. Parses the response into an EvalResult with score, passed, and reason

Writing Criteria

The prompt parameter is your grading rubric. Write it as a clear instruction telling the LLM how to score the output.

Simple criteria

result = evaluate(
    prompt="Rate the professionalism of this email from 0 to 1.",
    output="Hey dude, we need the report ASAP. Thx.",
    engine="llm",
    model="gpt-4o",
)
# score → 0.2

Criteria with input references

Use {field_name} placeholders to reference any input field in your criteria.

result = evaluate(
    prompt="Does the response answer the question '{query}'? Score 0 if it ignores the question, 1 if it fully answers it.",
    output="The capital of France is Paris.",
    query="What is the capital of France?",
    engine="llm",
    model="gemini/gemini-2.5-flash",
)
# score → 1.0

Multi-dimensional criteria

result = evaluate(
    prompt="""Score this customer support response from 0 to 1 based on:
- Empathy (does it acknowledge the customer's frustration?)
- Accuracy (is the information correct?)
- Actionability (does it give clear next steps?)
Weight all three equally.""",
    output="I understand this is frustrating. The issue is caused by a known bug in v2.3. We've released a fix in v2.4 — please update and let me know if it persists.",
    query="Your app keeps crashing and I've lost my data!",
    engine="llm",
    model="gpt-4o",
)

Auto-Generated Rubrics

Short criteria can be ambiguous. Set generate_prompt=True to have the SDK expand your description into a detailed rubric automatically. The generated rubric is cached for the session.

# Without generate_prompt — the LLM interprets "empathetic" loosely
result = evaluate(
    prompt="Check if the response is empathetic",
    output="I understand. Let me help fix that.",
    engine="llm",
    model="gemini/gemini-2.5-flash",
)

# With generate_prompt — expands into a detailed rubric first
result = evaluate(
    prompt="Check if the response is empathetic",
    output="I understand. Let me help fix that.",
    engine="llm",
    model="gemini/gemini-2.5-flash",
    generate_prompt=True,
)

You can also generate rubrics separately:

from fi.evals.core.prompt_generator import generate_grading_criteria

rubric = generate_grading_criteria(
    "Check if the response is empathetic and acknowledges the customer's frustration",
    model="gemini/gemini-2.5-flash",
)
print(rubric)  # detailed multi-point rubric

Supported Models

Any model string supported by LiteLLM works. Common examples:

ModelStringAPI Key Env Var
Gemini 2.5 Flashgemini/gemini-2.5-flashGOOGLE_API_KEY
Gemini 2.5 Progemini/gemini-2.5-proGOOGLE_API_KEY
GPT-4ogpt-4oOPENAI_API_KEY
GPT-4o Minigpt-4o-miniOPENAI_API_KEY
Claude Sonnet 4claude-sonnet-4-20250514ANTHROPIC_API_KEY
Ollama (local)ollama/llama3.2:3bNone (local)
Was this page helpful?

Questions & Discussion