LLM-as-Judge: Custom Grading Criteria for Future AGI Evals

Define custom grading criteria and run them with any LLM: GPT-4o, Gemini, Claude, Ollama, or any LiteLLM-supported model.

📝

TL;DR

Write custom grading criteria in plain English, score with any LLM
Any LiteLLM model string works: gemini/gemini-2.5-flash, gpt-4o, claude-sonnet-4-20250514, ollama/llama3.2:3b
Auto-generate detailed rubrics from short descriptions with generate_prompt=True

Use LLM-as-Judge when none of the 76+ local metrics or 100+ cloud templates cover your use case. Write grading criteria in plain English, pick a model, and the SDK sends it to the LLM and parses the score back into an EvalResult.

Note

Requires pip install ai-evaluation and an API key for your chosen model provider (e.g. GOOGLE_API_KEY, OPENAI_API_KEY, ANTHROPIC_API_KEY).

Quick Example

from fi.evals import evaluate

result = evaluate(
    prompt="Rate how helpful this response is from 0 to 1. A helpful response directly answers the question with actionable steps.",
    output="Here are 3 steps to fix the issue: 1. Check your config file...",
    query="How do I fix the login error?",
    engine="llm",
    model="gemini/gemini-2.5-flash",
)

print(result.score)   # 0.9
print(result.reason)  # JSON with score and explanation

How It Works

When you pass engine="llm" (or a non-Turing model string), the SDK:

Takes your prompt as the grading criteria
Substitutes any {field_name} placeholders with your input values
Sends the criteria + inputs to the LLM
Parses the response into an EvalResult with score, passed, and reason

Writing Criteria

The prompt parameter is your grading rubric. Write it as a clear instruction telling the LLM how to score the output.

Simple criteria

result = evaluate(
    prompt="Rate the professionalism of this email from 0 to 1.",
    output="Hey dude, we need the report ASAP. Thx.",
    engine="llm",
    model="gpt-4o",
)
# score → 0.2

Criteria with input references

Use {field_name} placeholders to reference any input field in your criteria.

result = evaluate(
    prompt="Does the response answer the question '{query}'? Score 0 if it ignores the question, 1 if it fully answers it.",
    output="The capital of France is Paris.",
    query="What is the capital of France?",
    engine="llm",
    model="gemini/gemini-2.5-flash",
)
# score → 1.0

Multi-dimensional criteria

result = evaluate(
    prompt="""Score this customer support response from 0 to 1 based on:
- Empathy (does it acknowledge the customer's frustration?)
- Accuracy (is the information correct?)
- Actionability (does it give clear next steps?)
Weight all three equally.""",
    output="I understand this is frustrating. The issue is caused by a known bug in v2.3. We've released a fix in v2.4 — please update and let me know if it persists.",
    query="Your app keeps crashing and I've lost my data!",
    engine="llm",
    model="gpt-4o",
)

Auto-Generated Rubrics

Short criteria can be ambiguous. Set generate_prompt=True to have the SDK expand your description into a detailed rubric automatically. The generated rubric is cached for the session.

# Without generate_prompt — the LLM interprets "empathetic" loosely
result = evaluate(
    prompt="Check if the response is empathetic",
    output="I understand. Let me help fix that.",
    engine="llm",
    model="gemini/gemini-2.5-flash",
)

# With generate_prompt — expands into a detailed rubric first
result = evaluate(
    prompt="Check if the response is empathetic",
    output="I understand. Let me help fix that.",
    engine="llm",
    model="gemini/gemini-2.5-flash",
    generate_prompt=True,
)

You can also generate rubrics separately:

from fi.evals.core.prompt_generator import generate_grading_criteria

rubric = generate_grading_criteria(
    "Check if the response is empathetic and acknowledges the customer's frustration",
    model="gemini/gemini-2.5-flash",
)
print(rubric)  # detailed multi-point rubric

Supported Models

Any model string supported by LiteLLM works. Common examples:

Model	String	API Key Env Var
Gemini 2.5 Flash	`gemini/gemini-2.5-flash`	`GOOGLE_API_KEY`
Gemini 2.5 Pro	`gemini/gemini-2.5-pro`	`GOOGLE_API_KEY`
GPT-4o	`gpt-4o`	`OPENAI_API_KEY`
GPT-4o Mini	`gpt-4o-mini`	`OPENAI_API_KEY`
Claude Sonnet 4	`claude-sonnet-4-20250514`	`ANTHROPIC_API_KEY`
Ollama (local)	`ollama/llama3.2:3b`	None (local)

Questions & Discussion

LLM-as-Judge: Custom Grading Criteria for Future AGI Evals

Quick Example

How It Works

Writing Criteria

Simple criteria

Criteria with input references

Multi-dimensional criteria

Auto-Generated Rubrics

Supported Models

Cloud Evals

Metrics Reference

Feedback Loops

Quick Example

How It Works

Writing Criteria

Simple criteria

Criteria with input references

Multi-dimensional criteria

Auto-Generated Rubrics

Supported Models

Related

Cloud Evals

Metrics Reference

Feedback Loops