Custom Eval Metrics: Build Your Own

Define LLM quality criteria in plain English, register reusable eval metrics in the FutureAGI dashboard, and run them via SDK with a single evaluate() call.

📝

TL;DR

Define quality criteria in plain English, register them as reusable eval metrics in the FutureAGI dashboard, and run them via SDK with a single evaluate() call.

By the end of this guide you will have created two custom eval metrics: one for a customer support quality rubric and one for a code review assistant, then run both from Python.

Time	Difficulty	Package
10 min	Beginner	`ai-evaluation`

Prerequisites

FutureAGI account → app.futureagi.com
API keys: FI_API_KEY and FI_SECRET_KEY (see Get your API keys)
Python 3.9+

Install

pip install futureagi ai-evaluation

export FI_API_KEY="your-api-key"
export FI_SECRET_KEY="your-secret-key"

Tutorial

Create a custom eval from the dashboard

Custom evals are created in the platform and then available by name in SDK calls.

Go to app.futureagi.com → Evals (left sidebar under BUILD)
Click Create Evaluation
Fill in:
- Name: support_quality (lowercase, underscores only)
- Template type: Use Future AGI Agents (or Use other LLMs / Function based)
- Model: turing_small (for Future AGI Agents)
- Output Type: Pass/Fail
- Optional fields: add tags and description if needed
Write the Rule Prompt using {{variable_name}} for dynamic inputs:

You are evaluating a customer support response.

The customer asked: {{user_query}}
The agent responded: {{agent_response}}

Mark PASS only if all of these are true:
- It acknowledges the customer's specific issue
- It gives a concrete next step or resolution
- It maintains a professional and empathetic tone

Mark FAIL if any required condition is missing, or if the response is dismissive, vague, or off-topic.

Return a clear PASS/FAIL decision with a short reason.

Click Create Evaluation

Your eval is now registered and can be selected in Dataset/Simulation evaluation flows.

Run your custom eval via SDK

Use Evaluator from the ai-evaluation SDK and call your custom eval by name. Pass the same variable names used in your Rule Prompt.

The model for a custom eval is configured in the dashboard when you create or edit that eval.

import os
from fi.evals import Evaluator

evaluator = Evaluator(
    fi_api_key=os.environ["FI_API_KEY"],
    fi_secret_key=os.environ["FI_SECRET_KEY"],
)

result = evaluator.evaluate(
    eval_templates="support_quality",
    inputs={
        "user_query": "My order arrived damaged. What do I do?",
        "agent_response": "I'm sorry to hear that. I've filed a replacement request and you'll receive a shipping confirmation within 24 hours.",
    },
)

eval_result = result.eval_results[0]
print(eval_result.output)
print(eval_result.reason)

Sample output shape:

0.0/1.0 or pass/fail-style output
<reason text>

Try a failing response:

result = evaluator.evaluate(
    eval_templates="support_quality",
    inputs={
        "user_query": "My order arrived damaged. What do I do?",
        "agent_response": "Please contact our returns department.",
    },
)

eval_result = result.eval_results[0]
print(eval_result.output)
print(eval_result.reason)

Create a second custom eval (numerical scoring)

Use Percentage output type when you need a continuous quality score rather than binary pass/fail. In SDK results, this is typically returned as a normalized score (0.0 to 1.0).

Repeat Step 1, but set:
- Name: code_review_quality
- Output Type: Percentage (displayed in SDK as 0.0-1.0)
- Rule Prompt:

You are evaluating a code review comment.

The code change: {{code_diff}}
The review comment: {{review_comment}}

Score using these weights:
- 40 points: Does it clearly explain what's wrong?
- 30 points: Does it suggest a concrete fix or improvement?
- 30 points: Is it constructive and respectful?

Return a normalized score from 0.0 to 1.0 (for example, 0.91 for 91/100).

Run it via SDK:

result = evaluator.evaluate(
    eval_templates="code_review_quality",
    inputs={
        "code_diff": "- return user.name\n+ return user.name.strip()",
        "review_comment": "Good catch: whitespace in names can cause login failures. Consider adding a test case for this.",
    },
)

eval_result = result.eval_results[0]
print(f"Score: {eval_result.output}")
print(f"Reason: {eval_result.reason}")

What you built

You can now create custom eval metrics in the FutureAGI dashboard and run them programmatically via the SDK.

Created a support_quality custom eval in the dashboard with a plain-English Pass/Fail rubric
Created a code_review_quality custom eval with a weighted scoring rubric (returned as 0.0-1.0)
Ran both evals via Evaluator.evaluate() using their registered names

Questions & Discussion

Custom Eval Metrics: Build Your Own

Install

Tutorial

Create a custom eval from the dashboard

Run your custom eval via SDK

Create a second custom eval (numerical scoring)

What you built

Next steps

All Built-in Metrics

Running Your First Eval

Hallucination Detection

Eval Groups