Custom Eval Metrics: Write Your Own Evaluation Criteria
Define quality criteria in plain English and run them as reusable eval metrics from the dashboard or SDK on any dataset or production trace.
Define quality criteria in plain English, register them as reusable eval metrics in the FutureAGI dashboard, and run them via SDK with a single evaluate() call.
By the end of this guide you will have created two custom eval metrics: one for a customer support quality rubric and one for a code review assistant, then run both from Python.
| Time | Difficulty | Package |
|---|---|---|
| 10 min | Beginner | ai-evaluation |
- FutureAGI account → app.futureagi.com
- API keys:
FI_API_KEYandFI_SECRET_KEY(see Get your API keys) - Python 3.9+
Install
pip install futureagi ai-evaluation
export FI_API_KEY="your-api-key"
export FI_SECRET_KEY="your-secret-key"
Tutorial
Create a custom eval from the dashboard
Custom evals are created in the platform and then available by name in SDK calls.
-
Go to app.futureagi.com → Evals (left sidebar under BUILD)
-
Click Create Evaluation
-
Fill in:
- Name:
support_quality(lowercase, underscores only) - Template type: Use Future AGI Agents (or Use other LLMs / Function based)
- Model:
turing_small(for Future AGI Agents) - Output Type:
Pass/Fail - Optional fields: add tags and description if needed
- Name:
-
Write the Rule Prompt using
{{variable_name}}for dynamic inputs:
You are evaluating a customer support response.
The customer asked: {{user_query}}
The agent responded: {{agent_response}}
Mark PASS only if all of these are true:
- It acknowledges the customer's specific issue
- It gives a concrete next step or resolution
- It maintains a professional and empathetic tone
Mark FAIL if any required condition is missing, or if the response is dismissive, vague, or off-topic.
Return a clear PASS/FAIL decision with a short reason.- Click Create Evaluation
Your eval is now registered and can be selected in Dataset/Simulation evaluation flows.
Run your custom eval via SDK
Use Evaluator from the ai-evaluation SDK and call your custom eval by name. Pass the same variable names used in your Rule Prompt.
The model for a custom eval is configured in the dashboard when you create or edit that eval.
import os
from fi.evals import Evaluator
evaluator = Evaluator(
fi_api_key=os.environ["FI_API_KEY"],
fi_secret_key=os.environ["FI_SECRET_KEY"],
)
result = evaluator.evaluate(
eval_templates="support_quality",
inputs={
"user_query": "My order arrived damaged. What do I do?",
"agent_response": "I'm sorry to hear that. I've filed a replacement request and you'll receive a shipping confirmation within 24 hours.",
},
)
eval_result = result.eval_results[0]
print(eval_result.output)
print(eval_result.reason)Sample output shape:
0.0/1.0 or pass/fail-style output
<reason text>Try a failing response:
result = evaluator.evaluate(
eval_templates="support_quality",
inputs={
"user_query": "My order arrived damaged. What do I do?",
"agent_response": "Please contact our returns department.",
},
)
eval_result = result.eval_results[0]
print(eval_result.output)
print(eval_result.reason) Create a second custom eval (numerical scoring)
Use Percentage output type when you need a continuous quality score rather than binary pass/fail. In SDK results, this is typically returned as a normalized score (0.0 to 1.0).
- Repeat Step 1, but set:
- Name:
code_review_quality - Output Type:
Percentage(displayed in SDK as0.0-1.0) - Rule Prompt:
- Name:
You are evaluating a code review comment.
The code change: {{code_diff}}
The review comment: {{review_comment}}
Score using these weights:
- 40 points: Does it clearly explain what's wrong?
- 30 points: Does it suggest a concrete fix or improvement?
- 30 points: Is it constructive and respectful?
Return a normalized score from 0.0 to 1.0 (for example, 0.91 for 91/100).Run it via SDK:
result = evaluator.evaluate(
eval_templates="code_review_quality",
inputs={
"code_diff": "- return user.name\n+ return user.name.strip()",
"review_comment": "Good catch: whitespace in names can cause login failures. Consider adding a test case for this.",
},
)
eval_result = result.eval_results[0]
print(f"Score: {eval_result.output}")
print(f"Reason: {eval_result.reason}") What you built
You can now create custom eval metrics in the FutureAGI dashboard and run them programmatically via the SDK.
- Created a
support_qualitycustom eval in the dashboard with a plain-English Pass/Fail rubric - Created a
code_review_qualitycustom eval with a weighted scoring rubric (returned as0.0-1.0) - Ran both evals via
Evaluator.evaluate()using their registered names