Custom Eval Metrics: Build Your Own
Define LLM quality criteria in plain English, register reusable eval metrics in the FutureAGI dashboard, and run them via SDK with a single evaluate() call.
Define quality criteria in plain English, register them as reusable eval metrics in the FutureAGI dashboard, and run them via SDK with a single evaluate() call.
By the end of this guide you will have created two custom eval metrics: one for a customer support quality rubric and one for a code review assistant, then run both from Python.
| Time | Difficulty | Package |
|---|---|---|
| 10 min | Beginner | ai-evaluation |
- FutureAGI account → app.futureagi.com
- API keys:
FI_API_KEYandFI_SECRET_KEY(see Get your API keys) - Python 3.9+
Install
pip install futureagi ai-evaluation
export FI_API_KEY="your-api-key"
export FI_SECRET_KEY="your-secret-key"
Tutorial
Create a custom eval from the dashboard
Custom evals are created in the platform and then available by name in SDK calls.
-
Go to app.futureagi.com → Evals (left sidebar under BUILD)
-
Click Create Evaluation
-
Fill in:
- Name:
support_quality(lowercase, underscores only) - Template type: Use Future AGI Agents (or Use other LLMs / Function based)
- Model:
turing_small(for Future AGI Agents) - Output Type:
Pass/Fail - Optional fields: add tags and description if needed
- Name:
-
Write the Rule Prompt using
{{variable_name}}for dynamic inputs:
You are evaluating a customer support response.
The customer asked: {{user_query}}
The agent responded: {{agent_response}}
Mark PASS only if all of these are true:
- It acknowledges the customer's specific issue
- It gives a concrete next step or resolution
- It maintains a professional and empathetic tone
Mark FAIL if any required condition is missing, or if the response is dismissive, vague, or off-topic.
Return a clear PASS/FAIL decision with a short reason.- Click Create Evaluation
Your eval is now registered and can be selected in Dataset/Simulation evaluation flows.
Run your custom eval via SDK
Use Evaluator from the ai-evaluation SDK and call your custom eval by name. Pass the same variable names used in your Rule Prompt.
The model for a custom eval is configured in the dashboard when you create or edit that eval.
import os
from fi.evals import Evaluator
evaluator = Evaluator(
fi_api_key=os.environ["FI_API_KEY"],
fi_secret_key=os.environ["FI_SECRET_KEY"],
)
result = evaluator.evaluate(
eval_templates="support_quality",
inputs={
"user_query": "My order arrived damaged. What do I do?",
"agent_response": "I'm sorry to hear that. I've filed a replacement request and you'll receive a shipping confirmation within 24 hours.",
},
)
eval_result = result.eval_results[0]
print(eval_result.output)
print(eval_result.reason)Sample output shape:
0.0/1.0 or pass/fail-style output
<reason text>Try a failing response:
result = evaluator.evaluate(
eval_templates="support_quality",
inputs={
"user_query": "My order arrived damaged. What do I do?",
"agent_response": "Please contact our returns department.",
},
)
eval_result = result.eval_results[0]
print(eval_result.output)
print(eval_result.reason) Create a second custom eval (numerical scoring)
Use Percentage output type when you need a continuous quality score rather than binary pass/fail. In SDK results, this is typically returned as a normalized score (0.0 to 1.0).
- Repeat Step 1, but set:
- Name:
code_review_quality - Output Type:
Percentage(displayed in SDK as0.0-1.0) - Rule Prompt:
- Name:
You are evaluating a code review comment.
The code change: {{code_diff}}
The review comment: {{review_comment}}
Score using these weights:
- 40 points: Does it clearly explain what's wrong?
- 30 points: Does it suggest a concrete fix or improvement?
- 30 points: Is it constructive and respectful?
Return a normalized score from 0.0 to 1.0 (for example, 0.91 for 91/100).Run it via SDK:
result = evaluator.evaluate(
eval_templates="code_review_quality",
inputs={
"code_diff": "- return user.name\n+ return user.name.strip()",
"review_comment": "Good catch: whitespace in names can cause login failures. Consider adding a test case for this.",
},
)
eval_result = result.eval_results[0]
print(f"Score: {eval_result.output}")
print(f"Reason: {eval_result.reason}") What you built
You can now create custom eval metrics in the FutureAGI dashboard and run them programmatically via the SDK.
- Created a
support_qualitycustom eval in the dashboard with a plain-English Pass/Fail rubric - Created a
code_review_qualitycustom eval with a weighted scoring rubric (returned as0.0-1.0) - Ran both evals via
Evaluator.evaluate()using their registered names
Next steps
Questions & Discussion