LLM Function Calling Accuracy Evaluation Metric

Evaluates whether LLM output correctly identifies when to call a function, selects the right tool, and extracts the appropriate parameters from the input.

result = evaluator.evaluate(
    eval_templates="evaluate_function_calling",
    inputs={
        "input": "Get the weather for London",
        "output": '{"function": "get_weather", "parameters": {"city": "London", "country": "UK"}}'
    },
    model_name="turing_flash"
)

print(result.eval_results[0].output)
print(result.eval_results[0].reason)

import { Evaluator, Templates } from "@future-agi/ai-evaluation";

const evaluator = new Evaluator();

const result = await evaluator.evaluate(
  "evaluate_function_calling",
  {
    input: "Get the weather for London",
    output: '{"function": "get_weather", "parameters": {"city": "London", "country": "UK"}}'
  },
  {
    modelName: "turing_flash",
  }
);

console.log(result);


Required Input	Type	Description
`input`	`string`	input provided to the LLM that triggers the function call.
`output`	`string`	LLM’s output that has the resulting function call or response.

Output
	Field	Description
	Result	Returns Passed if the LLM correctly identified that a function/tool call was necessary, or Failed if the LLM did not correctly handle the function call requirement.
	Reason	Provides a detailed explanation of the function calling evaluation.

What to Do When Function Calling Evaluation Fails

Examine the output to identify whether the failure was due to missing function call identification or incorrect parameter extraction. If the output did not recognise the need for a function call, review the input to ensure that the function’s necessity was clearly communicated. If the parameters were incorrect or incomplete.

Refining the model’s output or adjusting the function call handling process can help improve accuracy in future evaluations.

Comparing Evaluate Function Calling with Similar Evals

Task Completion: Evaluate Function Calling assesses whether the LLM correctly identifies and formats a function/tool call, while Task Completion measures whether the model fulfilled the user’s overall request accurately.
Instruction Adherence: Evaluate Function Calling focuses on whether the correct function and parameters were identified, while Instruction Adherence evaluates whether the output follows the prompt instructions more broadly.

Was this page helpful?

Questions & Discussion