Evaluation Using Interface

Input:

  • input: The input column provided to the LLM that triggers the function call.
  • output: Column which has the resulting function call or response generated by the LLM.

Output:

  • Result: Passed / Failed

Interpretation:

  • Passed: The LLM correctly identified that a function/tool call was necessary based on the input and accurately extracted the required parameters into the expected format.
  • Failed: The LLM did not correctly handle the function call requirement. This could mean it either failed to recognize the need for a function call altogether, or it recognized the need but extracted incorrect or incomplete parameters from the input.

Evaluation Using Python SDK

Click here to learn how to setup evaluation using the Python SDK.


InputParameterTypeDescription
Requiredinputstringinput text provided to the LLM that triggers the function call.
outputstringoutput text which has the resulting function call or response generated by the LLM.
OutputTypeDescription
ResultboolReturns 0 or 1.
0: The LLM did not correctly handle the function call requirement.
1: The LLM correctly identified that a function/tool call was necessary.
from fi.testcases import TestCase
from fi.evals.templates import LLMFunctionCalling

test_case = TestCase(
    input="Get the weather for London",
    output='{"function": "get_weather", "parameters": {"city": "London", "country": "UK"}}'
)

template = LLMFunctionCalling()

response = evaluator.evaluate(eval_templates=[template], inputs=[test_case])

score = response.eval_results[0].metrics[0].value


What to Do When Function Calling Evaluation Fails

Examine the output to identify whether the failure was due to missing function call identification or incorrect parameter extraction. If the output did not recognise the need for a function call, review the input to ensure that the function’s necessity was clearly communicated. If the parameters were incorrect or incomplete.

Refining the model’s output or adjusting the function call handling process can help improve accuracy in future evaluations.


Differentiating Function Calling Eval with API Call Eval

The API Call evaluation focuses on making network requests to external services and validating the responses, while Evaluate LLM Function Calling examines whether LLMs correctly identify and execute function calls.

API calls are used for external interactions like retrieving data or triggering actions, while function call evaluation ensures that LLMs correctly interpret and execute function calls based on input prompts.

They differ in validation criteria, where API calls are assessed based on response content, status codes, and data integrity, the function call evaluation focuses on the accuracy of function call identification and parameter extraction.