Evaluating how well a language model utilises provided context chunks is essential for ensuring accurate, relevant, and contextually appropriate responses. A model that effectively leverages context can generate more reliable and factually grounded outputs, reducing hallucinations and improving consistency.

Failure to properly utilise context can lead to:

  • Incomplete responses – Missing relevant details present in the provided context.
  • Context misalignment – The model generating responses that contradict or ignore the context.
  • Inefficient information usage – Not fully leveraging the available context to improve accuracy.

To assess this, two key evaluations are used:

These evaluations help ensure that AI-generated responses are grounded in the provided information, improving reliability and accuracy.


1. Chunk Attribution

Evaluates whether a language model references the provided context chunks in its generated response. It checks whether the model acknowledges and utilises the context at all.

Click here to read the eval definition of Chunk Attribution

a. Using Interface

Required Inputs

  • input: The original query or instruction.
  • output: The generated response.
  • context: The provided contextual information.

Output

Returns a Pass/Fail result:

  • Pass – The response references the provided context.
  • Fail – The response does not utilise the context.

b. Using SDK

from fi.testcases import TestCase
from fi.evals.templates import ChunkAttribution

test_case = TestCase(
    input="What is the capital of France?",
    output="According to the provided information, Paris is the capital city of France. It is a major European city and a global center for art, fashion, and culture.",
    context=[
        "Paris is the capital and largest city of France.",
        "France is a country in Western Europe.",
        "Paris is known for its art museums and fashion districts."
    ]
)

template = ChunkAttribution()

response = evaluator.evaluate(eval_templates=[template], inputs=[test_case])

print(f"Score: {response.eval_results[0].metrics[0].value}")
print(f"Reason: {response.eval_results[0].reason}")

2. Chunk Utilization

Measures how effectively the model integrates the retrieved context chunks into its response. Unlike Chunk Attribution, which only checks for references, Chunk Utilization assigns a score based on how well the context contributes to a meaningful response.

  • Higher Score – The model extensively incorporates relevant chunks.
  • Lower Score – The model includes minimal or no context.

Click here to read the eval definition of Chunk Utilization

a. Using Interface

Required Inputs

  • input: The prompt given to the model.
  • output: The model’s generated response.
  • context: The provided context chunks.

Config

  • Scoring Criteria – Measures depth of context integration.

Output

  • Score between 0 and 1, where higher values indicate better utilization of context.

b. Using SDK

from fi.testcases import TestCase
from fi.evals.templates import ChunkUtilization

test_case = TestCase(
    input="What is the capital of France?",
    output="According to the provided information, Paris is the capital city of France. It is a major European city and a global center for art, fashion, and culture.",
    context=[
        "Paris is the capital and largest city of France.",
        "France is a country in Western Europe.",
        "Paris is known for its art museums and fashion districts."
    ]
)

template = ChunkUtilization()

response = evaluator.evaluate(eval_templates=[template], inputs=[test_case])

print(f"Score: {response.eval_results[0].metrics[0].value}")
print(f"Reason: {response.eval_results[0].reason}")