Evaluation Using Interface

Input:

  • Required Inputs:
    • reference: The reference text containing the information to be captured.
    • hypothesis: The text to be evaluated for recall against the reference.

Output:

  • Score: A numeric score between 0 and 1, where 1 represents perfect recall.
  • Reason: A detailed explanation of the recall evaluation.

Evaluation Using Python SDK

Click here to learn how to setup evaluation using the Python SDK.

Input:

  • Required Inputs:
    • reference: string - The reference text containing the information to be captured.
    • hypothesis: string - The text to be evaluated for recall against the reference.

Output:

  • Score: Returns a float value between 0 and 1, where higher values indicate better recall.
  • Reason: Provides a detailed explanation of the recall evaluation.
result = evaluator.evaluate(
    eval_templates="recall_score", 
    inputs={
        "reference": "The Eiffel Tower is a famous landmark in Paris, built in 1889 for the World's Fair. It stands 324 meters tall.",
        "hypothesis": "The Eiffel Tower, located in Paris, was built in 1889 and is 324 meters high."
    }
)

print(result.eval_results[0].metrics[0].value)
print(result.eval_results[0].reason)

Example Output:

0.8571428571428571
The Recall Score is 0.8571428571428571 because the hypothesis captures most, but not all, of the key information from the reference.

- The hypothesis correctly includes information about the Eiffel Tower being in Paris, built in 1889, and being 324 meters tall/high.
- The hypothesis doesn't include that the Eiffel Tower is a "famous landmark" or that it was built "for the World's Fair".
- Recall measures how much of the reference information is present in the hypothesis, and 6 out of 7 key information points are covered, resulting in the score of approximately 0.857.

Overview

Recall Score measures how completely a hypothesis text captures the information present in a reference text. Unlike metrics that focus on exact wording, Recall Score evaluates whether the essential information is preserved, regardless of how it’s phrased.

A high recall score indicates that the hypothesis contains most or all of the information from the reference, while a low score suggests significant information has been omitted.


What to do If you get Undesired Results

If the recall score is lower than expected:

  • Ensure that all key facts, entities, and relationships from the reference are included in the hypothesis
  • Check for missing details, numbers, dates, or proper nouns that might be important
  • Verify that important contextual information isn’t omitted
  • Consider that paraphrasing may preserve recall as long as the core information is included
  • For summaries, focus on including the most critical information from the reference
  • Be aware that recall doesn’t penalize for additional information in the hypothesis (that’s measured by precision)

Comparing Recall Score with Similar Evals

  • ROUGE Score: While Recall Score focuses on information coverage, ROUGE Score uses n-gram overlap to evaluate text similarity.
  • BLEU Score: Recall Score measures how much reference information is captured, while BLEU Score emphasizes precision by measuring how much of the hypothesis matches the reference.
  • Completeness: Recall Score measures information coverage from a reference text, whereas Completeness evaluates whether a response fully answers a given query.