Skip to main content
result = evaluator.evaluate(
    eval_templates="recall_score",
    inputs={
        "reference": "The Eiffel Tower is a famous landmark in Paris, built in 1889 for the World's Fair. It stands 324 meters tall.",
        "hypothesis": "The Eiffel Tower, located in Paris, was built in 1889 and is 324 meters high."
    },
    model_name="turing_flash"
)

print(result.eval_results[0].output)
print(result.eval_results[0].reason)
Input
Required InputTypeDescription
referencestringThe reference containing the information to be captured.
hypothesisstringThe content to be evaluated for recall against the reference.
Output
FieldDescription
ResultReturns a score representing the recall of the hypothesis against the reference, where higher values indicate better recall.
ReasonProvides a detailed explanation of the recall evaluation.

Overview

Recall Score measures how completely a hypothesis text captures the information present in a reference text. Unlike metrics that focus on exact wording, Recall Score evaluates whether the essential information is preserved, regardless of how it’s phrased. A high recall score indicates that the hypothesis contains most or all of the information from the reference, while a low score suggests significant information has been omitted.

What to do If you get Undesired Results

If the recall score is lower than expected:
  • Ensure that all key facts, entities, and relationships from the reference are included in the hypothesis
  • Check for missing details, numbers, dates, or proper nouns that might be important
  • Verify that important contextual information isn’t omitted
  • Consider that paraphrasing may preserve recall as long as the core information is included
  • For summaries, focus on including the most critical information from the reference
  • Be aware that recall doesn’t penalize for additional information in the hypothesis (that’s measured by precision)

Comparing Recall Score with Similar Evals

  • ROUGE Score: While Recall Score focuses on information coverage, ROUGE Score uses n-gram overlap to evaluate text similarity.
  • BLEU Score: Recall Score measures how much reference information is captured, while BLEU Score emphasizes precision by measuring how much of the hypothesis matches the reference.
  • Completeness: Recall Score measures information coverage from a reference text, whereas Completeness evaluates whether a response fully answers a given query.
I