Recall Score

Evaluation Using Interface

Input:

Required Inputs:
- reference: The reference text containing the information to be captured.
- hypothesis: The text to be evaluated for recall against the reference.

Output:

Score: A numeric score between 0 and 1, where 1 represents perfect recall.
Reason: A detailed explanation of the recall evaluation.

Evaluation Using SDK

Click here to learn how to setup evaluation using SDK.

Input:

Required Inputs:
- reference: string - The reference text containing the information to be captured.
- hypothesis: string - The text to be evaluated for recall against the reference.

Output:

Score: Returns a float value between 0 and 1, where higher values indicate better recall.
Reason: Provides a detailed explanation of the recall evaluation.

result = evaluator.evaluate(
    eval_templates="recall_score",
    inputs={
        "reference": "The Eiffel Tower is a famous landmark in Paris, built in 1889 for the World's Fair. It stands 324 meters tall.",
        "hypothesis": "The Eiffel Tower, located in Paris, was built in 1889 and is 324 meters high."
    },
    model_name="turing_flash"
)

print(result.eval_results[0].output)
print(result.eval_results[0].reason)

Example Output:

0.8571428571428571
The Recall Score is 0.8571428571428571 because the hypothesis captures most, but not all, of the key information from the reference.

- The hypothesis correctly includes information about the Eiffel Tower being in Paris, built in 1889, and being 324 meters tall/high.
- The hypothesis doesn't include that the Eiffel Tower is a "famous landmark" or that it was built "for the World's Fair".
- Recall measures how much of the reference information is present in the hypothesis, and 6 out of 7 key information points are covered, resulting in the score of approximately 0.857.

Overview

Recall Score measures how completely a hypothesis text captures the information present in a reference text. Unlike metrics that focus on exact wording, Recall Score evaluates whether the essential information is preserved, regardless of how it’s phrased. A high recall score indicates that the hypothesis contains most or all of the information from the reference, while a low score suggests significant information has been omitted.

What to do If you get Undesired Results

If the recall score is lower than expected:

Ensure that all key facts, entities, and relationships from the reference are included in the hypothesis
Check for missing details, numbers, dates, or proper nouns that might be important
Verify that important contextual information isn’t omitted
Consider that paraphrasing may preserve recall as long as the core information is included
For summaries, focus on including the most critical information from the reference
Be aware that recall doesn’t penalize for additional information in the hypothesis (that’s measured by precision)

Comparing Recall Score with Similar Evals

ROUGE Score: While Recall Score focuses on information coverage, ROUGE Score uses n-gram overlap to evaluate text similarity.
BLEU Score: Recall Score measures how much reference information is captured, while BLEU Score emphasizes precision by measuring how much of the hypothesis matches the reference.
Completeness: Recall Score measures information coverage from a reference text, whereas Completeness evaluates whether a response fully answers a given query.

Introduction

Evaluation

Simulations

Knowledge Base

Dataset

Prototype

Observe

Tracing

Optimization

Prompt Workbench

Protect

MCP

Admin & Settings

FAQs

Evaluation Using Interface

Evaluation Using SDK

Overview

What to do If you get Undesired Results

Comparing Recall Score with Similar Evals

Introduction

Evaluation

Simulations

Knowledge Base

Dataset

Prototype

Observe

Tracing

Optimization

Prompt Workbench

Protect

MCP

Admin & Settings

FAQs

​Evaluation Using Interface

​Evaluation Using SDK

​Overview

​What to do If you get Undesired Results

​Comparing Recall Score with Similar Evals

Evaluation Using Interface

Evaluation Using SDK

Overview

What to do If you get Undesired Results

Comparing Recall Score with Similar Evals