Eval Definition
Recall Score
Measures how much of the information in the reference text is captured in the hypothesis text.
Evaluation Using Interface
Input:
- Required Inputs:
- reference: The reference text containing the information to be captured.
- hypothesis: The text to be evaluated for recall against the reference.
Output:
- Score: A numeric score between 0 and 1, where 1 represents perfect recall.
- Reason: A detailed explanation of the recall evaluation.
Evaluation Using Python SDK
Click here to learn how to setup evaluation using the Python SDK.
Input:
- Required Inputs:
- reference:
string
- The reference text containing the information to be captured. - hypothesis:
string
- The text to be evaluated for recall against the reference.
- reference:
Output:
- Score: Returns a float value between 0 and 1, where higher values indicate better recall.
- Reason: Provides a detailed explanation of the recall evaluation.
Example Output:
Overview
Recall Score measures how completely a hypothesis text captures the information present in a reference text. Unlike metrics that focus on exact wording, Recall Score evaluates whether the essential information is preserved, regardless of how it’s phrased.
A high recall score indicates that the hypothesis contains most or all of the information from the reference, while a low score suggests significant information has been omitted.
What to do If you get Undesired Results
If the recall score is lower than expected:
- Ensure that all key facts, entities, and relationships from the reference are included in the hypothesis
- Check for missing details, numbers, dates, or proper nouns that might be important
- Verify that important contextual information isn’t omitted
- Consider that paraphrasing may preserve recall as long as the core information is included
- For summaries, focus on including the most critical information from the reference
- Be aware that recall doesn’t penalize for additional information in the hypothesis (that’s measured by precision)
Comparing Recall Score with Similar Evals
- ROUGE Score: While Recall Score focuses on information coverage, ROUGE Score uses n-gram overlap to evaluate text similarity.
- BLEU Score: Recall Score measures how much reference information is captured, while BLEU Score emphasizes precision by measuring how much of the hypothesis matches the reference.
- Completeness: Recall Score measures information coverage from a reference text, whereas Completeness evaluates whether a response fully answers a given query.