Skip to main content
import json
from fi.evals import Evaluator

evaluator = Evaluator()

result = evaluator.evaluate(
    eval_templates="recall_at_k",
    inputs={
        "hypothesis": json.dumps([
            "Paris is the capital of France.",
            "The Eiffel Tower was built in 1889.",
            "France is in Europe.",
            "The Louvre is in Paris.",
            "Napoleon was born in Corsica."
        ]),
        "reference": json.dumps([
            "Paris is the capital of France.",
            "The Eiffel Tower was built in 1889.",
            "The Louvre is in Paris."
        ])
    },
    eval_config={"k": 5}
)

print(result.eval_results[0].output)   # 1.0
print(result.eval_results[0].reason)
In this example, 5 chunks are retrieved and 3 are in the ground truth. With K set to 5 (the full list), all 3 relevant chunks appear in the retrieved results, giving a recall of 3/3 = 1.0. Try setting eval_config={"k": 3} to see how recall drops when only the top 3 chunks are considered.
Input
Required InputTypeDescription
hypothesisstringJSON-serialized list of retrieved chunks in ranked order
referencestringJSON-serialized list of ground-truth relevant chunks
Output
FieldDescription
ResultReturns a score between 0 and 1, where 1 means all relevant chunks were found in the top K results
ReasonShort summary string of the score, e.g. Recall@3: 0.5
Parameter
NameTypeDescription
eval_config (evalConfig in JS/TS)dict / Record<string, any>Optional. Pass {"k": N} to limit evaluation to the top N retrieved chunks. Defaults to using the full list.

Batch evaluation

To evaluate multiple queries in a single call, pass a list of JSON-serialized inputs. Each element represents one retrieval evaluation:
Python
results = evaluator.evaluate(
    eval_templates="recall_at_k",
    inputs={
        "hypothesis": [
            json.dumps(["Paris is the capital of France.", "France is in Europe.", "Napoleon was born in Corsica."]),
            json.dumps(["The sky is blue.", "Water is wet."]),
            json.dumps(["Unrelated 1.", "Unrelated 2.", "Unrelated 3.", "The Louvre is in Paris."]),
        ],
        "reference": [
            json.dumps(["Paris is the capital of France.", "The Eiffel Tower was built in 1889."]),
            json.dumps(["The sky is blue.", "Water is wet."]),
            json.dumps(["The Louvre is in Paris."]),
        ],
    },
    eval_config={"k": 3},
)

for i, r in enumerate(results.eval_results):
    print(f"Query {i+1}: {r.output}")
# Query 1: 0.5   (1 of 2 relevant found in top 3)
# Query 2: 1.0   (2 of 2 relevant found)
# Query 3: 0.0   (relevant chunk at position 4, outside top 3)

How it works

Recall@K answers the question: “Of all the chunks that should have been retrieved, how many actually appear in the top K results?” Formula:
Recall@K = (number of relevant items in top K) / (total number of relevant items)
Matching is based on exact string equality between retrieved chunks and ground-truth chunks. A recall of 1.0 means the retriever found every relevant chunk; a recall of 0.5 means half of the relevant chunks are missing. By default (without eval_config), the evaluator uses the full retrieved list. Pass eval_config={"k": N} to limit evaluation to the top N chunks.
Pass eval_config={"k": N} to evaluate only the top N retrieved chunks. For example, eval_config={"k": 3} checks if relevant chunks appear in the first 3 results.

What to do when Recall@K is Low

If recall is low, the retriever is missing relevant context:
  • Increase the number of chunks retrieved (higher K) to capture more relevant results
  • Improve the embedding model or chunking strategy so relevant content ranks higher
  • Check if ground-truth chunks are being split across multiple smaller chunks, causing partial matches
  • Ensure the query is being embedded with the same model used for document embeddings
  • Consider hybrid retrieval (combining dense and sparse methods) to catch different types of relevance

Differentiating Recall@K with Similar Evals

  • Precision@K: Recall@K measures how many relevant chunks were found, while Precision@K measures how many retrieved chunks are actually relevant. High recall with low precision means the retriever finds everything but also returns noise.
  • NDCG@K: NDCG@K goes beyond recall by also considering ranking order, giving more credit when relevant chunks appear earlier in results.
  • Hit Rate: Hit Rate only checks if at least one relevant chunk was retrieved, while Recall@K measures the fraction of all relevant chunks found.