Skip to main content
import json
from fi.evals import Evaluator

evaluator = Evaluator()

result = evaluator.evaluate(
    eval_templates="mrr",
    inputs={
        "hypothesis": json.dumps([
            "France is in Europe.",
            "Napoleon was born in Corsica.",
            "Paris is the capital of France.",
            "The Eiffel Tower was built in 1889.",
            "The Louvre is in Paris."
        ]),
        "reference": json.dumps([
            "Paris is the capital of France.",
            "The Eiffel Tower was built in 1889.",
            "The Louvre is in Paris."
        ])
    }
)

print(result.eval_results[0].output)   # 0.333
print(result.eval_results[0].reason)
In this example, the first relevant chunk (“Paris is the capital of France.”) appears at position 3, so the reciprocal rank is 1/3 = 0.333.
Input
Required InputTypeDescription
hypothesisstringJSON-serialized list of retrieved chunks in ranked order
referencestringJSON-serialized list of ground-truth relevant chunks
Output
FieldDescription
ResultReturns a score between 0 and 1, where 1 means the first relevant chunk is at position 1
ReasonShort summary string of the score, e.g. MRR: 0.333
MRR does not take a k parameter. It scans the entire retrieved list to find the first relevant item.

Batch evaluation

To evaluate multiple queries in a single call, pass a list of JSON-serialized inputs. Each element represents one retrieval evaluation:
Python
results = evaluator.evaluate(
    eval_templates="mrr",
    inputs={
        "hypothesis": [
            json.dumps(["Paris is the capital of France.", "France is in Europe.", "Napoleon was born in Corsica."]),
            json.dumps(["The sky is blue.", "Water is wet."]),
            json.dumps(["Unrelated 1.", "Unrelated 2.", "Unrelated 3.", "The Louvre is in Paris."]),
        ],
        "reference": [
            json.dumps(["Paris is the capital of France.", "The Eiffel Tower was built in 1889."]),
            json.dumps(["The sky is blue.", "Water is wet."]),
            json.dumps(["The Louvre is in Paris."]),
        ],
    },
)

for i, r in enumerate(results.eval_results):
    print(f"Query {i+1}: {r.output}")
# Query 1: 1.0    (first relevant at position 1)
# Query 2: 1.0    (first relevant at position 1)
# Query 3: 0.25   (first relevant at position 4)

How it works

MRR (Mean Reciprocal Rank) measures how quickly the retriever surfaces the first relevant result. The score is the reciprocal of the rank position where the first relevant chunk appears. Formula:
MRR = 1 / (position of the first relevant item)
If the first relevant chunk is at position 1, the score is 1.0. At position 2, it’s 0.5. At position 3, it’s 0.333. If no relevant chunk is found, the score is 0.0. MRR is particularly useful for question-answering RAG systems where the first relevant chunk often contains the answer. It directly measures the user experience of finding information quickly. Matching is based on exact string equality between retrieved chunks and ground-truth chunks.

What to do when MRR is Low

If MRR is low, the first relevant chunk is appearing too far down in results:
  • Apply a re-ranking step to push the most relevant chunk to the top position
  • Check if irrelevant but semantically similar chunks are outranking the correct answer
  • Ensure query formatting matches the style of your indexed documents
  • For short queries, consider query expansion to add context that helps the retriever identify the best match
  • Verify that the first relevant chunk in your ground truth is actually the most directly relevant one

Differentiating MRR with Similar Evals

  • NDCG@K: NDCG@K evaluates the ranking quality across all relevant chunks, while MRR only looks at where the first relevant chunk appears.
  • Hit Rate: Hit Rate is binary (was any relevant chunk retrieved?), while MRR measures exactly how high the first relevant chunk ranks.
  • Recall@K: Recall@K measures how many relevant chunks were found regardless of position, while MRR focuses solely on the position of the first relevant chunk.