| Input | |||
|---|---|---|---|
| Required Input | Type | Description | |
hypothesis | string | JSON-serialized list of retrieved chunks in ranked order | |
reference | string | JSON-serialized list of ground-truth relevant chunks |
| Output | ||
|---|---|---|
| Field | Description | |
| Result | Returns a score between 0 and 1, where 1 means the first relevant chunk is at position 1 | |
| Reason | Short summary string of the score, e.g. MRR: 0.333 |
MRR does not take a
k parameter. It scans the entire retrieved list to find the first relevant item.Batch evaluation
To evaluate multiple queries in a single call, pass a list of JSON-serialized inputs. Each element represents one retrieval evaluation:Python
How it works
MRR (Mean Reciprocal Rank) measures how quickly the retriever surfaces the first relevant result. The score is the reciprocal of the rank position where the first relevant chunk appears. Formula:What to do when MRR is Low
If MRR is low, the first relevant chunk is appearing too far down in results:- Apply a re-ranking step to push the most relevant chunk to the top position
- Check if irrelevant but semantically similar chunks are outranking the correct answer
- Ensure query formatting matches the style of your indexed documents
- For short queries, consider query expansion to add context that helps the retriever identify the best match
- Verify that the first relevant chunk in your ground truth is actually the most directly relevant one
Differentiating MRR with Similar Evals
- NDCG@K: NDCG@K evaluates the ranking quality across all relevant chunks, while MRR only looks at where the first relevant chunk appears.
- Hit Rate: Hit Rate is binary (was any relevant chunk retrieved?), while MRR measures exactly how high the first relevant chunk ranks.
- Recall@K: Recall@K measures how many relevant chunks were found regardless of position, while MRR focuses solely on the position of the first relevant chunk.