import json
from fi.evals import Evaluator
evaluator = Evaluator()
result = evaluator.evaluate(
eval_templates="precision_at_k",
inputs={
"hypothesis": json.dumps([
"Paris is the capital of France.",
"France is in Europe.",
"The Eiffel Tower was built in 1889.",
"Napoleon was born in Corsica.",
"The Louvre is in Paris."
]),
"reference": json.dumps([
"Paris is the capital of France.",
"The Eiffel Tower was built in 1889.",
"The Louvre is in Paris."
])
},
eval_config={"k": 5}
)
print(result.eval_results[0].output) # 0.6
print(result.eval_results[0].reason)
In this example, 5 chunks are retrieved. Of those 5, 3 are in the ground truth (“Paris is the capital…”, “The Eiffel Tower…”, and “The Louvre is in Paris.”), giving a precision of 3/5 = 0.6.
| Input | | | |
|---|
| Required Input | Type | Description |
| hypothesis | string | JSON-serialized list of retrieved chunks in ranked order |
| reference | string | JSON-serialized list of ground-truth relevant chunks |
| Output | | |
|---|
| Field | Description |
| Result | Returns a score between 0 and 1, where 1 means every chunk in the top K is relevant |
| Reason | Short summary string of the score, e.g. Precision@3: 0.333 |
| Parameter | | | |
|---|
| Name | Type | Description |
| eval_config (evalConfig in JS/TS) | dict / Record<string, any> | Optional. Pass {"k": N} to limit evaluation to the top N retrieved chunks. Defaults to using the full list. |
Batch evaluation
To evaluate multiple queries in a single call, pass a list of JSON-serialized inputs. Each element represents one retrieval evaluation:
results = evaluator.evaluate(
eval_templates="precision_at_k",
inputs={
"hypothesis": [
json.dumps(["Paris is the capital of France.", "France is in Europe.", "Napoleon was born in Corsica."]),
json.dumps(["The sky is blue.", "Water is wet."]),
json.dumps(["Unrelated 1.", "Unrelated 2.", "Unrelated 3.", "The Louvre is in Paris."]),
],
"reference": [
json.dumps(["Paris is the capital of France.", "The Eiffel Tower was built in 1889."]),
json.dumps(["The sky is blue.", "Water is wet."]),
json.dumps(["The Louvre is in Paris."]),
],
},
eval_config={"k": 3},
)
for i, r in enumerate(results.eval_results):
print(f"Query {i+1}: {r.output}")
# Query 1: 0.333 (1 relevant in top 3 / 3)
# Query 2: 0.667 (2 relevant in top 3 / 3)
# Query 3: 0.0 (0 relevant in top 3 / 3)
How it works
Precision@K answers the question: “Of the top K chunks the retriever returned, how many are actually relevant?”
Formula:
Precision@K = (number of relevant items in top K) / K
The denominator is always K, even if fewer than K items were retrieved. Matching is based on exact string equality between retrieved chunks and ground-truth chunks.
Pass eval_config={"k": N} to evaluate only the top N retrieved chunks. For example, eval_config={"k": 3} checks precision within the first 3 results only.
A precision of 1.0 means every retrieved chunk is useful; a precision of 0.5 means half the results are noise. Low precision means your LLM receives irrelevant context, which can increase cost (more tokens) and in some cases cause the model to hallucinate based on misleading information.
By default (without eval_config), the evaluator uses the full retrieved list. Pass eval_config={"k": N} to limit evaluation to the top N chunks.
What to do when Precision@K is Low
If precision is low, the retriever is returning too much irrelevant content:
- Reduce the number of chunks retrieved (lower K) to keep only the most confident matches
- Improve the embedding model to better distinguish relevant from irrelevant content
- Apply a similarity threshold to filter out low-confidence results before passing to the LLM
- Review your chunking strategy: chunks that are too large may contain a mix of relevant and irrelevant content
- Consider re-ranking retrieved results with a cross-encoder before passing them to the generator
Differentiating Precision@K with Similar Evals
- Recall@K: Precision@K measures retrieval quality (how clean the results are), while Recall@K measures retrieval coverage (how many relevant items were found). Optimizing one often trades off against the other.
- NDCG@K: NDCG@K considers both relevance and ranking position, while Precision@K treats all positions equally within the top K.
- Chunk Utilization: Precision@K evaluates retrieval quality before generation, while Chunk Utilization measures how well the generator actually uses the retrieved chunks.