RAG Metrics
19 local metrics for evaluating RAG pipelines — retrieval quality, generation faithfulness, advanced reasoning, and composite scores.
- 19 metrics: 8 retrieval, 6 generation, 3 advanced, 2 composite
- All run locally via
from fi.evals import evaluate - Pass
output,context, and optionallyexpected_outputorquery
RAG metrics evaluate both sides of a retrieval-augmented generation pipeline: did the retriever fetch the right context, and did the generator use it well?
from fi.evals import evaluate
result = evaluate(
"groundedness",
output="Paris is the capital of France and has about 2.1 million residents.",
context="France is a country in Western Europe. Its capital is Paris, population approximately 2.1 million.",
)
print(result.score) # 1.0 (all claims grounded in context)
Retrieval Metrics
How well did the retriever fetch relevant context?
| Metric | What it measures |
|---|---|
context_recall | How much relevant information was retrieved |
context_precision | How much of retrieved context is actually relevant |
context_entity_recall | How many named entities from context appear in output |
noise_sensitivity | How well the pipeline handles noisy context |
ndcg | Ranking quality via Normalized Discounted Cumulative Gain |
mrr | How early the first relevant result appears |
precision_at_k | Fraction of top-K retrieved chunks that are relevant |
recall_at_k | Fraction of all relevant chunks in top-K results |
context_recall
How much relevant information was retrieved. Compares context against expected output.
result = evaluate(
"context_recall",
output="The Eiffel Tower is 330 metres tall and was built in 1889.",
context="The Eiffel Tower stands 330 metres tall. It was constructed for the 1889 World's Fair.",
expected_output="The Eiffel Tower is 330 metres tall, built in 1889.",
)
# score → 1.0
context_precision
How much of the retrieved context is actually relevant. Penalizes noisy chunks.
result = evaluate(
"context_precision",
output="Python was created by Guido van Rossum.",
context="Python was created by Guido van Rossum in 1991. Java was created by James Gosling. C++ by Bjarne Stroustrup.",
expected_output="Guido van Rossum created Python.",
)
# score penalized by irrelevant Java/C++ context
context_entity_recall
How many named entities from context appear in the output (names, dates, places).
result = evaluate(
"context_entity_recall",
output="Marie Curie won the Nobel Prize in Physics in 1903.",
context="Marie Curie won the Nobel Prize in Physics in 1903 and Chemistry in 1911.",
)
# High — key entities carried through
noise_sensitivity
How well the pipeline handles noisy or irrelevant context.
result = evaluate(
"noise_sensitivity",
output="The speed of light is approximately 299,792 km/s.",
context="Speed of light is 299,792,458 m/s. Bananas are a good source of potassium.",
)
ndcg
Normalized Discounted Cumulative Gain. Higher-ranked relevant items contribute more to the score.
result = evaluate(
"ndcg",
output="The Great Wall of China is over 13,000 miles long.",
context="The Great Wall of China stretches over 13,000 miles. It was built over many centuries.",
expected_output="The Great Wall is over 13,000 miles long.",
)
mrr
Mean Reciprocal Rank. Measures how early the first relevant result appears in the retrieved set.
result = evaluate(
"mrr",
output="Water boils at 100 degrees Celsius at sea level.",
context="Water boils at 100°C at standard atmospheric pressure. Ice melts at 0°C.",
expected_output="Water boils at 100 degrees Celsius.",
)
precision_at_k
Fraction of top-K retrieved chunks that are relevant.
result = evaluate(
"precision_at_k",
output="Photosynthesis converts sunlight into chemical energy.",
context="Photosynthesis converts light energy into chemical energy in plants. Mitosis is a type of cell division.",
expected_output="Photosynthesis converts sunlight into chemical energy.",
)
recall_at_k
Fraction of all relevant chunks that appear in top-K results.
result = evaluate(
"recall_at_k",
output="DNA carries genetic information and is shaped as a double helix.",
context="DNA is a molecule that carries genetic instructions. Its structure is a double helix discovered by Watson and Crick.",
expected_output="DNA carries genetic information in a double helix structure.",
)
Generation Metrics
How well did the generator use the retrieved context?
| Metric | What it measures |
|---|---|
answer_relevancy | How relevant the answer is to the query |
context_utilization | How much of the context the generator used |
context_relevance_to_response | Whether context supports the response |
rag_faithfulness | Whether the output is faithful to context |
rag_faithfulness_with_reference | Faithfulness checked against context and a reference |
groundedness | Whether every claim traces back to context |
answer_relevancy
How relevant the answer is to the original query. Correct info that doesn’t answer the question scores low.
result = evaluate("answer_relevancy", output="The capital of France is Paris.", query="What is the capital of France?")
# score → 1.0
result = evaluate("answer_relevancy", output="France has 67 million people.", query="What is the capital of France?")
# score → low (correct but irrelevant)
context_utilization
How much of the provided context the generator actually used.
result = evaluate(
"context_utilization",
output="Jupiter is the largest planet with a diameter of 139,820 km and at least 95 moons.",
context="Jupiter is the largest planet. Diameter: 139,820 km. At least 95 known moons including the four Galilean moons.",
)
# High — used multiple facts
context_relevance_to_response
Whether the context supports what was said in the output (reverse of context_utilization).
result = evaluate(
"context_relevance_to_response",
output="The Nile is the longest river in Africa.",
context="The Nile River, at about 6,650 km, is the longest river in Africa. It flows through eleven countries.",
)
rag_faithfulness
Whether the output is faithful to context. Penalizes claims that go beyond the context, even if true.
# Faithful
result = evaluate("rag_faithfulness", output="Mars has two moons: Phobos and Deimos.", context="Mars has two satellites: Phobos and Deimos.")
# score → 1.0
# Unfaithful — adds info not in context
result = evaluate("rag_faithfulness", output="Mars has two moons and a thin CO2 atmosphere.", context="Mars has two satellites: Phobos and Deimos.")
# score penalized
rag_faithfulness_with_reference
Faithfulness checked against both context and a reference answer.
result = evaluate(
"rag_faithfulness_with_reference",
output="Einstein developed the theory of general relativity in 1915.",
context="Albert Einstein published his theory of general relativity in 1915.",
expected_output="Einstein published general relativity in 1915.",
)
groundedness
Whether every claim in the response can be traced back to the context.
result = evaluate(
"groundedness",
output="The Amazon River is the longest in South America, flowing through Brazil, Peru, and Colombia.",
context="The Amazon is the longest river in South America at ~6,400 km, flowing through Brazil, Peru, and Colombia.",
)
# score → 1.0
Advanced Metrics
Metrics that evaluate deeper reasoning and attribution.
| Metric | What it measures |
|---|---|
multi_hop_reasoning | Whether output correctly chains facts from different parts of context |
source_attribution | Whether information is properly attributed to its source |
citation_presence | Whether the output includes citations or references |
multi_hop_reasoning
Whether the output correctly chains facts from different parts of the context.
result = evaluate(
"multi_hop_reasoning",
output="Since Alice manages Bob and Bob leads engineering, Alice oversees engineering.",
context="Alice is VP of Engineering and manages Bob. Bob leads the backend engineering team.",
)
# High — correctly chains two facts
source_attribution
Whether information is properly attributed to its source within the context.
result = evaluate(
"source_attribution",
output="According to the WHO report, global life expectancy increased to 73 years in 2019.",
context="The WHO World Health Statistics report states that global life expectancy reached 73.3 years in 2019.",
)
citation_presence
Whether the output includes citations or references to source material.
result = evaluate(
"citation_presence",
output="The study found a 15% improvement in accuracy [1]. Processing time decreased by 20% [2].",
context="[1] Smith et al. reported 15% accuracy gains. [2] Jones et al. observed 20% faster processing.",
)
Composite Metrics
Single scores that combine retrieval and generation quality.
| Metric | What it measures |
|---|---|
rag_score | Single composite score for overall RAG quality |
rag_score_detailed | Same as rag_score with per-metric breakdown in metadata |
rag_score
Single composite score combining retrieval and generation quality.
result = evaluate(
"rag_score",
output="Quantum entanglement links particles so measuring one instantly affects the other regardless of distance.",
context="Quantum entanglement is a phenomenon where particles become correlated such that measuring one instantly influences the other, regardless of distance.",
expected_output="Quantum entanglement links particles so measuring one affects the other instantly.",
)
print(result.score)
rag_score_detailed
Same as rag_score but returns a per-metric breakdown in result.metadata.
result = evaluate(
"rag_score_detailed",
output="Mitochondria are the powerhouses of the cell, producing ATP through cellular respiration.",
context="Mitochondria generate most of the cell's supply of ATP, used as a source of chemical energy. This process is called cellular respiration.",
expected_output="Mitochondria produce ATP via cellular respiration.",
)
print(result.score) # composite score
print(result.metadata) # per-metric breakdown