RAG Metrics: Local Evaluation for Retrieval Pipelines

19 local metrics for evaluating RAG pipelines: retrieval quality, generation faithfulness, advanced reasoning, and composite scores.

📝

TL;DR

19 metrics: 8 retrieval, 6 generation, 3 advanced, 2 composite
All run locally via from fi.evals import evaluate
Pass output, context, and optionally expected_output or query

RAG metrics evaluate both sides of a retrieval-augmented generation pipeline: did the retriever fetch the right context, and did the generator use it well?

from fi.evals import evaluate

result = evaluate(
    "groundedness",
    output="Paris is the capital of France and has about 2.1 million residents.",
    context="France is a country in Western Europe. Its capital is Paris, population approximately 2.1 million.",
)
print(result.score)   # 1.0 (all claims grounded in context)

Retrieval Metrics

How well did the retriever fetch relevant context?

Metric	What it measures
`context_recall`	How much relevant information was retrieved
`context_precision`	How much of retrieved context is actually relevant
`context_entity_recall`	How many named entities from context appear in output
`noise_sensitivity`	How well the pipeline handles noisy context
`ndcg`	Ranking quality via Normalized Discounted Cumulative Gain
`mrr`	How early the first relevant result appears
`precision_at_k`	Fraction of top-K retrieved chunks that are relevant
`recall_at_k`	Fraction of all relevant chunks in top-K results

context_recall

How much relevant information was retrieved. Compares context against expected output.

result = evaluate(
    "context_recall",
    output="The Eiffel Tower is 330 metres tall and was built in 1889.",
    context="The Eiffel Tower stands 330 metres tall. It was constructed for the 1889 World's Fair.",
    expected_output="The Eiffel Tower is 330 metres tall, built in 1889.",
)
# score → 1.0

context_precision

How much of the retrieved context is actually relevant. Penalizes noisy chunks.

result = evaluate(
    "context_precision",
    output="Python was created by Guido van Rossum.",
    context="Python was created by Guido van Rossum in 1991. Java was created by James Gosling. C++ by Bjarne Stroustrup.",
    expected_output="Guido van Rossum created Python.",
)
# score penalized by irrelevant Java/C++ context

context_entity_recall

How many named entities from context appear in the output (names, dates, places).

result = evaluate(
    "context_entity_recall",
    output="Marie Curie won the Nobel Prize in Physics in 1903.",
    context="Marie Curie won the Nobel Prize in Physics in 1903 and Chemistry in 1911.",
)
# High — key entities carried through

noise_sensitivity

How well the pipeline handles noisy or irrelevant context.

result = evaluate(
    "noise_sensitivity",
    output="The speed of light is approximately 299,792 km/s.",
    context="Speed of light is 299,792,458 m/s. Bananas are a good source of potassium.",
)

ndcg

Normalized Discounted Cumulative Gain. Higher-ranked relevant items contribute more to the score.

result = evaluate(
    "ndcg",
    output="The Great Wall of China is over 13,000 miles long.",
    context="The Great Wall of China stretches over 13,000 miles. It was built over many centuries.",
    expected_output="The Great Wall is over 13,000 miles long.",
)

mrr

Mean Reciprocal Rank. Measures how early the first relevant result appears in the retrieved set.

result = evaluate(
    "mrr",
    output="Water boils at 100 degrees Celsius at sea level.",
    context="Water boils at 100°C at standard atmospheric pressure. Ice melts at 0°C.",
    expected_output="Water boils at 100 degrees Celsius.",
)

precision_at_k

Fraction of top-K retrieved chunks that are relevant.

result = evaluate(
    "precision_at_k",
    output="Photosynthesis converts sunlight into chemical energy.",
    context="Photosynthesis converts light energy into chemical energy in plants. Mitosis is a type of cell division.",
    expected_output="Photosynthesis converts sunlight into chemical energy.",
)

recall_at_k

Fraction of all relevant chunks that appear in top-K results.

result = evaluate(
    "recall_at_k",
    output="DNA carries genetic information and is shaped as a double helix.",
    context="DNA is a molecule that carries genetic instructions. Its structure is a double helix discovered by Watson and Crick.",
    expected_output="DNA carries genetic information in a double helix structure.",
)

Generation Metrics

How well did the generator use the retrieved context?

Metric	What it measures
`answer_relevancy`	How relevant the answer is to the query
`context_utilization`	How much of the context the generator used
`context_relevance_to_response`	Whether context supports the response
`rag_faithfulness`	Whether the output is faithful to context
`rag_faithfulness_with_reference`	Faithfulness checked against context and a reference
`groundedness`	Whether every claim traces back to context

answer_relevancy

How relevant the answer is to the original query. Correct info that doesn’t answer the question scores low.

result = evaluate("answer_relevancy", output="The capital of France is Paris.", query="What is the capital of France?")
# score → 1.0

result = evaluate("answer_relevancy", output="France has 67 million people.", query="What is the capital of France?")
# score → low (correct but irrelevant)

context_utilization

How much of the provided context the generator actually used.

result = evaluate(
    "context_utilization",
    output="Jupiter is the largest planet with a diameter of 139,820 km and at least 95 moons.",
    context="Jupiter is the largest planet. Diameter: 139,820 km. At least 95 known moons including the four Galilean moons.",
)
# High — used multiple facts

context_relevance_to_response

Whether the context supports what was said in the output (reverse of context_utilization).

result = evaluate(
    "context_relevance_to_response",
    output="The Nile is the longest river in Africa.",
    context="The Nile River, at about 6,650 km, is the longest river in Africa. It flows through eleven countries.",
)

rag_faithfulness

Whether the output is faithful to context. Penalizes claims that go beyond the context, even if true.

# Faithful
result = evaluate("rag_faithfulness", output="Mars has two moons: Phobos and Deimos.", context="Mars has two satellites: Phobos and Deimos.")
# score → 1.0

# Unfaithful — adds info not in context
result = evaluate("rag_faithfulness", output="Mars has two moons and a thin CO2 atmosphere.", context="Mars has two satellites: Phobos and Deimos.")
# score penalized

rag_faithfulness_with_reference

Faithfulness checked against both context and a reference answer.

result = evaluate(
    "rag_faithfulness_with_reference",
    output="Einstein developed the theory of general relativity in 1915.",
    context="Albert Einstein published his theory of general relativity in 1915.",
    expected_output="Einstein published general relativity in 1915.",
)

groundedness

Whether every claim in the response can be traced back to the context.

result = evaluate(
    "groundedness",
    output="The Amazon River is the longest in South America, flowing through Brazil, Peru, and Colombia.",
    context="The Amazon is the longest river in South America at ~6,400 km, flowing through Brazil, Peru, and Colombia.",
)
# score → 1.0

Advanced Metrics

Metrics that evaluate deeper reasoning and attribution.

Metric	What it measures
`multi_hop_reasoning`	Whether output correctly chains facts from different parts of context
`source_attribution`	Whether information is properly attributed to its source
`citation_presence`	Whether the output includes citations or references

multi_hop_reasoning

Whether the output correctly chains facts from different parts of the context.

result = evaluate(
    "multi_hop_reasoning",
    output="Since Alice manages Bob and Bob leads engineering, Alice oversees engineering.",
    context="Alice is VP of Engineering and manages Bob. Bob leads the backend engineering team.",
)
# High — correctly chains two facts

source_attribution

Whether information is properly attributed to its source within the context.

result = evaluate(
    "source_attribution",
    output="According to the WHO report, global life expectancy increased to 73 years in 2019.",
    context="The WHO World Health Statistics report states that global life expectancy reached 73.3 years in 2019.",
)

citation_presence

Whether the output includes citations or references to source material.

result = evaluate(
    "citation_presence",
    output="The study found a 15% improvement in accuracy [1]. Processing time decreased by 20% [2].",
    context="[1] Smith et al. reported 15% accuracy gains. [2] Jones et al. observed 20% faster processing.",
)

Composite Metrics

Single scores that combine retrieval and generation quality.

Metric	What it measures
`rag_score`	Single composite score for overall RAG quality
`rag_score_detailed`	Same as `rag_score` with per-metric breakdown in metadata

rag_score

Single composite score combining retrieval and generation quality.

result = evaluate(
    "rag_score",
    output="Quantum entanglement links particles so measuring one instantly affects the other regardless of distance.",
    context="Quantum entanglement is a phenomenon where particles become correlated such that measuring one instantly influences the other, regardless of distance.",
    expected_output="Quantum entanglement links particles so measuring one affects the other instantly.",
)
print(result.score)

rag_score_detailed

Same as rag_score but returns a per-metric breakdown in result.metadata.

result = evaluate(
    "rag_score_detailed",
    output="Mitochondria are the powerhouses of the cell, producing ATP through cellular respiration.",
    context="Mitochondria generate most of the cell's supply of ATP, used as a source of chemical energy. This process is called cellular respiration.",
    expected_output="Mitochondria produce ATP via cellular respiration.",
)
print(result.score)     # composite score
print(result.metadata)  # per-metric breakdown

Hallucination

Faithfulness and contradiction detection.

String & Similarity

BLEU, ROUGE, Levenshtein, embedding similarity.

Running Evaluations

Full evaluate() API reference.

Was this page helpful?

Questions & Discussion