Retrieval-Augmented Generation (RAG) is an information retrieval techniques that combines the generative capabilities of LLMs with an external data sources. Evaluating RAG systems can become critical in ensuring the generated outputs are contextually relevant, factually accurate, and aligned with the input queries.

Any typical RAG architecture includes retriever and generator. Retriever uses similarity search to search through the knowledge base of vector embeddings, returning the most relevant vectors. The generator then take this output of retriever to produce a final answer in natural language.

Without evaluating RAG system frequently, it can lead to generating hallucinated responses or using irrelevant or outdated context, leading to incorrect or incomplete answers. By implementing a structured evaluation framework, developers can ensure their RAG systems perform reliably, remain contextually grounded, and deliver results that align with user expectations and business requirements.

Evaluating RAG applications involves analysing the interaction between retrieved contexts, the AI-generated responses, and the queries to guarantee precision, coherence, and completeness. These evaluations help optimise the performance of RAG-based systems in applications like search engines, knowledge retrieval, and Q&A systems.

Below are the key evals used to assess RAG applications:


1. Eval Context Retrieval Quality

Assesses the relevance and quality of retrieved contexts in relation to the input query. Ensures that the selected contexts provide sufficient and accurate information for generating responses

Click here to read the eval definition of Context Retrieval Quality

a. Using Interface

Required Parameters

  • Inputs:
    • input: Query/question
    • output: Generated response
    • context: Retrieved context
  • Config
    • check_internet: Boolean - Whether to verify against internet sources

Output: Returns float between 0 and 1, where higher values indicate better context retrieval quality

b. Using SDK

from fi.evals import EvalClient
from fi.testcases import TestCase
from fi.evals.templates import ContextRetrieval

retrieval_eval = ContextRetrieval(config={
	"criteria": "Return quality of output based on relevance to the input and context"
})

test_case = TestCase(
    input="What are black holes?",
    output="Black holes are regions of spacetime where gravity is so strong that nothing can escape.",
    context="Black holes are cosmic objects with extremely strong gravitational fields"
)

result = evaluator.evaluate(eval_templates=[retrieval_eval], inputs=[test_case])
retrieval_score = result.eval_results[0].metrics[0].value

2. Eval Ranking

Ranks retrieved contexts based on their relevance to the query. This evaluation ensures that the most relevant pieces of context are prioritised during response generation.

Click here to read the eval definition of Eval Ranking

a. Using Interface

Required Parameters

  • Inputs:
    • input: Query
    • context: List of contexts to rank
  • Config:
    • criteria: Ranking criteria description

Output: Returns float between 0 and 1, where higher values indicate better ranking quality

b. Using SDK

from fi.evals import EvalClient
from fi.testcases import TestCase
from fi.evals.templates import Ranking

ranking_eval = Ranking(config={
    "criteria": "Rank contexts based on relevance to the query"
})

test_case = TestCase(
    input="What is the solar system?",
    context=[
        "The solar system consists of the Sun and celestial objects bound to it",
        "Our solar system formed 4.6 billion years ago"
    ]
)

result = evaluator.evaluate(eval_templates=[ranking_eval], inputs=[test_case])
ranking_score = result.eval_results[0].metrics[0].value

3. Context Relevance

Determines the relevancy of retrieved context in addressing the user query. Evaluates whether the retrieved data aligns with the requirements of the query.

Click here to read the eval definition of Context Relevance

a. Using Interface

Required Parameters

  • Inputs:
    • context: Context provided
    • input: Query
    • output: Generated response
  • Config:
    • model: LLM model to use
    • check_internet: Boolean (True/False)

Output: Returns float between 0 and 1, where higher values indicate more relevant context

b. Using SDK

from fi.evals import EvalClient
from fi.testcases import TestCase
from fi.evals.templates import ContextRelevance

relevance_eval = ContextRelevance(config={
    "check_internet": False
})

test_case = TestCase(
    context="The current temperature is 72°F with partly cloudy skies.",
    input="What is the weather like?",
)

result = evaluator.evaluate(eval_templates=[relevance_eval], inputs=[test_case])
relevance_score = result.eval_results[0].metrics[0].value

4. Context Similarity

Compares the provided context with the expected or ideal context. Ensures the similarity in semantic meaning between the two contexts to validate their relevance.

Click here to read the eval definition of Context Similarity

a. Using Interface

Required Parameters

  • Inputs:
    • Context: Reference context
    • Response: Generated response
  • Config:
    • Comparator: Contains method to use for comparison
    • Failure Threshold: The threshold below which the evaluation fails Float (e.g., 0.7)

Output: Returns float between 0 and 1. If values ≥ failure_threshold indicate sufficient similarity

b. Using SDK

from fi.evals import EvalClient
from fi.testcases import TestCase
from fi.evals.types import Comparator
from fi.evals.templates import ContextSimilarity

template = ContextSimilarity(
    config={
        "comparator": Comparator.COSINE.value,
        "failure_threshold": 0.7
    }
)

test_case = TestCase(
    context="The Earth orbits around the Sun in an elliptical path.",
    response="The Earth's orbit around the Sun is not perfectly circular but elliptical."
)

result = evaluator.evaluate(eval_templates=[template], inputs=[test_case])

5. Answer Similarity

Evaluates how closely the AI-generated response matches the expected response. Used to ensure that the system produces accurate and precise answers based on the input query.

Click here to read the eval definition of Answer Similarity

a. Using Interface

Required Parameters

  • Inputs:
    • response: String - Generated answer
    • expected_response: String - Reference answer
  • Config:
    • Comparator: Contains method to use for comparison
    • Failure Threshold: The threshold below which the evaluation fails Float (e.g., 0.7)

Output: Returns float between 0 and 1, where values ≥ failure_threshold indicate sufficient similarity

b. Using SDK

from fi.evals import EvalClient
from fi.testcases import LLMTestCase
from fi.evals.templates import AnswerSimilarity
from fi.evals.types import Comparator

similarity_eval = AnswerSimilarity(config={
    "comparator": Comparator.COSINE.value,
    "failure_threshold": 0.8
})

test_case = LLMTestCase(
    query="What is the example?",
    response="example response",
    expected_response="example of expected response"
)

evaluator = EvalClient()
result = evaluator.evaluate(eval_templates=[similarity_eval], inputs=[test_case])
similarity_score = result.eval_results[0].metrics[0].value

6. Completeness

Evaluates whether the generated response addresses all aspects of the input query. Ensures that no key details are missed in the AI’s output.

Click here to read the eval definition of Completeness

a. Using Interface

Required Parameters

  • Inputs:
    • input: Original text
    • output: AI generated content

Output: Returns float between 0 and 1, where the higher values indicate more complete content

b. Using SDK

from fi.evals import EvalClient
from fi.testcases import TestCase
from fi.evals.templates import Completeness

completeness_eval = Completeness()

test_case = TestCase(
    input="Comprehensive response covering all aspects...",
    output="example of output content"
)

result = evaluator.evaluate(eval_templates=[completeness_eval], inputs=[test_case])
completeness_score = result.eval_results[0].metrics[0].value

7. Groundedness

Determines if the generated response is grounded in the provided context. Verifies that the output is based on accurate and relevant information, avoiding hallucination

Click here to read the eval definition of Groundedness

a. Using Interface

Required Parameters

  • Inputs:
    • response: Generated response
    • context: Source context

Config:

  • model: LLM model to use

Output: Return float between 0 and 1, where higher values indicate better grounding in source context

b. Using SDK

from fi.evals import EvalClient
from fi.testcases import TestCase
from fi.evals.templates import Groundedness

groundedness_eval = Groundedness(config={"model": "gpt-4o-mini"})

test_case = TestCase(
    response="The Earth orbits around the Sun",
    context="The Earth completes one orbit around the Sun every 365.25 days"
)

result = evaluator.evaluate(eval_templates=[groundedness_eval], inputs=[test_case])
groundedness_score = result.eval_results[0].metrics[0].value