Retrieval-Augmented Generation (RAG) is an information retrieval techniques that combines the generative capabilities of LLMs with an external data sources. Evaluating RAG systems can become critical in ensuring the generated outputs are contextually relevant, factually accurate, and aligned with the input queries. Any typical RAG architecture includes retriever and generator. Retriever uses similarity search to search through the knowledge base of vector embeddings, returning the most relevant vectors. The generator then take this output of retriever to produce a final answer in natural language. Without evaluating RAG system frequently, it can lead to generating hallucinated responses or using irrelevant or outdated context, leading to incorrect or incomplete answers. By implementing a structured evaluation framework, developers can ensure their RAG systems perform reliably, remain contextually grounded, and deliver results that align with user expectations and business requirements. Evaluating RAG applications involves analysing the interaction between retrieved contexts, the AI-generated responses, and the queries to guarantee precision, coherence, and completeness. These evaluations help optimise the performance of RAG-based systems in applications like search engines, knowledge retrieval, and Q&A systems. Below are the key evals used to assess RAG applications:

1. Eval Ranking

Ranks retrieved contexts based on their relevance to the query. This evaluation ensures that the most relevant pieces of context are prioritised during response generation. Click here to read the eval definition of Eval Ranking

a. Using Interface

Required Parameters
  • Inputs:
    • input: Query
    • context: List of contexts to rank
  • Config:
    • criteria: Ranking criteria description
Output: Returns float between 0 and 1, where higher values indicate better ranking quality

b. Using SDK

result = evaluator.evaluate(
    eval_templates="eval_ranking",
    inputs={
    "context":
      '''Honey never spoils because it has low moisture content and high acidity, creating an environment that resists bacteria and microorganisms.,
      These properties make it inhospitable for most microbes that would typically cause food to rot or ferment.
      Archaeologists have even found pots of honey in ancient Egyptian tombs that are still perfectly edible.''',
  "input": "Why doesn’t honey go bad?",
  "output": "Honey doesn’t spoil because its low moisture and high acidity prevent the growth of bacteria and other microbes."
},
    model_name="turing_flash"
)

print(result.eval_results[0].output)
print(result.eval_results[0].reason)

2. Context Relevance

Determines the relevancy of retrieved context in addressing the user query. Evaluates whether the retrieved data aligns with the requirements of the query. Click here to read the eval definition of Context Relevance

a. Using Interface

Required Parameters
  • Inputs:
    • context: Context provided
    • input: Query
  • Config:
    • Check Internet: Boolean - Whether to verify against internet sources
Output: Returns float between 0 and 1, where higher values indicate more relevant context

b. Using SDK

result = evaluator.evaluate(
    eval_templates="context_relevance",
    inputs={
        "context": "Honey never spoils because it has low moisture content and high acidity, creating an environment that resists bacteria and microorganisms. Archaeologists have even found pots of honey in ancient Egyptian tombs that are still perfectly edible.",
        "input": "Why doesn’t honey go bad?"
    },
    model_name="turing_flash"
)

print(result.eval_results[0].output)
print(result.eval_results[0].reason)

3. Completeness

Evaluates whether the generated response addresses all aspects of the input query. Ensures that no key details are missed in the AI’s output. Click here to read the eval definition of Completeness

a. Using Interface

Required Parameters
  • Inputs:
    • input: Original text
    • output: AI generated content
Output: Returns float between 0 and 1, where the higher values indicate more complete content

b. Using SDK

result = evaluator.evaluate(
    eval_templates="completeness",
    inputs={
        "input": "Why doesn’t honey go bad?",
        "output": "Honey doesn’t spoil because its low moisture and high acidity prevent the growth of bacteria and other microbes."
    },
    model_name="turing_flash"
)

print(result.eval_results[0].output)
print(result.eval_results[0].reason)

4. Groundedness

Determines if the generated response is grounded in the provided context. Verifies that the output is based on accurate and relevant information, avoiding hallucination Click here to read the eval definition of Groundedness

a. Using Interface

Required Parameters
  • Inputs:
    • output: Generated response from the model
    • input: User provided Input to the model
Output: Return float between 0 and 1, where higher values indicate better grounding in source context

b. Using SDK

result = evaluator.evaluate(
    eval_templates="groundedness",
    inputs={
        "input": "Honey never spoils because it has low moisture content and high acidity, creating an environment that resists bacteria and microorganisms. Archaeologists have even found pots of honey in ancient Egyptian tombs that are still perfectly edible.",
        "output": "Honey doesn’t spoil because its low moisture and high acidity prevent the growth of bacteria and other microbes."
},
    model_name="turing_flash"
)

print(result.eval_results[0].output)
print(result.eval_results[0].reason)