Evaluate RAG Applications
Retrieval-Augmented Generation (RAG) is an information retrieval techniques that combines the generative capabilities of LLMs with an external data sources. Evaluating RAG systems can become critical in ensuring the generated outputs are contextually relevant, factually accurate, and aligned with the input queries.
Any typical RAG architecture includes retriever and generator. Retriever uses similarity search to search through the knowledge base of vector embeddings, returning the most relevant vectors. The generator then take this output of retriever to produce a final answer in natural language.
Without evaluating RAG system frequently, it can lead to generating hallucinated responses or using irrelevant or outdated context, leading to incorrect or incomplete answers. By implementing a structured evaluation framework, developers can ensure their RAG systems perform reliably, remain contextually grounded, and deliver results that align with user expectations and business requirements.
Evaluating RAG applications involves analysing the interaction between retrieved contexts, the AI-generated responses, and the queries to guarantee precision, coherence, and completeness. These evaluations help optimise the performance of RAG-based systems in applications like search engines, knowledge retrieval, and Q&A systems.
Below are the key evals used to assess RAG applications:
- Eval Context Retrieval Quality
- Eval Ranking
- Context Relevance
- Context Similarity
- Answer Similarity
- Completeness
- Groundedness
1. Eval Context Retrieval Quality
Assesses the relevance and quality of retrieved contexts in relation to the input query. Ensures that the selected contexts provide sufficient and accurate information for generating responses
Click here to read the eval definition of Context Retrieval Quality
a. Using Interface
Required Parameters
- Inputs:
- input: Query/question
- output: Generated response
- context: Retrieved context
- Config
- check_internet: Boolean - Whether to verify against internet sources
Output: Returns float between 0 and 1, where higher values indicate better context retrieval quality
b. Using SDK
2. Eval Ranking
Ranks retrieved contexts based on their relevance to the query. This evaluation ensures that the most relevant pieces of context are prioritised during response generation.
Click here to read the eval definition of Eval Ranking
a. Using Interface
Required Parameters
- Inputs:
- input: Query
- context: List of contexts to rank
- Config:
- criteria: Ranking criteria description
Output: Returns float between 0 and 1, where higher values indicate better ranking quality
b. Using SDK
3. Context Relevance
Determines the relevancy of retrieved context in addressing the user query. Evaluates whether the retrieved data aligns with the requirements of the query.
Click here to read the eval definition of Context Relevance
a. Using Interface
Required Parameters
- Inputs:
- context: Context provided
- input: Query
- output: Generated response
- Config:
- model: LLM model to use
- check_internet: Boolean (True/False)
Output: Returns float between 0 and 1, where higher values indicate more relevant context
b. Using SDK
4. Context Similarity
Compares the provided context with the expected or ideal context. Ensures the similarity in semantic meaning between the two contexts to validate their relevance.
Click here to read the eval definition of Context Similarity
a. Using Interface
Required Parameters
- Inputs:
- Context: Reference context
- Response: Generated response
- Config:
- Comparator: Contains method to use for comparison
- Failure Threshold: The threshold below which the evaluation fails Float (e.g., 0.7)
Output: Returns float between 0 and 1. If values ≥ failure_threshold indicate sufficient similarity
b. Using SDK
5. Answer Similarity
Evaluates how closely the AI-generated response matches the expected response. Used to ensure that the system produces accurate and precise answers based on the input query.
Click here to read the eval definition of Answer Similarity
a. Using Interface
Required Parameters
- Inputs:
- response: String - Generated answer
- expected_response: String - Reference answer
- Config:
- Comparator: Contains method to use for comparison
- Failure Threshold: The threshold below which the evaluation fails Float (e.g., 0.7)
Output: Returns float between 0 and 1, where values ≥ failure_threshold indicate sufficient similarity
b. Using SDK
6. Completeness
Evaluates whether the generated response addresses all aspects of the input query. Ensures that no key details are missed in the AI’s output.
Click here to read the eval definition of Completeness
a. Using Interface
Required Parameters
- Inputs:
- input: Original text
- output: AI generated content
Output: Returns float between 0 and 1, where the higher values indicate more complete content
b. Using SDK
7. Groundedness
Determines if the generated response is grounded in the provided context. Verifies that the output is based on accurate and relevant information, avoiding hallucination
Click here to read the eval definition of Groundedness
a. Using Interface
Required Parameters
- Inputs:
- response: Generated response
- context: Source context
Config:
- model: LLM model to use
Output: Return float between 0 and 1, where higher values indicate better grounding in source context