Retrieval-Augmented Generation (RAG) is an information retrieval techniques that combines the generative capabilities of LLMs with an external data sources. Evaluating RAG systems can become critical in ensuring the generated outputs are contextually relevant, factually accurate, and aligned with the input queries.Any typical RAG architecture includes retriever and generator. Retriever uses similarity search to search through the knowledge base of vector embeddings, returning the most relevant vectors. The generator then take this output of retriever to produce a final answer in natural language.Without evaluating RAG system frequently, it can lead to generating hallucinated responses or using irrelevant or outdated context, leading to incorrect or incomplete answers. By implementing a structured evaluation framework, developers can ensure their RAG systems perform reliably, remain contextually grounded, and deliver results that align with user expectations and business requirements.Evaluating RAG applications involves analysing the interaction between retrieved contexts, the AI-generated responses, and the queries to guarantee precision, coherence, and completeness. These evaluations help optimise the performance of RAG-based systems in applications like search engines, knowledge retrieval, and Q&A systems.Below are the key evals used to assess RAG applications:
Ranks retrieved contexts based on their relevance to the query. This evaluation ensures that the most relevant pieces of context are prioritised during response generation.Click here to read the eval definition of Eval Ranking
result = evaluator.evaluate( eval_templates="eval_ranking", inputs={ "context": '''Honey never spoils because it has low moisture content and high acidity, creating an environment that resists bacteria and microorganisms., These properties make it inhospitable for most microbes that would typically cause food to rot or ferment. Archaeologists have even found pots of honey in ancient Egyptian tombs that are still perfectly edible.''', "input": "Why doesn’t honey go bad?", "output": "Honey doesn’t spoil because its low moisture and high acidity prevent the growth of bacteria and other microbes."}, model_name="turing_flash")print(result.eval_results[0].output)print(result.eval_results[0].reason)
result = evaluator.evaluate( eval_templates="context_relevance", inputs={ "context": "Honey never spoils because it has low moisture content and high acidity, creating an environment that resists bacteria and microorganisms. Archaeologists have even found pots of honey in ancient Egyptian tombs that are still perfectly edible.", "input": "Why doesn’t honey go bad?" }, model_name="turing_flash")print(result.eval_results[0].output)print(result.eval_results[0].reason)
result = evaluator.evaluate( eval_templates="completeness", inputs={ "input": "Why doesn’t honey go bad?", "output": "Honey doesn’t spoil because its low moisture and high acidity prevent the growth of bacteria and other microbes." }, model_name="turing_flash")print(result.eval_results[0].output)print(result.eval_results[0].reason)
Determines if the generated response is grounded in the provided context. Verifies that the output is based on accurate and relevant information, avoiding hallucinationClick here to read the eval definition of Groundedness
result = evaluator.evaluate( eval_templates="groundedness", inputs={ "input": "Honey never spoils because it has low moisture content and high acidity, creating an environment that resists bacteria and microorganisms. Archaeologists have even found pots of honey in ancient Egyptian tombs that are still perfectly edible.", "output": "Honey doesn’t spoil because its low moisture and high acidity prevent the growth of bacteria and other microbes."}, model_name="turing_flash")print(result.eval_results[0].output)print(result.eval_results[0].reason)