How to Decrease RAG Hallucinations with Future AGI
Objective
This cookbook aims to minimise hallucinations in a typical RAG workflows by carefully assessing and refining key components of the RAG pipeline. The goal is to discover the optimal setting which will yield accurate and context-grounded responses by using Future AGI’s evaluation suite. Using a structured benchmark dataset composed of user questions, retrieved context passages, and model-generated answers, we assess how well different RAG setup utilise provided information to minimise factual inconsistencies.
This includes tuning three core aspects of a RAG pipeline: chunking strategies, retrieval strategies, and chain strategies. And then assessing every single unique combination for its effect on hallucination rates. Ultimately, it aims at a quantitative method to select RAG configurations whose contextual relevance and factual alignment is optimal, contributing to the overall trustworthiness of outcomes from the RAG application.
About The Dataset
We use here a benchmark dataset for the evaluation of the response alignment for RAG workflows. This allows to measure how models use retrieved context to generate relevant responses. The dataset contains the following columns:
- question: The user query that was asked to the language model.
- context: The retrieved text provided to the model to help answer the query.
- answer: The response generated by the model using the given context and question.
Methodology
To systematically reduce hallucinations in RAG workflows, this cookbook adopts a structured evaluation pipeline driven by Future AGI’s automated instrumentation framework. The methodology is centered around three phases: configuration-driven RAG setup, model response generation, and automated evaluation of factual alignment and context adherence.
- Configuration-Driven RAG Setup: The RAG system is parameterised in a configuration file which enables reproducible experimentation for different strategies. These key components include:
- Chunking Strategy: The input document are chunked using either
RecursiveCharacterTextSplitter
orCharacterTextSplitter
. - Retrieval Strategy: Using FAISS-based vector stores to perform document retrieval via either
similarity
ormmr
(Maximal Marginal Relevance) search modes - Chain Strategy: Feed retrieved documents+input queries into a LangChain-based chain (
stuff
,map_reduce
,refine
ormap_rerank
) to get final responses via OpenAI’s GPT-4o-mini.
- Chunking Strategy: The input document are chunked using either
- Instrumentation: The evaluation from Future AGI is provided through the
fi_instrumentation
SDK. This setup allows evaluation in real-time across the following metrics:- Groundedness: Evaluates whether a response is firmly based on the provided context. (Learn more)
- Context Adherence: Evaluates how well responses stays within the provided context (Learn more)
- Context Retrieval Quality: Evaluates the quality of the context retrieved for generating a response. (Learn more)
Click here to learn how to setup trace provider in Future AGI
- Automated Evaluation Execution: A predefined set of queries is executed against each RAG configuration. For each query:
- The RAG pipeline generates a response based on the configured setup.
- Evaluation spans are automatically captured and sent to Future AGI.
- Scores for groundedness, context adherence, and retrieval quality are logged and analysed.
Experimentation
1. Importing Required Libraries
2. Configuration Loading
Loads settings from a YAML configuration file. These parameters control document loading, chunking strategies, retrieval logic, and model details.
3. Environment Setup
This sets the Open AI API and Future AGI API keys from the config into environment variables.
4. Future AGI Instrumentation Setup
This section defines the evaluation metrics used to score each RAG response.
Click here to learn more about setting up instrumentation
- RAG Setup
It reads data, chunks documents, creates embeddings, indexes them using FAISS vector database, and then builds a LangChain-powered RetrievalQA chain.
5. Query Processing
Runs a single query through the RAG pipeline and retrieves the model’s answer.
6. Evaluation Execution
It sets up the RAG pipeline and loads queries from configuration file. For each query, it Invokes the pipeline and sends data to Future AGI for scoring.
7. Main Function
It parses command-line arguments, loads the config, sets up environment variables and instrumentation, and runs the full evaluation process.
8. Configuration File
Defines all the experiment parameters such as:
- API keys and service URLs
- Chunking strategy (
splitter_type
,chunk_size
) - Retrieval type (
similarity
,mmr
) - Chain strategy (
map_reduce
,stuff
,refine
,map_rerank
) - Evaluation queries for benchmarking hallucination and context relevance
Result
Future AGI’s automated scoring framework was used to assess each experimental run to establish which RAG configuration was the most effective. The evaluation included both quality metrics — including groundedness, context correctness, and retrieval quality — as well as system metrics like cost and latency. A weighted preference model to reflect real-world tradeoffs between performance and efficiency was employed to rank the output.
Inside the ‘Choose Winner’ option provided in top right corner of All Runs, the evaluation sliders were positioned to place higher value on model accuracy than operational efficiency. Weights were assigned as follows:
- Evaluation Metrics:
- Avg. Groundedness: 8/10
- Avg. Context Adherence: 8/10
- Avg. Context Retrieval Quality: 8 out of 10
- System Metrics:
- Avg. Cost: 6/10
- Avg. Latency: 6/10
This setup prioritises accuracy and context in alignment at a reasonable cost in keep time and responsiveness.
The winner configuration was CharacterTextSplitter_mmr_map_rerank, which combines chatacter-based chunking, MMR (Maximal Marginal Relevance) retrieval and a map-rerank generation. This approach provides a solid trade-off between reliability and efficiency of resources, making it a good fit for production-level RAG pipelines where hallucination minimisation is of concern.
Frequently Asked Questions (FAQs)
-
Will I be able to re-use this evaluation setup for other RAG use cases or datasets?
Yes. The evaluation pipeline described in this blog is configuration based and task agnostic. The instrumentation and metric setup you have applies to any RAG dataset.
-
Will I require labeled data in order to evaluate the hallucinations when using Future AGI?
No, future AGI does model-based evaluation, it rates your outputs with AI evaluators without needing labeled ground truth answers beforehand. This enables rapid, scalable testing across configurations without the manual annotation burden.
-
I am using a different framework for my RAG application. Can I still use Future AGI for evaluation purposes?
Yes. It is compatible with a variety of frameworks via automatic tracing and SDK integrations, such as LangChain, Haystack, DSPy, LlamaIndex, Instructor, Crew AI, and others. With little to no setup, most major RAG stacks can have their evaluations instrumented.
-
How can I be sure my RAG pipeline isn’t hallucinating?
One way to identify hallucinations is to check if the responses generated by the model are directly supported by the context that is retrieved. This way, you will be able to measure factual alignment with automated metrics like Groundedness and Context Adherence instead of human reviewers.
-
Can I create custom evaluations tailored to my RAG use case in Future AGI?
Yes. The Deterministic Eval template in Future AGI supports custom evaluations (Click here to learn more about deterministic eval). This lets you apply stringent criteria to your RAG outputs minimising variability.
Ready to Reduce Hallucinations in Your RAG Applications?
Start evaluating your RAG workflows with confidence using Future AGI’s automated, no-label-required evaluation framework. Future AGI provides the tools you need to systematically reduce hallucination.