How to Decrease RAG Hallucinations with Future AGI
Objective
This cookbook aims to minimise hallucinations in a typical RAG workflows by carefully assessing and refining key components of the RAG pipeline. The goal is to discover the optimal setting which will yield accurate and context-grounded responses by using Future AGI’s evaluation suite. Using a structured benchmark dataset composed of user questions, retrieved context passages, and model-generated answers, we assess how well different RAG setup utilise provided information to minimise factual inconsistencies.
This includes tuning three core aspects of a RAG pipeline: chunking strategies, retrieval strategies, and chain strategies. And then assessing every single unique combination for its effect on hallucination rates. Ultimately, it aims at a quantitative method to select RAG configurations whose contextual relevance and factual alignment is optimal, contributing to the overall trustworthiness of outcomes from the RAG application.
About The Dataset
We use here a benchmark dataset for the evaluation of the response alignment for RAG workflows. This allows to measure how models use retrieved context to generate relevant responses. The dataset contains the following columns:
- question: The user query that was asked to the language model.
- context: The retrieved text provided to the model to help answer the query.
- answer: The response generated by the model using the given context and question.
Below are a few sample rows from the dataset:
context | question | answer |
---|---|---|
Francisco Rogers found the answer to a search query collar george herbert write my essay constitution research paper ideas definition essay humility … | Who found the answer to a search query collar george herbert essay? | Francisco Rogers found the answer to a search query collar george herbert essay. |
Game Notes EDM vs BUF Buffalo Sabres (Head Coach: Dan Bylsma) at Edmonton Oilers (Head Coach: Todd McLellan) NHL Game #31, Rogers Place, 2016-10-16 05:00:00PM (GMT -0600) … | Who were the three stars in the NHL game between Buffalo Sabres and Edmonton Oilers? | The three stars were Ryan O’Reilly, Brian Gionta, and Leon Draisaitl. |
Methodology
To systematically reduce hallucinations in RAG workflows, this cookbook adopts a structured evaluation pipeline driven by Future AGI’s automated instrumentation framework. The methodology is centered around three phases: configuration-driven RAG setup, model response generation, and automated evaluation of factual alignment and context adherence.
- Configuration-Driven RAG Setup: The RAG system is parameterised in a configuration file which enables reproducible experimentation for different strategies. These key components include:
- Chunking Strategy: The input document are chunked using either
RecursiveCharacterTextSplitter
orCharacterTextSplitter
. - Retrieval Strategy: Using FAISS-based vector stores to perform document retrieval via either
similarity
ormmr
(Maximal Marginal Relevance) search modes - Chain Strategy: Feed retrieved documents+input queries into a LangChain-based chain (
stuff
,map_reduce
,refine
ormap_rerank
) to get final responses via OpenAI’s GPT-4o-mini.
- Chunking Strategy: The input document are chunked using either
- Instrumentation: The evaluation from Future AGI is provided through the
fi_instrumentation
SDK. This setup allows evaluation in real-time across the following metrics:- Groundedness: Evaluates whether a response is firmly based on the provided context.
- Context Adherence: Evaluates how well responses stays within the provided context.
- Context Retrieval Quality: Evaluates the quality of the context retrieved for generating a response.
- Automated Evaluation Execution: A predefined set of queries is executed against each RAG configuration. For each query:
- The RAG pipeline generates a response based on the configured setup.
- Evaluation spans are automatically captured and sent to Future AGI.
- Scores for groundedness, context adherence, and retrieval quality are logged and analysed.
Experimentation
1. Project Structure Overview
2. Configuration File (config.yaml)
Defines all the experiment parameters such as:
- API keys such as Open AI’s and Future AGI’s key
Click here to access Future AGI API keys
- Chunking strategy (
splitter_type
,chunk_size
) - Retrieval type (
similarity
,mmr
) - Chain strategy (
map_reduce
,stuff
,refine
,map_rerank
) - Evaluation queries for benchmarking hallucination and context relevance
3. Installing Required Libraries
To install essential libraries that is required for the experimentation performed in this cookbook for configuration management, model integration and LangChain capabilities.
To add tracing and observability capabilities provided by Future AGI to your LangChain applications.
4. Importing Required Libraries
5. Configuration Loading
Loads settings from a YAML configuration file. These parameters control document loading, chunking strategies, retrieval logic, and model details.
6. Environment Setup
This sets the Open AI API and Future AGI API keys from the config into environment variables.
7. Instrumentation Setup
It is the process of adding tracing to your LLM applications. Tracing helps you monitor critical metrics like cost, latency, and evaluation results.
Where a span represents a single operation within an execution flow, recording input-output data, execution time, and errors, a trace connects multiple spans to represent the full execution flow of a request.
This experimentation is done to find the best fit of your application for your use case before deploying in production.
7.1 Setting Up Eval Tags
To quantify performance of each combination of RAG setup, a set of evals according to the use-case are chosen. In this cookbook, since we are dealing with RAG hallucination, so following evals are chosen for evaluation:
- Groundedness:
- Evaluates if response of model is based on the provided context.
- Input Mapping:
output
: The generated response from the model.input
: The user-provided input to the model.
- Returns a percentage score, where high score Indicate that the
output
is well-grounded in theinput
- Context Adherence:
- Evaluates how well responses stay within the provided context by measuring if the output contains any information not present in the given context.
- Input Mapping:
output
: The output response generated by model.context
: The context provided to the model.
- Returns a percentage score where a high score Indicate that the output is more contextually consistent.
- Context Retrieval Quality:
- Evaluates the quality of the context retrieved for generating a response.
- Input Mapping:
input
: The user-provided input to the model.output
: The output response generated by model.context
: The context provided to the model.
- Config:
criteria
: Description of the criteria for evaluation
- Returns a percentage score, where a high-score Indicate that the context is relevant or sufficient to produce an accurate and coherent output.
The eval_tags
list contains multiple instances of EvalTag
. Each EvalTag
represents a specific evaluation configuration to be applied during runtime, encapsulating all necessary parameters for the evaluation process.
Parameters of EvalTag
:
-
type
: Specifies the category of the evaluation tag. In this cookbook,EvalTagType.OBSERVATION_SPAN
is used. -
value
: Defines the kind of operation the evaluation tag is concerned with.EvalSpanKind.LLM
indicates that the evaluation targets operations involving Large Language Models.EvalSpanKind.TOOL
: For operations involving tools.
-
eval_name
: The name of the evaluation to be performed.- For Groundedness Eval,
EvalName.GROUNDEDNESS
, - For Context Adherence Eval,
EvalName.CONTEXT_ADHERENCE
, - For Context Retrieval Quality,
EvalName.EVAL_CONTEXT_RETRIEVAL_QUALITY
Click here to get complete list of evals provided by Future AGI
- For Groundedness Eval,
-
config
: Dictionary for providing specific configurations for the evaluation. An empty dictionary means that default configuration parameters will be used.Click here to learn more about what config is required for corresponding evals -
mapping
: This dictionary maps the required inputs for the evaluation to specific attributes of the operation.Click here to learn more about what inputs are required for corresponding evals -
custom_eval_name
: A user-defined name for the specific evaluation instance.
7.2 Setting Up Trace Provider
The trace provider is part of the traceAI ecosystem, which is an OSS package that enables tracing of AI applications and frameworks. It works in conjunction with OpenTelemetry to monitor code executions across different models, frameworks, and vendors.
To configure a trace_provider
, we need to pass following parameters to register
function:
project_type
: Specifies the type of project. In this cookbook,ProjectType.EXPERIMENT
is used since we are experimenting to find the best RAG setup before deploying in production.ProjectType.OBSERVE
is used to observe your AI application in production and measure the performance in real-time.project_name
: The name of the project. This is dynamically set from a configuration dictionary,config['future_agi']['project_name']
- **
project_version_name**:
The version name of the project. Similar to project_name, this is also dynamically set from the configuration dictionary,config['future_agi']['project_version']
eval_tags
: A list of evaluation tags that define specific evaluations to be applied.
7.3 Setting Up LangChain Instrumentor
This is done to integrate with the LangChain framework for the collection of telemetry data.
The instrument
method is called on the LangChainInstrumentor
instance. This method is responsible for setting up the instrumentation of the LangChain framework using the provided tracer_provider
.
Putting it all together, below is the function that configures eval_tags
, and sets up trace_provider
, which is then passed onto LangChainInstrumentor
instance.
8. RAG Setup
It reads data, chunks documents, creates embeddings, indexes them using FAISS vector database, and then builds a LangChain-powered RetrievalQA chain.
9. Query Processing
Runs a single query through the RAG pipeline and retrieves the model’s answer.
10. Evaluation Execution
It sets up the RAG pipeline and loads queries from configuration file. For each query, it Invokes the pipeline and sends data to Future AGI for scoring.
11. Main Function
It parses command-line arguments, loads the config, sets up environment variables and instrumentation, and runs the full evaluation process.
Result
Future AGI’s automated scoring framework was used to assess each experimental run to establish which RAG configuration was the most effective. The evaluation included both quality metrics — including groundedness, context correctness, and retrieval quality — as well as system metrics like cost and latency. A weighted preference model to reflect real-world tradeoffs between performance and efficiency was employed to rank the output.
Inside the ‘Choose Winner’ option provided in top right corner of All Runs, the evaluation sliders were positioned to place higher value on model accuracy than operational efficiency. Weights were assigned as follows:
This setup prioritises accuracy and context in alignment at a reasonable cost in keep time and responsiveness.
The winner configuration was CharacterTextSplitter_mmr_map_rerank, which combines chatacter-based chunking, MMR (Maximal Marginal Relevance) retrieval and a map-rerank generation. This approach provides a solid trade-off between reliability and efficiency of resources, making it a good fit for production-level RAG pipelines where hallucination minimisation is of concern.
Frequently Asked Questions (FAQs)
-
Will I be able to re-use this evaluation setup for other RAG use cases or datasets?
Yes. The evaluation pipeline described in this blog is configuration based and task agnostic. The instrumentation and metric setup you have applies to any RAG dataset.
-
Will I require labeled data in order to evaluate the hallucinations when using Future AGI?
No, future AGI does model-based evaluation, it rates your outputs with AI evaluators without needing labeled ground truth answers beforehand. This enables rapid, scalable testing across configurations without the manual annotation burden.
-
I am using a different framework for my RAG application. Can I still use Future AGI for evaluation purposes?
Yes. It is compatible with a variety of frameworks via automatic tracing and SDK integrations, such as LangChain, Haystack, DSPy, LlamaIndex, Instructor, Crew AI, and others. With little to no setup, most major RAG stacks can have their evaluations instrumented.
-
How can I be sure my RAG pipeline isn’t hallucinating?
One way to identify hallucinations is to check if the responses generated by the model are directly supported by the context that is retrieved. This way, you will be able to measure factual alignment with automated metrics like Groundedness and Context Adherence instead of human reviewers.
-
Can I create custom evaluations tailored to my RAG use case in Future AGI?
Yes. The Deterministic Eval template in Future AGI supports custom evaluations (Click here to learn more about deterministic eval). This lets you apply stringent criteria to your RAG outputs minimising variability.
Ready to Reduce Hallucinations in Your RAG Applications?
Start evaluating your RAG workflows with confidence using Future AGI’s automated, no-label-required evaluation framework. Future AGI provides the tools you need to systematically reduce hallucination.