Objective

This cookbook aims to minimise hallucinations in a typical RAG workflows by carefully assessing and refining key components of the RAG pipeline. The goal is to discover the optimal setting which will yield accurate and context-grounded responses by using Future AGI’s evaluation suite. Using a structured benchmark dataset composed of user questions, retrieved context passages, and model-generated answers, we assess how well different RAG setup utilise provided information to minimise factual inconsistencies.

This includes tuning three core aspects of a RAG pipeline: chunking strategies, retrieval strategies, and chain strategies. And then assessing every single unique combination for its effect on hallucination rates. Ultimately, it aims at a quantitative method to select RAG configurations whose contextual relevance and factual alignment is optimal, contributing to the overall trustworthiness of outcomes from the RAG application.


About The Dataset

We use here a benchmark dataset for the evaluation of the response alignment for RAG workflows. This allows to measure how models use retrieved context to generate relevant responses. The dataset contains the following columns:

  • question: The user query that was asked to the language model.
  • context: The retrieved text provided to the model to help answer the query.
  • answer: The response generated by the model using the given context and question.

Methodology

To systematically reduce hallucinations in RAG workflows, this cookbook adopts a structured evaluation pipeline driven by Future AGI’s automated instrumentation framework. The methodology is centered around three phases: configuration-driven RAG setup, model response generation, and automated evaluation of factual alignment and context adherence.

  • Configuration-Driven RAG Setup: The RAG system is parameterised in a configuration file which enables reproducible experimentation for different strategies. These key components include:
    • Chunking Strategy: The input document are chunked using either RecursiveCharacterTextSplitter or CharacterTextSplitter.
    • Retrieval Strategy: Using FAISS-based vector stores to perform document retrieval via either similarity or mmr (Maximal Marginal Relevance) search modes
    • Chain Strategy: Feed retrieved documents+input queries into a LangChain-based chain (stuff, map_reduce, refine or map_rerank) to get final responses via OpenAI’s GPT-4o-mini.
  • Instrumentation: The evaluation from Future AGI is provided through the fi_instrumentation SDK. This setup allows evaluation in real-time across the following metrics:
    • Groundedness: Evaluates whether a response is firmly based on the provided context. (Learn more)
    • Context Adherence: Evaluates how well responses stays within the provided context (Learn more)
    • Context Retrieval Quality: Evaluates the quality of the context retrieved for generating a response. (Learn more)

Click here to learn how to setup trace provider in Future AGI

  • Automated Evaluation Execution: A predefined set of queries is executed against each RAG configuration. For each query:
    • The RAG pipeline generates a response based on the configured setup.
    • Evaluation spans are automatically captured and sent to Future AGI.
    • Scores for groundedness, context adherence, and retrieval quality are logged and analysed.

Experimentation

1. Importing Required Libraries

import os
import csv
import yaml
import argparse
import traceback

from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter
)
from langchain.chains import RetrievalQA
from traceai_langchain import LangChainInstrumentor
from fi_instrumentation import register
from fi_instrumentation.fi_types import (
    EvalName,
    EvalSpanKind,
    EvalTag,
    EvalTagType,
    ProjectType
)

2. Configuration Loading

Loads settings from a YAML configuration file. These parameters control document loading, chunking strategies, retrieval logic, and model details.

def load_config(config_path: str) -> dict:
    try:
        with open(config_path, 'r') as f:
            config = yaml.safe_load(f)
        print(f"Configuration loaded successfully from {config_path}")
        return config
    except FileNotFoundError:
        print(f"Error: Configuration file not found at {config_path}")
        exit(1)
    except yaml.YAMLError as e:
        print(f"Error parsing YAML file {config_path}: {e}")
        exit(1)
    except Exception as e:
        print(f"An unexpected error occurred while loading config: {e}")
        exit(1)

3. Environment Setup

This sets the Open AI API and Future AGI API keys from the config into environment variables.

def setup_environment(config: dict):
    os.environ["FI_API_KEY"] = config['future_agi'].get('api_key')
    os.environ["FI_SECRET_KEY"] = config['future_agi'].get('secret_key')
    os.environ["OPENAI_API_KEY"] = config['openai'].get('api_key')
    os.environ["FI_BASE_URL"] = config['future_agi'].get('base_url', os.environ.get('FI_BASE_URL', 'https://api.futureagi.com'))

4. Future AGI Instrumentation Setup

This section defines the evaluation metrics used to score each RAG response.

Click here to learn more about setting up instrumentation

def setup_instrumentation(config: dict)
    eval_tags=[
        EvalTag(
            type=EvalTagType.OBSERVATION_SPAN,
            value=EvalSpanKind.LLM,
            eval_name=EvalName.GROUNDEDNESS,
            config={},
            mapping={
                "input": "llm.input_messages.1.message.content",
                "output": "llm.output_messages.0.message.content"
            },
            custom_eval_name="Groundedness"
        ),
        EvalTag(
            type=EvalTagType.OBSERVATION_SPAN,
            value=EvalSpanKind.LLM,
            eval_name=EvalName.CONTEXT_ADHERENCE,
            config={},
            mapping={
                "context": "llm.input_messages.0.message.content",
                "output": "llm.output_messages.0.message.content"
            },
            custom_eval_name="Context_Adherence"
        ),
        EvalTag(
            type=EvalTagType.OBSERVATION_SPAN,
            value=EvalSpanKind.LLM,
            eval_name=EvalName.EVAL_CONTEXT_RETRIEVAL_QUALITY,
            config={
                "criteria": "Evaluate if the context is relevant and sufficient to support the output."
            },
            mapping={
                "input": "llm.input_messages.1.message.content",
                "output": "llm.output_messages.0.message.content",
                "context": "llm.input_messages.0.message.content"
            },
            custom_eval_name="Context_Retrieval_Quality"
        )
    ]

    trace_provider = register(
        project_type=ProjectType.EXPERIMENT,
        project_name=config['future_agi']['project_name'],
        project_version_name=config['future_agi']['project_version'],
        eval_tags=eval_tags
    )
    LangChainInstrumentor().instrument(tracer_provider=trace_provider)
    print(f"FutureAGI instrumentation setup for Project: {config['future_agi']['project_name']}, Version: {config['future_agi']['project_version']}")

  1. RAG Setup

It reads data, chunks documents, creates embeddings, indexes them using FAISS vector database, and then builds a LangChain-powered RetrievalQA chain.

def setup_rag(config: dict):
    data_config = config['data']
    chunking_config = config['chunking']
    retrieval_config = config['retrieval']
    chain_config = config['chain']
    openai_config = config['openai']

    print(f"--- RAG Setup using Configuration ---")
    print(f"Data Path: {data_config['file_path']}")
    print(f"Chunking Enabled: {chunking_config['enabled']}")
    if chunking_config['enabled']:
        print(f"Chunker: {chunking_config['splitter_type']}, Size: {chunking_config['chunk_size']}, Overlap: {chunking_config['chunk_overlap']}")
    print(f"Retrieval Type: {retrieval_config['search_type']}, k: {retrieval_config['k']}")
    if retrieval_config['search_type'] == 'mmr':
         print(f"MMR Fetch K: {retrieval_config.get('fetch_k', 20)}, Lambda: {retrieval_config.get('lambda_mult', 0.5)}")
    print(f"Chain Type: {chain_config['type']}")
    print(f"LLM Model: {openai_config['llm_model']}, Temp: {openai_config['llm_temperature']}")
    print(f"Embedding Model: {openai_config.get('embedding_model', 'Default')}")
    print("-" * 30)

    try:
        # 1. Load Documents
        loader_args = {
            "file_path": data_config['file_path'],
            "encoding": data_config['encoding'],
        }
        if data_config.get('source_column'):
            loader_args['source_column'] = data_config['source_column']
        if data_config.get('metadata_columns'):
             loader_args['csv_args'] = {'fieldnames': data_config['metadata_columns']}

        loader = CSVLoader(**loader_args)
        documents = loader.load()
        print(f"Loaded {len(documents)} documents.")

        if not documents:
            print(f"No documents loaded. Check file content and CSVLoader configuration.")
            return None

        # 2. Chunk Documents (if enabled)
        if chunking_config['enabled']:
            splitter_type = chunking_config['splitter_type']
            if splitter_type == "RecursiveCharacterTextSplitter":
                text_splitter = RecursiveCharacterTextSplitter(
                    chunk_size=chunking_config['chunk_size'],
                    chunk_overlap=chunking_config['chunk_overlap'],
                    length_function=len,
                    add_start_index=True,
                )
            elif splitter_type == "CharacterTextSplitter":
                 text_splitter = CharacterTextSplitter(
                    separator="\n\n",
                    chunk_size=chunking_config['chunk_size'],
                    chunk_overlap=chunking_config['chunk_overlap'],
                    length_function=len,
                )
            else:
                print(f"Warning: Unknown splitter_type '{splitter_type}'. Defaulting to RecursiveCharacterTextSplitter.")
                text_splitter = RecursiveCharacterTextSplitter(
                    chunk_size=chunking_config['chunk_size'],
                    chunk_overlap=chunking_config['chunk_overlap']
                )

            docs_to_index = text_splitter.split_documents(documents)
            print(f"Split into {len(docs_to_index)} chunks.")
        else:
            docs_to_index = documents
            print("Chunking disabled, indexing whole documents.")

        # 3. Create Embeddings
        embedding_model_name = openai_config.get('embedding_model')
        if embedding_model_name:
            embeddings = OpenAIEmbeddings(model=embedding_model_name)
        else:
            embeddings = OpenAIEmbeddings()

        # 4. Create Vector Store
        print("Creating vector store...")
        vectorstore = FAISS.from_documents(docs_to_index, embeddings)
        print("Vector store created successfully.")

        # 5. Create Retriever
        retriever_kwargs = {"k": retrieval_config['k']}
        search_type = retrieval_config['search_type']
        if search_type == "mmr":
            retriever_kwargs['fetch_k'] = retrieval_config.get('fetch_k', 20)
            retriever_kwargs['lambda_mult'] = retrieval_config.get('lambda_mult', 0.5)

        retriever = vectorstore.as_retriever(
            search_type=search_type,
            search_kwargs=retriever_kwargs
        )

        # 6. Create LLM
        llm = ChatOpenAI(
            temperature=openai_config['llm_temperature'],
            model=openai_config['llm_model']
        )

        # 7. Create RetrievalQA Chain
        rag_chain = RetrievalQA.from_chain_type(
            llm=llm,
            chain_type=chain_config['type'],
            retriever=retriever,
            return_source_documents=chain_config['return_source_documents']
        )
        print("RAG chain setup complete.")
        return rag_chain

    except ValueError as ve:
        print(f"ValueError during RAG setup: {ve}")
        if "got an unexpected keyword argument 'fieldnames'" in str(ve):
             print("Hint: Check 'metadata_columns' in config.yaml. CSVLoader might expect them differently or they might not exist.")
        elif "must have a source_column" in str(ve):
             print("Hint: Check 'source_column' in config.yaml. It might be missing or incorrect.")
        else:
             print("This might relate to CSV column names specified in config.yaml (source_column, metadata_columns) not matching data.csv.")
        traceback.print_exc()
        return None
    except Exception as e:
        print(f"Error setting up RAG system: {str(e)}")
        traceback.print_exc()
        return None

5. Query Processing

Runs a single query through the RAG pipeline and retrieves the model’s answer.

def process_query(rag_chain, query: str, data_file_path: str):
    if rag_chain is None:
         return f"Sorry, the knowledge base from '{data_file_path}' could not be loaded. RAG chain is None."

    try:
        print(f"Invoking RAG chain for query: '{query}'")
        result = rag_chain.invoke({"query": query})
        response = result.get("result", "No answer could be generated based on the documents.")

        if rag_chain.return_source_documents:
            source_docs = result.get("source_documents", [])
            print(f"Retrieved {len(source_docs)} source documents for the answer.")
        return response

    except Exception as e:
        print(f"Error processing RAG query: {str(e)}")
        traceback.print_exc()
        return f"Sorry, I encountered an error during retrieval or generation: {str(e)}"

6. Evaluation Execution

It sets up the RAG pipeline and loads queries from configuration file. For each query, it Invokes the pipeline and sends data to Future AGI for scoring.

def run_evaluation_queries(config: dict):
    print("\n--- Initializing RAG based on Configuration ---")
    rag_chain = setup_rag(config)

    if rag_chain is None:
        print("\n--- RAG Setup Failed. Cannot run evaluation queries. Please check errors above. ---")
        return {}

    print("\n--- Starting RAG Evaluation Queries ---")
    queries = config['evaluation']['queries']
    data_file_path = config['data']['file_path']

    if not queries or any("[Your Column Name]" in q for q in queries):
        print("\n*** WARNING: Please replace placeholder queries in config.yaml under 'evaluation.queries'")
        print("*** with questions relevant to your specific data.csv file for meaningful evaluation! ***\n")

    results = {}
    for i, query in enumerate(queries):
        print(f"\n--- Query {i+1}/{len(queries)} ---")
        print(f"Query: {query}")
        response = process_query(rag_chain, query, data_file_path)
        print(f"Response: {response}")
        results[query] = response
        print("-" * 20)

    print("\n--- RAG Evaluation Queries Finished ---")
    print("Check the FutureAGI platform for traces and evaluation results.")
    print(f"Project: {config['future_agi']['project_name']}, Version: {config['future_agi']['project_version']}")
    return results

7. Main Function

It parses command-line arguments, loads the config, sets up environment variables and instrumentation, and runs the full evaluation process.

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Run RAG evaluation with configuration from a YAML file.")
    parser.add_argument(
        "-c", "--config",
        default="config.yaml",
        help="Path to the YAML configuration file (default: config.yaml)"
    )
    args = parser.parse_args()

    # Load Configuration
    config = load_config(args.config)

    # Setup Environment (API Keys etc.)
    setup_environment(config)

    # Setup FutureAGI Instrumentation
    setup_instrumentation(config)

    # Run Evaluation
    run_evaluation_queries(config)

    print("\nScript finished.")

8. Configuration File

Defines all the experiment parameters such as:

  • API keys and service URLs
  • Chunking strategy (splitter_type, chunk_size)
  • Retrieval type (similarity, mmr)
  • Chain strategy (map_reduce, stuff, refine, map_rerank )
  • Evaluation queries for benchmarking hallucination and context relevance

future_agi:
  api_key: "API_KEY"
  secret_key: "SECRET_KEY"
  base_url: "https://api.futureagi.com"
  project_name: "Experiment_RAG_Evaluation"
  project_version: "RecursiveCharacterTextSplitter_similarity_map_reduce"

openai:
  api_key: "OPENAI_API_KEY"
  llm_model: "gpt-4o-mini"
  llm_temperature: 0.5
  embedding_model: "text-embedding-3-small"

# --- Data Loading ---
data:
  file_path: "./data.csv"
  encoding: "utf-8"

# --- Chunking Strategy ---
chunking:
  enabled: true # Set to false to load documents whole (1 doc per CSV row)
  # Options: RecursiveCharacterTextSplitter, CharacterTextSplitter
  splitter_type: "RecursiveCharacterTextSplitter"
  chunk_size: 1000
  chunk_overlap: 150

# --- Retrieval Strategy ---
retrieval:
  # Options: "similarity", "mmr" (Maximal Marginal Relevance)
  search_type: "similarity"
  k: 3 # Number of documents to retrieve and pass to the LLM

# --- Chain Strategy ---
chain:
  # Options: "stuff", "map_reduce", "refine", "map_rerank"
  type: "map_reduce"
  return_source_documents: true

# --- Evaluation ---
evaluation:
  queries:
    - "Who found the answer to a search query collar george herbert essay?"
    - "What are some of the potential negative impacts of charity as discussed in the context?"
    - "Who were the three stars in the NHL game between Buffalo Sabres and Edmonton Oilers?"
    - "What services does Pearl Moving Company in Santa Clarita, 91390 offer?"
    - "What are the responsibilities of a Senior Planning Engineer in London, United Kingdom?"

Result

Future AGI’s automated scoring framework was used to assess each experimental run to establish which RAG configuration was the most effective. The evaluation included both quality metrics — including groundedness, context correctness, and retrieval quality — as well as system metrics like cost and latency. A weighted preference model to reflect real-world tradeoffs between performance and efficiency was employed to rank the output.

Inside the ‘Choose Winner’ option provided in top right corner of All Runs, the evaluation sliders were positioned to place higher value on model accuracy than operational efficiency. Weights were assigned as follows:

  • Evaluation Metrics:
    • Avg. Groundedness: 8/10
    • Avg. Context Adherence: 8/10
    • Avg. Context Retrieval Quality: 8 out of 10
  • System Metrics:
    • Avg. Cost: 6/10
    • Avg. Latency: 6/10

This setup prioritises accuracy and context in alignment at a reasonable cost in keep time and responsiveness.

The winner configuration was CharacterTextSplitter_mmr_map_rerank, which combines chatacter-based chunking, MMR (Maximal Marginal Relevance) retrieval and a map-rerank generation. This approach provides a solid trade-off between reliability and efficiency of resources, making it a good fit for production-level RAG pipelines where hallucination minimisation is of concern.


Frequently Asked Questions (FAQs)

  • Will I be able to re-use this evaluation setup for other RAG use cases or datasets?

    Yes. The evaluation pipeline described in this blog is configuration based and task agnostic. The instrumentation and metric setup you have applies to any RAG dataset.

  • Will I require labeled data in order to evaluate the hallucinations when using Future AGI?

    No, future AGI does model-based evaluation, it rates your outputs with AI evaluators without needing labeled ground truth answers beforehand. This enables rapid, scalable testing across configurations without the manual annotation burden.

  • I am using a different framework for my RAG application. Can I still use Future AGI for evaluation purposes?

    Yes. It is compatible with a variety of frameworks via automatic tracing and SDK integrations, such as LangChain, Haystack, DSPy, LlamaIndex, Instructor, Crew AI, and others. With little to no setup, most major RAG stacks can have their evaluations instrumented.

  • How can I be sure my RAG pipeline isn’t hallucinating?

    One way to identify hallucinations is to check if the responses generated by the model are directly supported by the context that is retrieved. This way, you will be able to measure factual alignment with automated metrics like Groundedness and Context Adherence instead of human reviewers.

  • Can I create custom evaluations tailored to my RAG use case in Future AGI?

    Yes. The Deterministic Eval template in Future AGI supports custom evaluations (Click here to learn more about deterministic eval). This lets you apply stringent criteria to your RAG outputs minimising variability.


Ready to Reduce Hallucinations in Your RAG Applications?

Start evaluating your RAG workflows with confidence using Future AGI’s automated, no-label-required evaluation framework. Future AGI provides the tools you need to systematically reduce hallucination.

Schedule a demo with us now!