Learn how to develop trustworthy and production-ready LlamaIndex PDF RAG chatbot by integrating Future AGI’s evaluation and optimisation framework GitHub Repo

1. Introduction

LLM applications that answer questions over enterprise documents often rely on retrieval-augmented generation (RAG). These systems must not only find relevant passages in PDFs and other documents, but also generate faithful and complete answers. However, RAG pipelines are prone to failure modes such as irrelevant retrieval, hallucination, or incomplete responses. Ensuring that each response in production is grounded in context, adheres to the query, and is task-complete is no longer optional. Developers also need transparency into how each response was generated: which chunks were retrieved, how embeddings were used, and how the final answer was assembled. This cookbook demonstrates how to build a PDF-based RAG chatbot using LlamaIndex, instrument it with Future AGI’s observability SDK, and run evaluations on traces. This makes the chatbot not only intelligent, but also explainable and production-ready.

2. Methodology

We will learn how to construct and evaluate (in real time) a conversational RAG workflow that ingest PDFs, builds vector index, retrieves relevant chunks, and then responds to user query with citations, as shown in Fig 1 below. Fig 1. Methodology for integrating Future AGI’s observability into LlamaIndex RAG Chatbot Fig 1. Methodology for integrating Future AGI’s observability into LlamaIndex RAG Chatbot The goal is not only to generate an answer, but to systematically observe and assess the quality of each response using span-level metrics captured across retrieval and generation. To achieve this, we use LlamaIndex to create a pipeline that ingests documents, processes them, and enables natural question-answering. Users can upload PDFs which are automatically indexed to make relevant information easy to retrieve later. The system splits documents into semantically meaningful chunks and converts them into embeddings using OpenAI’s text-embedding-3-large model. These embeddings are stored in a persistent vector index on disk, ensuring efficient lookups even across sessions. Whenever any user asks a questions, the query is analysed and, if necessary, rewritten in such a way to handle follow-up interactions effectively. The system then retrieves the most relevant document passages by comparing the query’s embedding against the indexed embeddings and ranking them by similarity. Once the top passages are identified, the assistant uses OpenAI model to generate a concise, context-aware response grounded entirely in the retrieved content. To ensure transparency, the assistant also provides references to the original documents, including file names, page numbers, and similarity scores, so users can trace each answer back to its supporting evidence. To make the system observable and debuggable, we integrate traceAI-llamaindex, which is the Future AGI’s python package for instrumenting applications made with LlamaIndex framework. Every user interaction produces a comprehensive execution trace that captures key details, including embedding generation, retrieval results, response synthesis steps, and latency metrics. These traces make the assistant’s decision-making process fully transparent, helping developers understand exactly how an answer was derived and quickly diagnose potential issues. Finally, we leverage Future AGI’s evaluation framework to continuously assess the quality of responses. Each query is evaluated along four critical dimensions:
  • Did the response fully solve what the user asked for?
  • Did the model introduce unsupported or fabricated facts?
  • Were the retrieved chunks the right ones to answer the query?
  • Did the model stay within retrieved context and avoid drifting into unrelated information?
These evaluations provide actionable insights, enabling developers to refine chunking strategies, optimize retrieval accuracy, and improve overall reliability over time. By combining LlamaIndex for document understanding, OpenAI models for reasoning, and Future AGI for observability and automated evaluation, this methodology delivers a conversational assistant that is not only intelligent but also explainable, trustworthy, and production-ready.

3. Observability With Future AGI

As RAG systems move from prototyping into production, the central challenge is no longer “Can the model generate an answer?” but “Can I trust this answer, and can I diagnose issues when it fails?” Traditional application monitoring focuses on CPU load, API uptime, or request throughput, is insufficient for LLM applications. A chatbot may remain online and perform at the infrastructure level while producing answers that are hallucinated, incomplete, or biased at the model level. Future AGI’s Observe platform addresses this gap by bringing enterprise-grade observability into the heart of AI-driven systems. Unlike deterministic software, LLMs are probabilistic systems. The same query may produce different answers depending on context, retrieved chunks, or even subtle prompt variations. Without structured monitoring, debugging issues becomes guesswork. Future AGI Observe solves this by automatically capturing execution traces from your LlamaIndex pipeline:
  • Which PDFs were retrieved, and which specific chunks were selected?
  • What embeddings were generated, and how long did they take?
  • What prompt was sent to the model, with what temperature, and how many tokens were consumed?
  • Did the final answer align with the retrieved evidence, or did the model hallucinate?
By answering these questions in real time, Observe makes your RAG pipeline explainable and diagnosable. It transforms a black-box chatbot into a system you can trust, evaluate, and continuously improve.

4. Building Blocks of Observability

At the heart of Observe are spans and traces.
  • A span is a single operation within your pipeline: an embedding call, a retrieval query, or an LLM generation step. Each span records metadata such as execution time, input and output payloads, model configuration, and errors if they occur.
  • A trace connects multiple spans together to represent the full lifecycle of a user request. In a PDF chatbot, one trace might contain:
    • A retriever span showing which chunks were selected and from which file/page.
    • An embedding span with input text length and latency.
    • An LLM span capturing the prompt, temperature, and token usage.
    • The final chat span with the user’s question and the assistant’s answer.
This hierarchical view allows you to replay any request end-to-end, debug where it went wrong, and validate whether outputs were grounded in the right evidence.

5. Instrumenting LlamaIndex Project

Future AGI builds on OpenTelemetry (OTel), the industry-standard open-source observability framework. OTel ensures traces are vendor-neutral, scalable, and exportable across monitoring backends. But OTel is infrastructure-centric. It understands function calls, API latencies, and database queries but not embeddings, prompts, or hallucinations. traceAI defines conventions for AI workloads and provides auto-instrumentation packages for framework such as LlamaIndex. With traceAI-llamaindex, every LlamaIndex operation is automatically traced with meaningful attributes.
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="llamaindex_project",
)

LlamaIndexInstrumentor().instrument(tracer_provider=trace_provider)
  • register() sets up an OpenTelemetry tracer that ships spans to Future AGI.
  • LlamaIndexInstrumentor().instrument() auto-instruments LlamaIndex so you get more AI-aware spans (Embedding, Retriever, LLM, Index build) with rich attributes (model name, token usage, prompt, chunk metadata, latencies, errors).
Click here to learn more about auto-instrumention
This level of detail allows teams to move from “The chatbot failed” to “The chatbot failed because it retrieved irrelevant chunks from document X, page 14, due to an overly generic embedding query.” With instrumentation enabled, every step of your Document Chat Assistant becomes transparent inside Future AGI Observe. Instead of treating the chatbot as a monolithic black box, traces break the flow into observable units that match your app’s architecture. Let’s map the app you’ve built to what Observe will capture.

6. LlamaIndex PDF Chatbot Application

The application we have built is a document-grounded chatbot powered by LlamaIndex, OpenAI models, and a simple Gradio UI. Its purpose is to allow users to upload enterprise PDFs, automatically index them into a vector database, and then ask natural-language questions whose answers are generated based strictly on retrieved content. Let’s break down how it works:

6.1 Document Ingestion and Indexing

Uploaded files are stored in the ./documents directory and indexed into a persistent ./vectorstore. This is handled by the following workflow:
docs = SimpleDirectoryReader(str(DOCUMENTS_PATH), recursive=True).load_data()
index = VectorStoreIndex.from_documents(docs)
index.storage_context.persist(persist_dir=str(STORAGE_PATH))
  • SimpleDirectoryReader parses PDFs (or text-based files) and splits them into nodes.
  • VectorStoreIndex converts these nodes into embeddings using OpenAI’s text-embedding-3-large model.
  • The embeddings are persisted locally, so queries remain efficient across sessions.
Whenever users upload new files, rebuild_index() is invoked to clear the old vectorstore and regenerate a fresh one.

6.2 Query Handling and Response Generation

When a user types a question in the Gradio chat interface, the respond() function orchestrates the pipeline:
response = engine.chat(message)
The query is embedded. Relevant chunks are retrieved from the vectorstore. OpenAI (gpt-4o-mini model) generates an answer grounded in those retrieved chunks. The assistant attaches citations (file names, page numbers, similarity scores) from the top source nodes. This ensures every answer is traceable back to its evidence.

6.3 Conversational Memory

The chatbot uses LlamaIndex’s ChatMemoryBuffer to maintain dialogue history. This allows follow-up questions to be condensed into standalone queries, making multi-turn conversations consistent and context-aware.

6.4 User Interface

Fig 2. LlamaIndex-based PDF-ingested chatbot with Gradio UI The Gradio app ties everything together:
  • Upload Panel: Users drag and drop files, triggering upload_and_process().
  • Chat Panel: A conversational interface (gr.ChatInterface) where users ask questions and receive grounded answers.
  • Examples: Pre-set queries (summarize, extract key points, compare concepts) to showcase functionality.

6.5 Why Observability Matters Here

Although the app is simple to use, internally it executes multiple hidden steps such as embedding generation, retrieval ranking, prompt assembly, LLM generation, that can fail silently or degrade quality. Without observability, developers only see the final text output, not the process that produced it. By instrumenting this app with Future AGI’s traceAI-llamaindex, each of these operations is automatically traced and turned into spans. This transforms the chatbot into a fully observable pipeline, where developers can validate whether answers are complete, grounded, and non-hallucinatory.

7. Tracing the LlamaIndex PDF Chatbot

With instrumentation enabled, every step of your Document Chat Assistant becomes transparent inside Future AGI Observe. Instead of treating the chatbot as a monolithic black box, traces break the flow into observable units that match your app’s architecture. Let’s map the app we have just built to what Observe will capture:

7.1 Document Upload and Indexing

When a user uploads PDFs through the Gradio interface, Observe records a chain of spans covering:
  • File Handling (save_uploaded) – which files were added, how large they were, and whether writes to ./documents succeeded.
  • Rebuild Index (rebuild_index) – deletion of the old ./vectorstore and creation of a new one.
  • Ingestion Spans (inside initialize_index) – SimpleDirectoryReader loading text, chunking documents into nodes, generating embeddings for each chunk, and persisting them.
If ingestion slows down or fails for certain files, you’ll see it here. Large PDFs create long embedding spans, while corrupted files show up as failed Reader spans.

7.2 Query Processing

When a user asks a question via the Gradio chat interface, it expands into multiple spans:
  • Embedding Span (Query): Observe logs the embedding request made for the user’s query, including model (text-embedding-3-large), input token count, and latency.
  • Retriever Span: This shows which chunks were selected from the vectorstore, their similarity scores, and their source metadata (file_name, page_number). You can directly validate whether the retrieved evidence is relevant.
  • LLM Span (Response Synthesis): The OpenAI model call (gpt-4o-mini by default) is captured in full: the constructed prompt (including condensed history), generation parameters (temperature, max tokens), token usage, latency, and the final output text.
Together, these spans reconstruct the entire reasoning path of the chatbot for a single question from query embedding to chunk selection to final answer.

7.3 Source Attribution

The app explicitly surfaces citations in responses. These same metadata fields are recorded in Retriever spans. This allows you to check whether the assistant is faithfully reporting sources or omitting them. By mapping spans directly onto your LlamaIndex PDF Chatbot, developers don’t just see metrics, they see their actual app behavior unfolding in real time. This closes the gap between code, model behavior, and user-facing output.

8. Evaluation

Instrumenting the chatbot gives you traces. But raw traces are only half the story. To ensure reliability, you also need evaluations. Future AGI lets you attach evaluation tasks from the dashboard/UI directly to spans in your pipeline. For a LlamaIndex PDF chatbot, the most relevant evaluations include:
  • Task Completion: Did the response fully solve what the user asked for? This ensures answers are not partial or evasive.
  • Detect Hallucination: Did the model introduce unsupported or fabricated facts? This prevents users from being misled.
  • Context Relevance: Were the retrieved chunks the right ones to answer the query? This checks if retrieval is working properly.
  • Context Adherence: Did the model stay within retrieved context and avoid drifting into unrelated information? This reinforces factual consistency.
  • Chunk Utilization: Quantifies how effectively the assistant incorporated retrieved context into its response.
  • Chunk Attribution: Validates whether the response referenced the retrieved chunks at all.
Click here to learn more about all the built-in evals Future AGI provides
These built-in evaluators provide strong coverage of the core failure modes in RAG pipelines: failing to answer the task, hallucinating unsupported facts, retrieving irrelevant context, ignoring retrieved content, or failing to attribute sources. Running them ensures a baseline level of quality monitoring across the system. However, no two enterprises share identical requirements. Built-in evaluations are general-purpose, but in many cases, domain-specific validation is needed. For example, a financial assistant may need to verify regulatory compliance, while a medical assistant must ensure responses align with clinical guidelines. This is where custom evaluations become essential. Future AGI supports creating custom evaluations that allow teams to define their own rules, scoring mechanisms, and validation logic. Custom evaluators are particularly useful when:
  • Standard checks are not enough to capture domain-specific risks.
  • Outputs must conform to strict business rules or regulatory frameworks.
  • Multi-factor scoring or weighted metrics are required.
  • You want guarantees about output format, citation correctness, or evidence alignment beyond generic grounding tests.
Click here to learn more about creating and using custom evals in Future AGI
For this project, we implemented a custom evaluation called citation_verification. Its purpose is to enforce strict fidelity between the generated response and the retrieved context. Unlike hallucination detection, which flags unsupported content broadly, this custom citation verification eval narrows the check to a stronger guarantee: every claim in the assistant’s output must be traceable to the retrieved chunks. This is especially critical in document-grounded workflows like our PDF chatbot, where end users expect answers not only to be “hallucination-free,” but also to cite the correct source evidence. In the Future AGI dashboard, we define evals as tasks and attach them to the appropriate span types as shown in Fig 3. Fig 3. Setting up evals at span level Fig 3. Setting up evals at span level This way, each span in a trace is automatically evaluated as soon as it’s generated. When a user asks a question, the trace view shows every operation (Embedding → Retriever → LLM → Synthesizer) alongside evaluation results as shown in Fig 4. Fig 4. Trace-level details of chatbot Fig 4. Trace-level details of chatbot On the left you can see the hierarchy of spans (embedding, retrieval, generation). On the right you can see the inputs and outputs (query + generated response). Bottom panel shows the eval results applied span-by-span. For example, in this run:
  • Task Completion shows “Passed” meaning the model generated a summary in direct response to the user’s query. This shows that the assistant fulfilled the requested task, producing an output aligned with the input intent.
  • Detect Hallucination shows “Passed” meaning the generated response did not include fabricated information or unsupported claims. This confirms that the assistant remained faithful to the retrieved content, with no invented facts.
  • Context Adherence scored 80%, meaning most of the response stayed within the retrieved context, but some parts drifted slightly. While this does not invalidate the answer, it suggests minor instances where the model included information not strictly found in the provided chunks. Monitoring this score helps minimise subtle inconsistencies.
  • Context Relevance scores 40%, meaning Retrieval surfaced only partially useful chunks for the task. Although the assistant still produced an acceptable summary, the evidence provided by the retriever was suboptimal. This signals a need to refine chunking or retriever configurations to ensure the model consistently receives the most relevant inputs.
Future AGI provides a comprehensive dashboard, as shown in figure 5, to visually analyse the eval results along with system metrics such as latency, cost, etc for comparing the performance of your application visually. Fig 5. Charts of eval metrics and system metrics Fig 5. Charts of eval metrics and system metrics These evaluations reveal that while the chatbot can complete tasks and avoid hallucinations, there is room for improvement in how context is retrieved and adhered to. High task completion and no hallucination confirm reliability at the generation stage, but weaker relevance and adherence scores highlight weaknesses in retrieval. Addressing these gaps through better chunking, reranking, or retriever tuning can significantly improve grounding quality and user trust. What makes this approach powerful is that evaluations run continuously and automatically across every user interaction. The system generates real-time quality signals that reflect how the pipeline performs under actual workloads. For example, a sudden dip in context relevance immediately points developers to retrieval as the root cause, while a drop in context adherence highlights drift during synthesis. In production environments, this continuous scoring becomes more than diagnostic; it forms the foundation for proactive monitoring. Once thresholds are defined, for example, hallucination must remain below x%, or relevance must stay above y%, Future AGI can automatically trigger alerts the moment performance begins to degrade. Instead of discovering weeks later that users were served incomplete or poorly grounded answers, teams receive real-time Slack/email notifications and can intervene before quality issues reach end users. Figure 6 below shows how an alert rule can be created directly from evaluation metrics. Here, the developer selects a metric they want to set alert on (e.g., token usage or context relevance), then defines an interval for monitoring, and sets thresholds that represent acceptable performance. Filters can further refine conditions to monitor specific spans, datasets, or user cohorts. This ensures that alerts are tuned to operational and business priorities rather than being generic warnings. Fig 6. Creating alert rule Fig 6. Creating alert rule Once active, alerts appear in a centralised alerts dashboard, shown in Figure 7. This dashboard consolidates triggered alerts across projects, classifying them by type (e.g., API failures, credit exhaustion, low context relevance), along with the status (Healthy vs Triggered), and time last triggered. Developers can immediately see which parts of the pipeline require attention, mute or resolve alerts, and review historical patterns to detect recurring issues. Fig 7. Alerts dashboard Fig 7. Alerts dashboard By combining continuous evaluations with automated alerting, Future AGI transforms observability from a passive reporting system into an active safeguard. Teams no longer just understand how their RAG pipelines behave, they are warned the moment reliability drifts, enabling faster intervention, reduced risk, and stronger user trust.

Conclusion

This cookbook has walked through the end-to-end process of building a PDF-grounded chatbot with LlamaIndex, powering it with OpenAI models, and making it observable and trustworthy using Future AGI’s observability framework. We began by constructing a pipeline that ingests enterprise PDFs, splits them into semantic chunks, and stores them in a vector index for fast and accurate retrieval. On top of this, we built a conversational assistant capable of answering natural-language questions with citations, giving users traceable, document-backed responses. The real differentiator came with observability. By instrumenting the application with traceAI-llamaindex, every step of the pipeline, from embeddings to retrieval to LLM output, became transparent and traceable. What was once a black-box chatbot turned into an explainable system where developers can see exactly how each answer is assembled, diagnose failures, and track performance over time. Finally, we configured evaluations and the results demonstrated that while the chatbot reliably completes tasks and avoids hallucinations, retrieval quality remains the most critical factor to optimize. These insights help developers go beyond functionality and focus on quality, grounding, and trustworthiness.

Ready to Make your LlamaIndex Application Reliable?

Start evaluating your LlamaIndex applications with confidence using Future AGI’s observability framework. Future AGI provides the tools you need to build applications that are reliable, explainable, and production-ready. Click here to schedule a demo with us now!