1. Introduction
LLM applications that answer questions over enterprise documents often rely on retrieval-augmented generation (RAG). These systems must not only find relevant passages in PDFs and other documents, but also generate faithful and complete answers. However, RAG pipelines are prone to failure modes such as irrelevant retrieval, hallucination, or incomplete responses. Ensuring that each response in production is grounded in context, adheres to the query, and is task-complete is no longer optional. Developers also need transparency into how each response was generated: which chunks were retrieved, how embeddings were used, and how the final answer was assembled. This cookbook demonstrates how to build a PDF-based RAG chatbot using LlamaIndex, instrument it with Future AGI’s observability SDK, and run evaluations on traces. This makes the chatbot not only intelligent, but also explainable and production-ready.2. Methodology
We will learn how to construct and evaluate (in real time) a conversational RAG workflow that ingest PDFs, builds vector index, retrieves relevant chunks, and then responds to user query with citations, as shown in Fig 1 below.
traceAI-llamaindex
, which is the Future AGI’s python package for instrumenting applications made with LlamaIndex framework. Every user interaction produces a comprehensive execution trace that captures key details, including embedding generation, retrieval results, response synthesis steps, and latency metrics. These traces make the assistant’s decision-making process fully transparent, helping developers understand exactly how an answer was derived and quickly diagnose potential issues.
Finally, we leverage Future AGI’s evaluation framework to continuously assess the quality of responses. Each query is evaluated along four critical dimensions:
- Did the response fully solve what the user asked for?
- Did the model introduce unsupported or fabricated facts?
- Were the retrieved chunks the right ones to answer the query?
- Did the model stay within retrieved context and avoid drifting into unrelated information?
3. Observability With Future AGI
As RAG systems move from prototyping into production, the central challenge is no longer “Can the model generate an answer?” but “Can I trust this answer, and can I diagnose issues when it fails?” Traditional application monitoring focuses on CPU load, API uptime, or request throughput, is insufficient for LLM applications. A chatbot may remain online and perform at the infrastructure level while producing answers that are hallucinated, incomplete, or biased at the model level. Future AGI’s Observe platform addresses this gap by bringing enterprise-grade observability into the heart of AI-driven systems. Unlike deterministic software, LLMs are probabilistic systems. The same query may produce different answers depending on context, retrieved chunks, or even subtle prompt variations. Without structured monitoring, debugging issues becomes guesswork. Future AGI Observe solves this by automatically capturing execution traces from your LlamaIndex pipeline:- Which PDFs were retrieved, and which specific chunks were selected?
- What embeddings were generated, and how long did they take?
- What prompt was sent to the model, with what temperature, and how many tokens were consumed?
- Did the final answer align with the retrieved evidence, or did the model hallucinate?
4. Building Blocks of Observability
At the heart of Observe are spans and traces.- A span is a single operation within your pipeline: an embedding call, a retrieval query, or an LLM generation step. Each span records metadata such as execution time, input and output payloads, model configuration, and errors if they occur.
- A trace connects multiple spans together to represent the full lifecycle of a user request. In a PDF chatbot, one trace might contain:
- A retriever span showing which chunks were selected and from which file/page.
- An embedding span with input text length and latency.
- An LLM span capturing the prompt, temperature, and token usage.
- The final chat span with the user’s question and the assistant’s answer.
5. Instrumenting LlamaIndex Project
Future AGI builds on OpenTelemetry (OTel), the industry-standard open-source observability framework. OTel ensures traces are vendor-neutral, scalable, and exportable across monitoring backends. But OTel is infrastructure-centric. It understands function calls, API latencies, and database queries but not embeddings, prompts, or hallucinations.traceAI
defines conventions for AI workloads and provides auto-instrumentation packages for framework such as LlamaIndex. With traceAI-llamaindex
, every LlamaIndex operation is automatically traced with meaningful attributes.
register()
sets up an OpenTelemetry tracer that ships spans to Future AGI.LlamaIndexInstrumentor().instrument()
auto-instruments LlamaIndex so you get more AI-aware spans (Embedding, Retriever, LLM, Index build) with rich attributes (model name, token usage, prompt, chunk metadata, latencies, errors).
Click here to learn more about auto-instrumention
6. LlamaIndex PDF Chatbot Application
The application we have built is a document-grounded chatbot powered by LlamaIndex, OpenAI models, and a simple Gradio UI. Its purpose is to allow users to upload enterprise PDFs, automatically index them into a vector database, and then ask natural-language questions whose answers are generated based strictly on retrieved content. Let’s break down how it works:6.1 Document Ingestion and Indexing
Uploaded files are stored in the./documents
directory and indexed into a persistent ./vectorstore
. This is handled by the following workflow:
- SimpleDirectoryReader parses PDFs (or text-based files) and splits them into nodes.
- VectorStoreIndex converts these nodes into embeddings using OpenAI’s
text-embedding-3-large
model. - The embeddings are persisted locally, so queries remain efficient across sessions.
rebuild_index()
is invoked to clear the old vectorstore and regenerate a fresh one.
6.2 Query Handling and Response Generation
When a user types a question in the Gradio chat interface, therespond()
function orchestrates the pipeline:
gpt-4o-mini
model) generates an answer grounded in those retrieved chunks. The assistant attaches citations (file names, page numbers, similarity scores) from the top source nodes. This ensures every answer is traceable back to its evidence.
6.3 Conversational Memory
The chatbot uses LlamaIndex’sChatMemoryBuffer
to maintain dialogue history. This allows follow-up questions to be condensed into standalone queries, making multi-turn conversations consistent and context-aware.
6.4 User Interface
Fig 2. LlamaIndex-based PDF-ingested chatbot with Gradio UI The Gradio app ties everything together:- Upload Panel: Users drag and drop files, triggering
upload_and_process()
. - Chat Panel: A conversational interface (
gr.ChatInterface
) where users ask questions and receive grounded answers. - Examples: Pre-set queries (summarize, extract key points, compare concepts) to showcase functionality.
6.5 Why Observability Matters Here
Although the app is simple to use, internally it executes multiple hidden steps such as embedding generation, retrieval ranking, prompt assembly, LLM generation, that can fail silently or degrade quality. Without observability, developers only see the final text output, not the process that produced it. By instrumenting this app with Future AGI’straceAI-llamaindex
, each of these operations is automatically traced and turned into spans. This transforms the chatbot into a fully observable pipeline, where developers can validate whether answers are complete, grounded, and non-hallucinatory.
7. Tracing the LlamaIndex PDF Chatbot
With instrumentation enabled, every step of your Document Chat Assistant becomes transparent inside Future AGI Observe. Instead of treating the chatbot as a monolithic black box, traces break the flow into observable units that match your app’s architecture. Let’s map the app we have just built to what Observe will capture:7.1 Document Upload and Indexing
When a user uploads PDFs through the Gradio interface, Observe records a chain of spans covering:- File Handling (
save_uploaded
) – which files were added, how large they were, and whether writes to./documents
succeeded. - Rebuild Index (
rebuild_index
) – deletion of the old./vectorstore
and creation of a new one. - Ingestion Spans (inside
initialize_index
) –SimpleDirectoryReader
loading text, chunking documents into nodes, generating embeddings for each chunk, and persisting them.
7.2 Query Processing
When a user asks a question via the Gradio chat interface, it expands into multiple spans:- Embedding Span (Query): Observe logs the embedding request made for the user’s query, including model (
text-embedding-3-large
), input token count, and latency. - Retriever Span: This shows which chunks were selected from the vectorstore, their similarity scores, and their source metadata (
file_name
,page_number
). You can directly validate whether the retrieved evidence is relevant. - LLM Span (Response Synthesis): The OpenAI model call (
gpt-4o-mini
by default) is captured in full: the constructed prompt (including condensed history), generation parameters (temperature, max tokens), token usage, latency, and the final output text.
7.3 Source Attribution
The app explicitly surfaces citations in responses. These same metadata fields are recorded in Retriever spans. This allows you to check whether the assistant is faithfully reporting sources or omitting them. By mapping spans directly onto your LlamaIndex PDF Chatbot, developers don’t just see metrics, they see their actual app behavior unfolding in real time. This closes the gap between code, model behavior, and user-facing output.8. Evaluation
Instrumenting the chatbot gives you traces. But raw traces are only half the story. To ensure reliability, you also need evaluations. Future AGI lets you attach evaluation tasks from the dashboard/UI directly to spans in your pipeline. For a LlamaIndex PDF chatbot, the most relevant evaluations include:- Task Completion: Did the response fully solve what the user asked for? This ensures answers are not partial or evasive.
- Detect Hallucination: Did the model introduce unsupported or fabricated facts? This prevents users from being misled.
- Context Relevance: Were the retrieved chunks the right ones to answer the query? This checks if retrieval is working properly.
- Context Adherence: Did the model stay within retrieved context and avoid drifting into unrelated information? This reinforces factual consistency.
- Chunk Utilization: Quantifies how effectively the assistant incorporated retrieved context into its response.
- Chunk Attribution: Validates whether the response referenced the retrieved chunks at all.
Click here to learn more about all the built-in evals Future AGI provides
These built-in evaluators provide strong coverage of the core failure modes in RAG pipelines: failing to answer the task, hallucinating unsupported facts, retrieving irrelevant context, ignoring retrieved content, or failing to attribute sources. Running them ensures a baseline level of quality monitoring across the system. However, no two enterprises share identical requirements. Built-in evaluations are general-purpose, but in many cases, domain-specific validation is needed. For example, a financial assistant may need to verify regulatory compliance, while a medical assistant must ensure responses align with clinical guidelines. This is where custom evaluations become essential. Future AGI supports creating custom evaluations that allow teams to define their own rules, scoring mechanisms, and validation logic. Custom evaluators are particularly useful when:
- Standard checks are not enough to capture domain-specific risks.
- Outputs must conform to strict business rules or regulatory frameworks.
- Multi-factor scoring or weighted metrics are required.
- You want guarantees about output format, citation correctness, or evidence alignment beyond generic grounding tests.
Click here to learn more about creating and using custom evals in Future AGI


- Task Completion shows “Passed” meaning the model generated a summary in direct response to the user’s query. This shows that the assistant fulfilled the requested task, producing an output aligned with the input intent.
- Detect Hallucination shows “Passed” meaning the generated response did not include fabricated information or unsupported claims. This confirms that the assistant remained faithful to the retrieved content, with no invented facts.
- Context Adherence scored 80%, meaning most of the response stayed within the retrieved context, but some parts drifted slightly. While this does not invalidate the answer, it suggests minor instances where the model included information not strictly found in the provided chunks. Monitoring this score helps minimise subtle inconsistencies.
- Context Relevance scores 40%, meaning Retrieval surfaced only partially useful chunks for the task. Although the assistant still produced an acceptable summary, the evidence provided by the retriever was suboptimal. This signals a need to refine chunking or retriever configurations to ensure the model consistently receives the most relevant inputs.



Conclusion
This cookbook has walked through the end-to-end process of building a PDF-grounded chatbot with LlamaIndex, powering it with OpenAI models, and making it observable and trustworthy using Future AGI’s observability framework. We began by constructing a pipeline that ingests enterprise PDFs, splits them into semantic chunks, and stores them in a vector index for fast and accurate retrieval. On top of this, we built a conversational assistant capable of answering natural-language questions with citations, giving users traceable, document-backed responses. The real differentiator came with observability. By instrumenting the application withtraceAI-llamaindex
, every step of the pipeline, from embeddings to retrieval to LLM output, became transparent and traceable. What was once a black-box chatbot turned into an explainable system where developers can see exactly how each answer is assembled, diagnose failures, and track performance over time.
Finally, we configured evaluations and the results demonstrated that while the chatbot reliably completes tasks and avoids hallucinations, retrieval quality remains the most critical factor to optimize. These insights help developers go beyond functionality and focus on quality, grounding, and trustworthiness.