Learn how to build production-grade PDF RAG chatbots using MongoDB Atlas for vector search and Future AGI to trace, evaluate, and real-time performance monitoring of LLM pipelines
traceai-langchain
instrumentation. Each user interaction produces a detailed trace that captures embeddings, retrieval results, prompt construction, and model outputs. These traces help developers understand how answers are generated, identify potential failure points such as retrieval errors or hallucinations, and monitor system latency. This observability layer makes the assistant not only functional but also transparent and easier to maintain.
The entire system is surfaced through a simple Gradio interface. After uploading and processing documents, users can ask questions and receive contextual answers along with referenced file names and page numbers. This end-to-end design provides a practical and extensible foundation for building explainable RAG applications.
traceAI
, an open-source package to enable standardised tracing of AI applications and frameworks. traceAI
integrates seamlessly with OTel and provides auto-instrumentation packages for popular frameworks such as LangChain. With traceAI-langchain
, every LangChain operation is automatically traced with meaningful attributes.
register()
sets up an OpenTelemetry tracer that ships spans to Future AGI.LangChainInstrumentor().instrument()
auto-instruments LangChain so you get more AI-aware spans (Embedding, Retriever, LLM, Index build) with rich attributes (model name, token usage, prompt, chunk metadata, latencies, errors).embed_query()
method on OpenAIEmbeddings
package. This allows the application to query the model for a real vector and extract its true shape.
With the correct dimension in hand, the application proceeds to configure the search index in Atlas. It first tries the modern schema (knnVector
) supported by MongoDB’s native vector search. If that isn’t available (e.g. on older clusters), it falls back to a legacy format (vector
). Both configurations uses cosine similarity, which is suitable for semantic search tasks like document retrieval. This two-step approach ensures that the system remains compatible, robust, and aligned with whatever embedding model is in use without manual tweaks.
PyPDFLoader
. Since language models perform better with shorter inputs, the full text is split into overlapping chunks using RecursiveCharacterTextSplitter
. These chunks help preserve context across boundaries while keeping the input size manageable.
Each chunk is then embedded using the OpenAI’s text-embedding-3-small
embedding model. These embedding vectors, along with the original chunk text and relevant metadata, are stored in MongoDB Atlas.
When a user submits a question, the system embeds the query in the same vector space and performs a similarity search against the stored embeddings. The most relevant chunks (here we are using top 6 relevant chunks) are retrieved and passed to RetrievalQA
chain. This chain generates an answer using the retrieved context and a structured prompt that keeps responses grounded in the source material.
These builtin evaluators provide strong coverage of the core failure modes in RAG pipelines: failing to answer the task, hallucinating unsupported facts, retrieving irrelevant context, ignoring retrieved content, or failing to attribute sources. Running them ensures a baseline level of quality monitoring across the system.
We built a custom evaluation called reference_verification to ensure strict fidelity between responses and retrieved context. Unlike general hallucination detection, which flags unsupported content, this evaluation enforces a stronger rule: every claim must be traceable to retrieved chunks. This is crucial for document-grounded workflows like our PDF chatbot, where users expect not just hallucination-free answers but also correctly cited evidence.