traceAI-llamaindex
, which is the Future AGI’s python package for instrumenting applications made with LlamaIndex framework. Every user interaction produces a comprehensive execution trace that captures key details, including embedding generation, retrieval results, response synthesis steps, and latency metrics. These traces make the assistant’s decision-making process fully transparent, helping developers understand exactly how an answer was derived and quickly diagnose potential issues.
Finally, we leverage Future AGI’s evaluation framework to continuously assess the quality of responses. Each query is evaluated along four critical dimensions:
traceAI
defines conventions for AI workloads and provides auto-instrumentation packages for framework such as LlamaIndex. With traceAI-llamaindex
, every LlamaIndex operation is automatically traced with meaningful attributes.
register()
sets up an OpenTelemetry tracer that ships spans to Future AGI.LlamaIndexInstrumentor().instrument()
auto-instruments LlamaIndex so you get more AI-aware spans (Embedding, Retriever, LLM, Index build) with rich attributes (model name, token usage, prompt, chunk metadata, latencies, errors)../documents
directory and indexed into a persistent ./vectorstore
. This is handled by the following workflow:
text-embedding-3-large
model.rebuild_index()
is invoked to clear the old vectorstore and regenerate a fresh one.
respond()
function orchestrates the pipeline:
gpt-4o-mini
model) generates an answer grounded in those retrieved chunks. The assistant attaches citations (file names, page numbers, similarity scores) from the top source nodes. This ensures every answer is traceable back to its evidence.
ChatMemoryBuffer
to maintain dialogue history. This allows follow-up questions to be condensed into standalone queries, making multi-turn conversations consistent and context-aware.
upload_and_process()
.gr.ChatInterface
) where users ask questions and receive grounded answers.traceAI-llamaindex
, each of these operations is automatically traced and turned into spans. This transforms the chatbot into a fully observable pipeline, where developers can validate whether answers are complete, grounded, and non-hallucinatory.
save_uploaded
) – which files were added, how large they were, and whether writes to ./documents
succeeded.rebuild_index
) – deletion of the old ./vectorstore
and creation of a new one.initialize_index
) – SimpleDirectoryReader
loading text, chunking documents into nodes, generating embeddings for each chunk, and persisting them.text-embedding-3-large
), input token count, and latency.file_name
, page_number
). You can directly validate whether the retrieved evidence is relevant.gpt-4o-mini
by default) is captured in full: the constructed prompt (including condensed history), generation parameters (temperature, max tokens), token usage, latency, and the final output text.These built-in evaluators provide strong coverage of the core failure modes in RAG pipelines: failing to answer the task, hallucinating unsupported facts, retrieving irrelevant context, ignoring retrieved content, or failing to attribute sources. Running them ensures a baseline level of quality monitoring across the system. However, no two enterprises share identical requirements. Built-in evaluations are general-purpose, but in many cases, domain-specific validation is needed. For example, a financial assistant may need to verify regulatory compliance, while a medical assistant must ensure responses align with clinical guidelines. This is where custom evaluations become essential. Future AGI supports creating custom evaluations that allow teams to define their own rules, scoring mechanisms, and validation logic. Custom evaluators are particularly useful when:
traceAI-llamaindex
, every step of the pipeline, from embeddings to retrieval to LLM output, became transparent and traceable. What was once a black-box chatbot turned into an explainable system where developers can see exactly how each answer is assembled, diagnose failures, and track performance over time.
Finally, we configured evaluations and the results demonstrated that while the chatbot reliably completes tasks and avoids hallucinations, retrieval quality remains the most critical factor to optimize. These insights help developers go beyond functionality and focus on quality, grounding, and trustworthiness.