Adding Reliability to Your LangChain/LangGraph Application with Future AGI
Learn how to enhance the reliability of your LangChain/LangGraph application by integrating Future AGI’s observability framework
Introduction
LLM applications often rely on agents that retrieve data, invoke tools and respond to user queries. This can sometimes lead to unpredictable behaviour. Ensuring that each response of such application in a production environment is complete, grounded and reliable has become essential.
As these applications grow in complexity, simply returning an answer is no longer enough. Developers need visibility into how each response is generated, what tools were used, what data was retrieved, and how decisions were made. This level of transparency is critical for debugging, monitoring, and improving reliability of such applications over time.
This tutorial demonstrates how to add reliability to your LLM application by incorporating evaluation and observability into your LangChain or LangGraph application using Future AGI’s instrumentation SDK.
Methodology
In this tutorial, we focus on building and evaluating a tool-augmented LLM agent capable of answering user queries using both its internal knowledge and real-time web search as shown in Fig 1. The objective is not just to generate responses, but to systematically monitor and assess their quality based on relevant metrics.
Fig 1: Framework for evaluating LangChain chatbot using Future AGI
To achieve this, we will build a conversational agent using LangGraph, that combines OpenAI’s model with the Google Search API as tool. The agent receives user query and then decides whether it can respond directly or requires web search for up-to-date information. When the tool is required, it performs a real-time Google Search and uses the results into its response.
To monitor how the agent behaves at each step, we will use Future AGI’s traceAI-langchain
python package, which records detailed traces of the model’s reasoning, tool usage, and responses. These traces are then evaluated for quality aspects like completeness, groundedness, hallucination, and correct use of tools. Completeness ensures the answer fully addresses the user’s query, groundedness verifies that the response is based on retrieved evidence, hallucination detection flags unsupported or fabricated content, and tool usage eval checks whether the agent invokes external tools appropriately and integrates results correctly. Together, these metrics help developers build agents that are not only intelligent, but also reliable, explainable, and production-ready.
Installing Required Packages
Importing Required Packages
Setting Up Environment
- Click here to learn how to access your
GOOGLE_API_KEY
andGOOGLE_CSE_ID
- Click here to access your
OPENAI_API_KEY
- Click here to access your
FI_API_KEY
andFI_SECRET_KEY
Instrumenting LangGraph Project
It is the process of adding tracing to your LLM applications. Tracing helps you monitor critical metrics like cost, latency, and evaluation results.
Where a span represents a single operation within an execution flow, recording input-output data, execution time, and errors, a trace connects multiple spans to represent the full execution flow of a request.
Instrumentation of such project requires 3 steps:
-
Setting Up Eval Tags:
To evaluate traces, we will use appropriate eval templates provided by Future AGI. Since we are dealing with tool-based chatbot agent, we will evaluate the agent’s behaviour on these metrics:
- Completeness: Evaluates whether the response fully addresses the input query.
- Groundedness: Evaluates whether the response is firmly based on provided input context.
- LLM Function Calling: Evaluates whether the output correctly identifies the need for a tool call and whether it accurately includes the tool.
- Detect Hallucination: Evaluates whether the model fabricated facts or added information that was not present in the input.
While these are the metrics we decided to use for this tutorial, Future AGI supports 50+ pre-built eval templates depending on different use-cases such as context adherence if you want to evaluate how well the model’s response stays within the given context, context retrieval quality if you want to measure the usefulness of the retrieved document, etc. You can also create custom eval if the existing template doesn’t fit your use-case.
Depending on your application’s requirements, additional metrics such as factual accuracy, chunk attribution, or stylistic quality can also be incorporated to provide a more comprehensive evaluation.
The
eval_tags
list contains multiple instances ofEvalTag
. EachEvalTag
represents a specific evaluation configuration to be applied during runtime, encapsulating all necessary parameters for the evaluation process.-
type
: Specifies the category of the evaluation tag. In this cookbook,EvalTagType.OBSERVATION_SPAN
is used. -
value
: Defines the kind of operation the evaluation tag is concerned with.EvalSpanKind.AGENT
indicates that the evaluation targets operations involving Agent.EvalSpanKind.TOOL
: For operations involving tools.
-
eval_name
: The name of the evaluation to be performed. -
config
: Dictionary for providing specific configurations for the evaluation. An empty dictionary means that default configuration parameters will be used. -
mapping
: This dictionary maps the required inputs for the evaluation to specific attributes of the operation. -
custom_eval_name
: A user-defined name for the specific evaluation instance.
Click here to learn more about the evals provided by Future AGI
-
Setting Up Trace Provider:
The trace provider is part of the traceAI ecosystem, which is an OSS package that enables tracing of AI applications and frameworks. It works in conjunction with OpenTelemetry to monitor code executions across different models, frameworks, and vendors.
To configure a
trace_provider
, we need to pass following parameters toregister
function:project_type
: Specifies the type of project. Here,ProjectType.EXPERIMENT
is used since the evaluation setup is more inclined towards experimentation of finding and evaluating chatbot.project_name
: User-defined name of the project.- **
project_version_name:
**The version name of the project to track different runs of experiment. eval_tags
: A list of evaluation tags that define specific evaluations to be applied.
-
Setting Up LangChain Instrumentor:
This is done to integrate with the LangChain framework for the collection of telemetry data. The
instrument
method is called on theLangChainInstrumentor
instance. This method is responsible for setting up the instrumentation of the LangChain framework using the providedtracer_provider
.
Creating LangGraph Application
We start by setting up a Google Search tool using the GoogleSearchAPIWrapper
. This tool acts as an external data source the agent can call when it needs current information. We then use ChatOpenAI
with the gpt-4o-mini
model and bind it to the search tool.
In LangGraph, each step in the agent’s logic is represented as a node in a graph. Each node handles a specific task, and the application moves from one node to another depending on the current state of the conversation. In our chatbot, we define three main nodes:
- Agent Node: This is the primary reasoning step. It receives the current conversation history, optionally includes past tool results, and generates a response or triggers a tool call.
- Tool Node: If the agent requests a tool, this node executes the Google Search and appends the result to the conversation context. It also logs the intermediate interaction.
- Final Node: If no further tools are needed, this node finalises the answer and returns it to the user.
A router
function then checks whether the agent has requested a tool. If it has, the flow moves to the tool node. If not, the agent proceeds directly to the final node to generate the response. This allows the agent to make decisions dynamically based on the query.
We then combine all the nodes into a complete graph using StateGraph
. This graph keeps track of the message history and tool results as the conversation progresses. Finally, we test the chatbot by running it on a few sample queries.
Fig 2 shows the LangGraph execution hierarchy as a tree, which is displayed as a tree, showing the full call stack. It starts with the agent node, which uses the GPT-4o-mini model (ChatOpenAI
) to interpret the user’s query. The model decides to use the tool node, which performs a Google Search (google_search
) using LangChain’s wrapper. After fetching results, control returns to the agent node again to interpret the tool response. Finally, the system reaches the final node, which generates the output. Bottom panel shows the results of the evals used at span level.
Fig 2: Future AGI dashboard for visualising traces and evals
Fig 3 shows an aggregated view of all spans, including the average latency, token usage and cost, along with evaluation scores. These scores provide quick insight into the quality of the agent’s behavior. In this example, the agent achieved 100% pass rates on Tool_Calling, Hallucination, and Groundedness, indicating correct tool usage, factual accuracy, and strong contextual grounding. However, the Completeness score is only 50%, suggesting that some responses did not fully address the user’s query.
Fig 3: Aggregated scores of evals
Conclusion
In this tutorial, we demonstrated how to build a trustworthy and reliable LangGraph-based conversational agent by combining OpenAI’s model with Google Search API. To ensure transparency and reliability, we integrated Future AGI’s evaluation and tracing framework. This allowed us to automatically capture detailed execution traces and assess the agent’s behavior.
Ready to Make your LangChain Application Reliable?
Start evaluating your LangChain/LangGraph applications with confidence using Future AGI’s observability framework. Future AGI provides the tools you need to build applications that are reliable, explainable, and production-ready.
Click here to schedule a demo with us now!