Production-Grade Observability for AI Agents: A Minimal-Code, Configuration-First Approach

grow more complex, traditional logging and monitoring fall short. What teams actually need is observability: the ability to trace agent decisions, evaluate response quality automatically, and detect drift over time—without writing and maintaining large amounts of custom evaluation and telemetry code.

Therefore, teams need to adopt the right platform for observability while they focus on the core task of building and improving the agents’ orchestration. And integrate their application to the observability platform with minimal overhead to their functional code. In this article, I will demonstrate how you can set up an open-source AI observability platform to perform the following using a minimal-code approach:

LLM-as-a-Judge: Configure pre-built evaluators to score responses for Correctness, Relevance, Hallucination and more. Display scores across runs with detailed logs and analytics.
Testing at scale: Set up datasets to store regression test cases for measuring accuracy against expected ground truth responses. Proactively detect LLM and agent drift.
MELT data: Track metrics (latency, token usage, model drift), events (API calls, LLM calls, tool usage), logs (user interaction, tool execution, agent decision making) with detailed traces – all without detailed telemetry and instrumentation code.

We will be using Langfuse for observability. It is open-source and framework-agnostic and can work with popular orchestration frameworks and LLM providers.

Multi-agent application

For this demonstration, I have attached the LangGraph code of a Customer Service application. The application accepts tickets from the user, classifies into Technical, Billing or Both using a Triage agent, then routes it to the Technical Support agent, Billing Support agent or to both of them. Then a finalizer agent synthesizes the response from both agents into a coherent, more readable format. The flowchart is as follows:

The code is attached here

# --------------------------------------------------
# 0. Load .env
# --------------------------------------------------
from dotenv import load_dotenv
load_dotenv(override=True)

# --------------------------------------------------
# 1. Imports
# --------------------------------------------------
import os
from typing import TypedDict

from langgraph.graph import StateGraph, END
from langchain_openai import AzureChatOpenAI

from langfuse import Langfuse
from langfuse.langchain import CallbackHandler

# --------------------------------------------------
# 2. Langfuse Client (WORKING CONFIG)
# --------------------------------------------------
langfuse = Langfuse(
    host="https://cloud.langfuse.com",
    public_key=os.environ["LANGFUSE_PUBLIC_KEY"] , 
    secret_key=os.environ["LANGFUSE_SECRET_KEY"]  
)
langfuse_callback = CallbackHandler()
os.environ["LANGGRAPH_TRACING"] = "false"


# --------------------------------------------------
# 3. Azure OpenAI Setup
# --------------------------------------------------
llm = AzureChatOpenAI(
    azure_deployment=os.environ["AZURE_OPENAI_DEPLOYMENT_NAME"],
    api_version=os.environ.get("AZURE_OPENAI_API_VERSION", "2025-01-01-preview"),
    temperature=0.2,
    callbacks=[langfuse_callback],  #  enables token usage
)

# --------------------------------------------------
# 4. Shared State
# --------------------------------------------------
class AgentState(TypedDict, total=False):
    ticket: str
    category: str
    technical_response: str
    billing_response: str
    final_response: str

# --------------------------------------------------
# 5. Agent Definitions
# --------------------------------------------------

def triage_agent(state: dict) -> dict:
    with langfuse.start_as_current_observation(
        as_type="span",
        name="triage_agent",
        input={"ticket": state["ticket"]},
    ) as span:
        span.update_trace(name="Customer Service Query - LangGraph Demo") 

        response = llm.invoke([
            {
                "role": "system",
                "content": (
                    "Classify the query as one of: "
                    "Technical, Billing, Both. "
                    "Respond with only the label."
                ),
            },
            {"role": "user", "content": state["ticket"]},
        ])

        raw = response.content.strip().lower()

        if "both" in raw:
            category = "Both"
        elif "technical" in raw:
            category = "Technical"
        elif "billing" in raw:
            category = "Billing"
        else:
            category = "Technical"  #  safe fallback

        span.update(output={"raw": raw, "category": category})

        return {"category": category}



def technical_support_agent(state: dict) -> dict:
    with langfuse.start_as_current_observation(
        as_type="span",
        name="technical_support_agent",
        input={
            "ticket": state["ticket"],
            "category": state.get("category"),
        },
    ) as span:

        response = llm.invoke([
            {
                "role": "system",
                "content": (
                    "You are a technical support specialist. "
                    "Provide a clear, step-by-step solution."
                ),
            },
            {"role": "user", "content": state["ticket"]},
        ])

        answer = response.content

        span.update(output={"technical_response": answer})

        return {"technical_response": answer}


def billing_support_agent(state: dict) -> dict:
    with langfuse.start_as_current_observation(
        as_type="span",
        name="billing_support_agent",
        input={
            "ticket": state["ticket"],
            "category": state.get("category"),
        },
    ) as span:

        response = llm.invoke([
            {
                "role": "system",
                "content": (
                    "You are a billing support specialist. "
                    "Answer clearly about payments, invoices, or accounts."
                ),
            },
            {"role": "user", "content": state["ticket"]},
        ])

        answer = response.content

        span.update(output={"billing_response": answer})

        return {"billing_response": answer}

def finalizer_agent(state: dict) -> dict:
    with langfuse.start_as_current_observation(
        as_type="span",
        name="finalizer_agent",
        input={
            "ticket": state["ticket"],
            "technical": state.get("technical_response"),
            "billing": state.get("billing_response"),
        },
    ) as span:

        parts = [
            f"Technical:\n{state['technical_response']}"
            for k in ["technical_response"]
            if state.get(k)
        ] + [
            f"Billing:\n{state['billing_response']}"
            for k in ["billing_response"]
            if state.get(k)
        ]

        if not parts:
            final = "Error: No agent responses available."
        else:
            response = llm.invoke([
                {
                    "role": "system",
                    "content": (
                        "Combine the following agent responses into ONE clear, professional, "
                        "customer-facing answer. Do not mention agents or internal labels. "
                        f"Answer the user's query: '{state['ticket']}'."
                    ),
                },
                {"role": "user", "content": "\n\n".join(parts)},
            ])
            final = response.content

        span.update(output={"final_response": final})
        return {"final_response": final}


# --------------------------------------------------
# 6. LangGraph Construction 
# --------------------------------------------------
builder = StateGraph(AgentState)

builder.add_node("triage", triage_agent)
builder.add_node("technical", technical_support_agent)
builder.add_node("billing", billing_support_agent)
builder.add_node("finalizer", finalizer_agent)

builder.set_entry_point("triage")

# Conditional routing
builder.add_conditional_edges(
    "triage",
    lambda state: state["category"],
    {
        "Technical": "technical",
        "Billing": "billing",
        "Both": "technical",
        "__default__": "technical",  #  never dead-end
    },
)

# Sequential resolution
builder.add_conditional_edges(
    "technical",
    lambda state: state["category"],
    {
        "Both": "billing",         # Proceed to billing if Both
        "__default__": "finalizer",
    },
)
builder.add_edge("billing", "finalizer")
builder.add_edge("finalizer", END)

graph = builder.compile()


# --------------------------------------------------
# 9. Main
# --------------------------------------------------
if __name__ == "__main__":

    print("===============================================")
    print(" Conditional Multi-Agent Support System (Ready)")
    print("===============================================")
    print("Enter 'exit' or 'quit' to stop the program.\n")
    
    while True:
        # Get user input for the ticket
        ticket = input("Enter your support query (ticket): ")

        # Check for exit command
        if ticket.lower() in ["exit", "quit"]:
            print("\nExiting the support system. Goodbye!")
            break

        if not ticket.strip():
            print("Please enter a non-empty query.")
            continue
            
        try:                
                # --- Run the graph with the user's ticket ---
             result = graph.invoke(
                {"ticket": ticket},
                config={"callbacks": [langfuse_callback]},
            )
        
            # --- Print Results ---
            category = result.get('category', 'N/A')
            print(f"\n Triage Classification: **{category}**")
            
            # Check which agents were executed based on the presence of a response
            executed_agents = []
            if result.get("technical_response"):
                executed_agents.append("Technical")
            if result.get("billing_response"):
                executed_agents.append("Billing")
            
            
            print(f" Agents Executed: {', '.join(executed_agents) if executed_agents else 'None (Triage Failed)'}")

            print("\n================ FINAL RESPONSE ================\n")
            print(result["final_response"])
            print("\n" + "="*60 + "\n")

        except Exception as e:
            # This is important for debugging: print the exception type and message
            print(f"\nAn error occurred during processing ({type(e).__name__}): {e}")
            print("\nPlease try another query.")
            print("\n" + "="*60 + "\n")

Observability Configuration

To set up Langfuse, go to https://cloud.langfuse.com/, and set up an account with a Billing tier (hobby tier with generous limits available), then set up a Project. In the project settings, you can generate the public and secret keys which need to be provided at the beginning of the code. You also need to add the LLM connection, which will be used for the LLM-as-a-Judge evaluation.

LLM-as-a-Judge setup

This is the core of the performance evaluation setup for agents. Here you can configure various pre-built Evaluators from the Evaluator Library which will score the responses on various criteria such as Conciseness, Correctness, Hallucination, Answer Critic etc. These should suffice for most use cases, else Custom Evaluators can be set up also. Here is a view of the Evaluator library:

Select the evaluator, say Relevance, that you wish to use. You can choose to run it for new or existing traces or for Dataset runs. In addition, review the evaluation prompt to ensure it satisfies your evaluation objective. Most importantly, the query, generation and other variables should be correctly mapped to the source (usually, to the Input and Output from the application trace). For our case, these will be the ticket data entered by the user and the response generated by the finalizer agent respectively. In addition, for Dataset runs, you can compare the generated responses to the Ground Truth responses stored as expected outputs (explained in the next sections).

Here is the configuration for the ‘GT Accuracy’ evaluation I set up for new Dataset runs, along with the Variable mapping. The evaluation prompt preview is also depicted. Most of the evaluators score within a range of 0 to 1:

For the customer service demo, I have configured 3 evaluators – Relevance, Conciseness which run for all new traces, and GT Accuracy, which deploys for Dataset runs only.

Datasets setup

Create a dataset to use as a test case repository. Here, you can store test cases with the input query and the ideal expected response. To create the dataset, there are 3 choices: create one record at a time, upload a CSV of queries and expected responses, or, quite conveniently, add inputs and outputs directly from the application traces whose responses are adjudged to be of good quality by human experts.

Here is the dataset I have created for the demo. These are a mix of technical, billing, or ‘Both’ queries, and I have created all the records from application traces:

That’s it! The configuration is done and we are ready to run observability.

Observability Results

The Langfuse Home page is a dashboard of several useful charts. It shows the count of execution traces, scores and averages at a glance, traces by time, model usage and cost etc.

MELT data

The most useful observability data is available in the ‘Tracing’ option, which displays summarized and detailed views of all executions. Here is a view of the dashboard depicting the time, name, input, output and the crucial latency and token usage metrics. Note that for every agent execution of our application, there are 2 evaluation traces generated for the Conciseness and Relevance evaluators we set up.

Conciseness and Relevance evaluation runs for each application execution

Let’s look at the details of one of the executions of the Customer Service application. On the left panel, the agent flow is depicted both as a tree as well as a flowchart. It shows the LangGraph nodes (agents) and the LLM calls along with the token usage. If our agents had tool calls or human-in-the-loop steps, they would have been depicted here as well. Note that the evaluation scores for Conciseness and Relevance are also depicted on top, which are 0.40 and 1 respectively for this run. Clicking on them shows the reason for the score and a link to take us to the evaluator trace.

On the right, for each agent, LLM and tool call, we can see the Input and generated output. For instance, here we see that the query was categorized as ‘Both’, and therefore in the left chart, it shows both the technical and billing support agents were called, which confirms our flow is working as expected.

On top of the right hand panel, there is the ‘Add to datasets’ button. At any step of the tree, this button, when clicked, will open up a panel like the one depicted below, where you can add the input and output of that step directly to a test dataset created in the previous section. This is a useful feature for human experts to add frequently occurring user queries and good responses to the dataset during normal agent operations, thereby building a Regression test repository with minimal effort. In future, when there is a major upgrade or release to the application, the Regression dataset can be run and the generated outputs can be scored against the Expected outputs (ground truth) recorded here using the ‘GT Accuracy’ evaluator we created during the LLM-as-a-judge setup. This helps to detect LLM drift (or agent drift) early and take corrective steps.

Here is one of the evaluation traces (Conciseness) for this application trace. The evaluator provides the reasoning for the score of 0.4 it adjudged this response to be.

Scores

The Scores option in Langfuse show a list of all the evaluation runs from the various active evaluators along with their scores. More pertinent is the Analytics dashboard, where two scores can be selected and metrics such as mean and standard deviation along with trend lines can be viewed.

Regression testing

With Datasets, we are ready to run regression testing using the test case repository of queries and expected outputs. We have stored 4 queries in our Regression dataset, with a mix of technical, billing and ‘Both’ queries.

For this, we can run the attached code which gets the relevant dataset and runs the experiment. All the test runs are logged along with the average scores. We can view the result of a selected test with Conciseness, GT Accuracy and Relevance scores for each test case in one dashboard. And as needed, the detailed trace can be accessed to see the reasoning for the score.

You can view the code here.

from langfuse import get_client
from langfuse.openai import OpenAI
from langchain_openai import AzureChatOpenAI
from langfuse import Langfuse
import os
# Initialize client
from dotenv import load_dotenv
load_dotenv(override=True)

langfuse = Langfuse(
    host="https://cloud.langfuse.com",
    public_key=os.environ["LANGFUSE_PUBLIC_KEY"] , 
    secret_key=os.environ["LANGFUSE_SECRET_KEY"]  
)

llm = AzureChatOpenAI(
    azure_deployment=os.environ.get("AZURE_OPENAI_DEPLOYMENT_NAME"),
    api_version=os.environ.get("AZURE_OPENAI_API_VERSION", "2025-01-01-preview"),
    temperature=0.2,
)

# Define your task function
def my_task(*, item, **kwargs):
    question = item.input['ticket'] 
    response = llm.invoke([{"role": "user", "content": question}])

    raw = response.content.strip().lower()
 
    return raw  
 
# Get dataset from Langfuse
dataset = langfuse.get_dataset("Regression")
 
# Run experiment directly on the dataset
result = dataset.run_experiment(
    name="Production Model Test",
    description="Monthly evaluation of our production model",
    task=my_task # see above for the task definition
)
 
# Use format method to display results
print(result.format())

Key Takeaways

AI observability does not need to be code-heavy.
Most evaluation, tracing, and regression testing capabilities for LLM agents can be enabled through configuration rather than custom code, significantly reducing development and maintenance effort.
Rich evaluation workflows can be defined declaratively.
Capabilities such as LLM-as-a-Judge scoring (relevance, conciseness, hallucination, ground-truth accuracy), variable mapping, and evaluation prompts are configured directly in the observability platform—without writing bespoke evaluation logic.
Datasets and regression testing are configuration-first features.
Test case repositories, dataset runs, and ground-truth comparisons can be set up and reused through the UI or simple configuration, allowing teams to run regression tests across agent versions with minimal additional code.
Full MELT observability comes “out of the box.”
Metrics (latency, token usage, cost), events (LLM and tool calls), logs, and traces are automatically captured and correlated, avoiding the need for manual instrumentation across agent workflows.
Minimal instrumentation, maximum visibility.
With lightweight SDK integration, teams gain deep visibility into multi-agent execution paths, evaluation results, and performance trends—freeing developers to focus on agent logic rather than observability plumbing.

Conclusion

As LLM agents become more complex, observability is no longer optional. Without it, multi-agent systems quickly turn into black boxes that are difficult to evaluate, debug, and improve.

An AI observability platform shifts this burden away from developers and application code. Using a minimal-code, configuration-first approach, teams can enable LLM-as-a-Judge evaluation, regression testing, and full MELT observability without building and maintaining custom pipelines. This not only reduces engineering effort but also accelerates the path from prototype to production.

By adopting an open-source, framework-agnostic platform like Langfuse, teams gain a single source of truth for agent performance—making AI systems easier to trust, evolve, and operate at scale.

Want to know more? The Customer Service agentic application presented here follows a manager-worker architecture pattern, which does not work in CrewAI. Read about how observability helped me to fix this well known issue with the manager-worker hierarchical process of CrewAI, by tracing agent responses at each step and refining them to get the orchestration to work as it should. Full analysis here: Why CrewAI’s Manager-Worker Architecture Fails — and How to Fix It

Connect with me and share your comments at www.linkedin.com/in/partha-sarkar-lets-talk-AI

_{All images and data used in this article are synthetically generated. Figures and code created by me}

Source link

The post Production-Grade Observability for AI Agents: A Minimal-Code, Configuration-First Approach first appeared on TechToday.

This post originally appeared on TechToday.