LangGraph 2.0: Stateful Multi-Agent Orchestration Framework
The definitive guide to LangGraph 2.0 -- the graph-based framework for building stateful, multi-step AI agents. From core concepts (StateGraph, nodes, edges, conditional routing) to checkpointing, human-in-the-loop patterns, multi-agent orchestration (supervisor, swarm), LangGraph Platform deployment, LangSmith tracing, and production patterns for error handling, retries, and observability.
1. What Is LangGraph?
LangGraph is an open-source framework by LangChain for building stateful, multi-step AI agents as directed graphs. Instead of writing linear chains or monolithic agent loops, you model your agent's workflow as a graph of nodes (functions) and edges (transitions), with typed state flowing through every step. This gives you explicit control over branching, looping, parallelism, and error handling -- the exact properties that production agents need but ad-hoc agent loops lack.
The framework is model-agnostic: it works with any LLM provider (OpenAI, Anthropic, Google, open-source models via Ollama). LangGraph is available for both Python and TypeScript, with official SDKs maintained by the LangChain team. As of April 2026, the Python package (langgraph on PyPI) is the most widely adopted agent orchestration library in the Python ecosystem, with the broader LangChain organization surpassing 126,000 GitHub stars across its repositories.
LangGraph sits at a specific layer in the AI agent stack: it handles orchestration (what runs when, in what order, with what state) but delegates model calls, tool execution, and external integrations to whatever libraries you prefer. You can use LangChain's abstractions, call provider SDKs directly, or mix both. This composability is why LangGraph works well alongside Claude Agent SDK agents, MCP servers, and other frameworks rather than replacing them.
2. LangGraph 2.0: What Changed
LangGraph 2.0 (February 2026) marks the transition from rapid iteration to production stability. The release codifies patterns that emerged from thousands of production deployments into stable, documented APIs with backward-compatibility guarantees within the major version. All APIs without experimental prefixes are now considered stable and production-ready.
Key changes in 2.0 include: stable checkpointer interfaces with standardized serialization across all backends (SQLite, PostgreSQL, Redis), first-class interrupt/Command primitives for human-in-the-loop workflows replacing the older breakpoint API, refined streaming with token-level and node-level streaming modes, improved subgraph composition for multi-agent architectures, and typed state reducers that let multiple nodes update shared state without conflicts.
The release policy now follows semantic versioning strictly: major releases every 6-12 months for stability, minor releases every 1-2 months for features, and patch releases weekly for fixes. This maturity makes LangGraph viable for enterprise teams that need predictable upgrade paths and long-term support.
3. Core Concepts: StateGraph, Nodes, Edges
LangGraph models every agent workflow as a StateGraph -- a directed graph where typed state flows through nodes connected by edges. Understanding these three primitives is the foundation for everything else in the framework.
StateGraph
The top-level container for your agent workflow. You define a typed state schema (using TypedDict or Pydantic), then add nodes and edges. The state schema declares every field your agent tracks -- messages, tool results, counters, flags -- along with reducer functions that control how concurrent updates merge. Compile the graph to get a runnable that processes inputs through the graph.
Nodes
Nodes are functions (sync or async) that receive the current state, perform work (LLM calls, tool execution, data processing), and return state updates. Each node runs independently and communicates only through state. This isolation makes nodes testable, retryable, and composable. Special nodes include START (entry point) and END (terminal).
Edges
Edges define transitions between nodes. A direct edge always routes from A to B. A conditional edge uses a function to inspect the current state and return the name of the next node (or END to stop). This is how agents make decisions: the LLM output is written to state, then a conditional edge reads it and routes accordingly.
State Reducers
Reducers control how node outputs merge into shared state. The default reducer overwrites the field. The add_messages reducer appends to a message list. Custom reducers handle counters, sets, deduplication, or any merge logic. Reducers prevent race conditions when parallel nodes update the same field, making concurrent execution safe.
from langgraph.graph import StateGraph, START, END
from typing import TypedDict, Annotated
from langgraph.graph.message import add_messages
class AgentState(TypedDict):
messages: Annotated[list, add_messages]
next_action: str
def call_model(state: AgentState) -> dict:
response = llm.invoke(state["messages"])
return {"messages": [response]}
def should_continue(state: AgentState) -> str:
last = state["messages"][-1]
if last.tool_calls:
return "tools"
return END
graph = StateGraph(AgentState)
graph.add_node("agent", call_model)
graph.add_node("tools", tool_node)
graph.add_edge(START, "agent")
graph.add_conditional_edges("agent", should_continue)
graph.add_edge("tools", "agent")
app = graph.compile()
4. Checkpointing and State Persistence
LangGraph's built-in persistence layer saves a snapshot of the graph state at every step of execution, organized into threads. When you compile a graph with a checkpointer, every node execution creates a checkpoint that can be inspected, replayed, or resumed. This is the foundation for conversational memory, human-in-the-loop workflows, fault tolerance, and time-travel debugging.
Three production-grade checkpointer backends are available:
SQLite (SqliteSaver)
File-based persistence ideal for local development, prototyping, and single-process deployments. Zero configuration -- just pass a file path. Supports async via aiosqlite. Use this for experimentation and workflows that do not need to share state across processes.
PostgreSQL (PostgresSaver)
The recommended backend for production deployments. Used internally by LangSmith. Supports concurrent access, ACID transactions, and scales with your existing PostgreSQL infrastructure. Async via asyncpg. Pair with connection pooling (PgBouncer) for high-throughput agent workloads.
Redis (RedisSaver)
High-performance in-memory persistence for latency-sensitive agents. The v0.1.0 release (2026) is a complete redesign optimizing checkpoint data structures for Redis's in-memory model. Ideal for real-time conversational agents, chatbots, and workflows where sub-millisecond state access matters. Supports Redis Cluster for horizontal scaling.
from langgraph.checkpoint.sqlite import SqliteSaver
from langgraph.checkpoint.postgres import PostgresSaver
# Local development
with SqliteSaver.from_conn_string("checkpoints.db") as checkpointer:
app = graph.compile(checkpointer=checkpointer)
result = app.invoke(
{"messages": [("user", "Plan my trip")]},
config={"configurable": {"thread_id": "trip-123"}}
)
# Production with PostgreSQL
async with PostgresSaver.from_conn_string(DATABASE_URL) as checkpointer:
app = graph.compile(checkpointer=checkpointer)
# Resume from last checkpoint
state = await app.aget_state({"configurable": {"thread_id": "trip-123"}})
5. Human-in-the-Loop Patterns
LangGraph provides first-class primitives for pausing agent execution, presenting information to a human, and resuming with their input. This is built on top of the checkpointing system: when a graph hits an interrupt() call, it saves state to the checkpointer and returns control to the caller. The caller collects human input and resumes the graph with a Command containing the response.
Three primary patterns cover most human-in-the-loop scenarios:
Approve / Reject
Pause before a critical action (API call, database write, email send) and wait for human approval. If approved, execution continues. If rejected, the graph routes to an alternative path or terminates. This pattern is essential for high-stakes operations where autonomous execution is too risky.
Review and Edit
Present the agent's proposed action or output to a human who can modify it before execution continues. The human edits the state directly -- changing tool arguments, rewriting a draft, or correcting extracted data. The graph resumes with the edited state as if the agent had produced it.
Collect Input
The agent determines it needs information it cannot obtain autonomously and pauses to ask the human. This handles clarification questions, preference selection, credential entry, and multi-step forms. The interrupt carries a structured prompt that the UI renders appropriately.
from langgraph.types import interrupt, Command
def sensitive_action(state):
# Pause and ask human for approval
decision = interrupt({
"action": "delete_records",
"count": state["record_count"],
"question": "Approve deletion of these records?"
})
if decision["approved"]:
return execute_deletion(state)
return {"messages": [("system", "Deletion cancelled by user.")]}
# Resume with human input
app.invoke(
Command(resume={"approved": True}),
config={"configurable": {"thread_id": "cleanup-456"}}
)
6. Multi-Agent Orchestration
LangGraph provides three architectures for coordinating multiple specialized agents. Each architecture trades off between control and autonomy, and you can combine them in hierarchies where a supervisor delegates to swarms or sequential chains.
Supervisor Pattern
A central supervisor agent routes tasks to specialized worker agents based on the current state. The supervisor is an LLM that decides which worker to invoke next, but never executes tools itself. Workers are simple, single-purpose agents (researcher, coder, reviewer). The langgraph-supervisor library provides a ready-made implementation. This pattern is best when you need structured, predictable workflows with clear delegation.
Swarm Pattern
Agents operate autonomously in a decentralized network, observing a shared workspace and contributing when their expertise is relevant. Unlike supervisor-based systems, swarm agents communicate directly via handoffs, reducing bottlenecks and enabling parallelization. The langgraph-swarm library implements this pattern. Best for emergent problem-solving where the optimal sequence of agents is not known in advance.
Collaborative Pattern
A hybrid that blends supervisor structure with swarm flexibility. A supervisor handles high-level routing while allowing worker agents to hand off to each other for sub-tasks. This works well for complex workflows where the overall structure is known but individual steps require adaptive collaboration between specialists.
from langgraph_supervisor import create_supervisor
from langgraph.prebuilt import create_react_agent
# Specialized worker agents
researcher = create_react_agent(model, tools=[search, wiki])
coder = create_react_agent(model, tools=[run_code, read_file])
reviewer = create_react_agent(model, tools=[lint, test])
# Supervisor orchestrates workers
supervisor = create_supervisor(
model=model,
agents=[researcher, coder, reviewer],
prompt="Route tasks to the appropriate specialist."
)
app = supervisor.compile()
7. LangGraph Platform (LangSmith Deployment)
LangGraph Platform (renamed to LangSmith Deployment in October 2025) is the managed infrastructure layer for deploying and scaling long-running, stateful agents. It handles the operational complexity that production agents demand: persistent state across requests, horizontal scaling, fault recovery, and monitoring -- without requiring you to build this infrastructure yourself.
Four deployment options are available:
Cloud SaaS
Fully managed, hosted within LangSmith. The fastest path from development to production -- deploy directly from the LangSmith UI with automatic updates and zero maintenance. Best for teams that want to focus on agent logic without managing infrastructure.
Bring Your Own Cloud (BYOC)
Run LangGraph Platform in your VPC while LangChain handles provisioning and maintenance. Your data stays in your environment, but you get managed upgrades, scaling, and monitoring. Ideal for teams with data residency requirements or existing cloud commitments.
Self-Hosted Enterprise
Deploy entirely on your own infrastructure for maximum control. Run on Kubernetes with Docker containers. You manage upgrades, scaling, and security. Best for organizations with strict compliance requirements or air-gapped environments.
Self-Hosted Lite
A free, limited version of LangGraph Platform (up to 1 million node executions). Run locally or self-hosted for development, testing, and small-scale production. No license required. A practical way to evaluate the platform before committing to a paid tier.
8. Tracing with LangSmith
LangSmith is the observability platform for LangGraph agents. When you set LANGCHAIN_TRACING_V2=true and provide an API key, every graph execution is automatically traced -- no custom instrumentation required. Traces capture the full execution tree: which nodes ran, in what order, with what inputs and outputs, how long each step took, and how many tokens were consumed.
The trace view in LangSmith shows a hierarchical tree representing the complete execution. You can drill into individual node runs, inspect state at each checkpoint, view LLM prompts and completions, and see tool call arguments and results. For production debugging, you can filter traces by latency, error status, token usage, or custom metadata. The Insights Agent (available for self-hosted LangSmith) automatically analyzes traces to detect usage patterns, common agent behaviors, and failure modes.
LangSmith is not limited to LangGraph -- it traces applications built with the OpenAI SDK, Anthropic SDK, Vercel AI SDK, LlamaIndex, or custom implementations. But the integration with LangGraph is deepest: server logs from LangSmith Deployment are linked directly to trace views, giving you a single window into both application-level behavior and infrastructure-level events.
# Enable tracing -- that's it
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "lsv2_..."
os.environ["LANGCHAIN_PROJECT"] = "my-agent"
# Every graph.invoke() and graph.astream() is now traced
result = app.invoke(
{"messages": [("user", "Analyze Q1 revenue")]},
config={"configurable": {"thread_id": "analysis-789"}}
)
# View trace at: https://smith.langchain.com/
9. Framework Comparison
LangGraph occupies a specific position in the agent framework landscape. Here is how it compares to the other major frameworks as of April 2026:
| Dimension | LangGraph | Claude Agent SDK | OpenAI Agents SDK | Google ADK | CrewAI |
|---|---|---|---|---|---|
| Orchestration | Directed graph with conditional edges | Tool-use chains with sub-agents | Explicit handoffs between agents | Hierarchical agent tree | Role-based crews with process types |
| Model Support | Fully model-agnostic | Claude models only | OpenAI models only | Optimized for Gemini, supports others | Fully model-agnostic |
| State Persistence | Built-in checkpointing with time travel | MCP server state | Context variables (ephemeral) | Session state with pluggable backends | Crew memory with configurable stores |
| Human-in-the-Loop | First-class interrupt/Command primitives | MCP elicitation | Manual via guardrails | Built-in approval steps | Human input tool |
| Observability | LangSmith (deep integration) | Built-in tracing | Built-in tracing | Cloud Trace, Cloud Logging | Community integrations |
| Learning Curve | Medium (graph concepts) | Medium (tool-use patterns) | Low (clean opinionated API) | Medium (GCP ecosystem) | Low (role-based DSL) |
| Best For | Complex stateful workflows, precise control | MCP-native development, safety-first | Fast prototyping with OpenAI models | GCP-native, multimodal, A2A protocol | Multi-agent collaboration, rapid prototyping |
The frameworks are increasingly interoperable rather than mutually exclusive. Google ADK can treat a LangGraph agent as an AgentTool, LangGraph can call ADK agents as subgraphs via API, and both support the MCP protocol for tool integration. Choose based on your primary constraint: LangGraph for complex stateful orchestration, Claude Agent SDK for MCP-native safety-first development, OpenAI SDK for the fastest path to a working agent, Google ADK for GCP-native multimodal agents, and CrewAI for rapid multi-agent prototyping with the broadest protocol support.
10. Production Patterns
Deploying LangGraph agents to production requires patterns for error handling, retries, observability, and resource management. These patterns leverage LangGraph's graph structure to make failure handling explicit and testable rather than hidden in try-catch blocks.
Retry Policies and Error Routing
LangGraph supports per-node retry policies with configurable max attempts, backoff strategies, and retry conditions. Transient failures (API timeouts, rate limits) get automatic retries with exponential backoff. LLM-recoverable errors loop back to the model with error context. User-fixable problems pause for human input via interrupt. Unexpected errors bubble up for debugging. Conditional edges route based on error type, making failure paths explicit in the graph.
Guardrails and Circuit Breakers
Bounded retries and step limits prevent runaway agent loops. Set maximum iterations per graph execution to cap cost and latency. Circuit breakers detect repeated failures to the same external service and fail fast instead of burning tokens on doomed retries. Timeout policies on individual nodes prevent single slow operations from blocking the entire workflow.
Observability Stack
Production agents need tracing showing which nodes ran with their inputs and outputs (LangSmith), metrics for system health (Prometheus/Grafana), and structured logs tying execution to business events. LangSmith captures token usage, latency, and error rates per node. You can replay historical runs with modified parameters for debugging and regression testing.
Scaling and State Management
Use PostgreSQL checkpointers for multi-process deployments behind load balancers. Redis checkpointers add sub-millisecond state access for latency-sensitive workloads. Deploy on Kubernetes with horizontal pod autoscaling based on queue depth or active thread count. The LangGraph Platform handles this automatically for managed deployments.
from langgraph.pregel import RetryPolicy
# Per-node retry with exponential backoff
retry = RetryPolicy(
max_attempts=3,
initial_interval=1.0,
backoff_factor=2.0,
retry_on=lambda e: isinstance(e, (TimeoutError, RateLimitError))
)
graph.add_node("api_call", call_external_api, retry=retry)
# Step limit to prevent runaway loops
app = graph.compile(
checkpointer=checkpointer,
recursion_limit=50 # Max node executions per invocation
)
11. langgraph-checkpoint v4.0.2 and TTL Strategies
The langgraph-checkpoint v4.0.2 release (April 2026) introduces the keep_latest TTL strategy for automatic checkpoint pruning. Instead of accumulating unbounded state history, you configure a retention policy that keeps only the most recent N checkpoints per thread, with older snapshots purged asynchronously. This dramatically reduces storage costs for high-volume production agents running thousands of threads.
The new RemoteCheckpointer enables cross-process subgraph checkpointing. When a supervisor graph delegates to a subgraph running in a separate process (or even on a different machine), the RemoteCheckpointer synchronizes state via an HTTP transport layer. This eliminates the previous limitation where subgraph state was only accessible from the parent process, enabling true distributed multi-agent architectures with independent scaling of supervisor and worker agents.
from langgraph.checkpoint.postgres import PostgresSaver
from langgraph.checkpoint.base import CheckpointConfig
# TTL strategy: keep only latest 50 checkpoints per thread
config = CheckpointConfig(
ttl_strategy="keep_latest",
keep_latest=50
)
checkpointer = PostgresSaver.from_conn_string(
conn_string="postgresql://...",
config=config
)
# RemoteCheckpointer for subgraph state across processes
from langgraph.checkpoint.remote import RemoteCheckpointer
sub_ckpt = RemoteCheckpointer(endpoint="https://worker-agent:8080/checkpoints")