LangGraph 1.0: Stateful Multi-Agent Orchestration Framework

The definitive guide to LangGraph 1.0 -- the graph-based framework for building stateful, multi-step AI agents. From core concepts (StateGraph, nodes, edges, conditional routing) to checkpointing, human-in-the-loop patterns, multi-agent orchestration (supervisor, swarm), LangGraph Platform deployment, LangSmith tracing, and production patterns for error handling, retries, and observability.

LangGraph 1.0StateGraphLangChainLangSmithPythonTypeScriptCheckpointingHuman-in-the-LoopSupervisorSwarmPostgreSQLRedisSQLite

What Is LangGraph?
LangGraph 1.0: What Changed
Core Concepts: StateGraph, Nodes, Edges
Checkpointing and State Persistence
Human-in-the-Loop Patterns
Multi-Agent Orchestration
LangGraph Platform
Tracing with LangSmith
Framework Comparison
Production Patterns

1. What Is LangGraph?

LangGraph is an open-source framework by LangChain for building stateful, multi-step AI agents as directed graphs. Instead of writing linear chains or monolithic agent loops, you model your agent's workflow as a graph of nodes (functions) and edges (transitions), with typed state flowing through every step. This gives you explicit control over branching, looping, parallelism, and error handling -- the exact properties that production agents need but ad-hoc agent loops lack.

The framework is model-agnostic: it works with any LLM provider (OpenAI, Anthropic, Google, open-source models via Ollama). LangGraph is available for both Python and TypeScript, with official SDKs maintained by the LangChain team. As of June 2026, the Python package (langgraph on PyPI, currently v1.2.4) is the most widely adopted agent orchestration library in the Python ecosystem, with the broader LangChain organization surpassing 126,000 GitHub stars across its repositories.

LangGraph sits at a specific layer in the AI agent stack: it handles orchestration (what runs when, in what order, with what state) but delegates model calls, tool execution, and external integrations to whatever libraries you prefer. You can use LangChain's abstractions, call provider SDKs directly, or mix both. This composability is why LangGraph works well alongside Claude Agent SDK agents, MCP servers, and other frameworks rather than replacing them.

2. LangGraph 1.0: What Changed

LangGraph 1.0 (generally available since October 22, 2025) marks the transition from rapid iteration to production stability. The release codifies patterns that emerged from thousands of production deployments into stable, documented APIs with backward-compatibility guarantees within the major version. All APIs without experimental prefixes are now considered stable and production-ready.

Key changes in 1.0 include: stable checkpointer interfaces with standardized serialization across all backends (SQLite, PostgreSQL, Redis), first-class interrupt/Command primitives for human-in-the-loop workflows replacing the older breakpoint API, refined streaming with token-level and node-level streaming modes, improved subgraph composition for multi-agent architectures, and typed state reducers that let multiple nodes update shared state without conflicts.

The release policy now follows semantic versioning strictly: major releases every 6-12 months for stability, minor releases every 1-2 months for features, and patch releases weekly for fixes. This maturity makes LangGraph viable for enterprise teams that need predictable upgrade paths and long-term support.

3. Core Concepts: StateGraph, Nodes, Edges

LangGraph models every agent workflow as a StateGraph -- a directed graph where typed state flows through nodes connected by edges. Understanding these three primitives is the foundation for everything else in the framework.

CORE

StateGraph

The top-level container for your agent workflow. You define a typed state schema (using TypedDict or Pydantic), then add nodes and edges. The state schema declares every field your agent tracks -- messages, tool results, counters, flags -- along with reducer functions that control how concurrent updates merge. Compile the graph to get a runnable that processes inputs through the graph.

LOGIC

Nodes

Nodes are functions (sync or async) that receive the current state, perform work (LLM calls, tool execution, data processing), and return state updates. Each node runs independently and communicates only through state. This isolation makes nodes testable, retryable, and composable. Special nodes include START (entry point) and END (terminal).

FLOW

Edges

Edges define transitions between nodes. A direct edge always routes from A to B. A conditional edge uses a function to inspect the current state and return the name of the next node (or END to stop). This is how agents make decisions: the LLM output is written to state, then a conditional edge reads it and routes accordingly.

STATE

State Reducers

Reducers control how node outputs merge into shared state. The default reducer overwrites the field. The add_messages reducer appends to a message list. Custom reducers handle counters, sets, deduplication, or any merge logic. Reducers prevent race conditions when parallel nodes update the same field, making concurrent execution safe.

from langgraph.graph import StateGraph, START, END
from typing import TypedDict, Annotated
from langgraph.graph.message import add_messages

class AgentState(TypedDict):
    messages: Annotated[list, add_messages]
    next_action: str

def call_model(state: AgentState) -> dict:
    response = llm.invoke(state["messages"])
    return {"messages": [response]}

def should_continue(state: AgentState) -> str:
    last = state["messages"][-1]
    if last.tool_calls:
        return "tools"
    return END

graph = StateGraph(AgentState)
graph.add_node("agent", call_model)
graph.add_node("tools", tool_node)
graph.add_edge(START, "agent")
graph.add_conditional_edges("agent", should_continue)
graph.add_edge("tools", "agent")
app = graph.compile()

4. Checkpointing and State Persistence

LangGraph's built-in persistence layer saves a snapshot of the graph state at every step of execution, organized into threads. When you compile a graph with a checkpointer, every node execution creates a checkpoint that can be inspected, replayed, or resumed. This is the foundation for conversational memory, human-in-the-loop workflows, fault tolerance, and time-travel debugging.

Three production-grade checkpointer backends are available:

LOCAL

SQLite (SqliteSaver)

File-based persistence ideal for local development, prototyping, and single-process deployments. Zero configuration -- just pass a file path. Supports async via aiosqlite. Use this for experimentation and workflows that do not need to share state across processes.

PROD

PostgreSQL (PostgresSaver)

The recommended backend for production deployments. Used internally by LangSmith. Supports concurrent access, ACID transactions, and scales with your existing PostgreSQL infrastructure. Async via asyncpg. Pair with connection pooling (PgBouncer) for high-throughput agent workloads.

PROD

Redis (RedisSaver)

High-performance in-memory persistence for latency-sensitive agents. The v0.1.0 release (2026) is a complete redesign optimizing checkpoint data structures for Redis's in-memory model. Ideal for real-time conversational agents, chatbots, and workflows where sub-millisecond state access matters. Supports Redis Cluster for horizontal scaling.

from langgraph.checkpoint.sqlite import SqliteSaver
from langgraph.checkpoint.postgres import PostgresSaver

# Local development
with SqliteSaver.from_conn_string("checkpoints.db") as checkpointer:
    app = graph.compile(checkpointer=checkpointer)
    result = app.invoke(
        {"messages": [("user", "Plan my trip")]},
        config={"configurable": {"thread_id": "trip-123"}}
    )

# Production with PostgreSQL
async with PostgresSaver.from_conn_string(DATABASE_URL) as checkpointer:
    app = graph.compile(checkpointer=checkpointer)
    # Resume from last checkpoint
    state = await app.aget_state({"configurable": {"thread_id": "trip-123"}})

5. Human-in-the-Loop Patterns

LangGraph provides first-class primitives for pausing agent execution, presenting information to a human, and resuming with their input. This is built on top of the checkpointing system: when a graph hits an interrupt() call, it saves state to the checkpointer and returns control to the caller. The caller collects human input and resumes the graph with a Command containing the response.

Three primary patterns cover most human-in-the-loop scenarios:

APPROVE

Approve / Reject

Pause before a critical action (API call, database write, email send) and wait for human approval. If approved, execution continues. If rejected, the graph routes to an alternative path or terminates. This pattern is essential for high-stakes operations where autonomous execution is too risky.

EDIT

Review and Edit

Present the agent's proposed action or output to a human who can modify it before execution continues. The human edits the state directly -- changing tool arguments, rewriting a draft, or correcting extracted data. The graph resumes with the edited state as if the agent had produced it.

INPUT

Collect Input

The agent determines it needs information it cannot obtain autonomously and pauses to ask the human. This handles clarification questions, preference selection, credential entry, and multi-step forms. The interrupt carries a structured prompt that the UI renders appropriately.

from langgraph.types import interrupt, Command

def sensitive_action(state):
    # Pause and ask human for approval
    decision = interrupt({
        "action": "delete_records",
        "count": state["record_count"],
        "question": "Approve deletion of these records?"
    })
    if decision["approved"]:
        return execute_deletion(state)
    return {"messages": [("system", "Deletion cancelled by user.")]}

# Resume with human input
app.invoke(
    Command(resume={"approved": True}),
    config={"configurable": {"thread_id": "cleanup-456"}}
)

6. Multi-Agent Orchestration

LangGraph provides three architectures for coordinating multiple specialized agents. Each architecture trades off between control and autonomy, and you can combine them in hierarchies where a supervisor delegates to swarms or sequential chains.

ARCH

Supervisor Pattern

A central supervisor agent routes tasks to specialized worker agents based on the current state. The supervisor is an LLM that decides which worker to invoke next, but never executes tools itself. Workers are simple, single-purpose agents (researcher, coder, reviewer). The langgraph-supervisor library provides a ready-made implementation. This pattern is best when you need structured, predictable workflows with clear delegation.

ARCH

Swarm Pattern

Agents operate autonomously in a decentralized network, observing a shared workspace and contributing when their expertise is relevant. Unlike supervisor-based systems, swarm agents communicate directly via handoffs, reducing bottlenecks and enabling parallelization. The langgraph-swarm library implements this pattern. Best for emergent problem-solving where the optimal sequence of agents is not known in advance.

ARCH

Collaborative Pattern

A hybrid that blends supervisor structure with swarm flexibility. A supervisor handles high-level routing while allowing worker agents to hand off to each other for sub-tasks. This works well for complex workflows where the overall structure is known but individual steps require adaptive collaboration between specialists.

from langgraph_supervisor import create_supervisor
from langgraph.prebuilt import create_react_agent

# Specialized worker agents
researcher = create_react_agent(model, tools=[search, wiki])
coder = create_react_agent(model, tools=[run_code, read_file])
reviewer = create_react_agent(model, tools=[lint, test])

# Supervisor orchestrates workers
supervisor = create_supervisor(
    model=model,
    agents=[researcher, coder, reviewer],
    prompt="Route tasks to the appropriate specialist."
)
app = supervisor.compile()

7. LangGraph Platform (LangSmith Deployment)

LangGraph Platform (renamed to LangSmith Deployment in October 2025) is the managed infrastructure layer for deploying and scaling long-running, stateful agents. It handles the operational complexity that production agents demand: persistent state across requests, horizontal scaling, fault recovery, and monitoring -- without requiring you to build this infrastructure yourself.

Four deployment options are available:

CLOUD

Cloud SaaS

Fully managed, hosted within LangSmith. The fastest path from development to production -- deploy directly from the LangSmith UI with automatic updates and zero maintenance. Best for teams that want to focus on agent logic without managing infrastructure.

HYBRID

Bring Your Own Cloud (BYOC)

Run LangGraph Platform in your VPC while LangChain handles provisioning and maintenance. Your data stays in your environment, but you get managed upgrades, scaling, and monitoring. Ideal for teams with data residency requirements or existing cloud commitments.

SELF

Self-Hosted Enterprise

Deploy entirely on your own infrastructure for maximum control. Run on Kubernetes with Docker containers. You manage upgrades, scaling, and security. Best for organizations with strict compliance requirements or air-gapped environments.

FREE

Self-Hosted Lite

A free, limited version of LangGraph Platform (up to 1 million node executions). Run locally or self-hosted for development, testing, and small-scale production. No license required. A practical way to evaluate the platform before committing to a paid tier.

8. Tracing with LangSmith

LangSmith is the observability platform for LangGraph agents. When you set LANGCHAIN_TRACING_V2=true and provide an API key, every graph execution is automatically traced -- no custom instrumentation required. Traces capture the full execution tree: which nodes ran, in what order, with what inputs and outputs, how long each step took, and how many tokens were consumed.

The trace view in LangSmith shows a hierarchical tree representing the complete execution. You can drill into individual node runs, inspect state at each checkpoint, view LLM prompts and completions, and see tool call arguments and results. For production debugging, you can filter traces by latency, error status, token usage, or custom metadata. The Insights Agent (available for self-hosted LangSmith) automatically analyzes traces to detect usage patterns, common agent behaviors, and failure modes.

LangSmith is not limited to LangGraph -- it traces applications built with the OpenAI SDK, Anthropic SDK, Vercel AI SDK, LlamaIndex, or custom implementations. But the integration with LangGraph is deepest: server logs from LangSmith Deployment are linked directly to trace views, giving you a single window into both application-level behavior and infrastructure-level events.

# Enable tracing -- that's it
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "lsv2_..."
os.environ["LANGCHAIN_PROJECT"] = "my-agent"

# Every graph.invoke() and graph.astream() is now traced
result = app.invoke(
    {"messages": [("user", "Analyze Q1 revenue")]},
    config={"configurable": {"thread_id": "analysis-789"}}
)
# View trace at: https://smith.langchain.com/

9. Framework Comparison

LangGraph occupies a specific position in the agent framework landscape. Here is how it compares to the other major frameworks as of April 2026:

Dimension	LangGraph	Claude Agent SDK	OpenAI Agents SDK	Google ADK	CrewAI
Orchestration	Directed graph with conditional edges	Tool-use chains with sub-agents	Explicit handoffs between agents	Hierarchical agent tree	Role-based crews with process types
Model Support	Fully model-agnostic	Claude models only	OpenAI models only	Optimized for Gemini, supports others	Fully model-agnostic
State Persistence	Built-in checkpointing with time travel	MCP server state	Context variables (ephemeral)	Session state with pluggable backends	Crew memory with configurable stores
Human-in-the-Loop	First-class interrupt/Command primitives	MCP elicitation	Manual via guardrails	Built-in approval steps	Human input tool
Observability	LangSmith (deep integration)	Built-in tracing	Built-in tracing	Cloud Trace, Cloud Logging	Community integrations
Learning Curve	Medium (graph concepts)	Medium (tool-use patterns)	Low (clean opinionated API)	Medium (GCP ecosystem)	Low (role-based DSL)
Best For	Complex stateful workflows, precise control	MCP-native development, safety-first	Fast prototyping with OpenAI models	GCP-native, multimodal, A2A protocol	Multi-agent collaboration, rapid prototyping

The frameworks are increasingly interoperable rather than mutually exclusive. Google ADK can treat a LangGraph agent as an AgentTool, LangGraph can call ADK agents as subgraphs via API, and both support the MCP protocol for tool integration. Choose based on your primary constraint: LangGraph for complex stateful orchestration, Claude Agent SDK for MCP-native safety-first development, OpenAI SDK for the fastest path to a working agent, Google ADK for GCP-native multimodal agents, and CrewAI for rapid multi-agent prototyping with the broadest protocol support.

10. Production Patterns

Deploying LangGraph agents to production requires patterns for error handling, retries, observability, and resource management. These patterns leverage LangGraph's graph structure to make failure handling explicit and testable rather than hidden in try-catch blocks.

RELIABILITY

Retry Policies and Error Routing

LangGraph supports per-node retry policies with configurable max attempts, backoff strategies, and retry conditions. Transient failures (API timeouts, rate limits) get automatic retries with exponential backoff. LLM-recoverable errors loop back to the model with error context. User-fixable problems pause for human input via interrupt. Unexpected errors bubble up for debugging. Conditional edges route based on error type, making failure paths explicit in the graph.

RELIABILITY

Guardrails and Circuit Breakers

Bounded retries and step limits prevent runaway agent loops. Set maximum iterations per graph execution to cap cost and latency. Circuit breakers detect repeated failures to the same external service and fail fast instead of burning tokens on doomed retries. Timeout policies on individual nodes prevent single slow operations from blocking the entire workflow.

OBSERVE

Observability Stack

Production agents need tracing showing which nodes ran with their inputs and outputs (LangSmith), metrics for system health (Prometheus/Grafana), and structured logs tying execution to business events. LangSmith captures token usage, latency, and error rates per node. You can replay historical runs with modified parameters for debugging and regression testing.

SCALE

Scaling and State Management

Use PostgreSQL checkpointers for multi-process deployments behind load balancers. Redis checkpointers add sub-millisecond state access for latency-sensitive workloads. Deploy on Kubernetes with horizontal pod autoscaling based on queue depth or active thread count. The LangGraph Platform handles this automatically for managed deployments.

from langgraph.pregel import RetryPolicy

# Per-node retry with exponential backoff
retry = RetryPolicy(
    max_attempts=3,
    initial_interval=1.0,
    backoff_factor=2.0,
    retry_on=lambda e: isinstance(e, (TimeoutError, RateLimitError))
)

graph.add_node("api_call", call_external_api, retry=retry)

# Step limit to prevent runaway loops
app = graph.compile(
    checkpointer=checkpointer,
    recursion_limit=50  # Max node executions per invocation
)

11. langgraph-checkpoint v4.1 and TTL Strategies

The langgraph-checkpoint v4.0.2 release (April 2026) introduces the keep_latest TTL strategy for automatic checkpoint pruning. Instead of accumulating unbounded state history, you configure a retention policy that keeps only the most recent N checkpoints per thread, with older snapshots purged asynchronously. This dramatically reduces storage costs for high-volume production agents running thousands of threads. The v4.1.0 release (May 12, 2026) added DeltaChannel (beta): channels that grow large over time (such as long message lists) store only the incremental delta at each step instead of re-serializing the full accumulated value, with a full snapshot forced every K supersteps. The current release is v4.1.1 (May 22, 2026).

The new RemoteCheckpointer enables cross-process subgraph checkpointing. When a supervisor graph delegates to a subgraph running in a separate process (or even on a different machine), the RemoteCheckpointer synchronizes state via an HTTP transport layer. This eliminates the previous limitation where subgraph state was only accessible from the parent process, enabling true distributed multi-agent architectures with independent scaling of supervisor and worker agents.

from langgraph.checkpoint.postgres import PostgresSaver
from langgraph.checkpoint.base import CheckpointConfig

# TTL strategy: keep only latest 50 checkpoints per thread
config = CheckpointConfig(
    ttl_strategy="keep_latest",
    keep_latest=50
)
checkpointer = PostgresSaver.from_conn_string(
    conn_string="postgresql://...",
    config=config
)

# RemoteCheckpointer for subgraph state across processes
from langgraph.checkpoint.remote import RemoteCheckpointer
sub_ckpt = RemoteCheckpointer(endpoint="https://worker-agent:8080/checkpoints")

Related Technologies

ArchitectureAI Agent Architecture: Multi-Agent Systems LanguagePython: From Scripting to AI/ML Engineering SDKClaude Agent SDK: Production-Grade AI Agents CachingRedis: In-Memory Data Store ContainersDocker: Containerization Guide InfrastructureKubernetes: Container Orchestration ProtocolA2A Protocol: Agent-to-Agent Interoperability AgentsCrewAI: Multi-Agent Collaboration Framework SDKOpenAI Agents SDK: Production Agent Framework

LangGraph 1.0: Stateful Multi-Agent Orchestration Framework

Table of Contents

1. What Is LangGraph?

2. LangGraph 1.0: What Changed

3. Core Concepts: StateGraph, Nodes, Edges

StateGraph

Nodes

Edges

State Reducers

4. Checkpointing and State Persistence

SQLite (SqliteSaver)

PostgreSQL (PostgresSaver)

Redis (RedisSaver)

5. Human-in-the-Loop Patterns

Approve / Reject

Review and Edit

Collect Input

6. Multi-Agent Orchestration

Supervisor Pattern

Swarm Pattern

Collaborative Pattern

7. LangGraph Platform (LangSmith Deployment)

Cloud SaaS

Bring Your Own Cloud (BYOC)

Self-Hosted Enterprise

Self-Hosted Lite

8. Tracing with LangSmith

9. Framework Comparison

10. Production Patterns

Retry Policies and Error Routing

Guardrails and Circuit Breakers

Observability Stack

Scaling and State Management

11. langgraph-checkpoint v4.1 and TTL Strategies

Related Technologies