Building Production Agents with ADK: a Practical Walkthrough

Search "how to build an agent with ADK" and the results are a feature tour: here is LlmAgent, here is a tool, here is a callback, each shown in isolation. None of them is a project. The gap between "I can call LlmAgent" and "I have an agent in production with tests and traces" is where most teams stall — not because any single piece is hard, but because nobody shows the seams.

This is the project, end to end. We build one agent — a support-ticket triage agent, concrete enough to be real — and take it from an empty directory to a deployed Agent Runtime endpoint with an eval set wired into CI. Every code block is paste-able. The companion repo is zinch-ai/adk-starter-kit on GitHub; this article is the narrated version of that repo. The Agent Development Kit (ADK) is not hard to start with — it is the second week, the eval set and the Memory Bank decision and the deploy, where teams hit the parts the quickstart skipped. We do not skip them.

1. What ADK is — and what it is not

The Agent Development Kit (ADK) v1.0 is an open-source, code-first framework for building agents. It is the developer-facing entry point to the Build pillar of the Gemini Enterprise Agent Platform. You write Python (or TypeScript — more on language status below), you get an agent that runs locally, deploys to managed infrastructure, and carries first-class tool calling, session state, and evaluation.

What ADK is:

A code framework, not a builder. Your agent is a Python package in your repo. It diffs, it reviews, it has a test suite, it ships through your CI. There is no canvas, no proprietary export format, no "open the project in the vendor console to change a prompt."
The opinionated layer above the model SDK. The Google Gen AI SDK makes model calls — generate, stream, embed. ADK is the orchestration on top: the agent loop, tool dispatch, state, sub-agent composition, the eval harness. If you have a hand-rolled while loop dispatching tools yourself, ADK is what replaces it.
A deployment target, not just a dev tool. The same agent object you run with adk run locally deploys to Agent Runtime — managed, autoscaling, observable infrastructure — without a rewrite.

What ADK is not:

Not Agent Studio. Agent Studio is the low-code canvas, also part of the Build pillar, aimed at a non-engineering builder. ADK is correct the moment the agent needs version control, a real test suite, or more than two non-trivial integrations — the FAQ covers the full decision.
Not Gemini-only. ADK runs Gemini models natively, but its model layer is pluggable — you can run an ADK agent on Claude, Llama, or a self-hosted model through the same interface. The agent code does not change when the model does.
Not a multi-agent protocol. ADK composes sub-agents within a project. Getting agents to talk across vendors and runtimes is the Agent2Agent (A2A) protocol, a separate layer — see our A2A enterprise interoperability guide for where that boundary sits.

2. ADK status as of publication

ADK ships per language, and the languages are not at the same maturity. As of May 2026:

Language	Status	Notes
Python	v1.32 stable; v2.0 in beta	The reference implementation. Everything in this guide is Python. v2.0 beta is in evaluation — start on v1.32 stable for anything going to production.
TypeScript	v1.0 stable	GA and production-viable. API shapes mirror the Python SDK closely; the concepts in this guide port directly.
Java	In active development	Not yet GA. Track the repo before committing a Java service to it.
Go	In active development	Earliest of the four. Not production-ready.

For a new agent today, the decision is Python v1.32 or TypeScript v1.0 — both stable, both deployable. We use Python because it is the reference implementation: features land there first and the docs are deepest. v2.0 beta is worth tracking, but "beta" means what it says — keep production on v1.32 until v2.0 reaches stable. The ADK engineering team's v1.0 retrospective is the best account of how the API stabilized and what the v1.x line guarantees.

3. Project setup

Start with the install. ADK is one package:

python -m venv .venv && source .venv/bin/activate
pip install google-adk

That pulls in the Gen AI SDK as a dependency — you do not install it separately. Now the project layout. This is the part the quickstarts skip, and it is the part that decides whether week two is pleasant. The support-triage-agent/ project is five files that matter and two that are scaffolding:

pyproject.toml — dependencies and packaging.
.env — local secrets, gitignored.
support_triage/ — the agent package.
- agent.py — the agent definition; the entry point ADK looks for.
- tools.py — the function tools.
- prompts.py — system instructions, kept out of agent.py.
eval/triage.evalset.json — the eval set, built in section 8.
deployment/runtime.py — the Agent Runtime deploy script, section 10.

Two conventions to internalize now. First, support_triage/agent.py must expose a module-level root_agent — that is the symbol adk run, adk web, and the eval runner all look for. Name it anything else and the tooling cannot find your agent. Second, prompts live in their own file — a system instruction grows to hundreds of lines, and keeping it in agent.py turns every prompt tweak into a noisy diff over your wiring code.

Configuration goes in .env, read by ADK automatically. For an agent running against the Agent Platform backend:

# .env — gitignored. Real values, never committed.
GOOGLE_GENAI_USE_VERTEXAI=true
GOOGLE_CLOUD_PROJECT=your-project-id
GOOGLE_CLOUD_LOCATION=us-central1

4. Defining your first agent

The minimum viable agent is three things: a model, a name, and an instruction. Here is the whole of agent.py before tools:

# support_triage/agent.py
from google.adk.agents import LlmAgent
 
from .prompts import TRIAGE_INSTRUCTION
 
root_agent = LlmAgent(
    name="support_triage",
    model="gemini-2.5-flash",
    description="Triages inbound support tickets: severity, category, routing.",
    instruction=TRIAGE_INSTRUCTION,
)

Four fields, each doing real work:

name — a stable identifier. It shows up in traces and, in a multi-agent system, it is how a parent agent addresses this one. Treat it like a function name: lowercase, specific, permanent.
model — the model string. gemini-2.5-flash is the right default for a triage workload: fast, cheap, strong enough for classification and extraction. Reserve a pro-tier model for steps that genuinely need deeper reasoning — paying pro latency and cost on a severity label is waste.
description — a one-line statement of what this agent does. In a single-agent app it is documentation. In a multi-agent app it is functional: a parent agent reads sub-agent descriptions to decide where to route. Write it as if another agent will act on it, because one will.
instruction — the system prompt, imported from prompts.py. This is where the agent's behavior lives, and it is not boilerplate. Keep it concrete:

# support_triage/prompts.py
TRIAGE_INSTRUCTION = """\
You are a support-ticket triage agent. For every inbound ticket you receive:
 
1. Assign a severity: P1 (production down, data loss, security), P2 (major
   feature broken, no workaround), P3 (minor bug, workaround exists), or P4
   (question, cosmetic, feature request).
2. Assign one category: billing, authentication, data-pipeline, api, ui, other.
3. Decide a routing target by calling the `lookup_routing` tool with the
   category. Do not guess the team name — always call the tool.
 
Respond with exactly: severity, category, routing target, and a one-sentence
justification. If the ticket text is too vague to classify, ask one specific
clarifying question instead of guessing.
"""

Note what this instruction does that a vague one would not: it enumerates the severity rubric instead of saying "assess severity," it forbids guessing the routing target, and it gives the agent an explicit escape hatch for ambiguous input. Those three properties are what make the agent's behavior testable in section 8.

5. Adding tools

An agent that only talks is a chatbot. Tools are what make it an agent. ADK has three tool surfaces, in increasing order of scope.

Function tools are the workhorse — a plain Python function ADK exposes to the model. The function's signature and docstring are the tool schema, so both matter:

# support_triage/tools.py
def lookup_routing(category: str) -> dict:
    """Look up the on-call team that owns a ticket category.
 
    Args:
        category: One of billing, authentication, data-pipeline, api, ui, other.
 
    Returns:
        A dict with `team` (the owning team's name) and `slack_channel`.
    """
    routing_table = {
        "billing": {"team": "Revenue Platform", "slack_channel": "#oncall-billing"},
        "authentication": {"team": "Identity", "slack_channel": "#oncall-identity"},
        "data-pipeline": {"team": "Data Platform", "slack_channel": "#oncall-data"},
        "api": {"team": "API Platform", "slack_channel": "#oncall-api"},
        "ui": {"team": "Web", "slack_channel": "#oncall-web"},
        "other": {"team": "Support Triage", "slack_channel": "#support-triage"},
    }
    return routing_table.get(category, routing_table["other"])

Then attach it — tools is a list on the agent:

# support_triage/agent.py
from google.adk.agents import LlmAgent
 
from .prompts import TRIAGE_INSTRUCTION
from .tools import lookup_routing
 
root_agent = LlmAgent(
    name="support_triage",
    model="gemini-2.5-flash",
    description="Triages inbound support tickets: severity, category, routing.",
    instruction=TRIAGE_INSTRUCTION,
    tools=[lookup_routing],
)

Two non-negotiables for function tools. The docstring is not optional — the model decides whether and how to call the tool from the docstring and type hints, so a function with no docstring is a tool the model is guessing about. Return structured data — a dict, not a prose string. The model handles {"team": "Identity", ...} far more reliably than "The Identity team owns this, ping #oncall-identity".

The second surface is the MCP toolset. The Model Context Protocol is the open standard for tool servers; ADK consumes any MCP server as a set of tools, which is how you reach an existing internal toolset without rewriting it as Python functions:

from google.adk.tools.mcp_tool import MCPToolset, StdioServerParameters
 
incident_tools = MCPToolset(
    connection_params=StdioServerParameters(
        command="npx",
        args=["-y", "@your-org/incident-mcp-server"],
    ),
)
 
# then: tools=[lookup_routing, incident_tools]

The third surface is third-party integrations — ADK ships connectors and supports tools from frameworks like LangChain, so you are not re-implementing a wrapper that already exists. The decision order is simple: a plain function for your own logic, an MCP toolset for an existing tool server, a connector for a SaaS system someone already wrapped.

6. State management: Sessions vs. Memory Bank

This is the ADK design decision teams get wrong most often. ADK has two kinds of state, and they are not interchangeable.

Session state is within-conversation memory. It lives for the life of one conversation and holds the working context of that exchange — a value a tool returned that a later step needs. You read and write it through the ToolContext and through callback_context.state in callbacks:

def lookup_routing(category: str, tool_context) -> dict:
    """...docstring as before..."""
    result = _routing_table.get(category, _routing_table["other"])
    # stash on session state so a later step in THIS conversation can read it
    tool_context.state["last_routing"] = result
    return result

Memory Bank is cross-session memory — a managed Agent Platform service that persists structured memory about a user or entity across conversations, so the agent next week recalls what mattered last week. You do not hand-roll this in Firestore or Redis; Memory Bank is the platform primitive for it, and ADK integrates with it through a memory service you configure on the runner.

The rule that prevents the mistake:

Session state is scoped to one conversation and is gone when it ends. Memory Bank is scoped to a user and persists across every conversation they ever have. If you find yourself writing conversation context into a database to read it back next week, you want Memory Bank. If you are passing a value between two steps of the same exchange, you want Session state.

For the triage agent, Session state is enough — each ticket is its own conversation and nothing needs to persist after it routes. An agent that remembered a specific customer's history across every ticket they filed is the Memory Bank case. Our HIPAA-compliant agents reference architecture walks the controls that apply when the thing Memory Bank persists is regulated data — the version of this decision that actually has stakes.

7. Multi-agent composition

One agent doing one job is the right starting point and often the right ending point. But when a workflow has genuinely distinct responsibilities, ADK composes agents into a graph: a coordinator with specialist sub-agents, each a full LlmAgent.

Suppose triage grows a second responsibility — after classifying a ticket, draft a first-response reply. That is a different skill with a different instruction and a different model tier. Model it as two specialists under a coordinator:

# support_triage/agent.py
from google.adk.agents import LlmAgent
 
from .prompts import TRIAGE_INSTRUCTION, DRAFTER_INSTRUCTION, COORDINATOR_INSTRUCTION
from .tools import lookup_routing
 
classifier = LlmAgent(
    name="ticket_classifier",
    model="gemini-2.5-flash",
    description="Assigns severity, category, and routing target to a ticket.",
    instruction=TRIAGE_INSTRUCTION,
    tools=[lookup_routing],
)
 
drafter = LlmAgent(
    name="response_drafter",
    model="gemini-2.5-pro",
    description="Drafts a first-response reply once a ticket is classified.",
    instruction=DRAFTER_INSTRUCTION,
)
 
root_agent = LlmAgent(
    name="support_triage_coordinator",
    model="gemini-2.5-flash",
    description="Coordinates ticket triage: classify, then draft a response.",
    instruction=COORDINATOR_INSTRUCTION,
    sub_agents=[classifier, drafter],
)

The coordinator does not re-implement classification or drafting — it delegates. It reads each sub-agent's description, decides which one a given step needs, and routes. This is why the description field is functional, not cosmetic: in sub_agents it is the routing signal. The coordinator's own instruction is short — it describes the sequence (classify, then draft), not the work.

Note the deliberate model split: flash for the classifier and coordinator, pro for the drafter. Drafting a customer-facing reply benefits from a stronger model; classification does not. Composition lets you spend model budget per step instead of per agent.

8. Evaluations: building an eval set

An agent without an eval set is a prompt you are hoping about. ADK treats evaluation as first-class: an eval set is a JSON file of test cases — inputs, the expected tool calls, and a reference response — and adk eval runs your agent against it and scores the result. A minimal eval/triage.evalset.json:

{
  "eval_set_id": "triage_core",
  "name": "Support triage — core cases",
  "eval_cases": [
    {
      "eval_id": "p1_data_loss",
      "conversation": [
        {
          "user_content": {
            "parts": [{ "text": "Our production database is returning empty results for all customers since the 3am deploy." }],
            "role": "user"
          },
          "final_response": {
            "parts": [{ "text": "Severity: P1. Category: data-pipeline. Routing target: Data Platform (#oncall-data). Justification: production data is unavailable for all customers." }],
            "role": "model"
          },
          "intermediate_data": {
            "tool_uses": [
              { "name": "lookup_routing", "args": { "category": "data-pipeline" } }
            ]
          }
        }
      ]
    },
    {
      "eval_id": "p4_feature_request",
      "conversation": [
        {
          "user_content": {
            "parts": [{ "text": "It would be nice if the dashboard had a dark mode." }],
            "role": "user"
          },
          "final_response": {
            "parts": [{ "text": "Severity: P4. Category: ui. Routing target: Web (#oncall-web). Justification: this is a cosmetic feature request, not a defect." }],
            "role": "model"
          },
          "intermediate_data": {
            "tool_uses": [
              { "name": "lookup_routing", "args": { "category": "ui" } }
            ]
          }
        }
      ]
    }
  ]
}

Run it:

adk eval support_triage eval/triage.evalset.json

ADK scores two things, and the distinction is the whole point:

Tool trajectory — did the agent call the right tools with the right arguments? This is a near-exact match against tool_uses. For the p1_data_loss case, the agent must call lookup_routing with category="data-pipeline". Call it with the wrong category, or skip it and guess the team, and the trajectory score drops. This catches the failure mode prose review misses — an answer that reads fine but was reached by the wrong path.
Response quality — does the final response match the reference? Because the wording will never be character-identical, this is scored by an LLM-as-judge against final_response, producing a similarity score rather than a pass/fail.

Two principles for an eval set that earns its keep. Seed it from real tickets, not invented ones — the cases that matter are the inputs your users actually send, including the ugly ambiguous ones. Test the escape hatch — add a case where the input is genuinely too vague to classify and the expected behavior is the clarifying question, not a guess. An eval set of only clean inputs tells you nothing about the agent's failure behavior, which is the behavior you most need to trust.

This is also where ADK becomes a CI artifact. adk eval exits non-zero on failure, so it drops straight into a pipeline step — the eval set runs on every pull request, and a prompt change that quietly breaks P1 detection fails the build instead of reaching production. Our six-week pilot methodology makes this the gate between "the agent works on my machine" and "the agent ships."

9. The local development loop

Before any deploy, the inner loop. ADK gives you three local commands and you will live in them.

adk run — the agent in your terminal, interactive:

adk run support_triage

You type a ticket, the agent responds, you iterate on the prompt. Fastest feedback for "does this instruction do what I meant."

adk web — a local web UI on localhost:

adk web

This is the one to develop in. It is not a chat box — it is a trace inspector. Every run shows the full execution: which sub-agent handled the step, every tool call with its arguments and return value, the model's reasoning, token counts. When the agent does something surprising, adk web shows you where — the tool that returned something unexpected, the sub-agent that got routed to wrong. Reading prose output guesses at the cause; reading the trace shows it.

adk eval — the eval set from section 8, run locally before you push.

The loop in practice: adk web to develop and watch traces, adk run for a quick terminal check, adk eval before every commit. All three run against the same agent object that deploys to production — there is no separate "local mode." What you debug is what you ship.

10. Deploying to Agent Runtime

The deploy is the part teams expect to be hard and it is not — because the project was laid out correctly in section 3. The same root_agent object you have been running locally is what deploys. Agent Runtime is the managed target: autoscaling, observable infrastructure, no servers to run. The deployment/runtime.py script:

# deployment/runtime.py
import vertexai
from vertexai import agent_engines
 
from support_triage.agent import root_agent
 
vertexai.init(
    project="your-project-id",
    location="us-central1",
    staging_bucket="gs://your-staging-bucket",
)
 
remote_agent = agent_engines.create(
    agent_engine=root_agent,
    requirements=[
        "google-adk",
    ],
    display_name="support-triage-agent",
)
 
print(f"Deployed: {remote_agent.resource_name}")

python deployment/runtime.py

What this does: packages the support_triage package, uploads it through the staging bucket, provisions a managed Agent Runtime instance, and returns a resource name — the addressable endpoint for the deployed agent. From there it scales with traffic, and you query it through the returned handle or the Agent Platform APIs.

Three things to get right before the first deploy:

requirements must list every dependency your agent imports beyond ADK itself. The runtime environment is not your laptop — if tools.py imports a library, it goes in this list. A missing requirement is the most common first-deploy failure, and it surfaces at runtime, not at deploy time.
The staging bucket is real infrastructure — a Cloud Storage bucket in the same project, created ahead of time. The deploy uploads your packaged agent through it.
Service account permissions — the runtime executes as a service account that needs permission to call the models and any tools the agent reaches. Provision this before the deploy, not after the first permission error in production.

Once this script is in the repo, deployment is a CI step like any other — the same artifact that passed adk eval is the artifact that deploys. The platform overview maps how Agent Runtime sits in the Scale pillar alongside the rest of the production surface.

11. Observability: traces and what to log

A deployed agent you cannot see inside is a liability. ADK is instrumented with OpenTelemetry out of the box — the agent emits spans for every run, every sub-agent invocation, every tool call and model call, with no extra instrumentation code.

The ADK trace view is the production analogue of what adk web showed you locally — the same full execution trace, now for live traffic. The continuity is the point: the trace you debugged in development is the trace you read in production. For your existing observability stack, point the OpenTelemetry export at any OTLP-compatible backend — Cloud Trace, or a third-party APM — by configuring the exporter at runtime startup. The spans ADK emits are standard OpenTelemetry; any conformant backend ingests them, which keeps agent traffic in the same observability picture as the rest of your services.

What to pay attention to once traces are flowing:

Tool-call arguments and returns. The highest-signal data. A misbehaving agent is almost always a tool getting the wrong arguments or returning something the model then misreads — the trace shows both sides.
Token counts per step. Your cost and latency budget, made visible. A step quietly consuming more tokens than expected is a regression you want to catch from a dashboard, not an invoice.
finish_reason on model calls. A response that hit a token limit (MAX_TOKENS) instead of completing (STOP) is silently truncated — a real failure hiding behind output that reads fine.
Sub-agent routing decisions. In a multi-agent system, which sub-agent the coordinator chose, for which input. A coordinator routing to the wrong specialist is a bug only the trace makes legible.

One discipline: do not log raw prompt and response content indiscriminately. Prompts and completions carry user data, and in a regulated context that is data with handling rules. Log the structural telemetry — trajectories, token counts, latencies, finish reasons — freely; gate content logging behind the same controls as any other sensitive data path.

12. Putting it together

The finished project is everything above in one repo — the same support-triage-agent/ layout from section 3, now with every file filled in:

pyproject.toml — dependencies and packaging.
.env — local secrets, gitignored.
support_triage/ — the agent package.
- agent.py — the coordinator, classifier, and drafter.
- tools.py — lookup_routing plus any MCP toolsets.
- prompts.py — the three instructions.
eval/triage.evalset.json — the trajectory and response-quality cases.
deployment/runtime.py — the Agent Runtime deploy.

That is a production agent. Not a prototype with a "harden it later" backlog — an agent with version control, a test suite that runs in CI, traces that flow to your observability stack, and a deploy that is one script. It fits in one readable project because ADK makes the hard parts — tool dispatch, state, sub-agent routing, evaluation, the managed deploy — first-class instead of bespoke. The work is in the design decisions (which state model, when to compose, what to put in the eval set), not the plumbing.

The companion repo, zinch-ai/adk-starter-kit, is this exact project, runnable — clone it, set your .env, and you have the loop from section 9 working in a few minutes. For the official API reference underneath this walkthrough, the canonical sources are the ADK Python repository and Google's ADK documentation.

Where to go from the starter kit depends on what you are building. The Engineering Code Review blueprint is this pattern applied to a real engineering workflow — a single agent, an eval set, Memory Bank for cross-session context. For a healthcare worked example, our prior authorization automation reference architecture takes the same ADK build pattern — focused tool inventory, Memory Bank, an eval set — into a regulated payer workflow, with a human-in-the-loop confirmation gate added for medical-necessity review; the prior authorization blueprint is the engagement shape around it. The full set of agent blueprints covers the rest, and our approach is how a project like this becomes a production system on a six-week clock rather than an open-ended build. If you want that scoped against your own workload, start a conversation.

13. FAQ

FAQ

Frequently asked questions

Both are part of the Build pillar; they target different owners. Agent Studio is the low-code canvas — fast for a non-engineering builder to assemble an agent without writing code. ADK is the code-first path — your agent is a Python or TypeScript package in your repo. Choose ADK the moment the agent needs version control, a real test suite, CI integration, or more than two non-trivial integrations — which is most agents headed for production. Choose Agent Studio for a quick internal tool a business team will own and a code review would be overhead. The honest tell: if the answer to "who edits this in six months" is "an engineer," start in ADK.

Not only Gemini. ADK runs Gemini models natively, but its model layer is pluggable — you can run an ADK agent on Claude, Llama, or a self-hosted model through the same interface, and the agent code does not change when the model does. The model field on LlmAgent is the seam. This matters in practice because it lets you pick a model per workload — a fast model for classification, a stronger one for a reasoning step — without rebuilding the agent, and it keeps you off single-vendor model lock-in.

Scope and lifetime. Session state is within-conversation memory — it holds the working context of one exchange and is gone when that conversation ends. You use it to pass a value between two steps of the same interaction. Memory Bank is a managed cross-session service — it persists structured memory about a user or entity across every conversation they have, so the agent next week recalls what mattered last week. The rule that prevents the common mistake: if you are tempted to write conversation context into your own database to read back later, you want Memory Bank; if you are passing a value between steps of the same exchange, you want Session state. Do not hand-roll Memory Bank's job in Firestore or Redis — it is the platform primitive for exactly that.

The same root_agent object you run locally is what deploys — there is no rewrite and no separate "production version" of the agent. You write a short deploy script that calls agent_engines.create() with your agent, its requirements list, and a staging bucket; running it packages the agent, provisions a managed Agent Runtime instance, and returns an addressable resource name. The three things to get right first: list every non-ADK dependency in requirements, create the Cloud Storage staging bucket ahead of time, and grant the runtime service account permission to call your models and tools. Once the script is in the repo, deploying is a CI step like any other.

For anything going to production, v1.32 stable. It is the maturity line that carries the v1.x API guarantees, and it is what this guide is written against. The v2.0 beta is worth reading the changelog for and evaluating in a branch, but "beta" means the API surface can still move — you do not want a beta SDK under a production agent. Start on v1.32, ship on v1.32, and treat the move to v2.0 as its own scoped piece of work once it reaches stable. The same logic applies across languages: Python v1.32 and TypeScript v1.0 are both production-viable today; Java and Go are in active development and not yet there.