Published on
OpenAI Agents SDK

Mastering the OpenAI Agents Python SDK: Build Intelligent AI Workflows with Tools, Guardrails & Multi-Agent Coordination

58 min0Comments
Post image

OpenAI Agents Python SDK – Comprehensive Tutorial

Introduction

The OpenAI Agents SDK is a Python framework for building agent-based AI systems that combine large language models (LLMs) with tools and structured workflows. It provides a lightweight set of primitivesAgents, Handoffs, and Guardrails – which can be composed to create complex, goal-driven AI applications (OpenAI Agents SDK). An Agent in this SDK is essentially an LLM (like GPT-4 or GPT-3.5) configured with instructions and equipped with tools it can use (OpenAI Agents SDK) (Orchestrating multiple agents - OpenAI Agents SDK). These agents operate in an agent loop: they receive an input (e.g. a user question), decide on actions (like calling a tool or delegating to another agent), and continue the loop until a final answer is produced (Running agents - OpenAI Agents SDK) (Running agents - OpenAI Agents SDK).

Key architectural concepts include:

  • Agents: The primary actors of your application. An agent consists of an LLM plus a set of instructions (often provided as a system prompt) and optional tools it can invoke. Given a task, an agent can autonomously plan how to solve it by reasoning through the problem, calling tools for assistance, and generating answers (Orchestrating multiple agents - OpenAI Agents SDK). For example, you might have an agent configured as a “Research Assistant” that uses a web search tool to find information and then answers the query.

  • Tools: Functions or capabilities that an agent can use to perform actions beyond the LLM’s built-in knowledge (OpenAI Agents SDK). Tools could be simple Python functions (file I/O, math calculations), external API calls (weather lookup), or even other agents (allowing an agent to delegate subtasks). The Agents SDK makes it easy to turn any Python function into a tool and handle tool inputs/outputs via function calling (Tools - OpenAI Agents SDK). Tools empower agents to interact with the world (e.g., search the web, query databases, execute code) rather than just generate text.

  • Handoffs: A mechanism for an agent to delegate or transfer control to another agent mid-task (OpenAI Agents SDK). Using handoffs, you can compose multiple specialized agents in a workflow – for example, a triage agent that hands off to a billing agent or refund agent based on the user’s request (OpenAI Agents SDK — Getting Started). Handoffs enable modular multi-agent systems where each agent focuses on a specific domain or skill.

  • Guardrails: Built-in validation and safety checks that run alongside agents (OpenAI Agents SDK). Guardrails can screen inputs (to prevent malicious or irrelevant queries) and validate outputs (to ensure correctness or policy compliance) (OpenAI Agents SDK — Getting Started). If a guardrail check fails (called a tripwire), the SDK can halt the agent’s execution or raise an exception before any undesired action is taken (Guardrails - OpenAI Agents SDK) (Guardrails - OpenAI Agents SDK). This helps maintain reliability and safety in production applications.

  • Tracing: The SDK has integrated tracing support for debugging and analysis (OpenAI Agents SDK — Getting Started). Every step of an agent’s reasoning – tool usage, decisions, and LLM messages – can be logged as a trace. Developers can inspect these traces (for example, via the OpenAI dashboard) to understand agent behavior, measure performance, and iteratively improve prompts or code (OpenAI Agents SDK — Getting Started). Tracing is enabled by default, giving you visibility into the agent’s thought process and actions.

These components work together to form an agent-based system architecture. An agent receives a goal or query, uses its LLM backbone to reason about the task, calls tools or delegates to other agents as needed, and eventually produces a result. Thanks to the SDK’s design, much of this complexity is handled for you by the agent loop and SDK runtime. You can start simple (a single agent answering questions) and gradually add tools, multiple agents, guardrails, and custom controls as your application grows in sophistication. The SDK emphasizes flexibility with minimal abstraction – it “works great out of the box” but lets you customize every piece of the workflow if necessary (OpenAI Agents SDK) (OpenAI Agents SDK).

In the rest of this tutorial, we’ll explore all major features of the OpenAI Agents Python SDK and demonstrate how to use them effectively. We’ll start by installing the SDK and creating a simple agent. Then we’ll delve into configuring custom agents, adding and building tools, managing agent context and state (memory), handling multi-agent orchestration with handoffs, integrating guardrails for safety, and more. Code examples are provided throughout, and by the end we’ll combine these concepts into a full example project. Let’s get started!

Installation and Setup

Before using the Agents SDK, you need to install it. The package name is openai-agents. Install it via pip:

pip install openai-agents
``` ([OpenAI Agents SDK](https://openai.github.io/openai-agents-python/#installation#:~:text=pip%20install%20openai))

This will install the core SDK and its dependencies. By default, the SDK will use OpenAI’s API for LLM calls, so you should have an OpenAI API key ready. Set the `OPENAI_API_KEY` environment variable to your API key (or use the SDK’s programmatic method to configure it, shown below) before running agents:

```bash
export OPENAI_API_KEY="sk-YourOpenAIAPIKey"

Optional dependencies: If you plan to use LiteLLM (a plugin system to integrate other model providers or open-source models) or voice features, there are extra dependencies. For example, to enable LiteLLM support (allowing use of non-OpenAI models), install with the litellm extra:

pip install "openai-agents[litellm]"

We’ll cover using custom models via LiteLLM in a later section. For now, with the SDK installed and your API key set, we can jump into creating agents.

Creating and Configuring an Agent

Creating a basic agent is straightforward. The SDK provides an Agent class, which you can instantiate with a name and optional parameters like instructions, tools, handoffs, and guardrails. At minimum, an agent needs a name (used for identification and logging) and can optionally have an instruction prompt that guides its behavior.

Let’s create a simple agent:

from agents import Agent, Runner

# Create a basic agent with a name and role instructions
agent = Agent(
    name="Assistant",
    instructions="You are a helpful assistant."
)

In this example, we configure the agent to behave as a helpful assistant by providing a brief instruction (which acts like a system prompt). If we run this agent on a user query, it will use an OpenAI GPT model under the hood to generate a response following that instruction.

Running the agent: The SDK’s Runner is used to execute agents. You can call Runner.run_sync(agent, input) for a quick synchronous call, or use await Runner.run(agent, input) in an async context. For example:

result = Runner.run_sync(agent, "Write a haiku about recursion in programming.")
print(result.final_output)

This will prompt the agent to answer the query. Under the hood, the agent’s LLM is called with the system instructions and user question, then the agent loop handles the response. Given our simple setup, the agent will directly answer. The result object (of type RunResult) contains the final output in result.final_output. In this case, it might print a haiku such as:

# Code within the code,
# Functions calling themselves,
# Infinite loop's dance.

(This example is the “hello world” from the documentation (OpenAI Agents SDK).)

By default, Agent() without specifying a model will use OpenAI’s chat model (gpt-3.5-turbo or gpt-4 via the Responses API) as the LLM. Make sure your OpenAI API key is set so the agent can call the model. You can change which model or API is used – more on that soon.

Agent configuration options: The Agent class provides several parameters to customize its behavior:

  • name (str): A descriptive name for the agent (e.g., "ResearchAgent"). This name is also used when the agent is used as a tool or in logs/traces.

  • instructions (str or callable): The prompt or system message that sets the role of the agent. This can be a fixed string or a function that generates a string (potentially using dynamic context). The instructions are placed at the top of the LLM’s context (as the system role) (Context management - OpenAI Agents SDK). Good instructions are crucial for guiding the agent’s behavior and informing it about available tools or constraints.

  • tools (list): A list of tools the agent can use. Tools can be added here as function references (for function tools) or tool instances (for certain built-in tools). We will discuss tools in depth in the next section. If no tools are provided, the agent can only rely on the LLM’s own knowledge and reasoning.

  • handoffs (list): A list of other agents or handoff configurations that this agent can delegate to. By specifying handoffs, you allow your agent to transfer control to another agent in certain situations. We will explore handoffs in the Multi-Agent and Handoff section.

  • guardrails (list): A list of guardrail functions (input or output guardrails) to apply for this agent. Guardrails are used to enforce validation on inputs or outputs – for example, preventing the agent from solving a user’s math homework or ensuring the output is in a certain format. Guardrails are covered in Guardrails and Validation below.

  • model: By default, each agent uses the global default model (OpenAI’s GPT via the Responses API). You can override this per agent by passing a model parameter. The SDK includes model classes like OpenAIChatCompletions and OpenAIResponsesModel, and with LiteLLM you can use models from Anthropic, etc. For example, you can do Agent(model=OpenAIChatCompletionsModel()) to force using the Chat Completions API, or Agent(model=LitellmModel(...)) to use a model from another provider. If you’re fine with the default, you can omit this parameter.

  • output_type: (Optional) A Pydantic model or data type specifying the expected structure of the final output. If provided, the SDK will attempt to parse the agent’s final answer into this type (and the LLM will be instructed to format its answer accordingly). This is useful for enforcing structured outputs (like JSON responses with specific fields). It’s an advanced feature – for most basic use cases, the final output will just be a string.

  • Other settings: The Agent can also accept things like max_turns (to limit how many loop iterations it can do before stopping), or max_tokens, etc., via its model settings. Many of these can be controlled globally when running the agent as well (via Runner.run options). We’ll touch on these in context.

With a basic agent defined, let’s move on to giving it more capabilities with tools.

Adding and Using Tools

One of the most powerful features of agentic systems is the ability for agents to use tools. Tools extend what an LLM agent can do – they allow the agent to take actions like browsing the web, doing calculations, retrieving information from databases, or even running code. In the OpenAI Agents SDK, tools are implemented as functions (Python callables) that the agent can invoke through a standardized interface (think of OpenAI’s function calling mechanism).

There are two main kinds of tools in this SDK:

  1. Hosted Tools (Built-in) – Tools provided by OpenAI via the API, such as web search, file retrieval, or computer control. These run on OpenAI’s side in the “Responses API” environment (Tools - OpenAI Agents SDK).
  2. Function Tools (Custom) – Tools that you implement in Python. The SDK makes it simple to expose a Python function (or coroutine) as a tool an agent can call (Tools - OpenAI Agents SDK).

Let’s explore each type and how to use them.

Built-in Hosted Tools

OpenAI’s Responses API offers a few built-in tools that agents can use without you writing any code for those capabilities. Currently, the built-in tools include:

  • Web Search – Allows an agent to search the web and retrieve live information (Tools - OpenAI Agents SDK).
  • File Search (Retrieval) – Allows querying a vector store of documents (useful for retrieval-augmented generation with your own data) (Tools - OpenAI Agents SDK).
  • Computer (Operating) – Allows an agent to simulate using a computer (mouse, keyboard actions) to perform tasks in a virtual environment (Tools - OpenAI Agents SDK).

To use hosted tools, your agent’s model must be the OpenAI Responses API model (which is the default). For example, to enable web search and file retrieval in an agent:

from agents import Agent, Runner, WebSearchTool, FileSearchTool

# Agent that can use web search and file search tools
agent = Agent(
    name="ResearchAgent",
    instructions="You are a research assistant that finds information online and in documents.",
    tools=[
        WebSearchTool(),  # enables web searches
        FileSearchTool(
            max_num_results=3,
            vector_store_ids=["<YOUR_VECTOR_STORE_ID>"]  # specify your vector store(s)
        )
    ]
)

result = Runner.run_sync(
    agent, 
    "Which coffee shop should I go to, taking into account my preferences and the weather today in SF?"
)
print(result.final_output)

In this code, we created a ResearchAgent that has access to two tools: a web search and a file search. The agent’s instructions suggest it should use these tools as needed. When we run it with a query, the agent can choose to call WebSearchTool to search the web for coffee shops and check weather, or use FileSearchTool if relevant data is stored in a vector DB. The final output will be the agent’s answer to the question, presumably using the information gathered.

Note: Hosted tools like WebSearchTool and FileSearchTool are executed by OpenAI’s systems (the model “plugs” into them) rather than running locally. They may incur additional costs or usage limits (for example, OpenAI prices the File Search tool usage separately (New tools for building agents | OpenAI)). Ensure you consult OpenAI’s tool documentation and have appropriate access if you use them. If your use-case doesn’t require these specific tools or you prefer full local control, you can implement similar functionality as custom function tools (for instance, using the requests library for web search or a local vector DB library).

Custom Function Tools

For most use cases, you will create your own tools by writing Python functions. The Agents SDK can automatically convert a Python function into a tool interface that the LLM can call. It uses Python’s introspection and Pydantic to generate a JSON schema for the function’s arguments, and even uses the function’s docstring as the tool’s description (Tools - OpenAI Agents SDK) (Tools - OpenAI Agents SDK). This is similar to OpenAI function calling – the LLM will be given the function name, description, and parameter schema, and if it decides to call the function, it will produce a JSON payload that the SDK parses and then calls your Python function with.

To declare a function as a tool, simply decorate it with @function_tool (imported from agents). For example:

from typing_extensions import TypedDict
from agents import Agent, Runner, function_tool

# Define a TypedDict for structured input (if needed)
class Location(TypedDict):
    lat: float
    long: float

@function_tool
async def fetch_weather(location: Location) -> str:
    """Fetch the weather for a given location.
    
    Args:
        location: The location to fetch the weather for (with 'lat' and 'long').
    """
    # In a real implementation, call a weather API using the coordinates.
    # Here we just return a dummy response.
    return "It is currently sunny at the given location."

@function_tool(name_override="fetch_data")
def read_file(path: str, directory: str | None = None) -> str:
    """Read the contents of a file.
    
    Args:
        path: The path to the file to read.
        directory: The directory to read the file from (optional).
    """
    # In a real implementation, open and read the file from disk or storage.
    return "<file contents>"

Let’s break down what’s happening:

  • We import function_tool and use it as a decorator on our functions fetch_weather and read_file. This tells the SDK to treat these functions as tools.

  • The fetch_weather function is async and takes a Location TypedDict with lat and long floats, returning a string. We gave it a docstring describing what it does. The SDK will automatically create a tool named "fetch_weather" (by default matching the function name) with a description taken from the docstring, and a JSON schema for an argument location (with lat and long as required properties) (Tools - OpenAI Agents SDK) (Tools - OpenAI Agents SDK). When the agent calls this tool, the SDK will provide the location parameter as a Python dict per the schema.

  • The read_file function is synchronous (can be sync or async, both are fine) and demonstrates the use of name_override in the decorator. We override the tool name to "fetch_data" to present a more semantically appropriate name to the LLM, instead of the function name read_file. This tool reads a file from a given path (with an optional directory). Its docstring will be used as the description unless we disable that feature. The SDK will generate a schema for two parameters: path (string) and directory (optional string).

After defining the tools, integrate them with an agent:

agent = Agent(
    name="UtilityAgent",
    instructions="You are a helpful assistant with access to file and weather information.",
    tools=[fetch_weather, read_file]  # we pass the function objects themselves
)

# Run the agent to test the tools
result = Runner.run_sync(agent, "What is the weather at latitude 37.77, longitude -122.42 and show me the content of file README.md?")
for tool in agent.tools:
    print(f"Tool loaded: {tool.name} - {tool.description}")
print("Final answer:", result.final_output)

When the agent receives the question, it might decide to call fetch_weather with the given coordinates, then call fetch_data (our read_file) with "README.md". The SDK handles the function calls: it will parse the LLM’s requests, execute fetch_weather and read_file, get their return values, and provide those back to the LLM as function results. The agent’s LLM then incorporates those results into its final answer. The printed output would list the tools and their auto-generated descriptions and schema, and the final answer might look something like a combined response of weather and file content (depending on how the prompt was phrased and the agent’s reasoning).

How it works: The SDK uses the Python inspect module and the griffe library to extract parameter names, types, and docstrings from your function to build a schema (Tools - OpenAI Agents SDK). The @function_tool decorator can be configured with parameters like name_override (as shown) or use_docstring_info=False if you don’t want to auto-use the docstring for descriptions. Under the hood, when an agent with function tools runs, the LLM is provided function definitions (names, descriptions, parameter specs) and it will output a special JSON if it decides to invoke a function. The SDK captures that, calls your Python function, and then feeds the return value (or error) back to the LLM. This loop continues until the LLM produces a final non-function output (Running agents - OpenAI Agents SDK). All of this is abstracted away so you can simply focus on writing the Python function for the tool’s logic.

Error handling in tools: If your tool function raises an exception or returns invalid data, how does the agent handle it? By default, the SDK will inform the LLM that an error occurred in the tool call (this is done via a default error handler that returns an error message to the LLM) (Tools - OpenAI Agents SDK). You can customize this behavior. The @function_tool decorator accepts a failure_error_function parameter where you can specify a custom error-handling function. For instance, you might want to handle certain exceptions and return a specific error message to the agent, or even decide to not catch it at all. If you pass failure_error_function=None, the raw exception will bubble up (e.g., a ModelBehaviorError if the model’s function call was malformed, or UserError if your tool code crashed) (Tools - OpenAI Agents SDK). In most cases, leaving the default behavior is fine – the agent will simply see an “error” result and can decide to apologize or try another strategy.

Agents as Tools

Besides regular function tools, you can also treat an Agent itself as a tool for another agent. This is a way to create a hierarchy of agents: a top-level agent can call a sub-agent as if it were calling a function. The difference between a handoff and an agent-tool (agent-as-tool) is that with a handoff, the sub-agent takes over completely (new conversation context), whereas with an agent tool, the main agent remains in control and simply gets the sub-agent’s result as a function result.

Use case: Suppose you want a central orchestrator agent that can call specialized translator agents for different languages, but you don’t want to fully hand off control (so that the main agent can potentially call multiple translators and then combine results). You can do this by adding agents as tools.

Example:

from agents import Agent, Runner

# Define two specialized agents (that could also be run standalone)
spanish_agent = Agent(name="SpanishAgent", instructions="Translate the user's message to Spanish.")
french_agent  = Agent(name="FrenchAgent",  instructions="Translate the user's message to French.")

# Create an orchestrator agent that can use the above agents as tools
orchestrator = Agent(
    name="OrchestratorAgent",
    instructions=(
        "You are a translation orchestrator. You have tools to translate to Spanish or French. "
        "If the user asks for a translation, call the relevant tool. Otherwise, just respond."
    ),
    tools=[
        spanish_agent.as_tool(tool_name="translate_to_spanish", 
                               tool_description="Translate the user's message to Spanish"),
        french_agent.as_tool(tool_name="translate_to_french", 
                              tool_description="Translate the user's message to French")
    ]
)

result = Runner.run_sync(orchestrator, "Please say 'Hello, how are you?' in French and Spanish.")
print(result.final_output)
# The orchestrator might use both tools and produce something like:
# "Spanish: Hola, ¿cómo estás?\nFrench: Bonjour, comment ça va ?"

Here we used agent.as_tool() to wrap each sub-agent as a tool with a given name and description (Tools - OpenAI Agents SDK). Internally, when the Orchestrator agent decides to call translate_to_french, the SDK will run the french_agent (using the query text as input) and return its final output as the “function result” to Orchestrator. This allows chaining agents without a formal handoff.

Customizing agent-tools: The as_tool method is a convenience for simple cases. It may not expose all agent configuration. For example, you might want the sub-agent to run with a specific max_turns limit or use a different model for that call. In such cases, you can always create a custom tool function that calls Runner.run on an agent. For instance:

from agents import function_tool

@function_tool
async def run_my_agent(subquery: str) -> str:
    """Run a custom agent with a given prompt."""
    custom_agent = Agent(name="MathAgent", instructions="Solve math problems step by step.")
    result = await Runner.run(custom_agent, subquery, max_turns=5)
    return str(result.final_output)

This defines a tool run_my_agent that, when called by an agent, will internally spin up a new MathAgent and execute it on the provided subquery (Tools - OpenAI Agents SDK) (Tools - OpenAI Agents SDK). This technique gives you full control: you can set model, temperature, turns, etc., for the sub-agent call. The trade-off is that the logic is now on you to implement (whereas agent.as_tool did it automatically with defaults).

In summary, tools – whether built-in, custom, or agent-based – are how you give your agent the ability to take actions. Next, we’ll see how to manage multiple agents more directly via handoffs, and how to maintain context and state between these interactions.

Managing Agent Context and State

In any non-trivial application, you’ll have some notion of context or state that the agent should be aware of or carry between steps. There are two main kinds of context to consider in the Agents SDK (Context management - OpenAI Agents SDK):

  1. Local context (Environment state) – data and dependencies on the Python side that your tools or callbacks need access to.
  2. LLM context (Conversation state) – information the LLM sees in its prompt (conversation history, system instructions, etc.) which constitutes the “memory” from the model’s perspective.

The SDK provides mechanisms to handle both types of context.

Local Context (Python Environment Data)

Local context is any Python object or data that you want to pass into the agent’s run, which can then be accessed inside tool functions or lifecycle hooks (like an on_handoff callback). For example, you might have a database connection, a user profile, or a simple flag that your tools need. You don’t want to embed this in the LLM prompt, but you do want your code to have it handy.

The Agents SDK addresses this through the RunContextWrapper and context passing in the Runner.run call (Context management - OpenAI Agents SDK):

  • You can create an arbitrary Python object (could be as simple as a dict, or a dataclass, etc.) that holds your context data.
  • When calling Runner.run (or run_sync), pass this object via the context= argument.
  • Any tool function that needs to use this context can declare a parameter of type RunContextWrapper[YourType]. The SDK will then pass a wrapper object to that tool, through which you can access your context (wrapper.context gives the underlying object) (Context management - OpenAI Agents SDK).

Example:

from dataclasses import dataclass
from agents import Agent, RunContextWrapper, Runner, function_tool

@dataclass
class UserInfo:
    name: str
    id: int

# A tool that uses the context to fetch some user-specific info
@function_tool
def fetch_user_data(ctx: RunContextWrapper[UserInfo]) -> str:
    user = ctx.context  # this is the UserInfo object
    # Using data from the context (this data is NOT visible to the LLM directly)
    return f"User {user.name} (id={user.id}) has 42 notifications."

# Create an agent that has this tool
agent = Agent[UserInfo](
    name="PersonalAssistant",
    instructions="You are an assistant that knows the user's info.",
    tools=[fetch_user_data]
)

# Prepare a context object
user_info = UserInfo(name="Alice", id=123)

# Run the agent with the context
result = Runner.run_sync(agent, "How many notifications do I have?", context=user_info)
print(result.final_output)
# Expected output uses the tool result: "User Alice (id=123) has 42 notifications."

Key points in this example:

  • We define a UserInfo dataclass to hold some user-specific data.
  • The fetch_user_data tool has a parameter ctx: RunContextWrapper[UserInfo]. When the agent calls this tool, it will receive a RunContextWrapper that wraps our UserInfo object. We then retrieve the actual context via ctx.context.
  • The agent is defined as Agent[UserInfo] – using a Python generic type for the agent. This helps with static type checking (it ensures that the agent’s tools and context all expect the same UserInfo type) (Context management - OpenAI Agents SDK) (Context management - OpenAI Agents SDK). It’s not strictly required to use the generic, but it’s a nice feature for catching mistakes.
  • We call Runner.run_sync with context=user_info. This attaches our UserInfo object to the run. All tool calls, guardrail checks, and handoff callbacks during this run will carry this context object via the wrapper.

The local context object is never sent to the LLM (Context management - OpenAI Agents SDK). It’s purely for your Python code’s use. This means you can put sensitive or large data there without worry – the LLM won’t see it unless you explicitly feed it through a tool or prompt.

Common uses of local context:

  • Passing in a database session or API client that tools can use.
  • Providing user-specific data (preferences, profile info) that some tools might need to personalize responses.
  • Caching data between tool calls, or accumulating results.

Remember that all components in a single agent run must share the same context type (Context management - OpenAI Agents SDK). If you start an agent with a certain context object type, you shouldn’t call a tool expecting a different context type on that same run.

LLM Context (Conversation and Memory)

Now let’s talk about context from the LLM’s perspective – essentially the conversation history and prompt that the model sees. LLMs don’t have persistent memory between runs (unless you supply it), and at any given step the model only knows what is in the prompt (system message, conversation turns so far, etc.) (Context management - OpenAI Agents SDK). The Agents SDK automatically manages the context for multi-step agent loops and multi-agent handoffs, but you as the developer should know how to provide additional info to the model when needed.

Ways to provide context to the LLM:

  • System Instructions: We already saw that you can put information into agent.instructions. This is akin to a system prompt that stays constant (or can be dynamically generated per run). Use this for global context the agent should always have. For example, if an agent should always know the user’s name or the current date, you could bake that into its instructions (perhaps by templating the instructions string with those values at runtime) (Context management - OpenAI Agents SDK).

  • User Query / Input: When you call Runner.run, you supply an input – often this is the user’s query. That becomes the last user message in the prompt. If you have some dynamic context that’s only relevant to this query, you might prepend or include it in the input. The SDK allows the input to be a raw string or a list of message dicts (each with a role and content) (Running agents - OpenAI Agents SDK). In advanced use, you could use result.to_input_list() from a previous run (which gives you a list of prior messages) and then add a new user message (as shown later) (Running agents - OpenAI Agents SDK). In short, whatever you pass as input will be seen by the agent as the latest user query (or full conversation if you pass a list).

  • Tools for Retrieval: Rather than stuffing all possible relevant info into the prompt (which could hit token limits), a common pattern is to provide tools that fetch information on demand. For instance, if the LLM might need some data, you can expose that via a tool (like a fetch_user_profile tool or a search_documents tool). The LLM will call the tool when it determines it needs that info (Context management - OpenAI Agents SDK). This way, the context is retrieved just-in-time. For example, using the FileSearchTool (vector DB) or a custom query_database tool allows the agent to get supplemental knowledge beyond its base prompt.

  • Handoff with Input Data: If you delegate to another agent, you might want to pass some info along. The SDK’s handoff() function supports an input_type to structure data going into the next agent (Handoffs - OpenAI Agents SDK) (Handoffs - OpenAI Agents SDK). For example, a triage agent could hand off with a message like “escalation reason: X” to the next agent by specifying a model (like a Pydantic BaseModel) as the input_type. The LLM will provide values for that model when initiating the handoff. We’ll see more in the next section.

In general, conversation history (memory) is automatically tracked within a single call to Runner.run. If the agent loops multiple times (due to tools or handoffs), the intermediate steps are part of the conversation context fed to subsequent LLM calls. For instance, if an agent calls a tool and gets a result, the next LLM call will include the tool’s result in the prompt (most likely as an assistant message indicating the function result). The SDK handles this assembly via the Model Context Protocol (MCP), ensuring the model sees a chain-of-thought.

However, between separate runs (separate calls to Runner.run), the SDK does not automatically remember anything. This statelessness is by design – it’s up to you to carry over relevant context if you want a multi-turn conversation with a user. The SDK helps via the RunResult object’s to_input_list() method, which provides the conversation from that run in a format ready to be used as input for the next run (Running agents - OpenAI Agents SDK).

Example of multi-turn conversation:

from agents import Agent, Runner, trace

agent = Agent(name="Assistant", instructions="Reply very concisely.")

# Suppose thread_id is an identifier for the conversation session
thread_id = "conv-12345" 

# First user question
with trace(workflow_name="Conversation", group_id=thread_id):
    result1 = Runner.run_sync(agent, "What city is the Golden Gate Bridge in?")
print("Agent:", result1.final_output)
# Agent: San Francisco

# User follow-up question, carrying context from previous turn
prev_conv = result1.to_input_list()  # get messages from turn 1
new_input = prev_conv + [{"role": "user", "content": "What state is it in?"}]
with trace(workflow_name="Conversation", group_id=thread_id):
    result2 = Runner.run_sync(agent, new_input)
print("Agent:", result2.final_output)
# Agent: California

In this conversation, we kept the same agent (which has a brief instruction just to be concise). After the first question, the agent answered "San Francisco." We then took the entire conversation (which includes: system message with instructions, user’s first question, assistant’s answer) via result1.to_input_list(). We append a new user message asking the follow-up. By passing that list as input to Runner.run_sync again, the agent is now aware of the prior context and can answer "California."

We also used trace(workflow_name="Conversation", group_id=thread_id) around each run – this ensures the traces from both turns are grouped together under the same conversation ID, which is useful in the OpenAI trace viewer (so you can see the whole conversation flow) (Running agents - OpenAI Agents SDK) (Running agents - OpenAI Agents SDK). The trace context manager is optional but recommended when you want to link multiple runs.

Thus, to maintain memory across turns, you as the developer manage the conversation history (which the SDK makes easy to serialize and pass forward). This explicit approach avoids the SDK making assumptions about which data to carry over, giving you full control.

Summary of Context Handling

  • Use local context (the context= parameter and RunContextWrapper) to keep Python-side state and dependencies that tools and hooks can access. This is your app state or environment available to the agent’s code.

  • Use LLM context (instructions, input messages, tool outputs, and conversation history) to provide information to the LLM. The agent’s own loop will include prior tool actions and handoffs automatically for that query, but for multi-turn dialogues, you feed the previous interaction history into the next call.

  • If you need long-term memory or knowledge base, consider using retrieval tools or external storage (e.g., storing summaries of conversation so far, or using vector search on dialogue history). The SDK’s primitives (tools & context) can be used to implement such memory systems.

With context and memory in mind, let’s return to orchestrating multiple agents and the concept of handoffs, which we’ve mentioned but not yet demonstrated.

Multi-Agent Orchestration and Handoffs

Complex tasks often benefit from dividing responsibilities among multiple specialized agents. For example, you might have one agent specialized in planning, another in mathematical reasoning, another in writing prose. The OpenAI Agents SDK supports multi-agent workflows primarily via handoffs, and secondarily via agent-tools and external orchestration (coding the logic in Python). We already touched on agent-tools; here we focus on handoffs and general orchestration patterns.

Handoffs: Delegating to Another Agent

A handoff means that an agent can invoke another agent as the next step in the conversation, effectively transferring control. When Agent A hands off to Agent B, Agent B takes over the dialogue (seeing the conversation history up to that point, by default) and will produce the final answer or possibly hand off further. This is different from treating B as a tool (where A would remain in charge); in a handoff, B becomes the active agent.

To use handoffs, you can configure an agent’s handoffs parameter with one or more target agents (or handoff objects). Each handoff target will appear to the original agent as a special “tool” with a name like "transfer_to_<AgentName>" by default (Handoffs - OpenAI Agents SDK). If the LLM decides that the query should be handled by a different agent, it can output a handoff action, and the SDK will switch the context to that agent.

Basic usage:

from agents import Agent

# Define some specialized agents
billing_agent = Agent(name="BillingAgent", instructions="Handle billing related queries.")
refund_agent  = Agent(name="RefundAgent",  instructions="Handle refund related queries.")

# A triage agent that can hand off to either billing or refund
triage_agent = Agent(
    name="TriageAgent",
    instructions="Figure out if the user's request is about billing or refund, then hand off appropriately.",
    handoffs=[billing_agent, refund_agent]  # we can list Agents directly
)

By listing billing_agent and refund_agent in triage_agent’s handoffs, we are allowing it to delegate to them. Under the hood, the SDK will create handoff “tools” for each, named "transfer_to_billing_agent" and "transfer_to_refund_agent" (it lowercases and underscores the name) (Handoffs - OpenAI Agents SDK). The triage_agent’s LLM instructions should guide it to decide when to use those. For instance, you might say in instructions: “If the question is about billing issues, transfer to the BillingAgent. If it’s about refunds, transfer to the RefundAgent.”

When running triage_agent, if the user asks “I want a refund for my last order”, the agent (with a good prompt) will decide this is a refund issue and respond with a handoff action to RefundAgent. The SDK will then initialize refund_agent with the conversation (by default, it passes the entire conversation so far to the refund agent, but this can be filtered; see below) and continue the loop with refund_agent now generating the answer.

Customizing handoffs: The SDK provides a handoff() function to create a Handoff object with more options (Handoffs - OpenAI Agents SDK). Using handoff() instead of directly listing an Agent allows you to:

  • Override the tool name or description presented to the LLM.
  • Set an on_handoff callback (a Python function) that will execute right when the handoff happens (useful for logging or prepping data).
  • Define an input_type for structured input to the next agent (so the LLM can pass some data to the next agent at the moment of handoff).
  • Provide an input_filter to tweak what portion of the conversation history the new agent sees.

For example:

from agents import handoff, RunContextWrapper

def on_handoff_callback(ctx: RunContextWrapper[None]):
    # This will run when handoff is triggered
    print("Handoff initiated!")

# Create a handoff with custom settings
refund_handoff = handoff(
    agent=refund_agent,
    on_handoff=on_handoff_callback,
    tool_name_override="escalate_to_refund",
    tool_description_override="Escalate the conversation to the refund department"
)

triage_agent = Agent(
    name="TriageAgent",
    handoffs=[billing_agent, refund_handoff]  # one handoff is plain agent, one is customized
)

Now, triage_agent has two handoff options: one to billing (default naming) and one to refund (with a custom name "escalate_to_refund" and a callback) (Handoffs - OpenAI Agents SDK) (Handoffs - OpenAI Agents SDK). The on_handoff callback can also accept a second argument for input data if input_type is used.

Passing data with handoffs: Sometimes you want the handing off agent to provide some data or reason to the next agent. For instance, an escalation might include a reason code. You can achieve this by specifying input_type when creating the handoff (Handoffs - OpenAI Agents SDK). The input_type should be a Pydantic model or dataclass describing the data. Then in the prompt for the first agent, encourage it to fill out those fields when handing off.

Example:

from pydantic import BaseModel

class EscalationInfo(BaseModel):
    reason: str

async def on_handoff_with_data(ctx: RunContextWrapper[None], data: EscalationInfo):
    print(f"Escalating because: {data.reason}")

escalation_agent = Agent(name="EscalationAgent", instructions="Handle escalations.")

escalation_handoff = handoff(
    agent=escalation_agent,
    input_type=EscalationInfo,
    on_handoff=on_handoff_with_data
)

# Suppose support_agent can hand off to escalation with a reason
support_agent = Agent(
    name="SupportAgent",
    instructions="If user is very unhappy, escalate. Ask them for reason then escalate.",
    handoffs=[escalation_handoff]
)

Now if support_agent decides to hand off, it can do so with an accompanying JSON payload like {"reason": "User asked for a supervisor"}. The SDK will parse that into an EscalationInfo object and pass it to EscalationAgent (likely included in the conversation or as part of system message) and also to our on_handoff_with_data callback (Handoffs - OpenAI Agents SDK) (Handoffs - OpenAI Agents SDK), which here just prints it.

Filtering conversation on handoff: By default, when Agent A hands off to Agent B, Agent B will see the entire conversation history that A had (this includes messages from earlier in the same run). In some cases, you might not want that. Maybe Agent B should only see the last user query and not the deliberation Agent A went through. The SDK allows you to define an input_filter function in the handoff to trim or alter the messages passed to B (Handoffs - OpenAI Agents SDK). For example, you could filter out system messages or previous assistant messages, or reduce the length of user message. Implementing an input_filter gives fine control over what context is shared vs reset at the handoff boundary.

In summary, handoffs are a powerful way to route tasks to specialized agents dynamically. A common pattern is a triage agent that purely decides which specialized agent should handle a request (OpenAI Agents SDK — Getting Started). The specialized agent then does the heavy lifting (potentially with its own tools, etc.). This is analogous to a dispatcher in a customer service system.

Orchestrating via Code vs via LLM

The Agents SDK supports both LLM-driven orchestration (letting the agent decide when to use tools or handoff) and developer-driven orchestration (writing Python code to coordinate agents).

  • Orchestration via LLM (Autonomous planning): This is when you lean on a single agent (or a hierarchy of agents via handoffs) to figure out how to solve an open-ended task (Orchestrating multiple agents - OpenAI Agents SDK) (Orchestrating multiple agents - OpenAI Agents SDK). For example, you give an agent a broad goal and tools, and it will iterate, call tools, delegate, etc., until done. The advantage is you leverage the full reasoning ability of the LLM to decide the sequence of actions. The main things to ensure here are good instructions (prompt engineering for telling the agent how and when to use tools) (Orchestrating multiple agents - OpenAI Agents SDK) and robust monitoring (since the LLM might take unexpected turns). The built-in tracing is very helpful to see what the agent is doing and debug its decision-making. A best practice is to keep agents specialized and tools clearly defined, so the LLM knows when to use what (Orchestrating multiple agents - OpenAI Agents SDK). You can also incorporate self-reflection loops (an agent critiquing its own output or approach) as part of its design (Orchestrating multiple agents - OpenAI Agents SDK).

  • Orchestration via Code (Deterministic flows): In some scenarios, you might not want the LLM to decide the whole workflow (maybe for efficiency or reliability reasons). Instead, you can write Python code to orchestrate multiple agent calls in a sequence or in parallel (Orchestrating multiple agents - OpenAI Agents SDK) (Orchestrating multiple agents - OpenAI Agents SDK). For example, you might first call an agent to classify a task, then based on the result call one of several other agents (this is akin to doing the triage in code rather than via an LLM’s handoff). Or you might run two agents in parallel on different subtasks and then combine their results. The SDK doesn’t have a high-level abstraction for this (since you’re essentially bypassing the single-agent loop), but you can use normal Python control flow and call Runner.run on different agents as needed. Because the SDK is “Python-first” by design (OpenAI Agents SDK), it encourages using Python logic rather than over-engineering new abstractions for every scenario.

For example, orchestrating via code:

# Pseudocode for a manual orchestration:
query = "Write a blog post about AI agents."

# Step 1: Use an agent to create an outline
outline_agent = Agent(name="OutlineAgent", instructions="Create a brief outline for the blog post.")
outline_result = Runner.run_sync(outline_agent, query)

# Step 2: Feed the outline to a writing agent
writing_agent = Agent(name="WriterAgent", instructions="Write a detailed article following the given outline.")
full_draft_result = Runner.run_sync(writing_agent, outline_result.final_output)

# Step 3: Use an evaluation agent to critique the draft
critique_agent = Agent(name="CritiqueAgent", instructions="Critique the draft and point out improvements.")
critique = Runner.run_sync(critique_agent, full_draft_result.final_output)

# Step 4: If critique suggests changes, feed back for revision or loop
print("Draft:\n", full_draft_result.final_output)
print("Critique:\n", critique.final_output)

In this made-up flow, we manually broke the task into outline -> write -> critique steps. We could even loop the write and critique until the critique is satisfied (this would be a simple while loop around those calls) (Orchestrating multiple agents - OpenAI Agents SDK) (Orchestrating multiple agents - OpenAI Agents SDK). This deterministically ensures structure and possibly saves tokens by not asking one agent to reason about the entire process.

Choosing an approach: Many applications will mix both approaches. You might use code to handle top-level decisions (like which high-level agent to invoke) and then let that agent use tools autonomously. Or vice versa: use an agent to decide something, then use code to carry out a batch of operations, etc. The SDK is flexible – its primitives (Agent, Runner.run, etc.) can be used wherever needed. The main difference is whether you want the LLM to drive the sequence of actions (faster to develop, but can be unpredictable) or prefer a fixed sequence (more controlled, but possibly less flexible or fluent).

We’ve now covered agents, tools, context, and multi-agent orchestration. Next, we will discuss Guardrails – a critical feature for production systems to ensure inputs and outputs meet certain criteria – and then look at tracing and debugging support. After that, we’ll tie everything together with an example project and best practices.

Guardrails and Output Validation

In agent systems, especially those deployed to end-users, it’s important to have checks on what goes in and out of your agents. The Agents SDK provides Guardrails as a built-in way to implement these checks. There are two types of guardrails you can define:

Guardrails can be used for security (e.g., check for prompts trying to jailbreak the agent or inject malicious content), for policy enforcement (ensure no sensitive info is revealed), or for quality control (make sure the answer is in a required format).

A guardrail is essentially a Python function that takes some input (and possibly the agent or context) and returns a GuardrailFunctionOutput indicating the result of the check (Guardrails - OpenAI Agents SDK). The SDK provides decorators @input_guardrail and @output_guardrail to simplify creating these.

How guardrails work in the loop: When you attach guardrails to an agent, the SDK will run the input guardrails right after receiving the user’s input, and run the output guardrails right after the agent produces a final answer. If any guardrail indicates a failure (via a tripwire flag), the SDK will halt execution and raise an exception (InputGuardrailTripwireTriggered or OutputGuardrailTripwireTriggered) (Guardrails - OpenAI Agents SDK) (Guardrails - OpenAI Agents SDK). You can catch these exceptions around Runner.run if you want to handle them gracefully (e.g., tell the user “sorry, invalid request”).

Let’s walk through creating a simple guardrail. Suppose we have a math-solving agent, but we want to ensure it’s not being exploited to do a user’s homework. We might impose a guardrail: if the user explicitly asks for help with “homework” or similar, we trigger a refusal.

from agents import Agent, Runner, input_guardrail, GuardrailFunctionOutput

@input_guardrail
def math_homework_check(input: str) -> GuardrailFunctionOutput:
    """Guardrail: Check if user is asking to solve math homework."""
    if "homework" in input.lower():
        # Trigger tripwire: user might be asking for homework help
        return GuardrailFunctionOutput(success=False, tripwire_triggered=True, message="Homework request detected.")
    # Otherwise, pass
    return GuardrailFunctionOutput(success=True)

# Create an agent with this guardrail
math_agent = Agent(
    name="MathAgent",
    instructions="You solve math problems step by step.",
    guardrails=[math_homework_check]  # adding our guardrail
)

# Now run the agent with a homework-like query
try:
    Runner.run_sync(math_agent, "Can you do my math homework on calculus?")
except Exception as e:
    print("Guardrail exception caught:", e)

In this code:

  • We used @input_guardrail to mark math_homework_check as an input guardrail function. It takes the input string (note: guardrail functions can also accept an agent or RunContextWrapper if needed, similar to tools) and returns GuardrailFunctionOutput.
  • We check if "homework" is present, and if so, we return an output with tripwire_triggered=True. This signals the SDK to stop.
  • If input is fine, we return success=True (implicitly tripwire_triggered=False by default).
  • We attach this guardrail to the agent via the guardrails parameter. (Under the hood, the SDK knows to treat input_guardrail decorated functions as input checks).
  • When running, if the guardrail triggers, a InputGuardrailTripwireTriggered exception will be raised (Guardrails - OpenAI Agents SDK). We catch it and print a message.

We could likewise implement an @output_guardrail to inspect math_agent’s output for certain content. For example, maybe ensure the solution doesn’t just give the final answer without steps (if we expect step-by-step). If an output guardrail’s tripwire triggers, an OutputGuardrailTripwireTriggered exception is raised (Guardrails - OpenAI Agents SDK).

Guardrail functions have freedom to implement any logic. In advanced cases, you might even call another agent or an external service inside the guardrail function. In fact, the SDK documentation shows an example of running a guardrail agent inside the guardrail function for complex checks (like using an LLM to analyze whether the content violates a rule) (Guardrails - OpenAI Agents SDK) (Guardrails - OpenAI Agents SDK). That example wraps an agent that checks if the user is asking the AI to do math homework, returning a structured result indicating yes/no (Guardrails - OpenAI Agents SDK) (Guardrails - OpenAI Agents SDK).

Using guardrails in practice: You add guardrails to an agent when you construct it (or potentially globally via run_config, but typically on the agent). The guardrail functions can raise exceptions as shown, which you should handle. For input guardrails, you might handle it by not calling the agent at all if it fails, and instead respond to the user (e.g., “I’m sorry, I cannot assist with that request.”). For output guardrails, since the agent already finished by the time it triggers, you might log it and modify the answer or withhold it.

The guardrail system is flexible but requires careful setup and testing. The documentation notes that it can be “cumbersome to set up but very powerful if done well” (OpenAI Agents SDK — Getting Started). Make sure to test scenarios to avoid false positives or negatives in your guardrails.

Tripwires vs normal results: A guardrail function returns a GuardrailFunctionOutput which has fields like success (bool), optional message, and errors. The tripwire is a specific flag to indicate a blocking condition. You could have guardrails that don’t use tripwires but instead, say, modify the output (for example, an output guardrail could post-process the text to fix JSON format and return success without triggering a halt). In such cases, you’d set success=True and provide a possibly modified content. But most often, guardrails are used to catch violations and stop execution.

Now that we have covered guardrails, let’s discuss how to debug and monitor these agent systems effectively using the SDK’s tracing and logging facilities.

Tracing and Debugging

Building agent applications involves a lot of prompt design, trial and error, and observation of agent behavior. The OpenAI Agents SDK comes with tracing built-in, which can record the flow of an agent’s execution (LLM calls, tool invocations, handoffs, etc.) in detail. This trace data can be viewed in OpenAI’s developer platform if you have it enabled, or you can process it yourself.

By default, tracing is enabled as long as you have set your OpenAI API key (Configuring the SDK - OpenAI Agents SDK). The trace data is sent to OpenAI’s tracing system. Each run (via Runner.run or run_sync) will create a trace with various spans (segments) for actions like the LLM thinking or a tool being called.

Using traces:

  • The simplest way to use tracing is just to allow it to run by default. When you execute your agent, behind the scenes it’s logging events. If you go to OpenAI’s dashboard (if available for the Agents/Tracing feature) you can see a timeline of the run, including all intermediate steps. This is extremely useful for understanding why an agent did something.
  • To group traces or give them names, use the trace() context manager as shown earlier. For example, with trace(workflow_name="OrderSupport", group_id=session_id): ... around a block of runs will label those traces (Running agents - OpenAI Agents SDK). workflow_name is like a category, and group_id can tie multiple runs together (e.g., a conversation thread). By default, each run might just be labeled by agent name and a timestamp; setting a workflow name helps when you have many different types of runs.
  • You can include trace_metadata (a dict) in run_config if you want to attach custom info to the trace (maybe user IDs or other context, ensuring not to include sensitive info unless you’ve enabled it).

Disabling or filtering traces:

  • If you’re in development and don’t want to hit the tracing API, you can disable tracing globally with set_tracing_disabled(True) (Configuring the SDK - OpenAI Agents SDK). This is useful for automated tests or if you temporarily don’t need tracing.
  • By default, traces do not include sensitive data like the exact content of inputs/outputs, to avoid accidentally logging PII. If you need full data in traces for debugging, you can set trace_include_sensitive_data=True in run_config or configure the SDK accordingly (Running agents - OpenAI Agents SDK). Be cautious and ensure compliance with any data handling policies if you do this.

Logging:

  • The SDK uses Python’s logging module. It has two loggers: openai.agents and openai.agents.tracing (the latter for trace-related logs) (Configuring the SDK - OpenAI Agents SDK). By default, only warnings and errors are shown on stdout.
  • You can enable more verbose logging by calling enable_verbose_stdout_logging() which sets up DEBUG level logging to stdout (Configuring the SDK - OpenAI Agents SDK). This will print out more information during execution (like decisions made, API call info, etc.) which can help debug issues locally without relying on external trace viewer.
  • You can also attach your own logging handlers if you want to log to a file or integrate with other monitoring.

In practice, you might use tracing for high-level insight and logging for low-level debugging. For example, when developing a new agent, run it with verbose logging to see immediate feedback in your console. Once it’s working, rely on the trace dashboards to monitor it in production or during integration tests.

Example of reading a trace: If your agent was using tools and handoffs, a trace might show:

  • Span 1: Agent “TriageAgent” received user input.
  • Span 2: TriageAgent decided to call tool “transfer_to_BillingAgent” (a handoff).
  • Span 3: BillingAgent started, got input X.
  • Span 4: BillingAgent called tool “lookup_invoice” with payload Y.
  • Span 5: BillingAgent final output produced.
  • Span 6: Back to TriageAgent (if any post-handoff steps) or end of run.

You can visually see this tree in the OpenAI trace UI, which helps ensure that the flow is correct. It also helps to pinpoint where errors occur (e.g., if a tool failed or the model returned invalid JSON, etc.).

Note: The tracing integration requires an OpenAI API key and likely logs data to OpenAI’s servers. If you prefer not to send trace data out, you can keep it disabled and rely on local logging. Or you could implement custom tracing by hooking into the lifecycle events (the SDK’s design likely allows custom Processor for traces, but that’s beyond the scope here).

By leveraging tracing and logging, you’ll gain confidence in what your agents are doing internally, making it easier to iterate on prompts, adjust tool definitions, or catch bugs in your handoff logic.

Advanced Tips and Best Practices

As you build more sophisticated agent systems with the OpenAI Agents SDK, here are some advanced tips and best practices to keep in mind:

  • Invest in Prompting and Instructions: Clear and detailed instructions (system prompts) for your agents are crucial. Enumerate what tools are available and how and when to use them. For example: “You have a tool WebSearchTool for web queries – use it when the user asks for information not in your knowledge.” Also specify any constraints (e.g., “Always answer in JSON format” or “Never reveal the calculation, just the result”). Well-crafted instructions make the agent’s reasoning more reliable (Orchestrating multiple agents - OpenAI Agents SDK).

  • Limit Agent Autonomy When Needed: While it’s tempting to have a single agent handle everything autonomously, in practice it’s often better to compose multiple specialized agents (Orchestrating multiple agents - OpenAI Agents SDK). Each agent can be an expert at one thing (one domain or one phase of a task). This specialization can be achieved via handoffs or orchestrating in code. It usually leads to better performance and easier debugging than one monolithic agent trying to do it all.

  • Use Guardrails for Safety and Quality: Don’t skip implementing guardrails for user input validation and output checking, especially if your app will face untrusted inputs or needs to meet certain output criteria. Even a simple profanity filter or format checker can catch issues early. Combine guardrails with test cases to ensure your agent doesn’t produce unwanted results.

  • Monitor and Iterate: Deploy with tracing or logging on, and watch how real inputs are handled. Look at where the agent might get confused or take too many steps. You might find that the agent isn’t using a tool when it should, or using one incorrectly. This is a prompt (instructions) problem – iterate on those. Or you might find a tool is too slow or error-prone – maybe refine the tool’s implementation or switch to a different approach. Continuous evaluation (potentially using OpenAI’s Evals or your own metrics) will help improve your agents (Orchestrating multiple agents - OpenAI Agents SDK) (Orchestrating multiple agents - OpenAI Agents SDK).

  • Structured Outputs and Chaining: If you plan to chain agents in code, having one agent produce a structured output (like a JSON with specific fields) can make it easier to feed into the next. The SDK’s support for output_type and JSON schemas can help here. You can enforce an agent’s answer to be a JSON by setting an appropriate Pydantic model as output_type, which will make the LLM output parse into that. This is useful if, for instance, you want an agent to output an action plan that your code will parse and execute.

  • Async for Performance: The SDK’s Runner.run is async, meaning you can use asyncio to run things in parallel. For example, if you have independent questions to answer or you want to race two different agents to see who finishes first, you can await asyncio.gather(Runner.run(agent1, q1), Runner.run(agent2, q2)) etc. Similarly, multiple tools can potentially be called in parallel by the agent (though by default, the agent calls them one by one in its reasoning process). Also consider using Runner.run_stream (or similar, if available) to stream partial answers to the user for a better UX when generating long responses (Running agents - OpenAI Agents SDK).

  • External Tools & Environments: The SDK is extensible. If you need integration with things like web browsers or complex environments, consider writing those as tools or even extending the SDK. The voice extension, for instance, adds speech-to-text and text-to-speech as part of an agent workflow, which shows the SDK can be expanded for multimodal interactions.

  • Model Choices: By default, you use OpenAI’s models. If you have needs for other models (Anthropic Claude, Azure OpenAI, local LLaMA, etc.), the LiteLLM integration allows that. Using LitellmModel with the appropriate model identifier and API key lets you swap in a different backend model without changing your agent logic (Using any model via LiteLLM - OpenAI Agents SDK) (Using any model via LiteLLM - OpenAI Agents SDK). Keep in mind different models have different capabilities in function calling, context length, etc. Always test your agents on the target model; you might need to adjust instructions or tools if switching from, say, GPT-4 to an open-source model due to differences in understanding or output style.

  • Resource Limits: If your agent uses many tools or runs for many turns, keep an eye on token usage and latency. Use the max_turns parameter to prevent infinite loops if a prompt goes wrong (Running agents - OpenAI Agents SDK) (Running agents - OpenAI Agents SDK). If you anticipate long conversations, implement a strategy to summarize or truncate history to avoid hitting context length limits. The Agents SDK won’t automatically clear history unless you explicitly filter it or end the run.

  • Error Handling: Be prepared to handle exceptions from the SDK, such as MaxTurnsExceeded (if the agent didn’t conclude within the limit) (Running agents - OpenAI Agents SDK) or ModelBehaviorError (if the model output was not understood, e.g., it produced malformed JSON for a tool call) (Running agents - OpenAI Agents SDK). These errors can be caught and maybe used to retry or fallback to a simpler approach. For instance, if a complex agent fails, you might have a very basic QA agent as a backup to at least produce some answer.

Now, to solidify understanding, let’s walk through a full example project that demonstrates many of these features working together.

Example Project: AI Task Assistant with Tools and Delegation

Imagine we want to build an AI Task Assistant that can do a few things:

  • Answer general knowledge questions (using web search if needed).
  • Solve math problems (using a calculator tool).
  • Handle customer support inquiries by either answering them or escalating to a human agent (we’ll simulate the human escalation with a specialized agent).

We will design this with multiple agents and tools:

  • A Calculator Tool (function tool) for arithmetic.
  • A Web Search Tool (hosted tool) for general knowledge.
  • A GeneralAgent that mainly answers questions using the above tools.
  • A SupportAgent that can answer support FAQs but will hand off to an EscalationAgent if it detects the user is asking for escalation or if it cannot handle the request.
  • A top-level TaskAgent that decides whether the user’s query is a general knowledge/math question or a support issue, and then hands off to either GeneralAgent or SupportAgent accordingly.

We’ll include guardrails to prevent solving explicit homework questions and to ensure the final answer isn’t too long (just as an example of output guardrail).

import math
from pydantic import BaseModel
from agents import Agent, Runner, function_tool, input_guardrail, output_guardrail, WebSearchTool, handoff

# 1. Define tools

@function_tool
def calculator(expression: str) -> str:
    """Evaluate a math expression and return the result."""
    # VERY basic safe eval for arithmetic
    try:
        result = eval(expression, {"__builtins__": None}, {"sqrt": math.sqrt})
        return str(result)
    except Exception as e:
        return "Error: invalid expression."

# 2. Define guardrails

@input_guardrail
def no_homework(input: str):
    # If user asks explicitly for homework help, trigger stop
    if "homework" in input.lower():
        return GuardrailFunctionOutput(success=False, tripwire_triggered=True, message="Homework detected")
    return GuardrailFunctionOutput(success=True)

class AnswerLengthCheck(BaseModel):
    too_long: bool

@output_guardrail
async def length_guardrail(agent, input, output: str) -> GuardrailFunctionOutput:
    # Simple length check: flag if output is too long
    too_long = len(output.split()) > 100  # if more than 100 words
    return GuardrailFunctionOutput(success=True, data=AnswerLengthCheck(too_long=too_long))

# 3. Define specialized agents

# General purpose agent with web search and calculator
general_agent = Agent(
    name="GeneralAgent",
    instructions=(
        "You are a general assistant. You can answer questions. "
        "Use the calculator tool for math, and the web search tool for factual questions if needed."
    ),
    tools=[calculator, WebSearchTool()],
    guardrails=[no_homework]  # don't allow homework requests
)

# Customer support agent with some predefined knowledge (for simplicity, we'll just put in instructions)
support_agent = Agent(
    name="SupportAgent",
    instructions=(
        "You are a customer support AI. You can answer questions about orders and account issues. "
        "If the user is very unsatisfied or asks for a human, escalate the issue."
    ),
    # Let's assume support_agent doesn't have special tools, just its knowledge
    guardrails=[]  # (we could add some guardrails here as needed)
)

# Escalation agent (simulating a human escalation or higher-level support)
escalation_agent = Agent(
    name="EscalationAgent",
    instructions="You are a human customer support manager. Provide a courteous response and assure the user that their issue will be handled.",
    guardrails=[]
)

# 4. Set up handoffs
# SupportAgent can hand off to EscalationAgent
support_agent.handoffs = [escalation_agent]  # using default handoff (transfer_to_escalation_agent)

# Top-level TaskAgent decides between general vs support
task_agent = Agent(
    name="TaskAgent",
    instructions=(
        "Determine if the user needs general assistance or support:\n"
        "- If it's a factual or math question, use the GeneralAssistant tool.\n"
        "- If it's about account/orders or a complaint, use the SupportAssistant tool."
    ),
    handoffs=[
        handoff(agent=general_agent, tool_name_override="GeneralAssistant", tool_description_override="Handle general questions or math"),
        handoff(agent=support_agent, tool_name_override="SupportAssistant", tool_description_override="Handle customer support issues")
    ],
    guardrails=[],
    output_type=AnswerLengthCheck  # expecting structured output to check length (for demonstration)
)

In this setup:

  • We created a calculator function tool for math.
  • We used WebSearchTool for factual lookup (OpenAI hosted tool).
  • GeneralAgent can use both tools.
  • SupportAgent has no tools (would rely on its prompt knowledge).
  • EscalationAgent is a simple agent for escalations.
  • We add a handoff from SupportAgent to EscalationAgent. If triggered, transfer_to_escalation_agent will appear as a tool for SupportAgent’s LLM.
  • We created a TaskAgent that can delegate to either GeneralAgent or SupportAgent using handoffs (renamed to GeneralAssistant/SupportAssistant for clarity in the prompt).
  • TaskAgent expects an AnswerLengthCheck as output (to demonstrate structured output usage in output guardrail).

Now, running the TaskAgent on different inputs:

# 5. Using the top-level agent
queries = [
    "What is 5 + 7 * 2?",                        # math question -> should use calculator via GeneralAgent
    "Who won the World Cup in 2018?",            # factual -> might use web search via GeneralAgent
    "I have an issue with my order delivery.",   # support issue -> should hand off to SupportAgent
    "This is unacceptable, I want a refund!",    # possibly escalate -> SupportAgent may hand off to EscalationAgent
    "Can you do my math homework for me?"        # should trigger homework guardrail and not answer
]

for q in queries:
    try:
        result = Runner.run_sync(task_agent, q)
        # Check output guardrail data
        if isinstance(result.final_output, AnswerLengthCheck):
            # If structured output, the actual text might be somewhere else; 
            # for this example, we didn't implement the actual answer generation into AnswerLengthCheck.
            print(f"Got structured result: {result.final_output}")
        else:
            print(f"Q: {q}\nA: {result.final_output}\n{'-'*40}")
    except Exception as e:
        print(f"Q: {q}\n[Guardrail Exception]: {e}\n{'-'*40}")

What we expect:

  • For the math query, TaskAgent should recognize it’s a general query, hand off to GeneralAgent. GeneralAgent’s LLM sees a math expression, likely calls the calculator tool. The calculator returns “19” and the agent returns that.
  • For the World Cup query, GeneralAgent might call WebSearchTool to find the answer (France won in 2018) and return it.
  • For the order issue, TaskAgent should hand off to SupportAgent. SupportAgent might just answer from its prompt (like “I’m sorry about your order issue, I can help with that…”).
  • For the angry refund demand, SupportAgent may detect the sentiment and do a handoff to EscalationAgent. Then EscalationAgent will respond with a managerial tone.
  • For the homework request, the no_homework input guardrail in GeneralAgent should trip. When TaskAgent tries to delegate, GeneralAgent will raise InputGuardrailTripwireTriggered, which likely propagates as an exception from Runner.run. We catch it and print a message indicating a guardrail was triggered (meaning we refused the request).

This example ties together tools, handoffs, guardrails, and multi-agent orchestration. In a real deployment, you’d refine each agent’s prompt, maybe add more tools (like retrieval for SupportAgent with company FAQs), and thoroughly test each pathway. But it demonstrates the SDK’s capabilities for mixing and matching components to build a custom agent solution.

Conclusion

In this tutorial, we covered the OpenAI Agents Python SDK in depth: from basic agent creation to advanced multi-agent orchestration. You learned how to equip agents with tools (both built-in and custom), manage contextual information and state, delegate tasks between agents with handoffs, enforce safety and validity with guardrails, and utilize tracing for debugging. By structuring your AI system into modular agents and tools, you can tackle complex tasks while maintaining clarity and control over each part of the process.

The Agents SDK is designed to be flexible yet developer-friendly – it leverages familiar Python constructs (functions, classes, context managers) to define AI behavior, rather than requiring you to learn a new DSL or framework. This means you can incrementally build up an agent system, test it like regular Python code, and integrate it with other software components or APIs.

As best practices, remember to iteratively refine your agents’ prompts and capabilities, monitor them in real usage, and apply guardrails and evaluations to keep them on track (Orchestrating multiple agents - OpenAI Agents SDK) (Orchestrating multiple agents - OpenAI Agents SDK). The landscape of LLM-based agents is evolving rapidly, and this SDK provides a solid foundation to experiment with advanced AI behaviors such as tool use, self-reflection, and multi-agent cooperation.

We encourage you to use this tutorial as a starting point. Try creating your own custom tools and agents for problems you care about – whether it’s an AI that can browse documentation and answer coding questions, or a multi-agent system that automates business workflows. With the OpenAI Agents SDK, much of the heavy lifting (LLM integration, message handling, function calling) is done for you, so you can focus on designing the intelligence and interactions of your agents.


Enjoyed this post? Found it insightful? Feel free to leave a comment below to share your thoughts or ask questions. A GitHub account is required to join the discussion.

Keep reading

Related posts