OpenAI GPT-5.4 Shows AI Agents Learning to Operate Computers

Right now, we are moving from models that excel at specific tasks to agents that can handle more complex workflows. When you prompt a model, you only get its trained intentions, but if you give it a computer environment, it can do much more, like run services, request data from APIs, and/or create useful things like spreadsheets and reports.

When building agents, some practical problems come up. For example:

You need to decide where to store intermediate files.

Avoid pasting large tables into prompts.

Give workflows network access without causing security issues.

Handle timeouts and read-rides without building your own workflow system.

To address these agent-specific challenges, we built the components needed to give the Responsys API a computer environment. By doing this, we enable reliable management of real-world tasks, freeing developers from having to create their own execution setups. This sets the stage for tackling the broader practical problems faced in agent development.

OpenAI’s API shell tool and hosted container workspace address these challenges. The model suggests steps and commands that run in a separate environment with its own filesystem, optional storage (e.g., SQLite), and limited Network Access.

With this foundation in place, let’s explore how we build a computer environment for agents and discuss early lessons from using it to accelerate, standardize, and improve safety in production workflows.

The Shell Tool

A good agent workload needs a tight execution loop:

The model suggests an action.

The platform executes it.

The result informs the next step.

We’ll start with the shell tool to illustrate this loop, then discuss the container, workspace, networking, reusable skills, and context. Compact Shenoy

To understand the shared tool, know how a model uses tools. It suggests tool calls after seeing step-by-step examples during training. The model proposes tool use but can’t execute the calls itself.

The shell tool gives the model command-line access to perform tasks like text search or API requests using familiar Unix utilities such as grep, curl, and awk.

Unlike our current code interpreter, which runs only Python, the Shell Tool supports a much broader range of use cases. You can run GO or JAVA programs or start a Node.js server. Such flexibility enables the model to handle more complex tasks.

Orchestrating The Agent Loop

On its own, a model can only propose shell commands, but how are these commands executed? We need an orchestrator to retrieve model output, invoke tools, and return the tools’ response to the model in a loop until the task is complete.

The Responsys API is how developers interact with OpenAI models when used with custom tools. The Responsys API returns control to the client, who must provide their own harness to run the tools. However, this API can also orchestrate between the modern and hosted tools out of the box.

When the Responsys API receives a prompt, it assembles model context: user prompt, prior dialog state, and tool instructions. For shell execution to work, the prompt must mention using the Shell Tool, and the selected model must be trained to propose shell commands. Models GPT-5.2 and later are trained to do so with all of these contexts.

The model then decides the next action. If it chooses shell execution, it returns one or more shell commands to the Responsys API service. The API service forwards those commands to the container runtime, streams the shell output back, and feeds it to the model in the next request’s context. The model can inspect the results, issue follow-up commands, or produce a final answer. The Responsys API repeats this loop until the model returns a completion without additional shell commands.

When the Responsys API runs a Shell command, it keeps a streaming connection to the Container Service open. As output appears, the API sends it to the model almost immediately. This lets the model decide whether to wait for more output, run another command, or issue a final response.

The model can suggest several shell commands at once. The Responsys API can run these commands concurrently in separate container sessions. Each session streams its output separately. The API then combines these streams into structured tool outputs for context. This allows the agent loop to run tasks such as searching files, fetching data, and checking results in parallel.

Commands that handle files or process data may generate lots of shell output. This can fill a context space without adding much value. To manage this, the model sets an output limit for each command. The Responsys API enforces the limit and returns a result that keeps both the start and end of the output, marking skipped content. For example, you might set an AF1000 character limit, keeping the beginning and end.

By combining concurrent execution and output limits, the agent loop maintains speed and context efficiency. The agent loop controls which tool outputs are included in the context, helping the model focus on important results rather than being overwhelmed by raw terminal logs.

When The Context Window Gets Full: Compaction

A challenge with agent loops is that some tasks run for a long time. These long tasks can fill up the context window, which tracks information across turns and agents. For example, an agent might call a skill, get a response, and then make turn calls and summaries. The Limited Context Window can fill up quickly, keeping important details while removing extraneous information. We built native compaction into the Responsys API. Developers don’t need to create their own summarization or state systems, and the feature matches model training.

Our latest models are trained to review prior dialog states and generate a compaction item that stores key information in an encrypted, token-efficient format. After compaction, the context window includes this compaction item and the most important parts of the earlier window. This makes workflow progress smooth across window boundaries, even in long, multi-step, or tool-driven sessions. Codex uses this system to handle long programming tasks and repeated tool use without losing quality.

You can use compaction either as a built-in server feature or through a separate /compact endpoint. With server-side compaction, you can set a threshold, and the system takes care of compaction timing for you, so you don’t need complex client logic. This setup allows a slightly larger input context window, so small overages just before compaction are handled rather than rejected. As models improve, the native compaction feature updates with every OpenAI model release.

Codex played a key role in building the compaction system. It was one of the first to use it. If one Codex instance hit a compaction error, we started another instance to investigate. This process helped Codex develop a strong built-in compaction system by working through the problem. Codex’s ability to examine and improve itself has become unique to OpenAI. While most tools just need users to learn them, Codex learns with us.

Container Context

Now let’s talk about State and Resources. The container is more than merely a place to run commands. It’s also the model’s working environment. Inside the container, the model can read files, query databases, and reach external systems, all under network policy controls.

File Systems

The first part of the container context is the file system, which is used to upload, organize, and manage resources. We created container and file APIs to give the model a clear view of available data and help it select specific file operations rather than run broad energy scans.

All inputs are directly into the prompt context. As inputs grow, filling the prompt becomes more expensive and harder for the model to navigate. A better approach is to stage resources in the container’s file system and let the model decide which to open, pass, or run via shell commands, much as humans do. Models work better with organized information.

Databases

The second part of the container context is databases. We recommend storing structured databases in SQLite and varying them directly rather than copying a spreadsheet into the prompt. Describe the tables and columns, and explain their meanings so the model can pull only the needed rows.

For example, if you ask which products had declining sales this quarter, the model can look up only the relevant rows rather than search the entire spreadsheet. This approach is faster, cheaper, and better suited to large data sets.

Network Access

The third part of the container context is Network Access, essential for agent workloads. Agents may need to fetch live data, call external APIs, or install packages. Giving containers full internet access can be risky, as it allows them to store information outside sites, access sensitive systems, or make it harder to prevent leaks.

To solve these problems without limiting what agents can do, we set up hosted containers to use a central egress policy proxy. All ongoing network requests go through a central policy layer that enforces allow lists and access controls and keeps traffic visible. For credentials, we use Domain Scoped Secret Injection at egress. The model and container only see placeholders, while the real secret values remain hidden and are used only for approved destinations. This reduces the risk of leaks while still allowing secure external calls.

Agent Skills

Shell commands are powerful, but many tasks follow similar multi-step patterns. Agents often must replan and relearn, leading to inconsistent results. Agent skills package these patterns into reusable building blocks. A Skill Easy Folder with a Skill.MD File and Resources such as API Specs and UI Assets.

This structure maps naturally to the runtime architecture we described earlier. The container provides persistent files and an execution context, and the shell tool provides the execution interface. With both in place, the model can discover scaled files using shell commands (ease, cat, etc.) when needed, interpret instructions, and run scaled scripts within the same agent loop.

We provide an API to manage skills on the OpenAI platform. Developers upload and store skill folders as versioned bundles, which can later be retrieved by skill ID before sending the prompt to the model. The Responsys API loads the skill and includes it in the model context. The sequence is deterministic.

Fetch skill metadata, including name and description.

Fetch the scale bundle, copy it into the container, and unpack it.

Update model context with skill metadata and the container path.

When deciding if an SQL is relevant, the model reviews its instructions step by step and runs its scripts using shell commands in the container.

How Agents are Made

To put all the pieces together: the Responsys API handles orchestration, provides shell tools, runs actions, supports a double-step container, provides an open system runtime context, supports skill add, provides reusable workflow logic, and enables compaction to let an agent run for a long time with the context needed for an end-to-end workflow.

Discover the right scale, fetch data, and transform it into a local structured state. Query it efficiently and generate durable artifacts.

Make Your Own Agent

For a step-by-step example using the shell tool and computer environment, see our developer blog post and cookbook. These resources show how to package and run a SQL with the responses API.

We’re eager to see what developers build. Language models go beyond creating text, images, and audio. We’ll continue to enhance our platform for complex real-world tasks at scale.

Source: From model to agent: Equipping the Responses API with a computer environment

Tags:

OpenAI

GPT-5.4

OpenAI API

AI Agents

Generative AI