Right now, we are shifting from using models AI systems that are strong at specific isolated tasks to reusing agents that can autonomously manage and coordinate more advanced workflows. When you prompt a model, you only engage its pattern-matching intelligence on single-step tasks. But if you give a model the environment and autonomy of an agent, it can perform far more complex operations, such as running services, requesting data from APIs, or generating useful inputs, such as spreadsheets or reports.  

When building agent systems, orchestrating models for complex automated workflows, several technical hurdles arise. For instance, you must determine intermediate storage solutions, prevent embedding large data structures in prompts, enable workflow-level network access while maintaining robust security, and implement timeout and retry mechanisms without developing a custom workflow engine.  

To streamline developer adoption, we engineered core infrastructure to provide the responses API with a controlled computational environment, enabling robust real-world task automation without requiring developers to manage execution contexts.  

OpenAI’s Responses API, together with the shell interface and hosted containerized workspace, addresses these engineering constraints. The model generates stepwise actions and shell commands executed within an isolated environment featuring its own visual file system, optional transactional storage via SQLite, and restricted outbound network connectivity.  

In this post, we’ll explain how we built a computer environment for agents and share some early lessons on using it to speed up and standardize production workflows. The model suggests an action, such as reading files or fetching data with an API; the platform runs it, and the result is used in the next step. To illustrate this in practice, we’ll start looking at the Shell tool, then move on to the container workspace, networking, reusable skills, and context compaction.  

To understand the shell tool, it helps to know how language models and agents interact with tools differently. A model uses tools by suggesting actions such as calling functions or issuing computer commands. During training, the model observes examples of tool use and their effects, learning when and how to suggest a tool. However, a model operating alone can only suggest using a tool; it can’t execute it. An agent, in contrast, provides the environment and orchestration needed to execute the model’s suggestions and complete real tasks.  

The Shell tool significantly extends the model’s capabilities. It provides direct access to a system’s command-line interface, enabling operations such as text searches or API communication. Built atop standard Unix utilities, it provides immediate access to commands such as grep, curl, and awk.  

Compared to our existing code interpreter, which only executes Python, the shell tool enables a much wider range of use cases, such as running Go or Java programs or starting a Node.js server. Such flexibility enables the model to perform complex agentic tasks.  

Orchestrating the Agent Loop 

By itself, a model can only suggest shell commands. But how do these commands actually run? We need an orchestrator to take the model’s output, run the tools, and send the results back to the model in a loop until the task is complete.  

Developers use the Responses API to interact with OpenAI models. When you use custom tools, the Responses API gives control back to the client, so you need your own setup to run the tools. But the API can also automatically manage the connection between the model and hosted tools.  

When the responses API receives a prompt, it assembles the model’s context, including the user prompt, prior dialogue, and tool instructions. For shell execution, the prompt must specify the shell tool, and the model must be trained to suggest shell commands. This applies to GPT 5.2 and later. With this context, the model decides what to do next. If it picks shell execution, it sends one or more shell commands to the responses API. The API sends these commands to the container runtime, streams the shell output back, and includes it in the next request’s context. The model can then review the results, send more commands, or give a final answer. The loop continues until the model finishes without any more shell commands.  

When the responses API initiates shell command execution, it maintains a live stream from the container service. As output becomes available, the API transmits it to the model in near real time, enabling the model to decide whether to wait or further output, initiate a new command, or compose a final output.  

The model can suggest several shell commands at once, and the responses API runs them simultaneously in separate container sessions. Each session streams its output independently, and the API combines these streams into structured tool outputs for context. This means the agent loop can run tasks such as file searches, data fetches, and result checks in parallel.  

For example, if you set a limit of 1,000 characters, the responses API will keep the first 500 and last 500 characters of the command output. The omitted section in the middle will be clearly marked, so you can see both the start and the end with skipped content indicated.  

By running commands in parallel and limiting output size, the agent loop stays fast. It uses context efficiently. This helps the model focus on important results instead of getting lost in long terminal logs.  

When Context Window Gets Full: Compaction 

A challenge with agent loops is that some tasks can take a long time to finish. These long tasks fill up the context window, which is needed to keep track of information across steps and agents. For example, when an agent calls a skill, gets a response, adds tool calls, and writes summaries, the limited context window can fill up fast. To keep important details while removing extraneous information, we added built-in compaction to the responses API. This means developers don’t have to build their own summarization or state systems. The compaction works with how the model is trained.  

Our latest models can review previous dialogue states and generate a compaction item that stores important information in an encrypted token-efficient format. After compaction, the next context window includes this compaction item and the most valuable parts of the earlier window. This lets windows operate uninterruptedly across context windows, even during long multi-step or tool-driven sessions.  

Codex uses this system to handle long programming tasks and repeated tool use without losing quality.  

You can use compaction either as a built-in server feature or through a separate /compact endpoint. With server-side compaction, you set a threshold, and the system manages compaction timing for you, so you don’t need complex client logic. This setup allows a slightly larger input context window, so requests just over the limit can still be processed and compacted rather than rejected. As models improve, the compaction feature updates with each OpenAI release.  

Codex played a key role in building the compaction system by being one of its first users. If a Codex instance encountered a compaction error, we started another instance to investigate the issue. This process helped Codex develop a strong built-in compaction system by solving real problems. Codex’s ability to review and improve itself is a unique part of working at OpenAI. Most tools only require users to learn them, but Codex learns with us.  

Container Context 

Now let’s cover the state and resources. The container is not only a place to run commands, but also the model’s working context. Inside the container, the model can read files, query databases, and access external systems under network policy controls.  

File Systems 

The first part of the container context is the file system. It is used to upload, organize, and manage resources. We created container and file APIs to give the model a clear view of available data. This helps it pick specific file operations rather than running broad scans.  

A common mistake is putting all input directly into the prompt context. As inputs get larger, the prompt becomes more expensive and harder for the model to use. It’s better to store resources in the container file system. Then the model can choose which files to open, read, or change using shell commands. Like most people, models work better when information is organized.  

Databases 

The second part of the container context is databases. We often recommend that developers store structured data in databases like SQLite and query them. Instead of putting a whole spreadsheet into the prompt, you can give the model a description of the tables, including the columns and their meanings, and let it pull only the rows it needs. In this quarter’s sales, the model can query only the relevant rows, rather than scanning the entire spreadsheet. This is faster, cheaper, and more scalable for larger datasets.  

Network Access 

The third part of the container context is network access, an essential part of the agent workload. Agents may need to give live data, call external APIs, or install packages. But giving containers full internet access can be risky. It could expose information to outside websites, accidentally reach sensitive systems, or make it harder to stop leaks and data theft. Use a sidecar egress proxy. All outbound network requests flow through a centralized layer, a policy layer that enforces allowlists and access controls while keeping traffic observable. For train chairs, we use domain skate secret injection at egress. The model and container only see placeholders while raw secret values remain outside the model’s visible context and are applied only to approved destinations. This reduces the risk of leakage while still enabling authenticated external calls.  

Agent Skills 

Shell commands are powerful, but many tasks follow the same multi-step patterns. Agents often have to figure out the workflow each time. This means replanning, reissuing commands, and relearning steps. It can result in inconsistent results and wasted effort. Agent skills package these patterns into reusable building blocks. A skill is a folder bundle. It includes a SKILL.md file with metadata and instructions, along with any necessary resources, such as API specs and UI assets.  

This structure fits well with the runtime setup described earlier. The container gives persistent files and an execution context. The shell tool provides a way to run commands. With both, the model can find skill files using shell commands like ls and cat. It can read instructions and run skill scripts all within the same agent loop.  

We offer APIs to manage skills on the OpenAI platform. Developers upload the end store skill folders as versioned bundles, which can be retrieved later by skill ID before sending a prompt to the model. The responses API loads the skill and adds it to the model context. This process follows the same steps: fetch skill metadata and include the name and descriptions.  

  • Fetch the skill bundle, copy it into the container, and unpack it.  
  • Update the model context, scale metadata, and the container path.  

When deciding if a skill is relevant, the model reviews its instructions step by step. Then it runs its scripts using shell commands in the container.  

How Agents Are Made 

To sum up: the responses API handles orchestration. The shell tool runs actions. The hosted container gives a persistent runtime context. Scales add reusable workflow logic. Compaction lets an agent run for a long time with the context it needs.  

With these building blocks, a single prompt can turn into a full workflow. It can find the right skill, get data, turn it into a structured local state, query it efficiently, and produce lasting results.  

Make Your Own Agent 

For an in-depth example of combining the shell tool and computer environment for end-to-end workflows, see our developer blog post and cookbook. These walk through packaging a skill and executing it through the Responses API.  

We look forward to seeing what developers create with these tools. Language models can do much more than just generate text, images, or audio. We will continue improving our platform to handle complex real-world tasks at scale.

SourceFrom model to agent: Equipping the Responses API with a computer environment 

Amazon

Leave a Reply

Your email address will not be published. Required fields are marked *