How Open AI's Codex AI Coding Agent Works: Technical Details [2025]

Here's what just happened: Open AI did something unusual. They actually explained how their stuff works.

A senior engineer named Michael Bolin published a detailed technical breakdown of how Codex, Open AI's CLI coding agent, handles the core "agentic loop." This isn't the usual vague marketing post. It's a real engineer talking about real problems they solved.

Why does this matter? Because AI coding agents are legitimately becoming useful. Claude with Code, Codex with GPT-5.2, and similar tools are reaching that inflection point where developers stop treating them as novelties and start building with them daily.

But here's the thing: these tools are deceptively complex under the hood. The magic you see on the surface—watching code get written, tests run, bugs fixed—requires solving some genuinely gnarly engineering challenges. Cache misses. Quadratic prompt growth. Sandboxing complexity. State management across conversations.

This article breaks down what Open AI revealed and then goes deeper into what it means for the future of AI-powered development.

TL; DR

The agentic loop is stateless: Every API call sends the entire conversation history. No server-side caching of state.
Prompts grow quadratically: Each turn adds more history. Codex mitigates this with prompt caching and automatic context compaction.
Cache hits are fragile: Changing tools, models, or sandbox settings mid-conversation can invalidate cached prompts.
Conversation compaction is essential: When tokens exceed thresholds, Codex compresses history while preserving model understanding via encrypted content items.
Production agents remain brittle: These tools excel at scaffolding and prototypes but struggle with edge cases and custom logic requiring human oversight.

What Is an AI Coding Agent, Really?

Let's step back. Most people think of coding assistants like GitHub Copilot as one thing. You type a comment, the AI suggests a function. Done.

An AI coding agent is completely different. It's autonomous (within limits). You say "build me a landing page component that accepts props and validates them," and it actually does it. No suggestion—an actual implementation. It writes code, runs tests, debugs failures, iterates on itself.

The difference is the loop.

Copilot is a one-turn system. Agent → model → code. Done.

Codex is a multi-turn system. Agent → model → code execution → results → model (again) → refinement → testing → debugging → more execution. It cycles until the task is complete or the agent decides human intervention is needed.

This loop is where all the complexity lives.

DID YOU KNOW: Open AI's Codex can execute code, read files, run shell commands, and search the web, but it has to request permission for each action through the agentic loop. It's not actually autonomous—it's supervised autonomy.

The Agentic Loop Explained: What Codex Actually Does

When you fire up Codex and ask it to do something, here's the actual sequence of events.

Step 1: Build the Initial Prompt

Codex constructs a structured prompt with several components, each with a defined role:

System: Base instructions for how the model should behave
Developer: Instructions from the CLI configuration file or defaults
User: Your actual request
Assistant: Previous responses and tool calls (on second turn onwards)
Tools: Available functions the model can call
Input: Metadata about environment, sandbox permissions, working directory

This isn't just a wall of text. It's carefully structured because the model uses these role distinctions to understand context priority.

Step 2: Send to Open AI's API

Codex sends this to Open AI's Responses API for inference. Here's the critical insight: the entire conversation history gets sent with every request.

Why not cache it server-side? Because Codex is stateless by design. The engineer explains that this simplifies things for Open AI (they don't need to store conversation state) and supports "Zero Data Retention" mode where Open AI deletes data immediately.

Trade-off: efficiency for privacy and simplicity.

Step 3: Model Responds or Requests a Tool

The model generates text. Either:

It produces a response for you ("Here's your code")
It requests a tool call ("I need to run npm test to check this")

If it's a direct response, the loop ends.

If it's a tool request, Codex executes it in a sandbox and feeds the results back into the conversation.

Step 4: Recursive Loop Until Done

Codex appends the tool output to the prompt and sends it back to the model. The model sees its own previous request, the execution results, and generates the next turn.

This repeats until the model stops requesting tools and produces a final message.

QUICK TIP: The loop continues as long as the model keeps requesting tools. You can manually interrupt with escape keys or by asking the agent to stop, but the model doesn't know when to quit on its own in complex tasks.

The Prompt Engineering Problem: Quadratic Growth

Here's where it gets weird. Each conversation turn adds more text to the prompt.

Turn 1: Your request (100 tokens)
Turn 2: Your request + model response + tool output (300 tokens)
Turn 3: Everything above + model response + more tool output (700 tokens)
Turn 4: Everything above + more responses and outputs (1,200 tokens)

This grows quadratically. If you have 10 turns in a conversation, the final prompt might include 10+ full turns of context, each with tool outputs sometimes containing massive amounts of data.

Why is this a problem?

API costs scale with prompt length: Open AI charges per token. Long prompts = expensive.
Latency increases: Larger prompts take longer to process.
Context window limits: Models have maximum token limits. GPT-5.2 has a large context, but it's not infinite.
Model performance degrades: Sometimes more context actually makes the model worse at the current task because it gets confused by earlier turns.

Open AI addressed this with prompt caching. The API can cache prefixes of prompts so that repeated sections don't get reprocessed. But here's the catch: cache hits only work for exact prefix matches.

If you:

Change the available tools mid-conversation
Switch models
Modify sandbox permissions
Update environment variables

The cache becomes invalid because the prefix changed. Codex has to send the entire prompt again.

Prompt Caching: When you send a long prompt to an API, the server caches the encoded tokens. If you send the same prefix again (+ new tokens), it only processes the new tokens. This reduces latency and cost, but cache misses are expensive because they force reprocessing of everything.

How Codex Manages Cache Performance

Open AI learned hard lessons about cache invalidation. They designed Codex to be very careful about operations that trigger cache misses.

For example, when you add a new custom tool via Model Context Protocol (MCP), Codex knows this will break the cache. It doesn't just naively update the tools list in the middle of a conversation. It has logic to minimize the impact or warn you that performance will degrade.

The practical effect: experienced Codex users learn to set up all their tools at the beginning of a session, before any work starts. Change tools later? The agent becomes slower and more expensive.

This is counterintuitive for users expecting a fluid experience, but it's a direct consequence of the physics of prompt caching.

Context Window Limits and Automatic Compaction

Eventually, even with caching, prompts get too long. They exceed the context window or become prohibitively expensive.

Codex has an elegant solution: automatic conversation compaction.

When the token count exceeds a threshold (Bolin doesn't specify the exact number), Codex triggers a special API endpoint that compresses conversation history. This is different from simply truncating. The system:

Keeps recent turns in full detail
Summarizes older turns into a compressed representation
Preserves the model's "understanding" of what happened through encrypted content items
Reduces total tokens while maintaining continuity

Earlier versions of Codex required users to manually compact conversations using a /compact slash command. The new system does it automatically, which is much better UX.

But there's a subtle cost: the model only has compressed summaries of old work, not the full details. In edge cases, the agent might forget nuances about decisions made earlier.

QUICK TIP: If you're building an AI agent for production and hitting long conversation scenarios, test your context compaction strategy early. The difference between full history and summaries can affect model reasoning quality.

Tool Integration: How Codex Knows What It Can Do

Codex doesn't have arbitrary access to your system. It has a defined set of tools it can call.

These include:

Shell commands: Execute npm install, python script.py, etc.
File operations: Read, write, modify files
Planning tools: Break down complex tasks into subtasks
Web search: Look up documentation or external information
Custom tools: Via Model Context Protocol (MCP) servers

Each tool gets a definition in the prompt. The definition includes:

Name
Description
Parameters (what inputs it accepts)
Return type (what data it gives back)

When the model generates a response, it can request to call any of these tools by generating structured output like:

{"tool": "shell", "command": "npm test"}

Codex executes the request in a sandboxed environment, captures the output, and feeds it back.

The sandboxing is critical. Codex can't actually delete your home directory or steal your API keys. It runs in an isolated container with restricted permissions.

DID YOU KNOW: Open AI discovered that MCP tools were being enumerated inconsistently, which caused cache misses. This is the kind of subtle bug that only shows up at scale when you're handling thousands of agent sessions. They had to fix the enumeration logic to ensure deterministic ordering.

Sandboxing: The Execution Environment

Let's talk about how Codex actually runs code safely.

When the model requests a shell command, Codex doesn't execute it directly on your machine. It spins up a containerized sandbox environment.

This sandbox:

Has a filesystem isolated from your real system
Has network access restrictions (controlled list of allowed domains)
Has process limits (can't consume unlimited CPU/memory)
Has timeout limits (commands that run too long get killed)
Has permission restrictions (can't access sensitive files)

You, as the user, can configure these permissions. Tight sandbox = safer but limited. Loose sandbox = more capable but riskier.

Codex tracks the sandbox state (what files exist, what's been installed, current working directory) and includes this context in every prompt turn. The model knows what's available in the sandbox and what operations it's already performed.

One critical detail: the sandbox is ephemeral. When the session ends, it disappears. Everything the agent created exists only in memory (or in files you explicitly saved).

This is actually a feature for security, but it means you can't have long-running services or persistent state across sessions.

The Stateless API Design: Why Everything Gets Resent

Let's zoom out and talk about why Open AI designed Codex as a stateless API client.

Stateful systems (where the server remembers conversation history) are easier to use but harder to scale and maintain. The server becomes a bottleneck. It needs to store conversation state. It needs to retrieve it efficiently. It becomes a database problem.

Stateless systems (where every request is self-contained) are harder to use but infinitely scalable. The client sends everything needed with each request. The server processes and responds. No stored state to manage.

Open AI chose stateless for good reasons:

Scalability: No need to manage conversation storage
Privacy: Zero Data Retention mode means no server-side storage at all
Simplicity: API clients are simpler to implement
Cost: Can use cheaper, stateless infrastructure

The trade-off is prompt size and latency. Every request includes the full history.

But here's where prompt caching comes back into play. Even though the entire history gets sent with every request, the API server caches the tokenized representation of prefixes. The actual token processing only happens for new content.

So from the user's perspective, it feels like state. From the server's perspective, it's purely stateless. The caching layer bridges the gap.

QUICK TIP: If you're building your own AI agent system, consider whether stateless is right for your use case. It's great for simplicity and scale, but it requires robust prompt management and caching to perform well.

Model Context Protocol: The Tool Ecosystem

Codex doesn't have hardcoded access to every possible tool. Instead, it uses Model Context Protocol (MCP), which is becoming an industry standard for AI agent tool integration.

MCP allows you to define custom tools as servers that Codex can communicate with. You could create:

A database query tool
A Slack integration
A custom API wrapper
A specialized linter

These tools are defined outside Codex and connected via the MCP protocol. When you start a session, you point Codex to the MCP servers you want available.

Codex discovers the available tools, includes them in the prompt, and can request them during execution.

This is how Codex scales without being rewritten. New tools = new MCP servers. No code changes needed.

But remember: adding or removing MCP servers mid-conversation can invalidate the prompt cache. The tool list changed, so the cached prefix is no longer valid.

The Role of the System Prompt: Setting Behavioral Guardrails

Every turn of the agentic loop starts with a system prompt. This is instructions for how the model should behave.

Codex's system prompt likely includes:

Role clarification: "You are an AI coding assistant that helps developers write and debug code."
Task limitations: "You must request permission before executing dangerous commands."
Tool usage rules: "Only call tools that are listed in the available tools section."
Output format: "Always provide a structured response with explanation."
Safety guidelines: "Never attempt to access sensitive data or bypass security measures."
Debugging approach: "If code fails, explain the error and suggest fixes."

These instructions influence every response the model generates. They're the guardrails that keep the agent focused and safe.

Bolin's post doesn't dive into the exact system prompt, probably for security reasons. But the engineering implication is clear: prompt engineering is crucial. A weak system prompt could lead to unsafe behavior or confused reasoning.

DID YOU KNOW: The system prompt for commercial AI agents like Codex is one of their most jealously guarded secrets. Companies spend months refining them, testing edge cases, and preventing prompt injection attacks.

Developer Instructions: Customization and Configuration

You don't have to accept Codex's default behavior. The system allows developer instructions.

These come from:

Base instructions: Built into the Codex CLI
Configuration files: User-specified overrides (.codex.json or similar)
Per-session flags: Command-line arguments

Examples of developer instructions you might provide:

"Prefer async/await over callbacks"
"Always write tests before implementation"
"Use TypeScript for all code"
"Prioritize performance over readability"
"Target Python 3.9+"

These instructions get included in the prompt, shaping the model's decisions without requiring changes to the system prompt.

This is why two people can use the same Codex version and get completely different behavior. User instructions matter enormously.

The Input Context: Environment Metadata

Before the model sees your request, Codex prepares input context.

This includes:

Current working directory: The model knows where it is
Project structure: File listing of the current directory
Environment variables: Available configuration
Installed dependencies: Package versions
Available tools: The full list of callable tools
Sandbox permissions: What the model is allowed to do
User request: Your actual message

This context is included in every turn. The model builds a complete picture of the environment before generating responses.

Without this context, the model would be blind. It wouldn't know what's in your project, what packages are installed, what directory it's working in.

Without context: "Write a function that uses React." Ambiguous. Is React installed? What version? What's the project structure?

With context: "Write a function that uses React (v 18.2.0, already installed in /node_modules). Here's the project structure..." Much more useful.

QUICK TIP: When using Codex or similar agents, the quality of your request matters far less than the quality of your environment context. Set up your project structure clearly, document dependencies, and provide clear instructions. The agent will do much better work.

Conversation History Management: The Memory Problem

As conversations extend, managing history becomes critical.

Codex keeps a structured history of:

User messages
Model responses
Tool requests (what the model asked for)
Tool responses (what the tool returned)
Errors
User interventions (when you corrected the agent)

All of this gets appended to the prompt. But append long enough, and you hit problems:

Costs explode: Each request processes all previous tokens again (unless cached)
Latency increases: Processing gets slower with more context
Model confusion: The model might contradict earlier decisions or forget important context
Token limits: You hit the context window ceiling

Codex's solution is automatic compaction. When history gets too large, it:

Identifies older conversation turns
Summarizes them into condensed representations
Keeps recent turns in full detail
Replaces old detail with encrypted content items that preserve semantic meaning

This is sophisticated. The system doesn't just delete old history (losing information). It compresses it intelligently.

The model still "knows" what happened earlier, but it doesn't have the full transcript of every command and output.

Performance Optimization: Caching Strategy

Optimizing performance in Codex is largely about optimizing caching.

Here's the caching hierarchy:

Prompt prefix caching: The API server caches token encodings of prompt prefixes
Exact match requirement: Cache hits only happen if the prefix is byte-for-byte identical
Invalidation rules: Changing tools, models, or sandbox settings breaks the cache
Latency impact: Cache hits are ~50-70% faster than cache misses
Cost impact: Cache hits are significantly cheaper

To optimize cache performance, Codex:

Freezes tool lists for the session duration
Batches configuration changes
Minimizes mutations to the sandbox
Structures prompts to maximize prefix stability

Advanced users understand this and structure their sessions to maximize cache hits. Asking Codex to add a new tool in the middle of work is performant suicide.

But most users don't think about this. They just notice that certain operations feel slow and don't know why.

Common Failure Modes: When Codex Gets Stuck

Open AI's post mentions several failure modes they've encountered and addressed:

MCP Tool Enumeration: Tools weren't being enumerated in a consistent order, causing cache misses when the tool list logically hadn't changed but syntactically had. They fixed the enumeration to be deterministic.

Cache Invalidation: The team discovered scenarios where cache was invalidated unnecessarily, hurting performance. They added logic to minimize cache breaks.

Context Confusion: In long conversations, the model sometimes forgot earlier decisions or contradicted previous work. The solution was better summarization during compaction.

Tool Response Parsing: When tool output was malformed or unexpected, the model could get confused. They added better error handling and fallback behaviors.

Sandbox Restrictions: Too-tight sandboxes prevented valid operations. Too-loose sandboxes created security risks. They built a system for granular permission management.

Timeout Handling: Commands that took too long would hang the entire session. They added timeout mechanisms and recovery paths.

Each of these is a real problem that emerged from real usage. Open AI's transparency about them is valuable.

DID YOU KNOW: One of the most challenging aspects of AI agent engineering isn't the model itself—it's the plumbing. How do you handle timeouts? Malformed outputs? Unexpected errors? These boring infrastructure problems often determine whether an agent is usable.

Future Directions: What's Coming

Bolin's post hints at future technical posts covering:

CLI architecture: The structure of Codex's command-line interface
Tool implementation details: How specific tools (shell, file operations, etc.) are built
Sandboxing model: The technical specifics of how sandboxes are created and managed

These posts should provide even deeper insights into the engineering.

But beyond technical implementation, there are broader questions about AI agents:

How do you make agents truly reliable for production work?
How do you handle the irreducible need for human oversight?
How do you design agents that fail gracefully instead of confidently making mistakes?
How do you reason about security and safety as agents become more capable?

These are partially technical questions and partially design and philosophy questions.

Practical Implications for Developers Using AI Agents

Understanding how Codex works should inform how you use it.

Session Structure: Structure your Codex sessions carefully. Set up tools and configuration at the start. Avoid mid-stream changes that break caching.

Context Provision: Before asking Codex to do something, provide comprehensive context about your project, goals, and constraints. The better the context, the better the output.

Iterative Refinement: Don't expect Codex to nail complex features in one turn. Structure work as an iterative process: initial implementation → testing → debugging → refinement.

Verification Mindset: Assume Codex will make mistakes. Every generated function needs to be tested. Every architectural decision needs to be reviewed. AI agents are productivity tools, not replacements for engineering judgment.

Tool Awareness: Understand what tools Codex has available and what limitations exist. If Codex can't access a tool you need, request it via MCP before starting work.

Cache Consciousness: In long sessions, be aware that certain operations (adding tools, changing configs) will hurt performance. Batch them early or accept the performance penalty.

QUICK TIP: The best way to use Codex is to think of it as an extremely knowledgeable junior developer. Give it clear specifications, provide context, review its work thoroughly, and iterate. Don't expect it to work independently on complex, ambiguous problems.

Comparing Codex to Other AI Agents

How does Codex compare to Claude Code or other agents?

Similarity: All modern AI agents use some version of the agentic loop described here. Claude Code, Codex, and others all request tools, execute them, and iterate.

Differences:

Underlying model: Codex uses GPT-5.2, Claude Code uses Claude Opus 4.5. Different training = different behaviors.
Tool ecosystem: Claude Code has integrations Claude has invested in. Codex has MCP flexibility.
Philosophy: Open AI emphasizes control and transparency (hence the technical post). Anthropic emphasizes safety and reasoning.
Performance profile: GPT-5.2 excels at speed and breadth. Claude Opus excels at depth and reasoning.
Pricing: Codex and Claude pricing differ, affecting cost-per-session calculations.

For coding tasks specifically, both are genuinely useful. The choice depends on your project's specific needs and preferences.

Building Your Own AI Agent: Lessons from Codex

If you're considering building an AI agent (perhaps using Runable or similar platforms that support agent automation), here are lessons from Codex's design:

1. Design for statelessness: Make each request self-contained. This scales better than storing conversation state on the server.

2. Implement prompt caching early: Token costs are your largest variable expense. Caching pays for itself immediately.

3. Build context management: Automatic context compaction is more important than you'd think. Long conversations become unusable without it.

4. Plan tool integration: Use standard protocols (like MCP) for tool definition. Don't hardcode tools into the agent.

5. Sandbox everything: Don't let agents execute arbitrary code. Use containers, permission restrictions, and timeouts.

6. Design for human oversight: Agents aren't autonomous. They need human-in-the-loop checkpoints, approval mechanisms, and clear logging.

7. Test edge cases obsessively: The difference between a novelty and a useful tool is handling edge cases gracefully.

Open AI learned these lessons the hard way. You can learn them from their experience.

The Brittleness Problem: What Codex Still Can't Do

Let's be honest: Codex has real limitations.

It excels at:

Scaffolding and boilerplate
Routine tasks (API implementations, standard patterns)
Quick prototypes
Documentation and comments

It struggles with:

Complex architectural decisions
Cross-cutting concerns (performance, security, observability)
Debugging tangled legacy code
Custom domain logic
Edge cases outside training data

The reason is fundamental: Codex is pattern-matching at scale. It works well when you're building something similar to millions of examples in its training data. It breaks down when you venture into the novel or obscure.

This brittleness isn't a bug to be fixed. It's a feature of how large language models work. They're generalizers, not specialists. They're broad but shallow.

For production work, you need human oversight. Codex generates code. You integrate it. You test it. You debug it. You own the consequences.

Agentic Loop: The iterative process where an AI agent receives input, generates a response, optionally requests tool execution, receives tool results, and repeats until task completion or user intervention. This differs from single-turn completions and enables multi-step reasoning and action.

The Competitive Landscape: Who's Building Similar Systems

Codex isn't alone. The entire industry is racing to build better AI agents.

Anthropic's Claude Code follows a similar pattern with some architectural differences.

GitHub Copilot is expanding from code completion toward agent-like capabilities.

Companies like Zapier are building agents for no-code automation.

The common thread: agentic loops are the pattern everyone's converging on. The differences are in implementation details, tool ecosystems, and model choice.

Open AI's decision to publish technical details is partly genuine engineering transparency and partly competitive positioning. By explaining how Codex works, they're demonstrating sophistication and building confidence that Open AI understands AI agents deeply.

Security Implications: Sandboxing and Safety

When you let an AI agent execute code, security becomes paramount.

Codex's approach:

Container isolation: Code runs in a sandboxed container, isolated from your real system
Permission restrictions: Fine-grained control over what the agent can access
Network isolation: Controlled list of allowed domains
Resource limits: CPU, memory, and timeout constraints
Logging and auditing: Every action is tracked

But here's the scary part: determined attackers might find ways to escape the sandbox. Or the agent might be tricked into doing something harmful through clever prompts.

This is why human oversight isn't optional. It's essential.

Open AI's sandboxing is better than most. But it's not perfect. The best security posture is: don't let the agent do anything you wouldn't trust a junior developer to do unsupervised.

Industry Impact: What This Means for Software Development

Codex and similar agents are genuinely shifting how software development works.

Not replacing it. Shifting it.

Developers using agents spend less time on boilerplate and routine work. They spend more time on architecture, edge cases, and testing.

Teams using agents move faster initially but sometimes move slower later when they discover debt from agent-generated code.

The skill that matters most isn't "can you code?" anymore. It's "can you direct an AI agent and evaluate its work?"

This is a real transition. The next generation of developers will learn differently. They'll spend less time memorizing syntax and API details. They'll spend more time on design and critical thinking.

Some worry this devalues human expertise. I'd argue it commoditizes routine work and increases value of expertise. You hire experienced developers for architecture and judgment, not for typing out CRUD operations.

Looking Ahead: The Evolution of Agents

Where does this go?

Near term (next 12-24 months): Agents get better at handling longer conversations, more complex tool ecosystems, and better state management.

Medium term (2-3 years): Agents move from coding-focused to multi-domain (systems administration, data analysis, business process automation).

Longer term (3+ years): The question becomes whether "agentic" remains a distinct category or whether all software interactions become agentified.

Technical challenges to solve:

Reliability: Agents still make mistakes. We need better error recovery.
Specialization: Generic agents are okay. Specialized agents (trained for specific domains) will be much better.
Real-time collaboration: Current agents are turn-based. What about continuous interaction?
Emergent safety: As agents become more autonomous, how do we ensure they behave as intended?

Open AI's transparency about Codex is valuable partly because it shows they're thinking deeply about these challenges. They're not just throwing a model at the problem. They're engineering thoughtfully.

Key Takeaways: The Big Picture

Open AI revealed a lot by publishing Bolin's technical post. Here's what matters:

1. Agentic loops are the standard pattern: Input → model → tool request → execution → feedback → repeat.

2. Statelessness is a feature, not a limitation: Every request sends full history, but prompt caching makes it efficient.

3. Context management is critical: Automatic compaction and conversation summarization keep agents usable at scale.

4. Tool integration is key: Model Context Protocol enables extensibility without rewriting the core agent.

5. Caching fragility is real: Cache misses hurt performance significantly. Agents must be designed with caching in mind.

6. Sandboxing is non-negotiable: Agents execute code. Without isolation, it's dangerous.

7. Human oversight is essential: Agents aren't autonomous. They're supervised automation tools.

8. The industry is converging on similar patterns: Codex, Claude Code, and others are implementing variations of the same fundamental loop.

9. This shifts software development: Less boilerplate, more architecture. Less syntax knowledge, more design thinking.

10. The challenges are largely engineering, not scientific: The big problems (state management, caching, tool integration) are solved through careful engineering, not breakthrough research.

FAQ

What is an agentic loop in AI coding agents?

An agentic loop is the core iterative process where an AI agent receives input, sends it to a model, gets a response that either completes the task or requests tool execution, executes that tool, and repeats the process with the results appended to the prompt. This enables agents to perform multi-step tasks like writing code, running tests, debugging failures, and refining implementations automatically.

How does Open AI's Codex agent prevent running out of context?

Codex uses automatic context compaction when conversations exceed a token threshold. Instead of truncating history, it compresses older conversation turns into encrypted content items that preserve semantic meaning while reducing token count. This allows the model to maintain understanding of earlier work without storing the full transcript, solving the quadratic prompt growth problem inherent in stateless agent systems.

Why does Codex send the entire conversation history with every API request?

Codex uses a stateless API design where the entire conversation history is sent with each request, rather than storing state server-side. This simplifies infrastructure, enables Zero Data Retention mode for privacy, and scales better than managing persistent conversation storage. Prompt caching mitigates the efficiency cost by caching token encodings of repeated prefix sections, so repeated content isn't reprocessed even though it's resent.

What happens when you change tools or settings mid-conversation in Codex?

Changing tools, models, or sandbox settings mid-conversation invalidates the cached prompt prefix, forcing the API to reprocess the entire prompt from scratch. This significantly impacts performance and latency. Advanced users structure sessions to freeze tool definitions and configurations at the start, minimizing cache-breaking operations during active work.

How does Codex execute code safely without compromising your system?

Codex executes all code in an isolated containerized sandbox environment with restricted permissions, network isolation, resource limits, and timeout constraints. The sandbox can't access your real filesystem, can only connect to whitelisted domains, and is ephemeral (deleted after the session). Every action is logged for auditing. This prevents the agent from causing damage even if it attempts malicious operations or gets tricked by adversarial prompts.

What's the difference between Codex and earlier coding assistants like GitHub Copilot?

GitHub Copilot uses one-turn completion where you provide a prompt and get a code suggestion. Codex is an agentic system that iterates: it writes code, runs tests, debugs failures, and refines implementations across multiple turns. Copilot is a suggestion tool. Codex is an agent that can complete multi-step tasks autonomously (with human oversight). This requires completely different architecture including loop management, state handling, tool integration, and conversation history management.

How much does the agentic loop cost compared to simple code completion?

Costs depend on conversation length and cache hit rate. Short conversations with good cache hits might cost 20-50% more than a single completion because of multiple API calls. Long conversations (10+ turns) without cache optimization could cost 3-5x more due to quadratic prompt growth. However, agents often complete tasks in a single multi-turn session that would require many manual iterations with a completion tool, so cost-per-task may actually be lower despite higher cost-per-token.

Can you use Codex with custom tools not included in the default set?

Yes, Codex supports custom tools through Model Context Protocol (MCP) servers. You can define custom integrations for databases, APIs, specialized linters, or domain-specific tools and connect them as MCP servers that Codex can discover and use. However, adding new tools mid-conversation invalidates the prompt cache, so optimal performance requires configuring all tools at session start.

What limitations does Codex have that humans still need to handle?

Codex excels at boilerplate and routine patterns but struggles with novel architecture, complex cross-cutting concerns (performance, security, observability), domain-specific logic, and edge cases outside its training data. It can't reason about business requirements, make judgment calls about trade-offs, or handle truly unexpected problems. Human oversight is essential for architecture decisions, testing, debugging, and any production code.

How is Codex's approach different from stateful agent systems?

Stateful systems store conversation history server-side, making the client simpler but requiring persistent database infrastructure. Codex's stateless design sends full history with each request, requiring prompt caching for efficiency but eliminating server-side storage. Stateless scales better and supports privacy features like Zero Data Retention, but requires more sophisticated client-side context management and makes cache optimization essential.

Conclusion

Open AI's decision to publish technical details about Codex is significant. It's rare for companies to explain their infrastructure so thoroughly.

What emerges is a picture of careful engineering. The agentic loop seems simple (request → model → tool → feedback), but actually implementing it at scale requires solving hard problems around state management, context windows, caching, sandboxing, and tool integration.

The core insight is that AI agents are fundamentally different from single-turn completions. They require rethinking API design, conversation management, and tooling. Open AI got this right, which is why Codex actually works for real tasks.

But Codex isn't magic. It's a sophisticated system with real constraints. It can't handle unlimited conversation length. It can't escape its training data. It can't reason reliably about novel problems. It needs human oversight.

The future of AI-assisted development likely involves more agents like Codex. Developers will spend less time on routine coding and more time on architecture, testing, and critical thinking. Teams that learn to work effectively with agents will move faster. Teams that resist or use them poorly will struggle.

The technical details matter because they inform how you should use these tools. Understanding prompt caching means you structure sessions to avoid cache invalidation. Understanding sandboxing means you know what's safe. Understanding tool integration means you can extend agents with custom capabilities.

Open AI showed their cards. The question now is whether other teams can catch up or innovate beyond this foundation. Either way, the agentic loop pattern is here to stay. Understanding it deeply is becoming essential for anyone building with AI.