Voozh

From the guides: Claude Code & Codex CLI

I use Claude Code as my primary development tool. That bias is worth stating upfront because the strongest comparison writing comes from knowing one tool deeply and testing the other honestly. Over 36 blind duels (where I ran identical tasks through both tools and scored the outputs without knowing which produced what ¹) and hundreds of sessions with both, I’ve found the answer to “which is better?” is genuinely “it depends on the task.”

Claude Code is better for deep refactoring, code review, and programmable governance through its lifecycle hook system; Codex CLI is better for kernel-level sandboxing and cross-tool portability via AGENTS.md. Claude Code enforces safety at the application layer with more than two dozen hook events you wire up yourself, while Codex enforces safety at the OS kernel layer where the model cannot circumvent restrictions. Choose Claude Code for complex multi-file reasoning and the deepest customizable workflows. Choose Codex for maximum isolation and standardized agent instructions that work across 8+ tools.

Current as of June 5, 2026. Both tools ship weekly, so the facts here have a shelf life. As of this revision, Claude Code defaults to Opus 4.8 (CLI v2.1.165) and Codex defaults to GPT-5.5 (CLI v0.137.0). The biggest change since the spring: Codex shipped a real lifecycle-hook system, narrowing what used to be Claude Code’s clearest lead. See Where Each Tool Wins for what that does and doesn’t change.

TL;DR

Claude Code and Codex CLI solve the same problem (AI-assisted development) with fundamentally different architectures. Claude Code governs primarily through hooks: more than two dozen lifecycle event types enforcing policy deterministically at the application layer ². Codex governs primarily through sandboxing: OS-level kernel restrictions below the application layer ³, now paired with its own lifecycle hooks. Neither approach is strictly superior.

Claude Code consistently outperformed Codex in code review and security verification in blind testing. Codex offers genuine advantages in sandboxing, cross-tool portability via AGENTS.md, and cloud task delegation.

Quick decision: Need kernel-level sandboxing or cross-tool AGENTS.md? → Codex. Need the most mature programmable governance hooks or deep refactoring? → Claude Code. Need both safety models? → Run both.

New to both? Start with the Claude Code guide or the Codex guide first. This post assumes familiarity with at least one.

Two Mental Models

Both tools are three-layer architectures, but the layers serve different purposes.

Claude Code:

Reasoning. Claude Code runs the selected Claude model. As of CLI v2.1.154 (May 28, 2026), Opus 4.8 is the default, with high effort by default and an /effort xhigh level for the hardest tasks; Sonnet 4.6 and Haiku 4.5 remain selectable for lighter work ²
Execution. Bash, file operations, git commands, MCP tool calls
Governance. Hooks intercept actions at more than two dozen lifecycle points ²; permissions gate scope

Codex:

Model. GPT-5.5 (launched April 23, 2026) is the default: 400K context in Codex, 1M in the API, $5 / $30 per MTok, 82.7% on Terminal-Bench 2.0 (state of the art at release). GPT-5.5-pro covers the highest-effort tier; the smaller GPT-5.4 mini still handles lower-latency subagent work ⁴
Sandbox. OS-level kernel enforcement (Seatbelt on macOS, Landlock + seccomp on Linux) ³
Approval. Three policies (untrusted, on-request, never) gate mutations before execution ⁵

The critical difference is where governance primarily lives. Claude Code’s center of gravity is the application layer; hooks are programs you write that intercept specific events. Codex’s center of gravity is the kernel layer; the operating system prevents disallowed operations regardless of what the model attempts. Both tools now have hooks, but the architectures still lead with different defaults.

Why this distinction matters: Application-layer governance is programmable. You can encode business logic, run linters, validate schemas, anything expressible in code. Kernel-layer governance is escape-proof. The model cannot circumvent restrictions because the OS denies the syscall before it reaches the application. Every safety architecture trades expressiveness for strength, and these two tools sit at opposite ends of that spectrum.

Configuration Philosophy

Claude Code uses JSON. Codex uses TOML. Both support hierarchical scoping. The philosophies differ in how they think about context-switching.

Claude Code: Layered configuration

// ~/.claude/settings.json (user-level)
{
"permissions":{
"allow":["Bash(git *)"],
"deny":["Bash(rm -rf *)"]
}
}

// .claude/settings.json (project-level, inherits user)
{
"permissions":{
"allow":["Bash(npm test)"]
}
}

Claude Code resolves settings from multiple layers: managed settings (highest priority) → command line → local project → shared project → user defaults ⁶. Memory files (CLAUDE.md) follow their own scoping: user → project → local. Skills and hooks add additional layers. The flexibility is powerful but the active configuration isn’t visible from any single file; you piece it together by reading the hierarchy.

Codex: Profiles with explicit switching

# ~/.codex/config.toml
model="gpt-5.5"
approval_policy="on-request"

[profiles.deep-review]
model="gpt-5.5-pro"
approval_policy="never"

[profiles.careful]
approval_policy="untrusted"

codex--profilecareful"Review this PR"
codex--profiledeep-review"Audit this module"

Codex profiles let you switch between configurations with a flag ⁷. No layer resolution to reason about; the active config is always explicit. For teams standardizing on approval policies, this is simpler to audit. Profiles have graduated from experimental to a first-class managed surface: --profile is now the primary selector across the CLI, TUI permissions, and sandbox flows, and named permission profiles support inheritance, list APIs, and a managed requirements.toml for org policy ⁷.

Safety Models

Safety is the deepest architectural divergence between the tools.

Claude Code: Deterministic hooks at the application layer

Hooks intercept actions before they execute. A PreToolUse hook on Bash can inspect every command and block dangerous patterns ²:

# Hook: git-safety-guardian (PreToolUse:Bash)
ifecho"$tool_input"|grep-q"push.*--force.*main";then
echo'{"decision": "block", "reason": "Force push to main blocked"}'
fi

The strength: hooks are programs. You can encode arbitrarily complex safety logic: check file paths, validate JSON, enforce naming conventions, run linters. I run 95 hooks covering everything from credential detection to quality gates.

The weakness: hooks operate at the application layer. In 2025, Check Point Research disclosed CVE-2025-59536, demonstrating that malicious hooks in project configuration files could execute shell commands during Claude Code initialization, before the user saw a consent dialog ¹⁹. Anthropic patched the vulnerability within weeks, but the disclosure validates the architectural concern: application-layer enforcement shares a process boundary with the agent. NVIDIA’s AI Red Team guidance reaches the same conclusion: “hooks and MCP initialization functions often run outside of a sandbox environment, offering an opportunity to escape sandbox controls” ²⁰.

Codex: Kernel-level sandboxing

Codex restricts the agent at the OS level. On macOS, Seatbelt profiles limit filesystem access, network connectivity, and process spawning ³. On Linux, Landlock + seccomp provide equivalent restrictions, with an optional Bubblewrap (bwrap) pipeline available via configuration ³.

# Three sandbox modes
codex--sandboxread-only# Agent can read but not write
codex--sandboxworkspace-write# Agent writes only in project directory (default)
codex--sandboxdanger-full-access# No restrictions (named to signal risk)

The strength: kernel-level enforcement is below the application. The model cannot escape restrictions by crafting clever commands; the operating system denies the syscall before it executes ³. The danger- prefix on full access mode reflects that removing sandbox restrictions is an exceptional action, not a routine setting.

The weakness: kernel restrictions are binary. You can allow or deny filesystem writes, but you can’t say “allow writes to src/ but block writes to config/ unless the change passes a linter.” That fine-grained governance requires application-level logic.

The tradeoff is real. Hooks provide granular, programmable safety but weaker boundaries. Sandboxing provides stronger boundaries but coarser control. A quick decision heuristic:

Internal trust, external code: Use Codex with read-only sandboxing when reviewing PRs from unknown contributors. The kernel prevents file modification regardless of what the model attempts.
Trusted code, policy enforcement: Use Claude Code hooks when you trust the codebase but need to enforce organizational standards: commit message formats, credential scanning, linting gates.
Both concerns: Run both. Use Codex for the initial safety boundary, then switch to Claude Code for governance-heavy review.

Extensibility

Both tools support customization, but maturity varies by mechanism.

Mechanism	Claude Code	Codex
Project instructions	CLAUDE.md (Claude-only)	AGENTS.md (cross-tool standard, 60K+ projects) ⁸
Lifecycle hooks	More than two dozen event types, deepest ecosystem ²	Real lifecycle hooks (`AfterAgent`, `AfterToolUse`) with a `/hooks` TUI browser; extensions observe subagent/tool/turn lifecycle ⁹
Skills/commands	Skills + slash commands	Skills + slash commands
Subagent delegation	Explicit Task tool plus dynamic workflows orchestrating tens to hundreds of agents via `/workflows` ¹⁰	Multi-agent tools (v2 runtime), max 6 concurrent by default ²¹
MCP integrations	STDIO + HTTP (10,000+ public servers) ¹¹	STDIO + HTTP, OAuth for streamable HTTP servers
Cloud delegation	None native	Cloud tasks (experimental: `codex cloud exec`) ¹²
Surfaces	CLI, VS Code, JetBrains	CLI, desktop app, IDE extension, cloud, Chrome extension ¹⁶

Where Claude Code leads: Hook depth. The lifecycle system spans PreToolUse, PostToolUse, UserPromptSubmit, SessionStart, SessionEnd, Stop, StopFailure, SubagentStart, SubagentStop, PreCompact, PermissionRequest, PermissionDenied, TaskCreated, TaskCompleted, CwdChanged, FileChanged, MessageDisplay, and more. That is more than two dozen events and still growing ². Codex now has hooks too, but Claude Code’s catalog is wider and more battle-tested, and it pairs with Stop-hook additionalContext for steering and dynamic /workflows for large agent fan-outs. If you need to enforce quality gates, detect credential leaks before commits, or inject context automatically across many event types, Claude Code’s hook architecture is the more mature option.

Where Codex closed a gap: Hooks are no longer Claude Code’s exclusive. The community had been asking for expanded hook events for most of 2025 ¹⁸, and Codex delivered: a real lifecycle-hook system with AfterAgent and AfterToolUse events, a /hooks TUI to discover and toggle them mid-session, and an extension API where extensions observe subagent start/stop, tool execution, and turn metadata with async approval ⁹. The old framing (Claude Code has hooks, Codex has a single fire-after-the-fact notification) is out of date. The honest 2026 statement: both tools have programmable governance hooks; Claude Code’s is broader and more mature, Codex’s runs alongside the strongest sandbox in the category.

Where Codex leads: Cross-tool portability and surfaces. AGENTS.md is an open standard governed by the Agentic AI Foundation under the Linux Foundation ¹³, adopted by 60,000+ projects ⁸. The same instruction file works in Codex, Cursor, GitHub Copilot, Amp, Windsurf, and Gemini CLI (with configuration) ¹⁴. CLAUDE.md is powerful but locked to Claude Code. Codex also runs across five surfaces (CLI, desktop app, IDE extension, cloud, and a Chrome extension that rides alongside normal browsing ¹⁶), and codex cloud exec offloads long-running work to OpenAI infrastructure and returns diffs ¹², a workflow Claude Code doesn’t offer natively.

Where Each Tool Wins

Based on 36 blind duels, where I sent identical prompts to both tools and scored outputs blind, and on daily production use:

Category	Claude Code	Codex	Ties
Code review & security	8	4	0
Feature implementation	5	5	2
Refactoring	4	3	1
DevOps & CI/CD	1	3	0

The full methodology and per-duel scoring is in The Blind Judge. These results predate Opus 4.8 and GPT-5.5, so treat them as directional rather than current scoreboard: they capture each tool’s shape (Claude Code stronger on review and reasoning, Codex stronger on DevOps and isolation), which has held across model upgrades, not the exact margins on today’s models. I’ll re-run the duels on the current defaults; until then, the category tendencies are the durable signal.

Claude Code wins

Code review and security verification. Claude Code won 8 of 12 decided duels in review tasks ¹. The quality philosophy system and evidence gates catch issues that slip through Codex’s more procedural approach.
Governance-heavy workflows. If your workflow requires pre-commit checks, credential scanning, output validation, or quality gates that block before execution, Claude Code’s PreToolUse hook is the mechanism. Codex now has its own lifecycle hooks (AfterAgent, AfterToolUse) ⁹, but they observe after the fact; for pre-execution blocking Codex leans on its sandbox and approval policy rather than a programmable pre-hook. For breadth of event types and inline blocking logic, Claude Code’s catalog is still the more complete governance toolkit.
Complex multi-agent orchestration. Explicit subagent delegation via the Task tool ¹⁰, combined with dynamic /workflows that fan out tens to hundreds of agents in the background and deliberation systems, enables workflows where many specialized agents collaborate with isolated context.
Deep codebase refactoring. Opus excels at holding architectural context across long sessions. The context engineering patterns that govern Claude Code’s hook/skill/rules hierarchy translate directly to how the model reasons about large codebases.

Codex wins

Sandbox-critical environments. If you’re running an AI agent against untrusted code, processing external PRs, or operating in a CI/CD pipeline where you need hard guarantees about filesystem and network access, Codex’s kernel-level sandboxing is the right tool ³. Application-level hooks cannot provide the same guarantee.
Cross-tool teams. If your team uses multiple AI coding tools, AGENTS.md gives you one instruction file that works in Codex, Cursor, Copilot, Amp, Windsurf, and more ¹⁴. No duplicate maintenance across CLAUDE.md, .cursor/rules, and Copilot instructions.
Cloud async workflows. codex cloud exec delegates tasks to cloud infrastructure and returns diffs ¹². For CI/CD integration or batch processing, this is a workflow Claude Code doesn’t offer natively.
Real-time steering. Codex’s steer mode lets you inject instructions mid-task with Enter (immediate) or queue follow-ups with Tab (next turn) ¹⁵. Claude Code supports follow-up messages but not mid-turn injection.
Surface coverage. Codex spans five surfaces: CLI, desktop app (macOS multi-tasking across parallel worktrees and floating windows), IDE extension (VS Code, Cursor, Windsurf), cloud tasks, and a Chrome extension that works alongside your browsing without taking it over ¹⁶. Claude Code integrates with VS Code and JetBrains ¹⁷ but is CLI-first. If you want one agent that follows you from terminal to editor to browser to cloud, Codex covers more ground.

Running Both

The tools don’t conflict. CLAUDE.md and AGENTS.md coexist in the same repository. Here’s my setup:

my-project/
├── .claude/
│ └── settings.json # Claude Code project config
├── CLAUDE.md # Claude Code instructions
├── AGENTS.md # Codex + Cursor + Copilot instructions
└── codex.md # Codex project config (optional)

A concrete dual-tool workflow: I use Claude Code for daily development: feature implementation, code review, multi-file refactors where hooks enforce quality gates at every step. When an external contributor opens a PR, I switch to Codex with --sandbox read-only to review their changes against untrusted code. When I need a second opinion on an architecture decision, I send the same prompt to both tools and compare outputs blind via the blind judge approach.

The dual-tool approach has empirical support beyond my own testing. Research by Milvus found that adversarial review between multiple AI models increased bug detection from 53% to 80% ²³. A separate study found that iterative Claude-Codex review loops caught 14 issues across 3 rounds that neither tool found alone ²⁴. Neither tool replaces the other; they cover different threat models and task profiles.

Key Takeaways

If you’re choosing a tool:

Start with your safety requirements. Need kernel-level sandboxing? Codex. Need programmable governance hooks? Claude Code.
Consider your team. Multiple AI tools in use? AGENTS.md avoids duplicate instruction maintenance across tools ¹⁴.
Try both on a real task before deciding. The blind judge methodology works for personal evaluation too.

If you’re already invested:

Claude Code users: write an AGENTS.md anyway. It takes 20 minutes and makes your project accessible to Codex, Cursor, and Copilot users.
Codex users: the hooks system has arrived. Browse it with /hooks, wire up AfterAgent/AfterToolUse, and lean on permission profiles plus the sandbox for pre-execution control ⁹. The “Codex has no hooks” assumption you may be carrying from earlier in 2026 is out of date.
Both tools are improving fast. The comparison in this post has a shelf life measured in weeks, not years, which is exactly why it carries a dated revision line.

FAQ

Can I use both tools in the same project?

Yes. CLAUDE.md and AGENTS.md are separate files with no conflicts. Each tool reads its own instruction file and ignores the other. I maintain both in my active projects.

Which tool is better for beginners?

Codex has a lower configuration barrier: three sandbox modes and three approval policies cover most use cases ⁵. Claude Code’s power comes from hooks and skills, which require investment to set up. Start with whichever model (Claude or GPT) you’re already comfortable with.

How do costs compare?

Both use token-based pricing through their respective APIs. Claude Code runs on Anthropic’s pricing; Codex runs on OpenAI’s credit system. Independent benchmarking by Composio found Codex consumed 2-4x fewer tokens for comparable results. On a Figma plugin task, Claude Code used 6.2M tokens versus Codex’s 1.5M ²². Token efficiency doesn’t translate directly to cost (different per-token pricing), but Codex’s lower token consumption is a measurable advantage for budget-constrained workflows.

Will AGENTS.md work with Claude Code?

Not currently. Claude Code reads CLAUDE.md; Codex reads AGENTS.md. The formats are similar enough that content translates easily between them, but there’s no automatic cross-reading. Writing both takes minimal effort since the content overlaps.

Which has better IDE integration?

Codex has the wider surface area: a macOS desktop app with multi-tasking and floating windows, an IDE extension for VS Code, Cursor, and Windsurf, and a Chrome extension, all sharing one session model ¹⁶. Claude Code integrates with VS Code via extension and JetBrains via plugin (beta) ¹⁷. Both work well; the choice depends on whether you prefer CLI-first (Claude Code) or a GUI/multi-surface footprint (Codex).

References

The Blind Judge: Claude vs Codex in 12 Tasks. Blind evaluation methodology and results ↩↩
Claude Code Hooks Reference and the Claude Code Changelog. More than two dozen lifecycle event types (and still growing) as of CLI v2.1.165 (June 5, 2026), including PreToolUse, PostToolUse, PostToolUseFailure, UserPromptSubmit, SessionStart, SessionEnd, Stop, StopFailure, SubagentStart, SubagentStop, PreCompact, PermissionRequest, PermissionDenied, TaskCreated, TaskCompleted, CwdChanged, FileChanged, and MessageDisplay. Opus 4.8 became the default model in v2.1.154 (May 28, 2026) with high effort by default and an /effort xhigh level. ↩↩↩↩↩↩
Codex Security Documentation. Seatbelt (macOS), Landlock + seccomp (Linux), three sandbox modes ↩↩↩↩↩↩
Codex Changelog and OpenAI model docs. GPT-5.5 (launched April 23, 2026) is Codex’s default: 400K context in Codex, 1M in the API, $5 input / $30 output per MTok, 82.7% on Terminal-Bench 2.0 (state of the art at release). GPT-5.5-pro (1M/1M, high effort) covers the highest-effort tier, and the smaller GPT-5.4 mini provides 400K context for lower-latency subagent work. Verified against the Codex CLI guide and OpenAI docs, current to June 5, 2026. ↩
Codex Configuration Reference. Approval policies: untrusted, on-request, never ↩↩
Claude Code Settings. Five-layer configuration cascade ↩
Codex Advanced Configuration. Profiles (experimental) ↩↩
Linux Foundation AAIF Announcement. AGENTS.md adopted by 60,000+ projects ↩↩
Codex Changelog and Codex Advanced Configuration. Codex shipped a lifecycle-hook system: AfterAgent and AfterToolUse hook events (existing since v0.99.0+), a /hooks TUI to browse and toggle active hooks without leaving the session (v0.129.0+), and an extension API where extensions observe subagent start/stop, tool execution, turn metadata, and async approval/turn processing (v0.133.0+). The earlier notify / agent-turn-complete notification remains available. Codex hooks observe after the fact; pre-execution blocking is handled by the sandbox and approval policy. Verified against the Codex CLI guide, current to June 5, 2026. ↩↩↩↩
Claude Code Subagents. Task tool for explicit subagent spawning ↩↩
Anthropic MCP Foundation Announcement. 10,000+ active public MCP servers ↩
Codex CLI Reference: Cloud Tasks. codex cloud exec for delegating to cloud infrastructure ↩↩↩
OpenAI Co-founds the Agentic AI Foundation. AGENTS.md donated to AAIF under the Linux Foundation ↩
AGENTS.md. Cross-tool compatibility: Codex, Cursor, Copilot, Amp, Windsurf, Gemini CLI ↩↩↩
Codex CLI Features: Steer Mode. Enter for immediate steering, Tab for next-turn follow-up ↩
Introducing the Codex App and the Codex Changelog. Codex spans five surfaces as of June 2026: CLI, macOS desktop app (multi-tasking across parallel worktrees, floating windows), IDE extension (VS Code, Cursor, Windsurf), cloud tasks, and a Chrome extension that runs alongside normal browsing. ↩↩↩↩
Claude Code IDE Integrations. VS Code extension and JetBrains plugin (beta) ↩↩
Codex GitHub Issue #2109. Community request for expanded hook events ↩
Check Point Research, Caught in the Hook: RCE and API Token Exfiltration Through Claude Code Project Files. CVE-2025-59536: malicious hooks executing before user consent ↩
NVIDIA AI Red Team, Practical Security Guidance for Sandboxing Agentic Workflows. Five residual vulnerabilities in agentic coding tools ↩
Codex Sample Configuration. agents.max_threads = 6 default, configurable ↩
Morph/Composio, Codex vs Claude Code: Benchmarks, Agent Teams & Limits Compared. Token consumption benchmarks across identical tasks ↩
Milvus/Zilliz, AI Code Review Gets Better When Models Debate. 53% to 80% bug detection via adversarial debate ↩
Aseem Shrey, I Made Claude and Codex Argue Until My Code Plan Was Perfect. 14 issues caught in 3 rounds of iterative review ↩

Which Tool Should You Use?

Answer four questions to get a recommendation.

Loading quiz…

URL: https://blakecrosley.com/blog/claude-code-vs-codex

⇱ Claude Code vs Codex CLI 2026: Decision Reference