Nate Herk | AI Automation · Youtube · 26:34

100 Hours Testing Claude Code vs ChatGPT Codex (honest results)

A 26-minute live benchmark that runs three real builds side-by-side and reads the session logs to settle the Claude Code vs Codex debate with actual numbers.

Posted

May 26th 2026

today

Duration

26:34

Format

Tutorial

educational

Channel

NH

Nate Herk | AI Automation

§ 01 · The Hook

The bait, then the rug-pull.

For months, Claude Code was the only coding agent worth talking about. Then OpenAI shipped Codex -- and the comparison videos started. This one actually runs the tests.

§ · Chapters

Where the time goes.

00:00 – 01:50

01 · Hook and thesis

OpenAI comeback framing, promise of honest head-to-head across features, price, and three specific use cases.

01:50 – 04:00

02 · Claude Code overview

Task delegation, file editing, customization via hooks/skills/sub-agents. Desktop, terminal, web versions. Opus/Sonnet/Haiku models.

04:00 – 08:00

03 · Codex overview

GPT family models, gpt-codex-spark in preview. WorkTrees as the defining architectural choice. Included in every ChatGPT paid plan.

08:00 – 11:19

04 · Shared features

Both tools: local code editing, desktop app, VS Code extension, CLI, MCP, skills format, plugin marketplace, cloud delegation, hooks, sub-agents.

11:19 – 15:00

05 · Claude Code advantages

30 hook events vs 6. Auto-delegating sub-agents. /ultra-plan, /ultra-review, /loop. Channels integration. Agent SDK. Enterprise auth (Bedrock, Vertex, Foundry).

15:00 – 18:00

06 · Codex advantages

Native WorkTrees per thread. In-app browser. Computer-use QA. at-Codex GitHub PR integration. /goal. GPT image generation. OpenClaw/Hermes compatibility.

18:00 – 22:00

07 · Pricing and context windows

Claude: Pro $20, Max 5x $100, Max 20x $200. Codex: included in ChatGPT free through Pro $200. 1M token context (Claude) vs 256K (Codex).

22:00 – 25:12

08 · Live benchmark intro and results

Three identical prompts: research report PDF, landing page (Glaido), marketing analytics dashboard. Claude wins landing page and dashboard design; Codex wins PDF efficiency.

25:12 – 28:34

09 · Benchmark metrics deep-dive

Raw numbers from JSONL logs. Codex: 25:52, 6.19M tokens, $7.11. Claude: 14:51, 5.8M tokens, $11.05. Output tokens always higher for Claude. Efficiency scatter plot.

28:34 – 31:40

10 · Analysis and decision framework

Use Claude for front-end, deep planning, custom workflows, enterprise auth. Use Codex for research tasks, structured documents, /goal, GitHub PRs, image generation. Split workflow is valid.

31:40 – 26:34

11 · Portability and closing

Projects are files in folders -- not locked to either tool. CLAUDE.md becomes AGENTS.md. Closing thesis: which tool is best for this specific task.

§ · Storyboard

Visual structure at a glance.

hook -- growth chart

hook hook -- growth chart 00:00

host intro

promise host intro 00:46

Claude Code UI card

value Claude Code UI card 01:40

Claude vs Codex cards

value Claude vs Codex cards 06:45

Claude strengths section

value Claude strengths section 11:19

Codex strengths section

value Codex strengths section 15:00

Pricing section

value Pricing section 22:00

Codex vs Claude lab header

value Codex vs Claude lab header 25:12

benchmark aggregate totals

value benchmark aggregate totals 28:34

Who finished first chart

value Who finished first chart 29:40

efficiency scatter plot

value efficiency scatter plot 30:45

final scorecard

cta final scorecard 31:40

§ · Frameworks

Named ideas worth stealing.

31:40 list

Task-Fit Decision Matrix

Claude Code: complex front-end, visual design, deep planning, auto-delegation, hooks/skills/channels, Agent SDK, enterprise auth
Codex: research-heavy tasks, structured PDFs/reports, WorkTree-native shipping, /goal for long-running work, GitHub PR integration, image generation

A task-type decision rule rather than a blanket preference for one tool.

Steal for Any framework for choosing between AI coding tools on a per-task basis

18:20 concept

Output Token Efficiency as Session-Longevity Proxy

Output tokens cost more and burn session limits faster. Codex writes 2-5x fewer output tokens than Claude per equivalent task. This explains why Claude users report hitting limits faster -- and it is measurable from JSONL logs.

Steal for Any explanation of why AI session limits feel inconsistent across tools

§ · Quotables

Lines you could clip.

31:55

"It is not a matter of which tool is best, it is a matter of which tool is best for the specific use case in front of you."

Clean thesis statement, standalone quotable → TikTok hook

11:19

"ClaudeCode right now has 30 different hook events. Codex right now has about six. If you want to fire automated behavior into every part of the workflow, ClaudeCode gives you about five x the granularity."

Concrete number comparison, instantly shareable → IG reel cold open

28:35

"Claude has this way of planning the task tightly before it executes. And Codex tends to just grind through more iterations, which is why the input tokens stack up on its side."

Explains the data in plain English -- no setup needed → newsletter pull-quote

§ · CTA Breakdown

How they asked for the click.

26:00 link

"I broke all of this down into a resource guide that you can access for completely free, and you can find that in my free school community."

Verbal mention only, no overlay shown. Low-friction -- no product pitch, just a free community link.

§ 04 · The Script

Word for word.

HOOK opening / re-engagementCTA the pitch metaphor story

00:00HOOKThis could be one of the biggest comebacks in the AI space. Over the past years, OpenAI went from being the biggest AI company to becoming something kinda mid, and people who used AI to code basically forgot OpenAI existed, thanks to tools like Cloud Code. But over the past few weeks, I've seen a lot of videos saying that OpenAI Codex is actually better than Cloud Code.

00:15HOOKSo I've been trying Codex for the past month, and honestly, the results have been really impressive. But is Codex actually better than Cloud Code? Today, we're gonna answer that question by comparing them on features, price, and three specific use cases to see which one is better.

00:26HOOKAnd at the end, I'm gonna give you my honest opinion on which tool you should be using right now. So let's get into it. So real quick.

00:31HOOKIf you've never used ClaudeCode before, here's the gist. ClaudeCode is Anthropic's coding agent, Anthropic being the company behind Claude. The way it works is pretty simple.

00:38HOOKYou give it a task, like fix this bug or build me a new feature or review this pull request. ClaudeCode goes off. It plans to work.

00:44HOOKIt opens up your project. It edits your files, runs the commands, and it asks you for permission along the way based on your settings. And you can use it pretty much anywhere.

00:50HOOKThere's a terminal version. There's a Versus code extension, and there's a full desktop app for Mac and Windows. And they've also got a web version and research preview where you can just run sessions from any browser or even your phone.

00:59HOOKUnder the hood, it's running Opus, which is Anthropic's currently smartest model, or it can run Sonnet or Haiku as well. Opus and Sonnet are top tier for coding work. Now the part I really like about Cloud Code is how customizable it is.

01:09It's less of a tool, and it's more of a workflow system that you shape into your own engineering rituals and automations. You've got skills that you can drop in. There's hooks, which are basically automated triggers that fire whenever something happens in your session.

01:20And then you've got things like sub agents, which are specialist agents that Cloud can spin up on its own to handle specific kinds of work. And we're gonna dive deeper on all of this in just a sec. And now Codex is OpenAI's coding agent.

01:30And quick clarification, this is not the old Codex model from 2021 that retired. The new Codex is a full agentic system, very similar shape to Cloud Code, but with a few different opinions on how the work should flow. You can use Codex in, once again, terminal, desktop app for Mac and Windows, and a Versus code extension that works also with other IDEs like cursor and in their cloud version at chatgbt.com/codecs.

01:50The models behind codecs are the gbt family of models and a gbt dash codecs for coding specific work, and a faster smaller one called gbt dash codex dash spark. And that one is still in research preview for pro users at the moment. And the thing that stands out about codex is sort of like the unified shipping vibe, where clogged code feels like a workflow system that you're building out.

02:08Codex feels more like an opinionated machine designed to take you from agent is done all the way to the code is shipped to production. A good example here is the built in git work trees. Those are basically just like separate working copies of your project so that multiple tasks can run-in parallel without overriding each other.

02:23So the whole shape is tighter and more end to end out of the box. So we'll get into the specifics of what each tool actually does best in just a minute. And by the way, Codex is also included in every paid and free ChatGPT plan right now.

02:33So free plus pro business enterprise. If you're using ChatGPT, you've also got access to Codex. Whereas Cloud Code, you wouldn't be able to use for free.

02:40Now before we get into where they're different, I want to plant the thesis of this video early because it's very important. It's which tool is best for the specific use case that is currently sitting in front of you. So that's what I'm going to be discussing today.

02:51And one more thing I wanna plant on top of that, after spending a lot of time with both of these tools, I've noticed that they each have kind of a different feel. So Cloud Code to me feels more creative. It feels like it's better at brainstorming.

03:00It's better at like pushing back when I'm going down the wrong path. Whereas Codex feels really good at just, like, following my instructions and doing what I want. And honestly, it's also been sharper at, like, reviewing code and reviewing my plan and, like, finding bugs or gaps.

03:12So none of that is backed by, like, hard specific metrics or KPIs. It's just, the gut feeling that I get after spending lots of hours in both tools. But I do think it matters, I'm gonna come back to that at the end.

03:22With that out of the way, let's talk about how much these two have in common. Because honestly, after using them both heavily, the overlap is way bigger than most comparison videos admit. Both of them edit code on your local machine.

03:31Both have desktop Both have Versus code extensions. They both run the command line. They both support MCP, which is the open protocol for hooking up external tools to your AI.

03:38They both support CLIs as well. They've got the same skills format where you drop a markdown file with a YAML front matter into a folder, and agents can read through those, pick them up, and invoke them. Both tools have a plug in marketplace where you can browse and install community tools.

03:50They both have a cloud delegation option where you can fire off a task and walk away, and they also both have hooks and sub agents. So the question stops being, does my tool have feature x? The real question becomes, which one gives me the better workflow for the way that I actually want to work?

04:02And that's where they start to diverge, which is what we're going to break down next. Let's talk about what each of these tools is uniquely better at. We'll start with Cloud Code.

04:09The thing that sets it apart, in my opinion, is the depth of the customization. Cloud Code right now has 30 different hook events. Hooks, again, are automated triggers that fire when something happens, like when you submit a prompt or when a tool runs or when a session starts or when a task gets created.

04:21Codex right now has about six hook events. So if you wanna fire automated behavior into every part of the agent's workflow, ClaudeCode gives you about five x the granularity there. The next one is auto delegating subagents.

04:31Both tools have sub agents, but Claude code can spawn them on its own when a task needs it. Codex's docs specifically say that Codex won't spawn sub agents unless you explicitly ask. So with Claude, you can just give it a complex task, it'll decide on its own to spin up a planner agent and maybe an explorer agent and a code reviewer agent, whatever's needed for that task.

04:47And that's really powerful by default. And then there's two of my favorite slash commands, both still in research preview, but we have slash ultra plan and slash ultra review. Slash ultra plan takes the planning phase, and it ships it to a cloud cloud code session, and it lets you review the plan in your browser with inline comments, and then you can send it back to your terminal for the actual execution.

05:05Ultra review spins up, once again, kind of like a cloud instance with multiple reviewer agents, and it gives you a deep multi agent code review with reproduced findings. You get three free runs of that on pro and max, and then after that, it's build by run. And they're both insanely powerful for higher stakes work.

05:19Slash loop is another big one that I love. You can give Claude code a recurring prompt that runs on a schedule, or you can run it without a prompt, and Claude will go into maintenance mode and just keep your project tidy. So you could set up a loop to run a certain skill every single, like, twenty minutes, and it will just loop through.

05:32It handles unfinished tasks, addresses comments on your PRs, fixes merge conflicts, stuff like that. It's super, super useful. A couple more that don't get talked about enough.

05:39The first one is channels. That's an MCP server that pushes external events from Telegram or Discord or even iMessage into a running Cloud Code session. So you can literally text your agent from your phone.

05:48And then you've also got, like, dispatch or remote control. Then we have the Cloud Agent SDK, which is the same engine that powers Cloud Code exposed as a Python and TypeScript SDK so you can build your own agents on top of it. And we have enterprise auth, which probably doesn't matter to you if you're solo, but it is a big deal for teams.

06:04Cloud Code supports Bedrock, Vertex AI, and Microsoft Foundry, which are the enterprise cloud platforms that big companies use to host their AI. Codex just doesn't have that level of auth flexibility at the moment. So if you want a customizable coding system that you can shape into your own workflow, Cloud Code is in a class of its own right now.

06:18Okay. So flipping the script, what does Codex actually do better than Cloud Code? The first thing is the whole unified workflow shape.

06:23Codex is built around WorkTrees from the ground up. Every thread you spin up can run-in its own WorkTree without bumping into the main version of your project. Combine that with the fact that you can review, stage, commit, and push from the same desktop app, and you basically got a full shipping pipeline in one tool.

06:36Obviously, Cloud Code allows you to work with WorkTrees as well. Codex just does a really good job of making that feel more native. The second thing is in the in app browser.

06:43So Codex inside the desktop app has a built in browser that you can use to actually, like, look at the work that your agent just shipped. You can leave visual comments right on the page if you've ever finished a feature and then you had to switch over to Chrome to check it out, this is just a much cleaner, universal experience.

06:56Now to be fair, it also has a feature called Cloud in Chrome that gives you another type of functionality, but it just works differently. And Cloud in Chrome is a browser extension that runs inside of Chrome itself, whereas Codex put the browser right inside the desktop app. So the capability is there on both sides.

07:08Codex just keeps everything in one clean window. And it just does it a little bit better when you use the desktop app than the way that Claud code does it. But I think both of these platforms are everyday improving their desktop app experience.

07:18Now the other one that's pretty big is computer use, which both tools once again have, but Codex is is really sharp. They've got this whole product QA use case where you tell Codex, QA the app I just built, and Codex will open it up in the app. It will click around.

07:30It will find bugs, and it will log them with, like, you know, severity ratings, expected versus actual behavior, the steps to reproduce, and a triage summary. And that's a really polished way to use computer use, and it's something that I haven't seen Cloud Code build out as a first party flow yet. But especially when you realize that you can connect Codex and Cloud Code to any of these, like, external tools, you can do a lot of the same functionality with both tools.

07:48Codex also has a GitHub integration, which is pretty interesting. I mean, obviously, both tools can review pull requests and stuff, but Codex has, like, an at Codex mention model, and it's pretty smooth. You tag at Codex in a PR comment or an issue, and codex spins up a cloud sandbox to handle that.

08:01There's basically zero setup involved. You just tag it, and it runs. Now this fifth thing in codex is called slash goal, which is experimental and gated behind a feature flag, but anyone can actually go turn on that flag and use slash goal.

08:11This is for the work that's too big for a single prompt, but smaller than an open ended backlog. You define a goal with a verifiable stopping condition, and codex will just grind away until it's actually finished. And this could be, like, multiple, multiple hours.

08:22And, of course, as pretty much all these features I'm talking about, you can do the same thing in Cloud Code. You could maybe use the slash loop or you could use something like the Ralph Wiggum loop or maybe like Karpathy's auto research. So the capability is there on both sides, but Codex has just packaged this into one clean native slash command where in Cloud Code, you're stitching together a few different tools.

08:40Alright. So you literally can't make this stuff up. As soon as I finished recording that video, Cloud Code just released slash goal.

08:46So now we have slash goal natively within Codex and Cloud Code. So just wanna give you guys a quick update. Back to the video.

08:52And then the last one, because Codex is built by OpenAI, you get access right inside of Codex to g p t image two. And g p t image two is one of the strongest image generation models out there right now. So if you're building a project that needs image generation, whether that's a game or a product markup or maybe even a website, Codex can actually just generate those images for you right inside the app, whereas Anthropic doesn't actually have an image generation model at all.

09:11You would have to hook it up into some sort of third party tool. Okay. This next one is interesting because it's where the two companies really diverge philosophically.

09:18So a lot of you have probably seen third party tools popping up like OpenClaw or Hermes agent, which is the open source agent that lets you rep coding agents. They kinda blew up because they felt proactive. They have native crons.

09:28They have heartbeats. They can still use skills and stuff like that. The cool thing about OpenClaw is that you can actually sign in with your ChatGPT subscription and just route your codex usage through it.

09:36So you don't have to pay separately for an open API key, which would be way more expensive. You can also do this with a Hermes agent. Sam Allman himself put out a tweet on May 2 saying that you can now sign in to OpenClaw with your ChatGPT account and use your subscription there.

09:47So OpenAI CEO is publicly endorsing this, and that's a really permissive stance from OpenAI, and I bet they saw a massive spike in ChatGPT subscriptions after that announcement. Endopic stance is basically the opposite. The Cloud Agent SDK page on their docs literally says, unless previously approved, Anthropic does not allow third party developers to offer cloud.ai login or rate limits for their products, including agents built on top of the agent SDK.

10:07So in plain English, using your Cloud subscription inside of a third party tool like OpenClaw or Hermes isn't allowed unless Anthropic specifically approves you. And that's one important thing to keep in mind because it changes the economics of your decision. So if you live inside of these third party agent tools a lot, then you're probably gonna wanna go with Chatuchi PT Codex.

10:22Alright. So let's talk about pricing real quick because this is actually a big part of the decision. Both tools are included with their parent subscription, which means you don't need to mess with a separate API key to start using either one.

10:32So for Claude, you've got Claude Pro at $20 a month, which includes Claude code and the rest of Claude. Then you've got Claude Max five x at a $100 a month, which gives you five x the pro usage, and then Claude Max 20 x at $200 a month for 20 x usage. Pro is definitely enough to play around with Claude code, but if you're using it seriously every day, you're going to want at least one of the max plans.

10:49For codecs, it's included with ChatGPT free and then also plus at $20 a month all the way up to ChatGPT pro at $200 a month for basically unlimited use. Not really, but it feels like it. But right now, OpenAI has a promo running where the $100 tier on OpenAI side gets you two x codex usage through May 31.

11:04If you're going to test out codex heavily, that $100 tier is one of the best values in AI coding agent market right now. Now on context windows. Opus and Sonic can run-in Claude code with 1,000,000 tokens of context window.

11:15The latest GBT model in Codex runs at about 256,000 as the token context window. Now the part that I wanna flag that's more important than, like, just the raw price of your subscription is that a lot of people right now are complaining that they're hitting their clawed code limits, whether that be session or weekly, way faster than they used to.

11:31And I've been hearing this from my community for weeks and on x for weeks. So one of the things I tracked in the live test coming up is the actual token usage on each side. And honestly, the results didn't surprise me because as I've been playing around with these two tools, I have noticed that it seems like I'm able to do a lot more work in Codex before I'm hitting that limit compared to Cloud Code.

11:47So we're gonna go through those numbers together live after we run some of those experiments. So the takeaway is if you're already paying for one of them, you've already got a top tier coding agent. But I do think there's a lot of value in subscribing to both, playing around with them, and seeing which one you like better or if you like having both subscriptions for different types of work.

12:02To quickly recap what we've covered, Cloud Code is a more customizable shape. Deeper hooks, auto delegating sub agents, ultra plan, ultra review, slash loop, agent SDK. Codex is more unified shipping shape.

12:12WorkTrees, in app browser, it seems to follow directions better, sharper computer use, gbt image to access. Both tools have subscriptions. Both tools have kind of different context windows, and third party harnesses currently favor OpenAI, JetGPT.

12:23But this is where most comparison videos stop, just listing features and calling it a day. So here's what we're gonna do. I'm gonna give Claude code and Codex the exact same three prompts.

12:30A research report PDF with branding, a full landing page, and an interactive dashboard with real feeling data. Same prompt. I'm gonna put both tools side by side, so let's see what happens.

12:38Alright. So here are the final results of Codex versus Claude, and we're gonna come back to this and look at all of the actual breakdown in just a sec here. So let's actually look at the outputs of all of these three different prompts.

12:48So in this experiment, I did both of these, or I used Claude code and Codex in their respective desktop apps. The first thing that we did was the research report. This was something that we could turn into a skill, and it would give us a automation report for SMBs on, like, different automation tools.

13:02So this is the prompt that I shut off to both Codex and Cloud Code. As you can see, this is the prompt inside of Codex with the logo, and this was the one inside of Cloud Code. So let's take a look at the outputs.

13:11If I scroll down a little bit here, we should be able to see PDF. And if I click on that, we get to open this up in Cloud Code's desktop app, sort of like browser viewer. So I'll just do it in here for now.

13:20You can see right off the bat, you know, the logo's up top, but this is a major issue. Like, that is hard to read, and then the spacing right here is not great either. But this one's 15 pages, and as you scroll down, it gets better.

13:29I think the header looks really clean. The table of contents looks nice. I'm not gonna read and verify all of these facts.

13:34I just don't really feel like doing that right now. They're both pretty solid when it comes to doing research. And by the way, didn't give it any API keys.

13:39So they're doing research using their native, like, web fetch and web search tools, whatever those are. So it goes through executive summary. It goes through market overview, and you can see that this one is very, like, wordy.

13:48It's structured almost like it's trying to sort of, like, tell a story, and it's going over these different tools. We have a side by side comparison here, top three picks, Zapier, Lindy, make.com, and then at the end, have where the market is heading in the next twelve months with all the sources at the bottom here. And all of these are clickable links that I could go to, but not when I'm in the local host here.

14:04I was to open this up in my browser, like you could see right here in my browser, I could actually then go ahead and click on these links, and it would take me to that actual source. Now here is Codex in the desktop app. Interesting enough, you can't actually open PDFs right in here in the preview.

14:16So we have to open this up on our browser. And this is Codex's version. So right off the bat, it already looks better because we don't have some weird spacing on the title.

14:23The logo's there, but it kind of has this weird, like, you can tell it's a square image. So the header, nice. I thought the header was better with Cloud Code.

14:30Table of contents looks perfectly fine. We've got an executive snapshot, and some of this spacing feels a little bit almost rushed, like it feels a bit squished together. Market overview.

14:37And then as we go into the platforms here, we basically just get a table for each tool. Also, the footer on this version isn't as cool either. So Cloud Code went for more of like a, I'm gonna tell you a story, and I'm gonna break it down with bullets.

14:47OpenAI Codex went for more of like a, I'm just gonna give you a table, like a consistent table breakdown for each of these different tools. You'll also notice that this research report is nine pages, whereas the other one was 15. We get our side by side comparison.

14:58We have our top three picks, which are Zapier, Lindy, and Relay. And Claude Co's top three picks were Zapier, Lindy, and make.com, so kinda similar. And then we have where the market is heading over the next twelve months and a practical buying guide.

15:08And then we have all of our sources at the bottom, which once again, these are clickable links that work. Okay. So number two was our website.

15:13And we gave it the same exact prompts here with the Glido logo, and we told it to build us a landing page. We gave it the actual Glido site so it could go and look at it and maybe get some, like, you know, inspiration, and then it comes back with an actual landing page here. And then, of course, in Codex, we gave it the exact same prompt with the same logo.

15:29Now, here is the actual two landing pages. Which one do you think was which? This one on the left was Claude Code.

15:36So right off the bat, they have similar feels. Right? They have similar colors.

15:38You'll notice that OpenAI was able to put the logo up here, whereas for some reason, I don't know why Claude Code didn't. That would be a very easy fix. But as we scroll down, we can see that we've got sort of like an animation right here, which we have like the kind of like dictation looking thing.

15:51I like how this is a microphone that's pulsing rather than just this being like a g. I also like how this kind of like text cursor thing is blinking as well. And overall, as we start to scroll down here, I generally like Claude Codes version better.

16:03Like, even the font, it just feels a little bit less vibe coded. These logos are obviously wrong except for GitHub looks correct. Gmail looks sort of correct, but not really.

16:10We that would be an easy fix. But I like the sliding banner compared to just having these six boxes here. This next section, I once again think that this looks better.

16:17We have some glow. We have some icons rather than just like these random letters. So overall, I am liking Cloud Code's version here pretty much a lot better.

16:24Here's the pricing page, the differences here. So I think that Cloud takes the cake here. The logo thing would be a very, very easy fix.

16:30And as far as like a base, I think Cloud Code wins here. Okay. And the final one was a marketing analytics dashboard.

16:36I told it to make up all the data, but I pretty much gave it the same, like, acquired elements. So let me pull up both of these side by side, and just for proof, here is the same exact prompt inside of Codex. Alright.

16:44So here are the two dashboards. Once again, I put Clog Code on the left, and right off the bat, I already think that the Clog Code version just looks a lot better from a design perspective. Both of them are still functional.

16:53If I click on the different buttons, you can see the data will shift. And as the data moves and the numbers move and the charts move, we can still use our mouse to see the actual, like, numbers. So that is all working well.

17:02You'll notice here that we have orders and average order value, but here we just have revenue. We can come down here to channel breakdown, and we can hover over the different elements, and we get the data there. And even here, like, the conversion funnel, right, the purchase funnel, this just looks way more generic and bland, but this one has almost, like, sort of a a gradient that goes across.

17:18And I just think in general, the fonts and the vibe, everything about Cloud Code's version just looks better, even though from like a functional perspective, I think that they're the exact same. Alright. And now the part that you guys probably care more about, which is like the actual metrics of cost, speed, tokens, stuff like that.

17:33So we were using codecs with g b t 5.5 on high, and we're using Cloud Code with Opus 4.7 on high. So, yes, this was like a codecs versus Cloud Code video, but keep in mind that a lot of the actual performance is going to be determined by the underlying model that is powering the harness. So when Opus 4.8 or five drops and GPT six drops, these numbers would obviously look a little bit different.

17:53HOOKSo let's look at some of the totals and the numbers. Kinda surprising. So Codex, total time across three runs was almost 26, and Claude, total time across three runs was about fifteen minutes.

18:02HOOKTotal tokens were very similar. We had about 6,000,000, and you can see the breakdown. We're gonna break it down by experiment in just a sec, but about 6,000,000 tokens.

18:08HOOKWhat was interesting is that costed more with Claude code, and we'll break down why once we look at the experiment level breakdown, but keep that in mind. And then the average run, once again, Cloud Code was faster here. And keep in mind, we had one Cloud Code experiment that was like two minutes, and the Codex one was like eight, so that was like an outlier which kinda skewed the data.

18:24HOOKBut typically, I will say that I found that Codex is actually faster. And keep in mind with the with the token thing here, if you look at these two models side by side, g b t 5.5, Opus 4.7, they have similar input pricing, $5 for a million input tokens, but their output tokens, g b t 5.5 is $5 more expensive.

18:40HOOKBut GBT 5.5 seems to be super efficient with output tokens, which is why in this experiment, Cloud Code cost us more. Now this is API billing. I'm on a subscription for both of these, so I'm not actually getting charged $11 and $7, but this would actually factor in basically to, like, how fast your session limit is hit.

18:55HOOKSo let's keep scrolling down here. With the speed thing, we can obviously see that, um, this was the main outlier where Cloud Code finished really quick, almost, you know, two minutes, and then this one took Codex eight minutes. But I guess I stand corrected.

19:05I mean, in all of the results here, Cloud Code was pretty much faster all of them. For the input versus output tokens, we can see these charts might be a little bit hard to read because we have, like, input, we have cash, we have all this kind of stuff. But basically, what happened was Cloud Code was spending more output tokens than codex in all of them, which is like the little highlighted sliver at the top.

19:22You can see Cloud Code's output here was 83 k, almost 84, and Codex's output was 18 k. Over here, Codex's output was 20 k, and Claude's output was 80. And over here, Codex's output was 16, and Claude's output was 41.

19:35So Claude's output tokens is always higher than Codex's, at least in these three examples and based on other testing I've done. That's not like a definitive every single time rule, but it is a consistent pattern. So I think that, you know, we could look at the cost, obviously, but I think that this one chart is very interesting, if I can somehow make this one, like, full screen.

19:50This chart. This is efficiency and time. So the best place to be here would be bottom left.

19:54That means that you're very fast and you're very lean, and the worst place to be would be top right because you're slow and heavy. So on the x axis, we have total tokens, so more expensive as you go this way. And on the y axis, axis we have seconds, so slower as you go up.

20:06And it's really interesting because you can see here that we have two really great data points from Cloud Code, which were experiments two and three, and then we also have this one, which is a clear outlier in the good direction, which was experiment one, which is our research report from Cloud Code. And then we have kind of this accurate little bundle of codecs, which it's pretty consistent.

20:23Like, they're all kind of in this general area. They're all kind of in the middle of this scatter plot. So I thought that this was an interesting one to look at, and I would love to see what would happen if we would have ran, like, a 100 experiments, where where we would see, like, sort of the standard deviation and where we'd see the lines start to form for each of these tools.

20:38And I'm not gonna read these out because I think that it would be boring, but here are the raw numbers. If you wanna pause and take a look, you can certainly take a look through that. The So way that we were actually able to get this data is we just ask either Cloud Code or Codex to read its JSONL, which is like a session log, and it can pull the time, the tokens, the cache reads, all that kind of stuff.

20:55So that's how I pulled the data. If you guys ever curious about a session, just ask it to read the JSONL L and pull that data for you. Alright.

21:00So we just ran Cloud Code and Codex through these three live builds. Same prompt, both tools, three completely different kinds of work. And the honest takeaway before I dig into specifics is that this was not a clean sweep in either direction.

21:10I feel like Codex won at certain things and Cloud won at others. So starting with Claude code, the biggest standout for me was the dashboard test. Claude finished that build in just under two minutes.

21:18Codex took almost eight minutes for the same exact prompt. So Claude was roughly four times faster on the most complex of the three tasks. The token side was even more surprising.

21:26On that same dashboard build, Claude used about 282,000 tokens total, where Codex used about 1,640,000. So almost six times more tokens on the Codex side for one build.

21:35On the visual side, Claude also won the dashboard in my opinion and the landing page. The dashboard came back in dark mode and all the date filters worked, the hover statuses, the revenue chart just felt cleaner and more polished. Whereas Codex's dashboard was functionally the same, but it just felt cheaper to look at.

21:48And the landing page was the same story. Yes. Claude actually did forget to drop in the logo on that landing page, and the scrolling banner had, like, wrong logos and icons, but those are just mistakes that we could fix with one prompt.

21:58But the underlying design, the base that I wanted to start from, I think I liked Claude codes better. The pattern I noticed is that Claude has this way of planning the task tightly before it executes. And Codex tends to just grind through more iterations, which is why the input tokens stack up on its side for the more, you know, complex builds.

22:13So for front end work, especially anything with real interactivity and design polish, I think that Claude was the clear winner in that Now flipping over to Codex, the research report that it built was kind of a standout in my opinion. So Codex finished in about eight minutes and Claude took eight minutes and fifteen seconds, and Codex used about 2,800,000 tokens versus Claude's 4,700,000.

22:32So on the most research heavy task of the three, Codex was both faster and more efficient on tokens. Codex was also significantly faster on the landing page build, three minutes flat versus Claude's four minutes and thirty nine seconds. So if you're looking at pure speed, Codex typically tends to be faster.

22:46The other thing I noticed across all three tests is that Codex's output tokens are way leaner. Output tokens cost more than input tokens, so that is something important to keep in mind. And that's probably why on Codex, I'm not hitting my session limit as quick as with Cloud Code.

22:56On every single build, Codex wrote about two to five x fewer output tokens than Claude. So Codex tends to just be more concise in what it writes back. It seems to be more efficient.

23:04On the visual side for the PDF, I liked Codex's a little bit better. It felt like it had better spacing even though I thought that Cloud Code had a better header and a footer. And it's honestly just a toss but if I had to send one to a client, I probably would have went with Codex's version by a small margin here.

23:17And, obviously, I didn't read through every single sentence of the actual data in the research report, but that was my quick analysis. Alright. So given all that, let me give you my honest take on when to use each.

23:26I would say reach for Cloud Code when you're working on complex front end, when visual design quality matters, when the task requires deep planning, when you want auto delegation, when you're building custom workflows with hooks and skills and channels, and when you need the Cloud Agent SDK to embed agents in your own product, or when you're in an enterprise environment that needs Bedrock or Vertex off.

23:44Then I'd say to reach for Codex when the task is research heavy and pulling from the web, when you're, you know, producing structured documents like PDFs or reports, when you want a single desktop app that handles work trees and review and shipping, when you need to use slash goal for, like, long running objectives, when you want to use at codecs on GitHub PRs, or when your project needs image generation built into the workflow.

24:02On top of those buckets, I wanna come back to my observation from earlier because this is where it actually shapes my decision and practice. Like I said at the beginning of the video, Cloud Code in my experience just feels more creative. It pushes back.

24:12I prefer it as my brainstorming partner. It catches things that I might not have thought of. So when I'm in a planning phase or wrestling with a hard problem, that's usually when I will reach for Cloud Code.

24:19But Codex now just feels really good at executing. It just feels like it obeys me better. It follows instructions, especially as you're working on a project that starts to run a little bit longer.

24:27You're told what to do, it feels like it just does it. And, course, it's been sharper on, like, catching things in the code and reviewing it and plugging holes. And that's why I say it's never like which tool is better.

24:36It's a matter of which tool is better for this specific task. A lot of people have been finding a ton of success with doing planning and brainstorming and strategy with Cloud Code and then bringing in codex to actually, like, just review the code or maybe even execute on that plan. And one more mindset piece I wanna leave you with on top of all of this.

24:50Because you're working with coding agents, all you're really doing is you're making files that live inside of folders that live inside of more folders, you know, markdown files or JSON files or Python scripts or whatever it is, which means you're gonna be pushing all of this stuff to GitHub. You can pull that exact same project into Cloud Code or Codex or Open Claw or Hermes or whatever the next new tool is.

25:07You know, you're not locked into one environment just because you've been building on Cloud Code for the past six months. And if you ever wanna move between tools, it's really not that hard. You know, you open the project in another agent and you say, hey, I built this project in Cloud Code and you are Codex.

25:19Just walk through it, understand it, and then just update anything that needs to change. Or, you know, you could clone it and then have like a Cloud Code version of your project and a Codex version of your project or whatever it is. There's just a few small things that you're gonna have to swap like the Cloud.

25:30Md will now be an Agents. Md. But the agent will figure out pretty much all of that for you.

25:34CTASo the real mindset is just just keep an open mind. You're building portable skills inside portable folders. Whatever tool you the best workflow right now, just use that one.

25:41CTAAnd that brings me back to the thesis I started this video with, which is it's not a matter of which tool is best, it's a matter of which tool is best for the specific use case in front of you. And some people also might disagree with that. It's just kind of like how do you like to work and what features do you need.

25:53CTAAnd one last thing before I wrap, everything that I just walked through is accurate as of right now, May 2026. Both of these tools have been shipping at really incredible speeds. Know, new models will drop.

26:02CTAPricing tiers will shift. Features that are in research preview will graduate or they will be, you know, redacted. So if you're watching this video three months from now, just double check some of the specifics on the actual docs that I mentioned today.

26:12CTAKnow, the architectural differences that I walked through are likely to hold up, but some of those exact numbers or stats might not. And I know that we just covered a ton of information in this video, so I broke all of this down into a resource guide that you can access for completely free, and you can find that in my free school community.

26:24CTAThe link for that is down in the description. But that is gonna do it for today. So if you enjoyed the video or you learned something new, please give it a like.

26:29CTAHelps me out a ton. And as always, I appreciate you guys making it to the end of the video, and I'll see you on the next one. Thanks, everyone.

— full transcript

§ 05 · For Joe

Which coding agent to reach for, and when.

WHAT TO LEARN

The benchmark data splits cleanly: Claude Code wins on front-end quality and planning depth; Codex wins on token efficiency and research-heavy output -- and both tools are portable enough that you do not have to commit to just one.

Output tokens are priced higher than input tokens, and Claude Code consistently writes 2-5x more output tokens per task than Codex -- which is the direct cause of hitting Claude session limits faster, not a platform throttle.
Claude Code finished a marketing analytics dashboard in under 2 minutes using 283K tokens; Codex took 8 minutes and burned 1.64M tokens on the same prompt -- a 4x speed gap and 6x token gap for front-end work.
Codex won the research report task, finishing slightly faster and using 1.9M fewer tokens than Claude, which suggests Codex is more efficient when the task is document generation rather than UI construction.
Claude Code has 30 hook events for automated workflow triggers; Codex has about 6 -- if you need fine-grained automation that fires on specific agent behaviors, Claude Code is the only current option at that scale.
Claude Code auto-spawns sub-agents when task complexity warrants it; Codex only does so when explicitly asked -- which means complex multi-step tasks route differently through each tool even on identical prompts.
Projects built in either tool are portable: skills, hooks, and JSONL logs all transfer; the main swap is renaming CLAUDE.md to AGENTS.md when moving a project into Codex.
A practical split workflow -- use Claude Code for planning and brainstorming, then hand the plan to Codex for execution -- is validated by how each tool token behavior maps to planning-heavy vs execution-heavy phases.

§ 06 · Frame Gallery

Visual moments.

02:05

09:18

10:57

17:27

20:45

22:24