Mansel Scheffel · Youtube · 38:56

I Made Claude Code and Codex Build the Same App (RAW RESULTS)

A 39-minute unedited head-to-head where Claude Code ships in an hour and Codex never finishes.

Posted
February 14th 2026
3 months ago
Duration
38:56
Format
Tutorial
educational
Channel
MS
Mansel Scheffel
§ 01 · The Hook

The bait, then the rug-pull.

The same prompt. The same stack. Two AI coding tools, one clock running. Mansel Scheffel handed Claude Code (Opus 4.6) and Codex (GPT 5.3) identical instructions to build a live full-stack competitive intelligence app, then let Gemini Pro 3 judge the codebases. What followed was 39 unedited minutes that settled the argument more cleanly than any benchmark chart.

§ · Chapters

Where the time goes.

00:00 – 00:47

01 · Cold open

Side-by-side reveal of both apps' UIs. Immediate visual verdict: Claude's looks better.

00:47 – 02:53

02 · The challenge setup

Identical prompt introduced: build Rival (competitive intelligence app) using Supabase, Firecrawl, Vercel, ATLAS framework, GOTCHA system handbook.

02:53 – 05:50

03 · Both AIs start planning

Codex asks 9 clarifying questions (target user, LLM choice, auth style, visual direction). Claude asks one: API key for the edge function.

05:50 – 08:11

04 · Plan review

Codex's ATLAS-aligned plan is detailed. Claude's plan uses a table with stated reasoning for each tech choice.

08:11 – 11:09

05 · GOTCHA and ATLAS explained

Quick walkthrough of the 6-layer GOTCHA system handbook and the 5-7 step ATLAS build framework that both tools were using.

11:09 – 20:00

06 · Build in progress

Claude finishes in about an hour. Codex stalls on Supabase free-tier limit, asks repeated permission questions for every deployment step.

20:00 – 24:04

07 · Both apps deployed and tested

Both fail on first run. Codex magic link auth is broken. Neither app runs an analysis on the first attempt.

24:04 – 26:58

08 · Gemini judges the backend

Gemini Pro 3 reviews both codebases. Codex wins on security (RLS, immutable audit logging). Claude wins on frontend completeness (10 pages vs 6) and dependency count.

26:58 – 37:03

09 · Claude self-heals

Claude fixes two Supabase edge function bugs (JWT auth, max tokens) in real time by reading live logs. Codex spends 19 minutes finding the same auth error class.

37:03 – 38:56

10 · Verdict and CTA

Claude is the clear winner on speed, UX, and self-healing. Codex is abandoned. Creator plugs Skool community and vibe coding course.

§ · Storyboard

Visual structure at a glance.

cold open
Codex planning
ATLAS walkthrough
Gemini judge
verdict
§ · Frameworks

Named ideas worth stealing.

08:11 acronym

ATLAS

  1. Architect
  2. Trace
  3. Link
  4. Assemble
  5. Stress-test
  6. Validate
  7. Monitor

5-step (MVP) or 7-step (production) AI build framework. Forces planning before coding, layer-by-layer assembly, and security baked in before ship.

Steal for Any AI-assisted MVP build where you want to prevent one-shot randomness and reduce debugging time
11:09 acronym

GOTCHA

  1. Goals
  2. Orchestration
  3. Tools
  4. Context
  5. History
  6. Arguments

6-layer system handbook stored in claude.md or agents.md that governs an AI coding environment. Combines deterministic tools with probabilistic AI to reduce build variance.

Steal for Setting up a repeatable Claude Code or Codex workspace that behaves consistently across sessions
§ · Quotables

Lines you could clip.

31:47
"We fixed four errors in the time it took to do whatever the hell is going on here."
Live comparison moment, no setup needed, visceral delivery → TikTok hook
34:00
"Even when this thing is wrong, it's so confident about it that it just makes me love it."
Authentic, slightly unhinged, quotable on its own → newsletter pull-quote
36:56
"Codex failed to build an MVP, which the majority of platforms out there can do for $20 or less."
Blunt closing indictment, standalone → IG reel cold open
§ · Resources Mentioned

Things they pointed at.

00:00toolSupabase ↗
00:00toolFirecrawl ↗
00:00toolVercel ↗
24:04toolGemini Pro 3 (code judge)
§ · CTA Breakdown

How they asked for the click.

37:50 product
"Check out the videos on the screen right now. You can also look at my community where we've just launched the vibe coding course as well as a whole consulting path."

End-card with video suggestions + Skool community link in description. Soft sell, no hard push.

§ 04 · The Script

Word for word.

HOOK opening / re-engagementCTA the pitch analogy story
00:00HOOKI gave the exact same prompt to Claude and Codex. Build me a full stack app from scratch, super base back end FireCore scraping deployed live on Vercel. Just here's what I want.
00:08HOOKNow go and build it. Then I had Gemini judge both code bases to decide who won. This is claw diverse codex head to head and the results are pretty hilarious.
00:17So let's get into it. Seriously though, this thing on the right is disgusting. I'm not sure how something
00:24can produce something with a font and spacing that looks like this in 2026. I mean, obviously, this one on the left isn't production ready or gonna win any awards for beauty, but it's certainly a lot better than what's on the right here.
00:36Alrighty. So I think the best way to do this test is to give them the exact same prompt, but give little room for interpretation so these things can think a little bit for themselves.
00:45There are a few more constraints though. One being that they're both gonna be running off of the same framework, which is inside our Claw dot m d, and this is my gotcha framework, which we'll get into shortly. And then inside our goals folder over here, we also have a build app dot m d, which has the Atlas framework inside, and that is a very specific framework for building apps.
01:03And we'll go into those while these things are building, but for now, just know that they are gonna be running in a completely fresh environment, each of them. They've both just initialized it based on their respective files, and they have the same prompt locked in.
01:15So we're gonna run this. I'm gonna leave most of the things up to interpretation, and we're gonna see the result that we get and just how well these things can function without any of my inputs. And, obviously, if they need my input, we will go and inspect that to see when and where they go wrong.
01:28So a quick overview of what we are building. We're gonna be building a competitive intelligence app called Rival. A user pastes in a company URL,
01:35and the app scrapes that company plus its top competitors, then displays a full competitive teardown dashboard. This kind of analysis would normally cost 10 k from a consultant.
01:45Then I just tell it that it needs to use my Atlas framework, so read goals forward slash build app. Inside there, we talk about Atlas. Then I give it some clues on the required stack that I want, and I'm mentioning this specifically because I wanna see how well it integrates with the MCP servers and does everything that it needs to do.
02:00So we're using Superbase for the database and authentication. We're gonna be doing web scraping with Firecrawl, and the deployment is straight into Vercel, and all of this is gonna be via MCP.
02:09The front end framework, this is where I'm giving them a little bit of freedom so they can pick whatever will produce the best result. We talk about some core features over here. I'm not gonna go too much into that.
02:19And we explicitly tell them that this thing needs to have AI in there, obviously, for the analysis and things like that. Then I give them just a very brief list of requirements.
02:26So the app must be deployed and live on Vercel when you're done. Superbase must handle all data persistence and authentication, and a few other things in here.
02:34One thing that I thought would be really important is that it must create its own instance of the Superbase project and the Vercel project. And I did that because I figured maybe one of these might finish a step sooner than the other one.
02:46And if it did that, then the other one might think, oh, hey, there's something in here. I can just go ahead and use this. And that might screw things up for this competition.
02:53So other than that, I think we're ready to go here. I was gonna measure time on this, but I think they're probably gonna run into a little mistake here and there.
03:01But we'll find out. So straight out of the batch here, we can see them in their planning process. This is obviously Codex 5.3 on the right, and on the left, we have Opus 4.6.
03:10So this thing is already asking us questions. Who's the primary end user for Rival in the first version? Let's just say founders.
03:18Which LLM source should power the analysis pipeline? So there we go. It asked.
03:23And, you know, I was talking to one of the AIs earlier, it was like, I bet OpenAI is going to recommend an OpenAI product. So, anyway, I'll go with this. I'm happy to go with its recommendations.
03:34For free tier operation, which run profile should be default? Let's just use balanced. How should competitor selection work in v one?
03:42It's recommending fully automatic. So again, I'll go with its recommendation here because I want this thing to supposedly tell me the best version of that this app that we can possibly get out of it. What authentication
03:52UX should we launch with? Email magic link as the recommended approach. Okay.
03:57We can try that. I would have maybe gone with Google auth depending on how big the app is gonna be. How should saved analysis be accessible?
04:04Private per user, definitely. So it's interesting that it's asked so many clarifying questions upfront, whereas Claude hasn't asked a single thing yet, but we'll see what its planning is like.
04:15What default usage cap should reinforce per user to stay in free tiers? K.
04:21If that's what you recommend, which visual direction should the dashboard follow? Bold editorial,
04:27enterprise clean, or playful startup? Let's go with playful. Which seed company should we use for smoke testing and first demo run?
04:38Fine. Okay. We'll use Linear.
04:39I don't really care. Wow. This thing is still just kind of planning.
04:45And it's looking at the Vercel documentation, is interesting. So I'm guessing it's trying to figure out how to actually deploy directly to Vercel.
04:52Which front end framework should I lock for implementation? I'll take your recommendation. How should we handle repeated analysis
04:58for the same target domain? Twenty four hour cache. That makes sense.
05:03Which OpenAI model profile should be the default analyzer? 4.1 mini?
05:08Really? I mean, maybe. But jeez, that's gonna be retired soon, surely.
05:15So I mean so I mean if it's recommending that again, I'm gonna choose what this thing recommends because otherwise it's kind of cheating. I don't wanna steer it too much. I guess it's not overly complex to just kind of sort through information,
05:27but one would think that if it's doing some kind of sentiment analysis or any deeper research, you might want a different model. But maybe it's trying to balance costs and I think if we actually read over here one final cost quality decision model tier for the analysis engine, Again, we'll see if it gets away with it.
05:43And coming in first place here, we have Codex, which has its full plan and actually asked a ton of questions upfront that were pretty good to get more of a user experience or workflow out of this thing. So let's expand the plan a little bit.
05:56I'm not gonna read through the whole thing because it's pretty in-depth. So v one implementation plan summary.
06:02Rival will be a Next. J s 15 app deployed on Vercel using Superbase for auth data real time and Superbase Edge function for orchestration. Awesome.
06:11So that's the choice I would have made. A user submits one URL. Rival auto discovers three direct competitors,
06:19configurable up to five, scrapes target and competitors with FireCrawl, runs structured analysis using GPT 4.1 mini, and and renders a live updating dashboard with history. That's awesome.
06:31So then it talks about the plan here, and you can see this is where it's starting to go through my Atlas framework, which we will go through very soon while these things are building. So the first phase is to architect, and it gives us our full app brief problem, users, success, constraints.
06:45Although, this isn't as in-depth as most of my builds are with Claude, but we'll see what it does this time around. Next up, we have Trace and this is where it goes through the data schema and figures out all the moving parts from that end. So you can see here, Superbase and then it lists all the different schemas in here.
07:00We have an integrations map, Superbase, FireCrawl, OpenAI. Okay.
07:05That all looks pretty good and we've got some free tier controls in here. Then next up, we have link and this is where we check that everything is actually working before we go ahead and build so we can see all of the moving parts over here. And we'll go and check this before it just randomly builds the entire system and then has to figure this out after it's built something.
07:23Then we have the assemble part and this should be layer by layer. So hopefully, it's done that. Initialize app structure, implement super base schema, implement edge orchestration.
07:31Yes. So it's doing it piece by piece rather than just trying to one shot something super amazing. It's gonna take our layered approach.
07:37And then finally, stress test. So it's gonna be doing functional tests and integration tests
07:43and any edge case tests. Great. Then we have validate, which is the security and the correctness.
07:47So we obviously have to bake this in because if we don't have any security protocols on there, if we don't do any security checks as a part of our app development, it's not a very good app, is it? Now, realistically, there's only so many tests that this thing can run and obviously, if we have real users, they'll be doing user acceptance testing and a whole much deeper form of testing.
08:05But this thing just does what it can upfront. Then last up, monitor. We have some operational telemetry.
08:10So persist stage timings and failure reasons in analysis events, track run duration. There are lots of parts that could have taken here to monitor this stuff because both Superbase and Vercel have some form of observability in them. We could also have used Helicone or something like that for the OpenAI agent and various other things we could have done locally as well, but it didn't suggest any of those.
08:29To be honest, we weren't really that specific on that front. It's just pulling this out of my framework because my Atlas framework basically cements this in that we need to do this sort of thing.
08:38So that's a pretty cool plan. We're gonna go ahead and implement this very soon. I just wanna have a look at the questions that Claude is asking me over here.
08:44And it looks like the first question this thing is asking me is if I have an API key for Anthropic because it needs it for the edge function. So it looks like both of them have chosen a super base edge function for the AI, which would probably be the right move.
08:56Edge functions are probably the best way to do this kind of thing. So I'm gonna say, yes, I have one, and I'll provide this off screen to the AI when it asks for it. It did also offer GPT four o as a secondary option.
09:09But more interestingly here is that this thing didn't ask me a single question apart from the API key. So it's taken the entire build into its own hands without asking any clarifying questions and things like that, which is a very interesting approach. Let's have a look at what its plan looks like.
09:23So problem getting competitive intelligence today requires hours of manual research across competitor websites or paying 10 k for a consultant. Rival automates this. Paste the URL, get a full competitive teardown in under ninety seconds.
09:35And then it goes straight into Atlas, so architect talk about the problem, the users, the access, and constraints. So again, fairly similar to OpenAI's if we come and have a look over here.
09:45Same for bullet points, so it's kind of aligned with mine. Put this in a table for trace, which is just easier to read, suppose, for us.
09:52So layer the choice that it made and why it made it. This one didn't do that. So that's also really interesting.
09:58So the front end, it chose NextJS 14 plus and it's chose that because it's native Versal Deploy. And styling, Tailwind CSS. Okay.
10:07Okay. Okay. The AI analysis.
10:10Anthropic Claude Haiku 4.5, cheapest Claude model. Yeah.
10:14Sure. That makes sense, and it's fast. And I'm pretty sure this thing is gonna outperform what OpenAI recommended,
10:20but we'll find out very soon. Long running pipeline, super base edge function, 150 wait time versus the sales ten second limit.
10:29I wonder if it actually did research into that to make sure that that's accurate. Key architecture decision, super base edge functions for orchestration. And I chose that because the Vercel hobby plan has a ten second function time out.
10:40Far too short. Okay. Cool.
10:42And then we got to database schema. So migration one, two, and three, and four. Integrations map.
10:50All looking good. Atlas phase link connection validation, assembling everything.
10:55So it printed a very different plan for assemble. If we look at assemble over here, just the structure is very different.
11:02This one didn't give us a breakdown of everything. Cool. So I'm gonna approve both of these plans and I'm also gonna add the API key off screen so that this thing doesn't run into any problems.
11:11And while those things are building, I'm gonna give you guys a very quick overview of the system. I've gone massively in-depth in other videos, so you can check that out if you want long form. I'm just gonna give you some context as to how the system handbook works, how our build app framework works, and then I'll show you how to set all of this up manually and how you can integrate in your environment so you can then just go away and get this thing to build anything that you want, really.
11:31So firstly, we have claude.md. You can change this to any filename you want. Both Claude and Codex are using this right now.
11:38If you're switching to Codex, all you do is change this to agents dot m d and it will do the exact same thing. So we have the gotcha framework here. It's a six layer architecture
11:47and this just governs everything inside Versus Code or anti gravity, whatever you're using. We have our goals and this is what needs to happen. So you can see here one of our goals that comes standard is to build an app.
11:58We'll get into those very shortly. Then we have the orchestration layer, which is the manager, and this is the thing that governs everything inside our environment. Then we have tools, and these are deterministic scripts that will do the job each time in the exact same way.
12:10The reason that we have this three letter acronym over here is because AI is probabilistic. So one day you might get something like this out of it, and the other day it might do something completely different based on the exact same input. So we stop that happening by giving it deterministic tools, and those two things work together in order to have the probabilistic nature of AI, but we have that determinism for business logic and systems that we're building.
12:33Cool. But we can take it much further than that. So then we have context because, obviously, you have a business or system or anything that you're building, it has context behind it, and the AI can use that context as part of whatever it is that it's building.
12:46So for instance, in inside my folder, I have everything to do with my life, my business, my content, and this environment here can just pull that out for me. It's really useful if you're building specific things where you need that context. Then we have prompts, and all that is is literally a prompts folder with reusable prompts that we stash there so we can use them for regression testing outside of a Python script,
13:06or we can use them in other systems in a whole bunch of different scripts and things like that. And then finally, we have arguments which are just variables, and they go in at runtime for specific things that change during runtime as opposed to having to change them from a script every single time we run them. So again, just know that this thing kind of governs our entire system and there's a whole bunch more that goes in here.
13:26We've got guardrails baked in and again, you can check out the full length video if you really wanna understand how this file works. Just know that it governs everything within our environment and that's the important part here. But then more importantly, when we're trying to build, we don't just wanna yellow our way into every single build.
13:40We wanna give it some structure, some best practices structure. And that's where Atlas comes in. So this is a five step process for when I'm building MVPs, and it becomes a seven step process if I'm building something closer to production.
13:51So the first step that we have here is architect. This is where we define the problem, the users, what success looks like. Because in context engineering, you need to have an understanding of what good looks like.
14:02If you don't have that, the system couldn't possibly know where to take you. This is arguably the most important step in the entire process because the more specific you are and the more planning you do upfront, the less work you're gonna have to do later on. Next up, we have trace, and this is where we define the data schema, the integrations map, your tech stack, and things like that.
14:19You would wanna feed as many ideas of the tools that you're gonna be using upfront. Again, And you can work with the AI beforehand to figure this stuff out. It's not like I walk in there for every single build that I'm doing, like, yes, I want to do this.
14:30I know exactly what I'm doing. No. You can think with the AI to fill all of this stuff in for you.
14:36Then next up, we have link, and that's just where we validate all the connections. And then we have assemble, and that's where we build our layered architecture step by step. Start with basic functionality and you move upward rather than trying to just shoot for the moon and have to fix a bunch of shit later on.
14:50Then we have stress test. It's very important to test the functionality, the error handling before we actually do anything. And finally, bake in security and some monitoring because those are vital for any real app out there.
15:01So again, that's just these files at a very high level. Check out the deep dive on the screen now. But But what I'm gonna do is you can grab these files in the description below and you can use them as your framework to build whatever you want.
15:11I'm gonna show you how to set those up. Cool. So all you would have to do to get started is drag these into Versus Code.
15:16This is all part of the full tutorial video that I just mentioned a few seconds ago. So I'm not gonna go fully into the weeds with this because again, that will show you how to do all of this. So I got my little Claw chat here and I'm just gonna say initialize this environment.
15:29And then what it's gonna do is it's gonna go ahead and read its Claw dot m d. It's gonna understand what the system handbook is and it should build out our structure exactly as we want it. So I'm gonna bypass permissions.
15:38We can see there it's already gotten started and you can see how it's starting to build out our tools. We've got our context folder. We've got our goals, which was already there, but it should add a manifest very, very soon.
15:49We've also baked in memory, so that was an addition that I made in that other video that I keep referencing, where we're trying to bake in memory of this just like Claude Bot had, but but this is obviously way more secure. While our other environment is initializing, let's check-in on our builds.
16:02So this thing is just over halfway. That's Claude. And over here, it doesn't really have a to do list.
16:08It just seems to be kind of quirking away in the background. We'll flip back to this in a second. And there we go.
16:13This thing's done a few seconds later. So you can see it's now set up our entire environment over here with our memory system and the manifests that we need for our apps. And look at that.
16:23One of them is finished and one of them is asking me a question and kind of failed but not really a failure, I guess, at the end of the day. So we'll go into what it built very shortly. But on the right hand side here, Superbase reports that creating a new project will cost $0 monthly.
16:39Please confirm you understand that and want me to proceed. I absolutely do. Continue.
16:45That's fine. $0 $0. Meanwhile, we can come over here and have a look.
16:51Rival is live what was built. The database, super base project rival with three tables. The back end, Superbase Edge function analyze company v two with full scrape and AI pipeline.
17:01The front end, Next. JS16 plus Tailwind
17:05authentication is email password via Superbase. So Codex recommended the password list via a magic link, and this one recommended the traditional email password.
17:16At the very least, I would have added Gmail. I mean, come on. Deployment versus production, yes, it's exactly what we told it.
17:25And we'll come and open this up very, very soon. I just wanna see what this thing is saying on the right hand side here. Super base project creation is blocked by your account's free tier limit.
17:34Ah. That's interesting. So it's blocked because I've reached the max active free projects, is two.
17:41Alrighty. I just deleted one, so let's tell it. I just deleted one of the older projects, so you can go ahead and create it now.
17:48And we'll let that thing go off to the races. We'll come back and reveal these when the other one is done. Okay.
17:54And this thing is still chugging along here a little bit. Having a bit of trouble with super bass or doing something very, very in-depth compared to whatever Claude did. Another question.
18:03Do you want me to deploy rival to Vercel production now with required super based environment variables? I absolutely do. I wonder if this thing's gonna run at all on the left because it literally asked me nothing.
18:15So so far, already off the bat, I think Codex is superior in terms of asking user clarification and functionality upfront.
18:24Claude just kind of says, I know what's best and nobody cares about what the user wants. And then literally built the entire thing without asking me a single question about it apart from the API key. So it'll be interesting to see if its intelligence slash arrogance
18:38actually amounts to something over Codec's ability. Alrighty. We got some more things going on here.
18:44Production deploy started successfully. The URL's been issued. Vercel blocked production because NextJS fifteen point two point four is vulnerable.
18:53I'm upgrading to a patched NextJS line now. I wonder if this thing ran into the same version or if it automatically just went for NextJS 16.
19:02That'd be interesting to figure out. But this thing seems to be healing itself and just carrying on with life. And more questions here.
19:09Do you want me to install Next JS sixteen point one point six? Oh my god. Yes.
19:12Just do it. And this thing is still chugging away. Ask me another question
19:17about security, whether it should expose the public URL. And another one, do you want me to verify the deployed API health thing?
19:24Yes. Just do everything. So this thing is very needy.
19:29But I think they're doing it to bake in more security for people who are less security conscious. So they might stop, think about this, and then maybe ask this thing or, you know, Google it to see what the consequences might be. I'm not sure if that's intention.
19:41Do you want me to verify the deployed run analysis? Yes. Do it.
19:45So that's quite frustrating, and I'm sure you can bypass it. You probably just need to update your settings
19:51somewhere inside here to completely disable any form of it asking you anything. Claude just meant that a lot easier by having a bypass permissions button over here. Oh my god.
20:00I think it's finished and I have aged significantly. Implemented end to end rival v one in this workspace and deployed it live. Cloud resources created.
20:11Okay. That's fantastic. So it looks like it built a bunch of stuff.
20:14Cool. So let's get this thing going then. We need one manual step.
20:18Both of them recognized it, but Codex had a difference here in the sense that it needs to add the super base edge function secrets in the project. Wants me to do that manually. This thing does not mention that.
20:28That's really interesting. But we can easily go and do that. And then both of them identified that we need to change the URL
20:35inside the auth. So I'm gonna do that off screen and then we'll jump back when it's ready. Okay.
20:39So Codex is still stuck in some kind of test verification loop. And as you can see, I'm just asking it politely why it's taking so long to do all of this stuff when Claude finished ages ago.
20:51It's now literally been two hours, and I'm still sitting here. I've been sitting silently waiting, hoping, and even praying, which goes against every part of my being.
21:01And we still haven't finished. I think it is broken. I think I've officially broken codex
21:06or run out of something. We still have context, quite a few tokens, but obviously, it can compact it if it ran into any kind of problem.
21:13I definitely still have cred. I checked that as well. So I'm really not sure what's going on at this point.
21:19But we're gonna check Vercel and let's have a look at what's going on inside our console. So
21:25Codex Rival, this is the Codex one up. It's rebuilding. Okay.
21:31So at least it is doing something. It's been building for seven seconds. So let's leave that one alone then.
21:39Okay. So Codex finally deployed it, but as you can see, there's an error.
21:44So that's not a good start. So this is actually terrible and we're gonna reveal which one is which very soon.
21:51So what we did for testing here is we have a third party involved because it's not just me. I'm not just gonna judge this aesthetically and presume that the back end is all hunky dory. What we're actually doing here is we're gonna judge this, we're gonna test this ourselves,
22:05But then I've also been using Gemini Pro three, and it's busy looking at the back end and at our front end code to do an assessment for both of those to give its own judgment on who it thinks is superior at what. So So let's do some basic tests and then we're gonna flip over to Gemini and see what it thinks of this thing's building capabilities.
22:24Seriously though, this thing on the right is disgusting. I'm not sure how something can produce something with a font and spacing that looks like this in 2026.
22:34I mean, obviously, this one on the left isn't production ready or gonna win any awards for beauty but it's certainly a lot better than what's on the right here. But I'm not gonna be that negative.
22:44Let's test functionality. So why don't we just do let's do clickup.clickup.com.
22:51Maybe if I could spell clickup.com. And then we'll just copy and paste this in here. So if either of these work first shot, I will be massively impressed.
23:02Because remember this was a one shot. Ah, okay. So this one wants me to sign up first before we're allowed to do anything.
23:08So I've signed up to the left hand one. Let's see if this has the same functionality. Run teardown.
23:13Sign in. Yes. So this won't let me do anything without signing in either.
23:16But this one has a magic link. So let's test the magic link and see if that works. Oh, you're losing points there, buddy.
23:23Doesn't work. Let's try refreshing the page and trying again.
23:28Nope. That's still not working. So straight off the bat, it failed immediately for the second time trying to do its thing
23:36which obviously isn't very good. So I'm going to fix this off screen and we'll try log in again. Okay.
23:41So it fixed that problem. I'm gonna enter my email in here and sign up off screen again. Okay.
23:46So one of them is running on the right and let's do the other one on the left. If these work, honestly, I'll be amazed. If I had to guess which one is gonna work, I would probably say none of them but maybe the one on the left hand side.
23:59Let's find out. There we go. We already failed on the right hand side.
24:04Edge function returned a non two x x status code. So this is super generic. It's just saying a non two x x.
24:12So it's looking like 200 is normally okay. It doesn't even give us a proper error code. So we're gonna have to go and scrape through the back to see if we can find anything here.
24:21And at this point, it looks like they're both broken just as I thought neither of them would work. So left hand side, we've got Claude which at least looks better and in the right hand side, have Codex which took hours and hours to do absolutely nothing
24:34terrible and broken. But we'll fix this in the background. For now, let's flick on over to our anti gravity and have a look at what this thing when thinks who's the winner.
24:43So first up, I reviewed the security and architecture for the back end because obviously, that's completely separate. So the clear winner here was actually Codex, if you can believe that.
24:54So the winner is Rival twenty twenty six, which is our Codex build. The newer project demonstrates a significantly more mature and robust security architecture. It implements comprehensive row level security, which is pretty basic and also very important.
25:07CRUD operations includes immutable audit logging and utilizes performance optimized authorization patterns. Lots and lots of words. You can see this is a tie.
25:17This is probably the most basic thing that your database should have enabled for security. So it was a tie. No brainer there.
25:24But this thing says that Codex is the clear winner here, which is interesting, but I suppose also not surprising because it took significantly longer at planning,
25:35but also implementing and it asked me a bunch of questions. So I'm sure that that influenced the security as well.
25:41And if you look at the update for what Codex 5.3 was supposed to bring is that it is more robust in its security. So that's clearly showing on the back end here. However, on the front end,
25:53for the code base itself, things are a little bit different and you can see here the clear winner is Claude. So
26:00completeness, full auth flow, login sign up confirm, minimalist callback only. So it probably picked up that it didn't do the login first the first time when I scanned this because I scanned this before I had actually logged into either app
26:14and that's probably why I picked up that there was a problem. CoreLogic dedicated API analyze endpoint placeholder. Okay.
26:22Not sure what this thing is analyzing. UI complexity, 10 static pages, six architecture,
26:29237 dependencies, 139.
26:33Stability, how is it getting this? That's interesting.
26:39Alright. So it's literally just from a build. I wouldn't say that that is a stability of the platform then.
26:45HOOKSo perhaps we'll rerun the code based part of this after we get everything working. So I've got to say that Claude is still the clear winner for me like for every single thing that you do, Claude code is just better than Codex. I hate Codex's user journey.
26:59HOOKI hate its user experience. It is vastly slow in comparison to Claude. Even if I use Sonnet,
27:06HOOKCodex is slower than Sonnet. I don't like the way it thinks. It does random things very slowly as well.
27:14HOOKEverything in here is completely predictable. It lets me know what it's doing, and I realize you can click on here and see anyway. But something to do with the user journey of how Claude approaches it is just far superior to me,
27:28HOOKspecifically in its troubleshooting as well. There is no way you could convince me to ever use Codecs over Claude. I just wouldn't do it.
27:35HOOKI realized both of them failed, but realistically, you have to look at this. We built a somewhat complex, obviously not really, but
27:43HOOKfor a one shot, every single part of these functionalities to work flawlessly. It's not really realistic.
27:49HOOKThere's always gonna be a little bit of iterating especially when I gave it such vague instructions. If I gave it better planning upfront and I was very specific in what I wanted and how it worked and all of that, we would have got a much better one shot from the beginning.
28:01But the point of this test was to prove this kind of thing. I wanted to see what it could get away with if it had to think a little bit for itself and then just go and build.
28:09And this is where we got to. And Claude is definitely far superior on the sense of the aesthetics, but also troubleshooting.
28:16I mean, look how much faster this thing is working than this thing on the right. And Codex on the right had much more of a head start because I started troubleshooting this thing long before I started Claude. But we have to take into account anti gravity's critique
28:28with the back end. Obviously, I'm not going go through the database and the back end and verify any of this information. I'm going take it for its word.
28:36It used the MCP server. It logged straight into the projects and read both of them. So we have to take it on its word for that much.
28:43But this one I will rerun if we get these apps working. Otherwise, again, we're just going to take it on its word because from what we can see on the front end, it's vastly superior.
28:52It's a win for Claude. We don't really need Gemini's perspective on this. We can clearly see who the winner is on this side.
28:59In terms of front end functionality, yes, they are both broken and I'm gonna give this about five to ten more minutes to see if I can get either of them working properly. And if that doesn't work, then I'm just gonna cut this off because I've been recording for two hours and thirty four minutes and for the most part that is because of Codex and its literal
29:17slow time. Claude finished maybe an hour at least before Codex did.
29:23It just seemed to keep iterating through things and MCP servers and doing random things. And again, they both had the exact same prompt to start, so it's not like I'm doing something funny in the background here.
29:37So Claude has fixed its problem allegedly. The edge function analyze company was deployed with verified JWT true, which requires a valid JWT in the authorization header. Cool.
29:47So did you fix it? No Vercel redeployment needed. The fix was entirely on Superbase side.
29:53Alrighty. You should now be able to go here and log in. Okay.
30:03K. And we're off on a second test here using our competitive analysis while this thing is still trying to troubleshoot itself and just figure out life. So while this thing's running, I'm gonna see if we can trail any logs
30:14to see if this thing's actually running or if it can pick something up in real time if it is failing. It's working. Your new ClickUp analysis is actively running and is currently on the analyzing step.
30:25The final stage, here's the progress so far. Okay. So then it's a front end problem where it's just not updating these little bubbles.
30:32But let's see if it actually finishes. Codecs is still trying to fix itself.
30:38Let's have a look at our app. Have you finished? Nope.
30:43Still hasn't finished. What we can do is we can actually check FireCool to see if it's done anything. So the time is wrong here.
30:50It thinks I'm in a different time zone. But for the most part, it looks like it has been scraping competitors of ClickUp.
30:56So we got Monday, we got Notion, we've got Atlassian, Asana. So one of these has been scraping.
31:04I'm guessing it's obviously the Claude. Let's have a look. So they had scraped discovered competitors.
31:10Yes. Okay. So the Claude is winning so far.
31:12AI analysis failed to pass competitive analysis results. The problem is max tokens. It's too small for a full competitive analysis.
31:21We can see here on the right that codecs is just it's just getting nowhere. So, from a troubleshooting perspective, I mean, we're doing it live right now.
31:29This thing's already fixed two problems in the time that this thing's trying to figure out where left and right is. It's massively different. And this is what I'm talking about with the user journey.
31:38I don't know how to explain it or articulate it, but even the even when this thing is wrong, it's so confident about it that it just makes me love it. So that's why I've been using it.
31:49That's the why the only thing I pay for for the max plan is Claude. I promise I'm not sponsored by them. So edge function v four deployed three three key fixes, max tokens 8,000,
32:00try pass JSON, cool. Now try again. I'll do that.
32:04Okay. So it had a retry button, which I also like. I mean, I didn't ask for that functionality, but it had it and I clicked it.
32:10And here we go. So we can flick back over to fire crawl to see if this thing's doing anything. Let me refresh this.
32:20We should see some in progress unless this thing is so super fast that it's just that it's just running through it or maybe it will use a cached version. Who knows what it's doing?
32:30Nothing's been kicked off yet. Let's ask it to trail the logs and see if it can do that. On the right hand side here, we're still exploring
32:37life. Six files, two surges. What is what is it even doing?
32:46Pulling super base edge function logs now to see the exact HTTP status body behind. I mean, is mental that it takes us long to figure out something.
32:56Claude destroys it. It's running great on v four. Here's the live status edge function of post.
33:02No more four o ones. So if we come back here, do we see anything? Not yet.
33:08I hit the retry button. So how are you saying that one is still pending? Or what is happening with the one that was running now?
33:15The newest one completed successfully. Fully worked, but it didn't populate the front. I mean, let's try click back.
33:22So, this completed. Ah, there we go. So, there's still some kind of functionality problems here
33:28in terms of how it presents it to the front end because we had to refresh the page. But realistically, we're gonna go through this. I'm just gonna stop bothering with codecs now because this is ridiculous.
33:38We fixed four errors in the time it took to do whatever the hell is going on here. But it found a bug. So it's gonna go fix that bug while it fixes that bug.
33:47I'm just gonna come and take a look at this. Let's zoom in. ClickUp positions itself as an all in one replacement for a 100 plus fragmented tools, competing against established players like Asana, Monday, and Notion.
34:00So this is pretty cool. I mean, we have features, pricing, positioning, content, gaps, and each of them can be clicked on.
34:07So obviously, a UI perspective, I mean, it's not entirely beautiful, but it's functional and it far surpasses anything that Codex even bothered to get out. Even though it failed, this thing fixed itself.
34:18Codex couldn't even do that properly. So it's just a very clear table. We've got our pricing over here.
34:25ClickUp freemium, all of these have freemium, I guess, but so that's just a feature comparison and then a pricing comparison for the features that you'd get. Positioning, ClickUp software to replace all software strengths,
34:38clear memorable headline with strong consolidation value prop. Weaknesses, replace all software. Claim is hyperbolic and unsubstantiated.
34:46May alienate enterprise buyers. Alrighty. Asana,
34:51strengths Fortune 100 adoption creates significant credibility moat. Weaknesses, Fortune 100 focused may alienate SMBs.
35:01Notion is your AI everything app. I'm pretty sure a lot of apps are saying that now. Monday, outpace everyone with the best AI work platform.
35:10So these are pretty straightforward, but again, at least it's working. Content. So this is the content that each of them are putting out and you can see it's giving us a little summary over here.
35:18ClickUp and monday.com minimize blog investment instead embedding content through our product and demo pages. Asana and Notion prioritize blog and case studies for thought leadership and SEO. So again, that's pretty cool.
35:31And then gaps. So this is where it would be helpful to us if we were trying to get into this market ourselves and see how we could be different. Enterprise customer success and managed services narrative.
35:41ClickUp mentions CSM and managed services on enterprise tier, but no competitors prominently feature customer success stories, case studies or ROI testimonials. Asana leads here with Fortune 100 adoption claim, but lacks published customer success stories or quantified outcomes. So you can see this is pretty good.
35:59I mean, it it did exactly what we wanted based on our original prompt. In every single way, Claw delivered on everything that we wanted.
36:07Yes. It didn't work the first time around, but again, we were not hyper specific and it's normal to have some kind of bug with minor iterations.
36:13These were not giant things. It solved it in less than five minutes and if I didn't spend time waiting for Codex, we would have been done ages ago. But now let's flip back over to Codex and see what's cracking.
36:26So the error was coming from super base returning four zero one on run analysis. So a very similar error to what we had on the other side except this thing took nineteen minutes and ten seconds
36:38to find the exact same problem that this thing took I don't know what was it like thirty seconds to a minute to figure out and then solve in under four minutes. So that is an immense difference in in time. I mean, it's it's ridiculous.
36:52CTAI have no idea why the hell this thing takes so long. And for those of you wondering, yes, I am running this on the highest possible thought patterns that Codex has as well. So it's benchmark
37:03CTAto benchmark here.
37:07CTAAnd then we rerun it again and it says invalid JWT. And that's it. That is how I'm going to end this video.
37:13CTAI'm not wasting any more time on codecs. There is no way you could convince me or pay me to use codecs. And that's not me being a dick.
37:20CTAThat is because the user experience I have had from the get go of this has been horrible. Both from the way that it works inside Versus Code all the way through to building and iterating with this thing, it is not a pleasant journey.
37:33CTAIt might have much better security on the back end, but if it can't get the front end functioning at all, then we have massive disconnect between security and functionality. And realistically, when you're building anything, you need to have that perfect little triad or at least put it in favor without destroying the ability to use the thing properly.
37:53CTAAnd again, that's why I think Claude is massively superior here because they seem to be getting better and better at that balance while also riding this massive hype train that they have at the moment without destroying their own product. So I think they're doing amazing things and Gemini backs up obviously their front end.
38:09CTASure. Maybe their back end security and things like that could be better. But realistically,
38:14CTAif you're gonna be sending this to kind of thing to clients, you would be doing pen testing, you would be getting proper developers involved to make sure that this is getting done the way that it should be so that you're selling a real product. This was just MVP,
38:26CTAand Codex failed to build an MVP, which the majority of platforms out there can do for $20 or less. And Claude Sonnet could easily have done if I didn't bother using the Opus model here.
38:36CTAAnyway, that's where I'm gonna leave this thing. Let me know any comments you have down below. I will get back to all of them.
38:42CTAOtherwise, check out the videos on the screen right now. You can also look at my community where we've just launched the vibe coding course as well as a whole consulting path based on everything I've done over the last twelve years. So if that resonates with you, feel free to check it out.
38:53CTAOtherwise, I'll see you guys in the next one. Thanks for watching.
— full transcript
§ 05 · For Joe

Codex asks more; Claude ships faster and heals itself.

AI CODING BENCHMARK

When both tools encounter the same bug, the one with the tighter error-recovery loop wins -- and that gap showed up clearly over 39 minutes of unedited live build.

  • Upfront clarifying questions do not predict build quality -- both tools failed on first deployment regardless of how much planning either did.
  • The real benchmark for an AI coding tool is not the happy path; it is how fast it reads live logs, locates the root cause, and ships a patched version.
  • Claude Code fixed two Supabase edge function bugs in under five minutes by reading real-time logs; Codex spent 19 minutes reaching the same error class.
  • Vague prompts expose error-recovery loops, not raw intelligence -- give both tools the same underspecified task if you want an honest comparison.
  • Codex's security-first posture (stopping to ask permission for every deployment step) is a feature for enterprise teams and a tax for solo builders who just want iteration speed.
  • A third-party model reviewing both codebases is itself a repeatable workflow for adversarial quality checks on AI-generated code.
  • Backend security wins like RLS and audit logging are invisible to users until something breaks; frontend speed and self-healing are visible on every iteration cycle.
  • An ATLAS-style layered build framework forces both humans and AI tools toward fewer catastrophic surprises at deploy time.
§ 06 · Frame Gallery

Visual moments.