WEBVTT

00:00:00.000 --> 00:00:30.070
So Claude Opus 4.8 is finally here. And as always, the benchmarks look amazing. In a lot of the major categories, 4.8 is better than 4.7 and even better than g b t 5.5 as well. But the question is, is it really a better model? So today I wanna talk about what is new to Claude code because of Opus 4.8. I wanna talk about some of the issues that you guys have been having with 4.7 and some of the struggles and how 4.8 is supposedly going to address those issues. I'm gonna go over some key takeaways because it seems like this model is going to behave a little bit differently than 4.7,

00:00:30.070 --> 00:00:36.630
and you're gonna have to change the way that you work with it a little bit. So let's not waste any time and just get straight into this one. Okay. So it is 05/28/2026,

00:00:36.630 --> 00:01:47.745
and OPUS 4.8 has dropped, and it is apparently built on top of OPUS 4.7 with sharper judgment, more honesty about its own progress, and the ability to work independently for longer than its predecessors. And important to note, it is priced the exact same as Opus 4.7 on input and output tokens. But what's interesting right here is that they have increased rate limits and clawed code to accommodate the higher token usage of effort levels. So that's rate limits. That is not your, you know, five hour rolling window or your weekly session limits. Those remain untouched, but rate limits if you're using Cloud Code via API has been increased. Alright. So this is the blog post. I'll link this in the description, but I'm just gonna go over a few of the key findings here. Okay. So OPUS 4.8 launches alongside several new features. Users on cloud.a I now have control over the amount of effort Cloud puts into tasks. And in Cloud Code, we have a new feature called dynamic workflows that allows it to tackle very large scale problems. So I'm not gonna dive into Cloud workflows today, but this is a new feature that I will be making a video about shortly. But now in Cloud Code, obviously, you can see we have Opus 4.8 is here. It defaults to high effort. You can also, of course, switch the effort. But in here, can type in workflows, and that's how you could start using that dynamic workflow feature. But what I wanna show you guys in here, which is pretty cool, is in the terminal or the CLI version, if you do effort, you can see we have the slider.

00:01:47.985 --> 00:02:02.410
Like I said, it's going to default to using high, but you can also do low or medium, and you can come up here x high max or ultra code, which is x high plus workflows. So it's very, very smart over here. But, of course, it's gonna cost more from the token perspective.

00:02:02.490 --> 00:02:18.755
And then the more left you scroll down onto this slider, the faster your outputs will be. Of course, we can dig into the actual benchmarks, which I like to look at, but the thing about the benchmarks is every single time you see a new model, the benchmarks are amazing. Right? It's always better than the other ones, and you've always got these other comparisons.

00:02:18.915 --> 00:02:23.190
So, obviously, that's what they have to do from a marketing perspective. So it's really important for you to understand

00:02:23.590 --> 00:02:32.470
what model is actually the best for your use case. Like, maybe the case is, yeah, OPUS 4.8 really is better at agentic coding than something like Codex with GBT 5.5.

00:02:32.470 --> 00:02:41.585
But maybe for your very specific use case, Codex is just performing way better even if the explicit benchmarks don't say that it should. Like right here, for example, I think that Codex with GPT 5.5

00:02:41.585 --> 00:02:45.985
is much, much better at AgenTek computer use than OPUS 4.7 and OPUS 4.8,

00:02:46.225 --> 00:02:48.990
even though these two OPUS models apparently,

00:02:49.150 --> 00:03:09.185
objectively, would be better at agentic computer use than Codex. So always take these benchmarks with a grain of salt. Anthropic took a whole section of this blog to call out that one of the most prominent improvements is OPUS four point eight's honesty, which I think is interesting because that's definitely something that I noticed with OPUS 4.7 as we're gonna dig into over here as far as, like, problems that people have reported with OPUS 4.7.

00:03:09.265 --> 00:03:31.705
But they took time to call out the honesty here. We train all our models to be honest, to avoid making claims they can't support, like saying, hey. This is gonna take me four hours, and then it takes twenty minutes. Or saying, hey. I finished. I pushed all 50 into blah blah blah, but I only actually pushed 15. So if you guys have ever felt that, you're not alone. And apparently, OPUS 4.8 is much better at that. And so they actually have evaluations to test this sort of stuff, which is about misaligned behavior.

00:03:31.865 --> 00:03:43.480
And you can see here that in this case, a lower score is actually better. So right here, we've got, like, mythos preview coming in pretty low. Opus 4.8 comes in at almost half of what Opus 4.7 and Sonnet 4.6

00:03:43.480 --> 00:03:50.840
come in at. But take a look at this. Users will find Opus 4.8 to be a modest but tangible improvement on its predecessor 4.7.

00:03:50.920 --> 00:03:57.915
Obviously, there's still more work to be done. But what they say here is that they plan to release a new class of model with even higher intelligence than Opus,

00:03:58.075 --> 00:04:05.275
which is Mythos. You can see a small number of organizations are currently using it for cybersecurity work, but models of this capability

00:04:05.435 --> 00:04:24.595
require stronger cyber safeguards before they can be generally released to the public because we don't want some random kid in their basement hacking into your bank account. But, anyways, Opus 4.8 is available everywhere today. So wherever you're using Cloud Code, you should be able to access it. Open up a new terminal, Open up a new extension tab, whatever it is, and you can see right here we have Opus 4.8.

00:04:24.595 --> 00:04:36.720
And you will notice right away that we still have our 1,000,000 context window with Opus. I could come in here, and I could type in a just did that twice in a row, a slash model, and we can choose between default Sonnet Sonnet or our Opus 4.8,

00:04:36.800 --> 00:05:18.235
which is most capable for most work. But, anyways, Opus 4.7 was released April 16. So basically about a month and a half ago, they're moving really, really quick here. And when Opus 4.7 came out, they added the x high effort level, which now has been dwarfed by Max and the Ultra one, Max and Ultra code. But what's interesting is a lot of people actually weren't happy about this model release because they felt like it was actually worse than OPUS 4.6. So some of those main problems were it felt lazy. It was just basically giving up on the goal, on the task too early. So, you know, Codex had slash goal, and now a bunch of other different AI tools have slash goal. Cloud Code has slash goal. And that was kind of like a Band Aid fix to put on top of the model to help it work a little bit longer towards some sort of specified goal.

00:05:18.475 --> 00:05:33.170
But now that is just a core fundamental piece of the model. Not exactly the slash goal, but just the idea that it's going to be less lazy and it's going to be better at working for longer time. It was also said to be overly rigid with safety overreach. There was a ton of community feedback

00:05:33.170 --> 00:06:21.495
on the token burn and how much more expensive this model seemed to be. And the one that I think is the funniest is saying that it had an attitude, which honestly is true. If you've ever heard it sort of get a little sassy with you or push back on your own ideas, it's good to have that sort of, like, brainstorming thought partner. But I have noticed that sometimes it did sort of come off, like, very short and almost, like, stubborn. So those were some of the main problems that I felt, but also that the community had felt with four point seven. Now there is a big difference here between the model having problems and also, like, you using the model wrong. It's not always a model problem. Sometimes it truly is a skills problem, and the answer isn't just, oh, well, 4.7 can't do this. Let me wait for 4.8. Sometimes it is a user error thing. So I just did wanna call that out as well. So, anyways, 4.8 obviously comes out today, and it was built to fix this stuff. Right? It was built and said to have more honesty and self correction,

00:06:21.735 --> 00:06:33.030
more sustained autonomy on long running tasks, a warmer and a more collaborative vibe, and just efficiency and quality of life, meaning better tool calling, better reasoning, better question asking, better token efficiency,

00:06:33.190 --> 00:07:05.330
stuff like that. And so what I did is I read through a lot of the stuff that community was was talking about. I tested out Opus 4.8 a little bit. Obviously, it came out an hour ago, so I haven't been able to deep, deep dive into it yet. But I've tested it out. I also read through this documentation here about the prompting best practices, which is a pretty long article from the Claude API docs, which I will also link in the description if you wanna check it out. But after reading through all this stuff, there's a few takeaways that I I wrote down and that I wanted to share with you guys. The first one is that effort is the number one lever now. And when I look back at one of these problems right here, like maybe the laziness or the safety overreach,

00:07:05.410 --> 00:07:07.490
maybe that was an effort issue. Because,

00:07:07.970 --> 00:07:30.350
basically, if you were doing something that takes a lot of effort, but you've got the model set to low or, you know, medium or even just high, sometimes you just need more effort. And on the other side of the spectrum, if you're doing something that's really simple and you have that on high or extra high, then maybe you're also, like, dedicating more resources than you need, and the model is gonna overreason and overengineer. And that's where you're like, okay. This is so easy. Why can't it do it? It's simple.

00:07:30.670 --> 00:07:46.455
But maybe you just needed to turn down the effort. And so it really is a balance here between, you know, Cloud's intelligence and the token spend and the speed and all that kind of stuff you're looking at. But the point I'm trying to make here is if you are one of those people that open up Cloud Code and you just start typing and doing your work and building and you never tweak the model,

00:07:46.775 --> 00:07:57.340
start trying. Because the difference between Opus 4.8 on low and Opus 4.8 on extra high is a significant difference, like almost to the point where it feels like a different version, like an Opus 4.9.

00:07:57.420 --> 00:08:08.485
So it's definitely worth starting to pull on that lever a little bit if you never have. So the next one I've got is tell it what to do, not what not to do. And really the way I got there is if you go through this documentation,

00:08:08.805 --> 00:08:28.520
it always shows you good example prompts, right, that you could copy for specific scenarios. And when I looked through a lot of these example prompts, I realized that it wasn't really saying a lot of what not to do, or, I mean, that's horrible timing because right here it says do not do this. But it always tells more explicitly what to do, and what I thought was cool is it gives background.

00:08:28.520 --> 00:08:29.400
It gives context.

00:08:29.855 --> 00:08:44.015
Almost as if the model is sort of, like, curious, and it's gonna say, hey. You told me not to do x, and z, but but y. And the more that you can contextualize that stuff, the better it's gonna be able to follow those instructions. And that leads me to the next one right here, which is give the why behind an instruction.

00:08:44.440 --> 00:08:56.440
So rather than saying don't use em dashes, say to something like, I want this to come off like I'm really writing it. And this is my writing style, and I never use em dashes, so make sure you're following my writing style. And that is going to have a little bit better

00:08:57.695 --> 00:09:04.335
feeling of Opus actually following your instructions. If you guys have seen my comparisons between Opus and g p t 5.5,

00:09:04.655 --> 00:09:20.730
one of the things that I've said is that I love how creative Opus is, but sometimes I want it to just do the thing and I want it to do exactly how I want it. And so maybe that's an issue of effort and also telling it too many negative prompts. You know what I mean? So it's a mix of the model, but also take accountability on yourself a little bit and think, how can I actually

00:09:21.530 --> 00:09:47.040
use the model the way that the engineers of the model have actually told me to? Anyways, a few other ones. It's going to default to reasoning before calling tools. So it's gonna try to figure out, you know, the questions to ask and the approach to take on its own with what it has right now before it looks to spawn a sub agent, for example, or to go read that database or to go do this. And sometimes that's really good. Right? Sometimes you want that reasoning before it starts doing things, but sometimes

00:09:47.200 --> 00:09:56.355
you want that extra context to be pulled in before the reasoning starts. So that's why it's really important to play with your prompting, obviously, to play with your your effort level,

00:09:56.595 --> 00:10:24.835
and to be especially when you're switching over all these workflows from 4.7 to 4.8, you don't just switch over and say go and blindly trust that everything's gonna, you know, stick the same. You kind of wanna watch it a little bit to to get a feel for how the model behaves. And then when you look at things like response length and verbosity, you can see here that I said it calibrates its own length. And basically what I meant by that is that it's going to judge how complex what it should do and how it should respond based on the complexity of the task rather than defaulting to some sort of fixed verbosity.

00:10:24.995 --> 00:10:59.035
So this usually means that shorter answers on simple lookups, and you'll get longer ones on more of an open ended analysis that takes more reasoning. So, anyways, those are some of my main takeaways. Obviously, like I said, I've only played around with this model for about half an hour. I wanted to get this video out quick, but as I find more stuff out, I will continue to update you guys. So last two pieces here. What are people saying right now? So, obviously, there's a lot of different feelings. Right? A lot of positive and excited stuff. Right? People are saying, oh, this already one shotted my GPT 5.5 right here. Strongest coding model yet. I'm hooked. This is super warm, super collaborative, big jumps and, you know, benchmarks. But once again,

00:10:59.620 --> 00:11:03.780
a lot of these people have the intention to do some stuff like that or say stuff like this because,

00:11:04.020 --> 00:11:06.820
I don't know, they want engagement or they're marketing something.

00:11:06.980 --> 00:11:37.950
And so, obviously, it's it's great to look at the full end of the spectrum, which is why I also pulled in some mixed and cautious reports, like some early reports of bugs already in OPUS 4.8. Maybe just because of the rollout, whatever it is, people are still testing it. So there's a lot of stuff to still be cautious about. But the overall vibe, which I think is pretty cool, is that it's almost like we have, like, these, you know, four or five main bullets of 4.7 problems, and most of these improvements that we're reading about from 4.8 are directly hitting those 4.7 problems and pitfalls. So at least we're getting that sense of

00:11:38.590 --> 00:12:19.245
Claude using that data to make it better. And if you really think about it, think about the way that you use Claude code. Right? You ask something. Claude code responds. You correct it. And you have this back and forth of, like, I don't like that. Do this better. Blah blah blah. I'm your master. And then what happens is because, obviously, Anthropic can read those logs and and, you know, train their models on that data, It's able to say, okay. What are people not happy with OPUS four one seven about? What are they constantly saying? And let's just bake that into the model. So it really it would concern me if a lot of these key problems weren't being addressed, like, head on. Anyways, but the key thing that I want you guys to always think about is that benchmarks look great, and they always will.

00:12:19.565 --> 00:12:20.045
And

00:12:20.445 --> 00:12:29.645
someone else's use case is not your use case. So figure out right now in your workflow, in your OPUS 4.7 workflows, what are your problems? What do you typically get frustrated by?

00:12:30.340 --> 00:12:39.460
And maybe OPUS 4.8 fixes those problems, but maybe it won't. Even though the model is better, that doesn't mean it's better for that specific problem. So just always be thinking about

00:12:39.860 --> 00:13:01.670
how can you work in different models or different context strategies or different effort levels to directly address the actual constraints and pain points that you're having right now. So look for things like the vibe upgrade. Look for things like how often you're self correcting this thing and giving it the same instruction over and over. Obviously, you should be working in, like, memory and different skill files and things like that to address that repetition, but still.

00:13:01.990 --> 00:13:11.350
And then, of course, the whole token and workflow efficiency feeling that. You typically get a sense of when you're getting near the end of your session limit and when you need to pull back a little bit. Apparently, based on the documentation,

00:13:12.145 --> 00:13:43.848
this model is more efficient with tokens, but we don't actually know yet. And one great way to test that kind of stuff out is you can use my token tracker, my token dashboard tracker, which is completely free. It's an open source, just GitHub repo. I will leave that in my free school community linked in the description. Just give Cloud Code to GitHub repo, tell it to set it up, and it will pull in all of your historical data with Cloud Code, and you can see where your tokens are actually going. But, anyways, that is gonna do it for today. I hope you guys enjoyed this one or learned something new. And if you did, please give it a like. It helps me out a ton. And as always, I appreciate you guys made it to the of the video, and I'll see you on the next one. Thanks, guys.
