WEBVTT

00:00:00.000 --> 00:00:43.945
So look at this. On this day, I saved 91,000,000 tokens because of cache read, and in the past week, I've saved over 300,000,000 tokens because of it. Now don't freak out. This isn't anything that you have to go change. This is happening automatically if you are using Claude code or Claude. I And know that the concept of prompt caching might seem a little bit overwhelming, but today, I'm gonna make it as simple as possible and only really tell you what you need to know in order to make sure that you are saving your session limits and saving tokens. I'll also give you guys this entire token dashboard for free so you can actually start tracking your tokens a little bit Anyway, so let's talk about prompt caching, why your sessions burn out, and how to stop it. So what does caching actually cost you? Well, cached tokens only cost you 10% of normal input. So all the tokens that are getting cached are saving you a ton of money. So if we go back to this example, on this day when I had 91,000,000

00:00:43.945 --> 00:00:44.985
tokens cached,

00:00:45.305 --> 00:01:01.580
that costed me only as if I was processing about 9,000,000 of those tokens. The cache window on a cloud subscription is an hour. Meaning, if you're working with cloud code and you don't touch it for an hour and then you send another message, everything in that session gets uncached. So if you leave a session sitting for an hour or longer,

00:01:01.900 --> 00:01:25.590
then you're gonna pay more for it. And if you're using Cloud via API or sub agents, then the TTL or the time to live is only five minutes. You can change that, but it's just a little bit more expensive. You could bump it up to an hour if you want. But for Claude code inside of your terminal or your extension, whatever it is, that's an hour. And now here's a quote from Thoric from Anthropic. He said that we actually run alerts on our prompt cash hit rate and declare SUVs if they're too low. So, basically, them saying we take this stuff really, really seriously.

00:01:25.830 --> 00:01:28.550
And if we see that the hit rate isn't very high for

00:01:28.790 --> 00:02:19.805
users' Cloud Code caching, then we do something about it immediately. And that's very nice of them, but also, of course, it benefits themselves because with a high cache hit rate, Cloud Code feels faster, their serving cost is lower, subscription limits feel more generous, you know, because you're using less, and long coding sessions stay practical. And then if you have low cache hit rate, this is what happens. And, obviously, it's just a lose lose for everybody. And that's why I said, like, prompt caching can get very, very complex. And if you wanna check out more, then I'll link this article right here, which Thorik really goes into some depth here. But if you read this, at least when I did, was like, okay. This is a little bit overwhelming. I have a feeling I don't actually need to know all of this, but I do need to know at least a little bit, at least, you know, the eighty twenty of prompt caching so that I can get the most out of my session limits, and that's what I'm gonna break down today. So let's take a look at an example of how this actually grows. So by default,

00:02:20.260 --> 00:02:42.095
when you shoot off a message to Claude, there's going to be some information that needs to be cached right away. And, actually, let me just switch back to one of Thoric's graphics real quick. You can see here that we have the base system instructions get globally cached. We have tools like read, write, bash, grab, glob globally cached. We have per memory or sorry, per project things like Cloud. M d and memory, and that gets cached per project. We've got session state,

00:02:42.255 --> 00:02:47.375
and then we have user messages which grow each turn. So now that we take this into

00:02:47.930 --> 00:03:06.425
context, when we flip back over here, this is what it looks like. This is an example where we have four turns. So on turn one, there's no cache. Basically, we're matching on a prefix. So don't really have to worry about what that means, but I might mention that later. So, anyways, on turn one, there's nothing. Right? We're opening up a fresh session. We load in the system prompt, the project context,

00:03:06.425 --> 00:03:10.505
and we shoot off our first message. And all of this is kind of in this, like, brown

00:03:10.745 --> 00:03:33.065
highlight border, which means that this is new, and it has to be fully processed, and it's being written to the cache here. So before I continue down this graphic, in this dashboard, you can see that we have the difference between cache create and cache read. So on these days, you can see what are my input tokens, my output tokens, and my cache create. And And then over here, you can see my daily cache reads. And just a quick explanation,

00:03:33.225 --> 00:03:48.460
a cache create is writing something into cache for the first time. It's a onetime cost, and it pays off the next turn, unless, of course, everything gets uncached. And the cache read is tokens that Claude reused from a cache, like your claude.m d or some of the files or some of the global system instructions.

00:03:48.700 --> 00:03:53.820
And these are the things that are 10 times cheaper than fresh input. So anyways, on turn two,

00:03:54.060 --> 00:03:56.700
given that we're within that one hour TTL window,

00:03:57.265 --> 00:04:30.135
everything here is already in context, so it's cached. And then all that Claude actually has to process for the first time is reply one and message two, and it caches that. So then down here in turn three, all of that's cached, and we are bumping up a reply and a message, and those are the things that only get processed each time. But if we waited an hour and then we sent another message, or if we change the system prompt, then everything from the very beginning has to get fully recached. So imagine if you were on message, like, you know, 16 and you're way, way, way over here on the right and you change the system prompt or you wait an hour,

00:04:30.455 --> 00:04:32.295
then everything getting recached

00:04:32.535 --> 00:04:44.010
is going to be a pretty expensive move that you just made. So, anyways, once again, we have the system layer, the project layer, and the conversation layer. The system layer has instructions, tool definitions, output style, and here's where it might break.

00:04:44.330 --> 00:05:05.845
The project level or the project layer has Cloud. M d memory and rules, and then here's when that might break. And then we have, of course, the conversation, which is just like the replies and the messages, which gets recached every time, but that's how it should be. So here's where there's been some confusion among the community. So how long does the cache snapshot live, which is kind of called the TTL, the time to live?

00:05:06.245 --> 00:05:07.605
So on your Cloud subscription,

00:05:08.300 --> 00:05:17.820
you have an hour by default because it uses your subscription. But if you go over that weekly limit and you are now playing in your extra usage territory where you are paying per token API,

00:05:18.220 --> 00:05:34.385
then by default, that will be five minutes, which is very dangerous if you're managing multiple sessions and you're constantly recaching everything because five minutes is passing. You gotta be careful about that. And people were kinda suspicious. I don't know if you remember, like, a month or so ago when everyone was complaining about their clawed subscriptions,

00:05:34.385 --> 00:05:38.880
how quick they were eating it up. People thought maybe that they switched the cache TTL

00:05:38.880 --> 00:06:05.410
from an hour to five minutes without, like, saying anything to anybody. It turns out they didn't. So it is an hour, but that's just like you know, there was a lot of confusion around that. And I get why because, honestly, it's not super clear. Like, if you're on an API, you have five minutes by default, but you can increase the cost and you can do an hour, and then your sub agents on any plan are gonna be five minutes. And for some reason, all of this is documented about Cloud Code and the API, which are two very different things. But the cloud.ai,

00:06:05.410 --> 00:06:12.930
like, the web, we don't know exactly how that works. At least, I haven't found documentation on that exact. I'm assuming it's the same as your subscription,

00:06:13.090 --> 00:06:22.985
but I don't know a 100% for for fact. Anyways, three habits that cover 95% of people. Don't pause too long. So if you've gone over an hour on a session,

00:06:23.385 --> 00:06:47.370
just hand it off to a new session. Obviously, start fresh when you switch tasks. So do a slash compact, which will break the cache, or do a slash clear. Or you can also use my session handoff skill, which I will include as well for free. So both the token dashboard GitHub repo and the skill will be in my free school community. The link for that's down in the description. But, basically, what that means is let's say right here, I've got this project which helps me build this HTML file you guys are looking at. It's got 205,000

00:06:47.370 --> 00:06:59.465
tokens in here. And if I come in here and just do a session handoff, this basically summarizes everything we've done, all the important files that we've built, all of the open decisions, exactly where to pick back up. And then I basically am able to just copy that summary,

00:06:59.705 --> 00:07:23.705
do a slash clear, and then keep going. And it feels like I haven't actually lost anything. So that has been basically my replacement for doing slash compact. I've just enjoyed doing this better. And sometimes the compact takes a long time. This typically doesn't take anywhere over a minute. There you go. So that is my session handoff. I do a slash copy, and then I just go ahead and clear that, paste it in, hit enter, and now I'm basically right back where I was. And then this last one is for if you're using Claude Chat specifically.

00:07:23.785 --> 00:07:31.310
If you're gonna be pasting big documents in there, you're probably better off doing a project because like I said, I don't know exactly how the caching works in Cloud Chat,

00:07:31.550 --> 00:07:34.910
but we do have some confidence in saying that projects,

00:07:35.070 --> 00:08:13.085
those files are cached a little bit differently and probably more optimized for storing a bunch of documents compared to just dropping them into your Claude chat. So keep it alive, keep it focused, and start fresh when you switch. Now there's a few other things that were a little bit confusing to me as far as, like, what breaks the cache. So the first one is if you switch the model. So, you know, if you're in here and you're talking to Claude, hello, hello, hello, and then you go in here and you do a slash model and you actually switch the model, that's going to recache everything. Because if you remember earlier, said it's prefix matching, which I'm not gonna dive into right now. But if you switch the model, then you are switching essentially the prefix, and it can't match on that same cache. So if you switch the model,

00:08:13.405 --> 00:08:18.445
you are recaching everything. Now I do wanna apologize for something here because if you do model

00:08:18.765 --> 00:08:23.245
opus plan, which is something I've shown before in, like, token hacks videos,

00:08:23.325 --> 00:08:26.765
this basically means it uses opus for plan mode and then it switches to sonnet

00:08:27.140 --> 00:08:28.420
for the execution.

00:08:28.580 --> 00:08:51.645
But if you do that, just keep in mind, that's actually gonna break the cache because you're switching model halfway through. So right here, you can see each model has its own cache. Switching with model means the next request reads the entire conversation history with no cache hits. Even though the context is identical. The Opus plan model setting resolves to Opus during plan mode and Sonnet during execution, so each plan toggle is a model switch and starts a fresh cache. So it's very interesting because

00:08:51.885 --> 00:09:02.630
typically the point of that is to save your session limit, and I think ultimately in long run it does, But it is important to understand that doing that does reset the cache. Now what you can do is you can edit your cloud.md,

00:09:02.710 --> 00:09:55.490
and you can do that mid session because the edit actually doesn't apply until you restart that session, so the cache stays safe. And then, of course, the cloud.ai projects caching. It's not exactly documented, but pretty confident that it does help to drop docs in projects rather than in the chat. But, anyways, this token dashboard, like I said, is very helpful to just be able to understand, get a little bit more visibility into your tokens. This does track your tokens on a local device. So if you switch over to a laptop, then your dashboard is gonna look different than on your main, like, PC or whatever you use. But it's very, very simple. It is a GitHub repo. You will go to my free school community. The link is in the description. You'll click on classroom. You'll click on all YouTube resources, and then you'll be able to find it right in there. And once you get that GitHub repo, all you have to do is give the link to Claude code and say, hey. This is a token dashboard. Set this up on a local host. Boom. You've got it open. And it will pull in all of your past sessions. So it's not like you're gonna start fresh as soon as you,

00:09:56.290 --> 00:10:16.115
you know, link in this repo. It will read your past files, it will pull in your tokens. And then, of course, I will also include that session handoff skill that I just mentioned to you guys. So I know this one was super quick. Hopefully, this one was helpful, though. It's just important. Like I said, when I hear about stuff like this, I love to understand it to the point where I know how to use it and I know what's going on under the hood. But truthfully,

00:10:16.115 --> 00:10:42.545
if I looked at some of these other articles, like how in-depth they go and how much nuance there is, most of the stuff right now, I just don't need to know because I'm not using the the API in this way super heavily. So the reason I wanted to throw that out there is because it's important to stay updated and follow things, but just understand what do you really need to know at its core. So if you guys enjoyed the video or you learned something new, please give a like. Helps me out a ton. And as always, I appreciate you guys making it to the end of the video, and I'll see you on the next one. Thanks, guys.