WEBVTT

00:00:00.000 --> 00:00:06.480
When you give Claude code the ability to instantly watch any video on the Internet for free, it becomes genuinely

00:00:06.480 --> 00:00:07.520
unstoppable.

00:00:07.520 --> 00:00:16.495
With this Claude skill, Claude can understand video as well as it reads PDFs. Hours long YouTube videos, Instagram reels, looms, local files,

00:00:16.735 --> 00:00:39.590
anything. Before, Claude was just guessing. Now it can watch the whole thing frame by frame instantly. It's like Neo plugging into the matrix. By the time you've hit play, Claude's already watched the whole thing and become an expert. I've tried a bunch of transcript tools before developing this one, and they all let me down. They either cost way too much or they only ever read the transcript and missed half the video. This skill gives Claude the frames and the audio together,

00:00:39.830 --> 00:01:12.510
so it actually sees what's happening on screen. Right now, I'll walk you through exactly how it all works, the use case that completely changed how I consume content, and how to set this up in your own Claude code in under five minutes. Here's what it actually looks like on a forty five minute video done in less than a few minutes. On the left, I have a YC lecture from Sam Altman about how to start a startup. I'm gonna press play on that now and then grab the URL. All I have to do is go over to Claude and type slash watch and then paste the URL here. Then Claude gets to work grabbing the subtitles from YouTube for

00:01:12.105 --> 00:02:14.815
extracting the frames, and analyzing them all together. So the reason this is better than just pulling the transcript is because Claude can actually grab the frames from this video. In this lecture, Sam goes through and shows a bunch of really great graphs. And this is important context for Claude because if you're only getting the transcript, you're only getting half of the information. Now here's where most of the existing video tools fall short because they base everything around the transcript. When something happens on screen and it's not explicitly referenced in the transcript, Claude doesn't know about it, and you miss out on key context, which matters because half of the interesting stuff in a video isn't said out loud. It happens on screen. So this skill actually watches. It pulls frame by frame screenshots and puts it together with a per second time stamped transcript to get Claude the full picture and full context. And just like that, we're only two minutes into the lecture. Sam is still introducing what he's gonna talk about today, and Claude has already ingested the entire thing. I have a structured summary of all of the speakers. I can see exactly what they talk about, and now I can actually query Claude on anything about this context

00:02:14.895 --> 00:02:23.050
and then start to put it to work instantly right here in the terminal. That's a forty five minute video done in less than two minutes watched,

00:02:23.050 --> 00:03:41.950
analyzed, and applied. That's the matrix moment. You're not watching content anymore. You're actually downloading context automatically and putting it to work straight away. And you're probably thinking at the moment, there's some expensive API doing the heavy lifting here, but there isn't. But before we get into that, let's get into the setup. By the way, I'm giving this whole skill away for free on GitHub. The link is in the description below. Just run these install commands and the setup takes care of the rest. Once the skill is installed, Claude runs the setup script and installs any dependencies that you don't have already. It authenticates with the transcription API. Don't worry. This one is pretty much free and we'll get to it in a second. But under the hood, the pipeline is actually surprisingly simple. Now here's the part that nobody really talks about. Claude can't actually watch video because Anthropic doesn't have a video model yet. There are some other providers that can, like Google's Gemini model, but they're pretty expensive and they don't integrate nicely with Claude. So if you're watching a lot of content, that bill stacks up pretty fast. Luckily, there's a smarter way to do this because if you really break it down, a video is just two things. It's a bunch of frames and a transcript. That's it. So instead of paying for another expensive model, I can just split the video into those two pieces and hand it to Claude in a format that it already knows how to read, pictures and text. Now this is the part I love because the skill is doing this with two of the oldest, most battle tested line tools on the Internet, YouTube DLP and FFmpeg.

00:03:41.950 --> 00:03:58.035
These aren't MCPs. They're not some new wrapper. There's no third party service involved in the middle here. They install one song right on your machine. Millions of developers have used them for over a decade now. They're rock solid and completely free. And they're what every video tool you've ever touched is probably using under the hood. YTDLP

00:03:58.035 --> 00:04:04.355
is the downloader. You can think of it as a right click save video, but it works on basically the whole Internet. FFmpeg

00:04:04.355 --> 00:05:30.195
is the video engine. It takes the video and turns it into two things that Claude actually wants. First, screenshots, which are taken every few seconds all the way through the video, and then second, the audio file, which is pulled out as a clean little file ready to be transcribed into text using Whisper. Now Claude has the full picture when we put these two together. It's flipping through the screenshots like a flipbook, reading the transcript like a script, and the time stamps line up exactly so it knows on screen when something is being said. So that's the whole pipeline. YouTube DLP and FFmpeg doing all the heavy lifting locally on your laptop for free. The only thing we actually have to pay for here is the transcription and Claude usage. Caption's transcription is pretty much free. The skill just pulls them. And if it doesn't, it transcribes the audio using Whisper hosted on Grok or OpenAI. I prefer Grok because it's extremely fast, and their free tier covers basically anything you throw at it. So most videos cost you literally nothing to transcribe. I even used this exact skill to grow a universal context layer for content research, and I'll show you exactly how it works in a minute. And I can literally hear the keyboards clattering right now. Brad, this is gonna torch your token budget. But this actually surprised me, so let's do the math. The skill scales frame count to video length, and it actually caps anything over thirty minutes to a 100 frames. So thirty minute video and a one hour video pretty much cost the same amount in dollar terms, and that's about $1 per run. I ran every test in this video three times in parallel and burned less than 10% of my session, and that's over five hours of video watched live by Claude with transcription.

00:05:30.195 --> 00:06:03.385
And the transcription part's where it gets ridiculous. Every YouTube video comes with a free transcript. The skill just pulls them. There's no Whisper, no API call. It's totally free. And that goes for a bunch of other sites too. Whisper only kicks in for the stuff without captions, like a raw m p four, a Loom, or Instagram Reel. Grok's free tier actually gives you two hours of transcriptions per hour, which covers more than you'll realistically throw at it. I've used this skill every day for two weeks, and I'm still on the free tier. It's crazy. Look. I'm not saying this is perfect, and there's probably optimizations I haven't thought of, but for most people watching, this is essentially

00:06:03.180 --> 00:06:11.340
free. If you got ideas to make it cheaper or quicker, drop them in the comments below. Once I realized this was basically free, I started running it on everything,

00:06:11.340 --> 00:06:31.225
which is how I ended up building the system I'll show you at the end, and it's one that's generally changed how I consume content. Here's the part that actually makes this skill a must have. It works on any URL YTTLP supports, which is over a thousand sites. So this isn't just limited to the big social media companies or YouTube. And it even works if you have the files locally

00:06:30.820 --> 00:08:35.815
downloaded. So that opens up a bunch of use cases that you probably wouldn't expect. So this is what I'm doing for content research now. I take a winning video from the Internet, and I ask Claude to break down the hook. Claude tells me the visual setup, the exact words, where the pattern interrupt lands, and what's on screen at the moment of the hook. That used to take me ten minutes per video pausing and scrubbing, now it's just a paste. And for developers, there's another use case, debugging screen recordings. If you're a developer and a UI bug shows up, you record a thirty second screen recording, drop it into Claude, and ask what happens right before the crash. Claude reads the frames around that moment, finds a state change, and tells you the exact frame the issue starts with. That alone has saved me hours. The skill also has a zoom flag, start time and end time. So you can drop those in and click and focus frame by frame extraction on a specific window of a video. So you can ask about a ten second segment of a two hour video without burning your entire context window. Whatever you're using video for, you can probably stop watching it manually because of this skill. So earlier, I told you that once you start using this thing, it seriously starts to change how you consume content. Now I wanna show you my personal favorite use case for it, which is feeding my second brain. I keep a knowledge base in Obsidian with notes, snippets, ideas for content. And the bottleneck for me has always been throughput because there's just so much good content out there by creators at the moment. There's not enough time to watch it all and write it all down. So I let Claude do both. I give it every single competitor that I think makes great content, and then from there, Claude uses the watch skill to automatically watch it and feed it straight into my second brain. So Claude watches each of these videos, frames, audio, everything, and then comes back with a clean structure and notes about what made the video work. It fills that straight into the second brain. And this is where things start to compound because the skill and your second brain are watching more and more videos, getting more and more context, and it's getting better and better over time, getting smarter automatically. The second brain side of this whole thing is a video on its own, and I walk through exactly how I run mine, content research, competitor intel, every podcast video I've ever listened to all in one searchable layer in Obsidian. If that's where you wanna take this, that's the next video to watch. It's linked up here. If this was useful, hit subscribe. Thanks for watching, and I'll see you in the next one.
