Transcript of WAN 2.2 Fun with Video Control and First/Last Frames in ComfyUI

Video Transcript:

Last week, we looked at some ways to control image generation with the open-source Quen model. And this week, it's video control for WAN. We've got text to video, video to video, image to video, control, video to video, and so much more. I've neatly arranged things into their little rodent method groups, so you've got less spaghetti and even more ease of understanding about how all the bits fit together. Once again, we're going to be using nodes from the awesome KeyI. And if you're looking to make this workflow yourself at home for free, then check out all the example workflows provided with that custom node pack, as that's where I started from, too. Alternatively, if you enjoy these workflows and find them helpful, you can support the channel via Patreon. That helps me create even more workflows for you and to share these instructional videos with everyone. Of course, the choice is yours. And a massive thank you to all those who are able to contribute. Now, rather than repeat myself with every tiny detail of this workflow, again, if you start with the WAN 2.1 refresher video and then the WAN 2.2 update, you'll be up to speed for this video in no time at all. So, let's get into it. I'm focusing on the WAN fun 14B model here, which works best with a decent amount of VRAM, such as 24 gig, but the block swap option can help reduce VRAM requirements, too. And we'll look at that in just a second. But do check out the 5B examples as well if you're running with less, a slightly different set of models there, but the general principles remain the same. As an alternative, if you're looking for a super simple 5B workflow, then browse templates is your friend. There we have WAN 2.25B fun control. And if that isn't enough, then smaller GGUF files are also available for 14B as well. Far too many options to cover in just a single video, but just to make you aware of some of the other files and workflows which are available. Okay, back to focusing on what we've got here before all those different choices get completely overwhelming. Now, one great thing about WAN fun control is we no longer need to pick a text or image to video model like in the previous videos, meaning this time around there's just one loader group. It still has the high and low models, but they can be used for everything from text to video right up to control videos with images. Very nice indeed. All the files shown here are available from Ki's file stores. And there I've got the WAN fun control A14B high and the Wanf fun control A14B low. Both of these are FP8. Of course, there I've got the quantization enabled. To speed things up, we've got some WAN lightning loras. And these from light XTOV are great because it means you can do things in just four steps. Although I'm actually going up to six in this workflow. Here they are. Then those Loras, one for the high model and one for the low model. Personally, I like to have quite a high strength for the high model, but do play around and see what is best for you. Here I've got it set to just a little bit under two. And a value of one will work just fine, although I felt it didn't quite follow the controls as well on the lower values. So that's why I've put them higher. Could be my imagination, but if it works, then it works. If you want to use more than one lura per model, then Ki also has this WAN video lura select multi- node. So wire that one up instead if you want to use loads of Loras at once. Keeping it simple for now, of course. So just the one Laura per model for today. And the final couple of model loaders here for the T5 text encoder and the VAE. Just the same as usual. A few optional groups next starting with torch compile. Again, the same as in previous videos, as a way to help speed things up. As mentioned earlier, the block swap option can help to save a bit of VRAM. Native Comf UI does this automatically, but here you get a little bit of manual control. I've also included some experimental options, and typically things like skip layer guidance would improve video quality whilst potentially sacrificing a bit of performance. Though for WAN 2.2, too. It seems to be just fine without these options. So, they're just there for your experimentations. The block swap and lures group there basically just applies the same block swap options to both models, the high and low, as well as those lightning loras, just like in my previous WAN 2.2 video. The prompt is the same for all examples you're going to see today. And of course, the best prompts always include some type of rodent. So, I've got mine as the pilot of an intergalactic spaceship. There's a cheese sandwich on deck two, and I've opted for a realistic style. There are bucket loads of options, and I suggest not running all of those at once, unless you've got a beast of a system. Typically, I just enable a single option at once. So, we can just turn those off. There we go. And then you're just doing text to video. Or maybe you want to do video to video. Either way, you've got them all there. text to video, video to video, image to video, control video such as with depth or pose, control video with first frame, last frame, or a control video with a reference image as well. There's loads more you can do, but I figured that should be enough to get you started on the journey. Okay, starting with option one then, text to video. This is the same as my previous WAN 2.2 to workflow video. Apart from we've got a new node here, WAN video add control embeds and that's essentially what makes this new model work. Standard empty embeds going in the new node and then that goes into the image embeds. Zooming out a little so you can see everything there. It is of course pretty much the same as last time. We've got the high model, the low model, and then of course we render the video. Because we're using those lightning loras, we can get away with CFG1. And I've got a shift value. It's the same in each one, but you can vary them. Shift value of six is fairly decent. It gives you a good amount of change. And the lower the shift value, I found, the less it tends to change, and the more the shift value, the more it's going to change. The generated video, I think, is pretty decent. I've got my rodent dude. He's got a rather curious set of armor on there or something. He's got his holographic interface. The cheese sandwich looks a bit strange. It's more like a sort of croissant on a napkin, but you know, overall, I think that's pretty decent. Option two, then video to video. So, we're not using a control video as yet. This is that raw video going in. We've got the WAN video in code, which is changing it into samples. And there, as you can see, going into the sampler. You can change some options on this WAN video encode node like I've got noted there. The noise or strength is pretty powerful. So even if you just add 0.001, that's a fair amount of change. The latent strength as well, if you go less than one, it's going to change more towards what you prompt for. As in my previous video, I'm using the video helper suite video info node there to get the FPS. And we're still using empty latence here. So here we've got the image embeds that's going up into that add control embeds. And for this one, I probably wouldn't change the start or end percentage, otherwise you might get weird results. As for the sampler, this one is where you have to change the D noise strength. So the only option here where you wouldn't really use a D noiseis strength of one to start with. So I've got some notes down here on that D noise strength. Anything above 0.85 85 and it's probably going to change the video entirely and not look like the original one at all. So try and keep that below 0.85 unless you really do want masses of change. Once again, the shift value will allow it to change more. And this other option down the bottom here, add noise to samples, because we are sending those samples in. So, if you set that to true, you don't have to, but if you do, it can add some more noise and help with the change. The resulting video in this case is this one. As you can see, the arm motions are definitely following that input video with the D noiseis strength of 0.79. It's following it quite closely, but it has been enough to change the face. So, you can see it's definitely following the motions, but it's no longer that original woman. I've got my rodent hero instead. On to option three then image to video. And again like my last WAN 2.2 video, we've got the WAN image to video embeds where you can have the start and the end image. This time we've got the start image of one particular rodent dude and the end image of a slightly different rodent dude. And the resulting video is nothing like the others and not really following the prompt at all because those input images are really really strong. So, because I've got the same prompt for everything there. I think when you're doing image to video, you'll really want to go along with what your images actually contain. So, here it's sort of automatically doing a morph between the two. With option four, this is where we start to use the power of the WAN fun model with our control video. So, we've got the same input video, but this time there's a depth anything node in there instead. So rather than sending that raw video in, we've got the depth anything V2 node there, which changes it into a depth map. This time the noise augment strength on the WAN video encode node. You can go a tiny bit higher like you see there. I've got it at 0.015. Now, one thing to note when you're using control videos like the depth one here is the more early steps you do, the more it's going to be like the control versus being more like your prompt. So, it's quite good keeping that down at just two so it follows your prompt more. But, I've increased the total number of steps just cuz I thought it looked better. For the samplers, I've gone with different shift values again. So, nine to start with. So, it does quite a lot of change. And then a shift of five on the second one. The result in this case is very funky indeed. You can tell it's definitely following that video even just from the depth input. And her face is interesting. Yeah, she's almost almost like a rodent. Again, hasn't quite got the chair. There's a little bit at the bottom there or the holographic control, but yeah, that's very nice. Another control video example, but this time instead of using that depth node, I'm using the DW pose estimator instead. The result with using pose is of course the image will change a lot more because you don't have all that extra information from the depth model. It's just got the pose. So now we've got a very interesting background. We've got a huge leather chair, a nice cheese sandwich there, and a little bit of holographic control as well. And definitely nothing weird about that character at all whatsoever. Definitely not his hairstyle, and certainly not the nose either. On to option five then. And for this one, we've got an input video, which we're turning into a depth map, but we've also got a start frame and an end frame as well. Let's have a little zoom in. see what we've got going on here. So, we've got our video which we're going in. We've got the depth anything like we've got from before. Then we've got the two input images. So, we've got the first image there. Resizing it going into the start image. And we've got this second image down here as well. Again, of course, resizing it. And that's going into the end image. Similar settings for the samplers. using two steps to start with, a total of six, and the shift value on both of those is set to five. The result in this case is pretty decent. There you go. So, you've got that woman, and we've essentially turned her into an anime version based on the start and end image. Now, you can see, of course, at the end there, it sort of zoomed in a little bit. That's due to the imperfect end image. So if you've got a decent start image and end image, then you'll get much better results than this example here. And then finally, the control video plus the reference image. So rather than using those very static first frame, last frame things, we're just sort of passing it a reference image. So it gets an idea of what it wants to do. So here we have the input image and of course we're resizing. That goes into the WAN video in code. We've got the samples going out there. And here I'm using the WAN video control embed. So rather than the add, there's the add node. WAN video add control embeds. This has got an extra option at the top. As you can see, the WAN video control embeds doesn't have that extra option. So we don't need to pass it in. All that needs is latent. And we've got the optional fun reference image. The result in this case is pretty decent. Now, the face has changed shape a little bit because, of course, it's a reference image, so it won't follow it exactly. It's going to go more towards the control video that you send in, but I think the result is still pretty decent. There you have it then. Loads of preset building blocks for you to start with. All from text to video down to control video, control with first frame, last frame, and reference image. I think that one fun video model is absolutely awesome. So much you can do with it. Lots of presets there, but do feel free to share other combinations you've discovered down in those comments. Rodent out for now, but above all, do have fun. Nerdy rodent. He really makes my day. Showing us AI in a really British way.

WAN 2.2 Fun with Video Control and First/Last Frames in ComfyUI

Channel: Nerdy Rodent

Convert Another Video

Share transcript:

Want to generate another YouTube transcript?

Enter a YouTube URL below to generate a new transcript.