Transcript of Wan 2.2 Setup tutorial with GGUF and Lightx2v in ComfyUI | 8GB VRAM
Video Transcript:
Using the GGUF model with Sage Attention and Light Text to Vora reduced the video generation time from 12 minutes to just 189 seconds. In this video, I'll show you how to set up 12.2 in Comfy Y. So, let's get started. 12.2 video generation is fully upgraded. Cinematic live aesthetic control. More stable and smooth dynamic generation. More realistic instruction following capability. We have prepared a cinematic video generation prompt guide to help everyone improve texture in aesthetic control, motion description, and stylize. In common aesthetic descriptions, we can try to fill in light source types such as daylight, artificial light, moonlight, practical light, fire light, fluorescent light, overcast light, mixed light, sunny light, etc. Light qualities can be described as soft light, hard light, top light, side light, back light, bottom light, rim light, silhouette, low contrast, high contrast. Time of day can be described as daytime, night, dusk, sunset, dawn, sunrise. Dot shot sizes can be described as close-up, extreme close-up, medium shot, medium close-up, medium long shot, long shot, wide angle. For composition types, we can describe center composition, balanced composition, biased composition, symmetrical composition, short side composition. For camera descriptions, focal lengths can be written as medium focal length, wide angle, long shot, ultra wide angle. Fisher camera angles can be written as over the shoulder, high angle, low angle, tilted angle, aerial shot, top down angle. Camera shot types can try to include clean singleperson shot, twoperson shot, singleperson shot, tracking shot. Establishing shot color tones can be written as warm tone, cool tone, high saturation, low saturation, etc. In common motion descriptions, we can write some motion types such as break dancing, running, skateboarding, playing soccer, tennis, ping, png, skiing, basketball, rugby, bowl balancing, dance, cartwheel, eta. Character emotions can be described as angry, fearful, happy, sad, surprised. Basic camera movements can be described as camera pushin, camera pullback, camera move right, camera move left, camera tilt up, advanced camera movements. We can also emphasize handheld camera, complex movement, follow shot, orbiting shot. In common styization, we can try to describe visual styles like felt style, 3D cartoon, pixel style, puppet animation, 3D game, clay style, anime, watercolor, black and white animation, oil painting style if you want special effects shots. You can also add tilt shift photography, time-lapse. Okay, now let's get all the files that we need for 1 2.2 and the workflows. First one that we need obviously is 1 2.2. too. Again, assume you have a 8 gig RAM laptop GPU. So, the community has already created them within one day, which is pretty impressive. You'll find the full set here, and I'll share the link in the video description down below. On this hugging face page, you'll find a whole bunch of different quant values. Again, assume I have 8 gigs. So, I've been using the Q4 just to make sure that I can run it and do the proper testing. I think I can probably get by with the Q5 as well. And obviously if you have more of ROM then you can go up to Q6 or Q8. If you have less you can try out the Q2 or Q3. I think I've seen some people use the Q3 and the results have been pretty good still. So really just grab the one that will fit your VM. The other big difference between known 2.2 and 2.1 at least for the imagetovideo model is that you need two of these GGUF. So you need the high noise one and it's a 14 billion parameter one for me. For example, I would grab the Q for high noise one. In addition, I would need the Q for low noise one. These get paired up and I'll show you later in the workflow. These get paired up together to produce one video generation. Grab the quants that you need and put them in the unit folder just like all the other gigaflops before. Next, we'll move on to the other important bit, which is the litex to Vora. This is the same Laura that was used in Noni 2.1 and we can continue to use this in only 2.2 which will greatly affect the speed of the generation dot by default one 2.2 generations are 20 steps so it takes a little bit longer but with the LEX 2V you can get it down to six steps for a really good generation. All the videos in the beginning intro were all done on six steps. From KJ's hugging face page you want to grab the correct one which is the I2V one. There's a whole bunch of different ranks here. The one I've been using in my testing is just rank 32. You can also go up to rank 64 or rank 1 to 8. So yeah, just get the one either 32 and up and then experiment however you like. The last set of files come from the Comfy Y page for 1 2.2. They've updated one of their guides here. Actually, there's nothing new for 1 to.2 unless you're using the 5B model. But because we're using the one for B model, you can just use the same text encoder that you've previously used for 12.1 dot. You can grab it here if you don't have it. You also have to use the 12.1 VAE for the one forb model. So you can grab that here or just reuse the one that you previously have. The 1 2.2 VAE is only being used for the 5B model and it's actually quite a bit slower. So right now you don't need to use that dot. put these two files in the correct folders and that's it. Okay, now we're back in Comfy Y. Here's just an overall view of the workflow itself. It's not super complicated, although it looks quite big, but I'll go through each of these nodes when I go through the whole workflow. So, one of the first things you have to do before you start is to update your Comfy. Once you bring up manager, you can update Comi by clicking update all or update Comfy.update all. I think will also update all the custom nodes you have. Just make sure you run this, then do a restart and then refresh the browser window. When it's all done, you want to be on 0.3.46. And you can check this in a couple of ways. I didn't show you my Comfy UI version is because even if I have up to date, the workflow is not there. If you have the same issue here, just go to the Comfy UI official website, click download JSON file, Ctrl+ A to select all the code, copy it, and then paste it in your Comfy UI. Now, I've used Theon to 214B image to video as a base and then expanded it with all the necessary extra nodes. Okay, let's go over the workflow itself. So, first we'll start on the very left with all the different loaders. The default one is the load diffusion model. I've bypassed those and then added the two from the Goff custom node. So what you'll notice is there's actually two sets of models that are being loaded for this workflow. In one of them, you put the high noise and then in the other you put the low noise. So what happens is in the video generation sampler, it'll actually split your steps into two and then do half of it on the high noise and then the second half on the low noise. So if you had six steps for instance, you would set it up so that three steps would be done on the high noise. And the high noise is supposed to give you the dynamic motions, compositions, all that kind of stuff. And the last three steps will use the low noise, which will then fill in all the fine details. So this part is different from 1.2.1. From here, the two models go through the usual speedups that I add. And again, if you don't have Sage Attention, just look up how to install Sage Attention and the torch settings. And then you can add these two nodes. If you can't get it working or if you don't have it installed, you can. Again, always just bypass all four of these and then you can run the model. The model will still work, but it'll just be a little bit slower. Okay. Next, continuing on the model chain, you have the Laura Loader model only nodes. These two you put in the Litex 2V. So the same Laura you add it to both. For the top for the high noise path you set it at three. And then the low noise pass you set it 1.5. These are the numbers that people have been experimenting with. These seem to give good results. For me it gives pretty good results. So I've stuck with these ones. Of course you can always just experiment and see if you get better results with other numbers from the Loras. Both of these go into the shift nodes and these are default at eight. Each one of these models will actually go into a K sampler of its own. So the top half with the highpass goes into the first K sampler and then the low pass one goes into the second K sampler. These two K samplers are connected by the latent. You also have a couple other things. Plugging in here the positive and negative prompts into the same ones into both. And then the latent image from down below which I'll explain now. If we take a look just underneath, we have the clip loaders, the text encoder, and the V loaders. These are just a general one from ComfyUI. You put your text encoder in here, make sure you have one selected, and then devices CPU if your CPU can run it. The VU, the 2.1 VAE, make sure that that one is selected. For the prompt connections, the text encoder gets hooked into the positive and negative prompt. In this case, because we're using litex to v, the negative prompt isn't actually being used, but it is being used in the default, which is why it's there. I also haven't had time to see if the one video neg has been updated and if it works for 1.2.2 yet, though, I've just left it at one without a negative prompt, but so far the generations seem okay. The positive and the negative prompts actually get fed into the ninage to video. Then from the image to video it gets fed into the case. The other thing connecting the one image to video is obviously your load image. For an ITV model you have to load your image here. The image goes through an image resize which I've added just to make it a little bit easier to set your resolution on the output. You'll set it on the image res that gets passed to the one image to video and then that resolution gets fed into the K sampler. The one for B model for 1.2 two two still uses 16 frames per second. So the length that you set here in this node has to be multiples of 16 + 1. 81 frames is the usual 5 seconds for one. Obviously if your GPU is more powerful or if you have more VOM, you can also up the resolution on your image generation. You can definitely do widescreen 832x480 or if you want a portrait, it would be 480x 832. Or you can generate 720p images by putting 1280x78. I think that's on the 1.22, but you don't need to go higher than 720. I think the models are just trained on 480p or 72p as the maximum. Now we move into the case sampler. Again, the first case sampler is for the high noise pass and the second one is for the low noise. What you see is that the high noise one has a seed obviously to deninois. This one looks more like the usual kampler settings that you would put. You have a seed. You can either set to randomize or fixed if you're trying to keep the same seed. Steps six. Because you're using litex 2V, if you want better quality, you can also do eight or 10 steps, but just keep it a multiple of two. CFG is one sampler anduler. I've been using Oiler and Simple. Some people have been using UniPC. I think you can experiment a little bit with these to see which ones you prefer. The last three are slightly different than 1.21. You have a start step at zero, end step at three. This means out of the six steps, you only do the first three steps here. Then you pass the leftover noise as well. All of this gets fed into the second K sampler. You'll notice you don't need a seed for this one because it's taking input from the first one. You match the steps. So 6 CFG is the same. Same sampler, sameuler. The bottom part instead of starting at zero, you pick it up and start it at 3 and then end it at 10,000. Or you can put six, but 10,000 you can just leave there and then set the second one to disable. When it finishes a combined six steps, it'll go through the VD code and then into the video combine, which is quite normal. So, here's a sample test video I ran. As you can clearly see, I'm using the Q4 KSGUF model with Sage Attention and Lightex 2 Vora, and it takes around 150 seconds to generate one video. I also tested the Q6 KGUF model, and guess what? No significant improvement in result quality, but the generation time jumped to 339.69 seconds. So yeah, feel free to use it if you're on a lower VROM GPU.
Wan 2.2 Setup tutorial with GGUF and Lightx2v in ComfyUI | 8GB VRAM
Channel: Atelier Darren
Share transcript:
Want to generate another YouTube transcript?
Enter a YouTube URL below to generate a new transcript.