YouTube to Text Converter

Transcript of Wan 2.2 Setup tutorial with GGUF and Lightx2v in ComfyUI | 8GB VRAM

Video Transcript:

Using the GGUF model with Sage Attention and  Light Text to Vora reduced the video generation   time from 12 minutes to just 189 seconds. In this  video, I'll show you how to set up 12.2 in Comfy   Y. So, let's get started. 12.2 video generation is  fully upgraded. Cinematic live aesthetic control.   More stable and smooth dynamic generation. More  realistic instruction following capability.   We have prepared a cinematic video generation  prompt guide to help everyone improve texture   in aesthetic control, motion description, and  stylize. In common aesthetic descriptions,   we can try to fill in light source types such as  daylight, artificial light, moonlight, practical   light, fire light, fluorescent light, overcast  light, mixed light, sunny light, etc. Light   qualities can be described as soft light, hard  light, top light, side light, back light, bottom   light, rim light, silhouette, low contrast, high  contrast. Time of day can be described as daytime,   night, dusk, sunset, dawn, sunrise. Dot shot sizes  can be described as close-up, extreme close-up,   medium shot, medium close-up, medium long shot,  long shot, wide angle. For composition types,   we can describe center composition, balanced  composition, biased composition, symmetrical   composition, short side composition. For camera  descriptions, focal lengths can be written as   medium focal length, wide angle, long shot, ultra  wide angle. Fisher camera angles can be written as   over the shoulder, high angle, low angle, tilted  angle, aerial shot, top down angle. Camera shot   types can try to include clean singleperson shot,  twoperson shot, singleperson shot, tracking shot.   Establishing shot color tones can be written  as warm tone, cool tone, high saturation, low   saturation, etc. In common motion descriptions, we  can write some motion types such as break dancing,   running, skateboarding, playing soccer,  tennis, ping, png, skiing, basketball, rugby,   bowl balancing, dance, cartwheel, eta. Character  emotions can be described as angry, fearful,   happy, sad, surprised. Basic camera movements can  be described as camera pushin, camera pullback,   camera move right, camera move left, camera  tilt up, advanced camera movements. We can   also emphasize handheld camera, complex movement,  follow shot, orbiting shot. In common styization,   we can try to describe visual styles like felt  style, 3D cartoon, pixel style, puppet animation,   3D game, clay style, anime, watercolor, black  and white animation, oil painting style if you   want special effects shots. You can also add tilt  shift photography, time-lapse. Okay, now let's get   all the files that we need for 1 2.2 and the  workflows. First one that we need obviously   is 1 2.2. too. Again, assume you have a 8 gig RAM  laptop GPU. So, the community has already created   them within one day, which is pretty impressive.  You'll find the full set here, and I'll share the   link in the video description down below. On this  hugging face page, you'll find a whole bunch of   different quant values. Again, assume I have 8  gigs. So, I've been using the Q4 just to make   sure that I can run it and do the proper testing.  I think I can probably get by with the Q5 as well.   And obviously if you have more of ROM then you can  go up to Q6 or Q8. If you have less you can try   out the Q2 or Q3. I think I've seen some people  use the Q3 and the results have been pretty good   still. So really just grab the one that will fit  your VM. The other big difference between known   2.2 and 2.1 at least for the imagetovideo model  is that you need two of these GGUF. So you need   the high noise one and it's a 14 billion parameter  one for me. For example, I would grab the Q for   high noise one. In addition, I would need the Q  for low noise one. These get paired up and I'll   show you later in the workflow. These get paired  up together to produce one video generation. Grab   the quants that you need and put them in the unit  folder just like all the other gigaflops before.   Next, we'll move on to the other important bit,  which is the litex to Vora. This is the same Laura   that was used in Noni 2.1 and we can continue to  use this in only 2.2 which will greatly affect   the speed of the generation dot by default one 2.2  generations are 20 steps so it takes a little bit   longer but with the LEX 2V you can get it down  to six steps for a really good generation. All   the videos in the beginning intro were all done on  six steps. From KJ's hugging face page you want to   grab the correct one which is the I2V one. There's  a whole bunch of different ranks here. The one   I've been using in my testing is just rank 32. You  can also go up to rank 64 or rank 1 to 8. So yeah,   just get the one either 32 and up and then  experiment however you like. The last set of files   come from the Comfy Y page for 1 2.2. They've  updated one of their guides here. Actually,   there's nothing new for 1 to.2 unless you're using  the 5B model. But because we're using the one for   B model, you can just use the same text encoder  that you've previously used for 12.1 dot. You can   grab it here if you don't have it. You also have  to use the 12.1 VAE for the one forb model. So you   can grab that here or just reuse the one that you  previously have. The 1 2.2 VAE is only being used   for the 5B model and it's actually quite a bit  slower. So right now you don't need to use that   dot. put these two files in the correct folders  and that's it. Okay, now we're back in Comfy   Y. Here's just an overall view of the workflow  itself. It's not super complicated, although it   looks quite big, but I'll go through each of these  nodes when I go through the whole workflow. So,   one of the first things you have to do before you  start is to update your Comfy. Once you bring up   manager, you can update Comi by clicking update  all or update Comfy.update all. I think will also   update all the custom nodes you have. Just make  sure you run this, then do a restart and then   refresh the browser window. When it's all done,  you want to be on 0.3.46. And you can check this   in a couple of ways. I didn't show you my Comfy UI  version is because even if I have up to date, the   workflow is not there. If you have the same issue  here, just go to the Comfy UI official website,   click download JSON file, Ctrl+ A to select  all the code, copy it, and then paste it in   your Comfy UI. Now, I've used Theon to 214B image  to video as a base and then expanded it with all   the necessary extra nodes. Okay, let's go over the  workflow itself. So, first we'll start on the very   left with all the different loaders. The default  one is the load diffusion model. I've bypassed   those and then added the two from the Goff custom  node. So what you'll notice is there's actually   two sets of models that are being loaded for this  workflow. In one of them, you put the high noise   and then in the other you put the low noise. So  what happens is in the video generation sampler,   it'll actually split your steps into two and  then do half of it on the high noise and then the   second half on the low noise. So if you had six  steps for instance, you would set it up so that   three steps would be done on the high noise. And  the high noise is supposed to give you the dynamic   motions, compositions, all that kind of stuff.  And the last three steps will use the low noise,   which will then fill in all the fine details. So  this part is different from 1.2.1. From here, the   two models go through the usual speedups that I  add. And again, if you don't have Sage Attention,   just look up how to install Sage Attention and  the torch settings. And then you can add these two   nodes. If you can't get it working or if you don't  have it installed, you can. Again, always just   bypass all four of these and then you can run the  model. The model will still work, but it'll just   be a little bit slower. Okay. Next, continuing on  the model chain, you have the Laura Loader model   only nodes. These two you put in the Litex 2V. So  the same Laura you add it to both. For the top for   the high noise path you set it at three. And then  the low noise pass you set it 1.5. These are the   numbers that people have been experimenting with.  These seem to give good results. For me it gives   pretty good results. So I've stuck with these  ones. Of course you can always just experiment   and see if you get better results with other  numbers from the Loras. Both of these go into   the shift nodes and these are default at eight.  Each one of these models will actually go into   a K sampler of its own. So the top half with the  highpass goes into the first K sampler and then   the low pass one goes into the second K sampler.  These two K samplers are connected by the latent.   You also have a couple other things. Plugging  in here the positive and negative prompts into   the same ones into both. And then the latent image  from down below which I'll explain now. If we take   a look just underneath, we have the clip loaders,  the text encoder, and the V loaders. These are   just a general one from ComfyUI. You put your text  encoder in here, make sure you have one selected,   and then devices CPU if your CPU can run it.  The VU, the 2.1 VAE, make sure that that one   is selected. For the prompt connections, the text  encoder gets hooked into the positive and negative   prompt. In this case, because we're using litex to  v, the negative prompt isn't actually being used,   but it is being used in the default, which is  why it's there. I also haven't had time to see   if the one video neg has been updated and if  it works for 1.2.2 yet, though, I've just left   it at one without a negative prompt, but so far  the generations seem okay. The positive and the   negative prompts actually get fed into the ninage  to video. Then from the image to video it gets   fed into the case. The other thing connecting the  one image to video is obviously your load image.   For an ITV model you have to load your image here.  The image goes through an image resize which I've   added just to make it a little bit easier to set  your resolution on the output. You'll set it on   the image res that gets passed to the one image to  video and then that resolution gets fed into the   K sampler. The one for B model for 1.2 two two  still uses 16 frames per second. So the length   that you set here in this node has to be multiples  of 16 + 1. 81 frames is the usual 5 seconds for   one. Obviously if your GPU is more powerful or if  you have more VOM, you can also up the resolution   on your image generation. You can definitely do  widescreen 832x480 or if you want a portrait, it   would be 480x 832. Or you can generate 720p images  by putting 1280x78. I think that's on the 1.22,   but you don't need to go higher than 720. I think  the models are just trained on 480p or 72p as the   maximum. Now we move into the case sampler. Again,  the first case sampler is for the high noise pass   and the second one is for the low noise. What  you see is that the high noise one has a seed   obviously to deninois. This one looks more like  the usual kampler settings that you would put.   You have a seed. You can either set to randomize  or fixed if you're trying to keep the same seed.   Steps six. Because you're using litex 2V, if you  want better quality, you can also do eight or 10   steps, but just keep it a multiple of two. CFG  is one sampler anduler. I've been using Oiler   and Simple. Some people have been using UniPC. I  think you can experiment a little bit with these   to see which ones you prefer. The last three are  slightly different than 1.21. You have a start   step at zero, end step at three. This means out of  the six steps, you only do the first three steps   here. Then you pass the leftover noise as well.  All of this gets fed into the second K sampler.   You'll notice you don't need a seed for this one  because it's taking input from the first one. You   match the steps. So 6 CFG is the same. Same  sampler, sameuler. The bottom part instead of   starting at zero, you pick it up and start it at  3 and then end it at 10,000. Or you can put six,   but 10,000 you can just leave there and then set  the second one to disable. When it finishes a   combined six steps, it'll go through the VD code  and then into the video combine, which is quite   normal. So, here's a sample test video I ran. As  you can clearly see, I'm using the Q4 KSGUF model   with Sage Attention and Lightex 2 Vora, and it  takes around 150 seconds to generate one video. I   also tested the Q6 KGUF model, and guess what? No  significant improvement in result quality, but the   generation time jumped to 339.69 seconds. So yeah,  feel free to use it if you're on a lower VROM GPU.

Wan 2.2 Setup tutorial with GGUF and Lightx2v in ComfyUI | 8GB VRAM

Channel: Atelier Darren

Convert Another Video

Share transcript:

Want to generate another YouTube transcript?

Enter a YouTube URL below to generate a new transcript.