YouTube to Text Converter

Transcript of ComfyUI With Wan 2.1 Fun Control - Video2Video With ControlNet Finally For AI Video!

Video Transcript:

hello everyone So we have another exciting update in AI video using one 2.1 Recently Alibaba Pal released the Juan 2.1 fun control and inpaint models which are modified models based on the one 2.1 AI video model These two models are available on hugging face You can download them directly from there Basically the one 2.1 fun Here's a chart showing what's the difference between each of those The inpaint is used for modifying your existing video frames doing frame by frame ink painting where you use the one 2.1 fun control the bottom two rows on this table You'll see that it supports control net to do motions like what you used to do with images Using control net you're able to use candy depth pose and also line art control net to do your motions And in AI videos it's going to handle multiple image frames in one generation batch When we stitch it together back into a sequence it becomes a video So in these videos we're going to check out how to use the control models in Alibaba Pal's one 2.1 fun series of model weights This time I'm going to use the fun 1.3 billion parameter size control This is going to be more consumer PC friendly Of course if you have more VRAM or memory you can try with the 14B model weight So here's what you're going to do First come to this hugging face page You see this one 2.1 fun 1.3B control which is what we're going to use Go to the file versions tab and you'll see a lot of files here Now if you've already used one 2.1 you don't need to download other files except for the diffusion pietorch model safe tensor file This is the model file that we're going to use only for the wan 2.1 file control model And these are able to be used within the comfy UI diffusion model already because as you can see other files such as the wan 2.1 VAE and the models T5XXL this is the text encoder model We already have those in one 2.1 when comfy UI integrated everything on the first day So we don't need those The only thing we need is the safe tensor files which is for the diffusion models The core thing itself to do this control AI video So come to the files folder where you downloaded the models You're going to save that in the models folder in comfy UI and then save that in the subfolder of the diffusion model I have a separate subfolder for each AI model as well And I put that in the one 2.1 folder You can see I've renamed this file called the Alibaba pal one 2.1 fun 1.3 billion control This is going to be the one we select later on in Comfy UI And the rest of the models as you can see I've already downloaded previously for one 2.1 image to video and text to video We don't need those at this moment So for the VAE that we have you guys should already have this If you've played around with one 2.1 that's linked in the Comfy UI official website there's a blog post linking all of those already and also the text encoder as well where you have the UMT 5X XL text encoder So we're going to use those This is going to be applied for the W 2.1 fun control model as well So come to the Comfy UI Well this footage was recorded previously when I updated my Comfy UI that's dedicated to running this one 2.1 fun update As you can see here I've screenshotted this part where the latest update of Comfy UI has added support for the Wan control Laura which is what we're going to use making the skip layer guidance diffusion transformer models work with one 2.1 as well There's one new native node in Kumui in the latest update that has the skip layer guidance which is dedicated for one 2.1 and other new features You can check it out at the bottom of this update notice So we're going to focus on that in Comfy UI Once again a lot of people haven't realized that the Comfy UI manager has some problems updating core features and some core coding when it updates I don't really recommend using the Comfy UI manager here to click the update Comfy UI button Instead I use the command prompt window here you know having your comfy UI folder and just type gitpull and you'll update the latest files in the main branch of the GitHub repositories of comfy UI then it merges some new files delete some old unnecessary files and we get the updated version of comfy UI after this process which is really fast it won't take too much time in processing and so after that what we're going to do is go to the python main py and start my Comfy UI This is well again my footage when I recorded it I don't want to skip this step to show you guys As you can see it's able to run my GPU and start my Comfy UI at this time Okay so here's the workflow that I just did yesterday last night when I noticed this update of Comfy UI and played around with it just by then And right here this is the most important thing that we have to know which is the one fun control to videos This is in native nodes in Comfy UI So you don't need to install any extra custom nodes to get this node All you need to do is follow the Comfy UI update method that I just did and after that get the latest update version You will have the one fun control to video And another node that I'm going to try out is the one fun inpaint to videos which is another model that I mentioned in Alibaba Pal that they did for video in painting Now these nodes are going to be used in other video tutorials So we're going to skip this first and focus on the fun control to videos at this moment The first thing we're going to do is see that we got the positive negative VAE control clip visions as well And you see this start image and the control videos Now this is going to be the important thing that we have to customize by ourselves Now the first thing is this The start image is like what you saw in the update in runway If you guys have checked it out in runwayML they have the AI video where using videotovideo methods and also allow you as a user to input a start image reference style for your videotovideo restyling referencing And as we have in Comfy of course we can do more customizations with that That can also do similar features as the runway start frames restyling functions in their platform So this is also like what we had previously in anime diff Remember you know last year when we started playing around with anime diff So for example like this node we have the motions for the models as well And then there's a lot of settings we can play around with just like that you know the animate diff as a whole framework We can do that also using the AI video which now allows us to use the wand 2.1 the trending thing in comfy UI right now Because of that Alibaba pals have created this add-on fun control model and we can use the control net in here Now the control net as we've played around with before is really mimicking the motions of your reference image in image generations For videos it's going to mimic the motions of the reference video So for example I have my input video here One is the Tik Tok dance video as a demo and it's going to be like this that I generated last night I tested this and it's pretty cool I combined the line art and the debut pose together So it enabled me to also mimic the face motions of the character and put that into my generated character as well because of the DW pose here And another example that I did is purely using the DW pose Well you can use open pose as well and it's up to your call And I did that using the first frame to restyle as a red color demon standing on the rooftop And there's some lightning in the first one second here And then we do the similar dance mimicking the reference video that looks pretty fun in here So we can also do that in this workflow very very similar concept as animate diff that we had last year using your reference image putting it in the IP adapter and then doing the styling doing the control net mimicking the motions of your reference video etc and all of that in AI video right now I feel is easier to manage and control rather than dealing with so much complexity of having different connections and all those modules together because what I did here is that I captured this reference image Once I captured the reference image here for example this Tik Tok dancer or this stock video dancer as well we use those motions in the first frames of this video So it will pass to the other group where I'm using the flux image generator to do the styles here And I'm using the control net for the flux and also as well here as I talked about in previous videos using control union pro which enables using different types of control net within one model And right now we have the new control net type in the update of comfy UI where you are able to select whichever control net type that you are going to use with the union pro Also I'm using the AIO auxiliary prep-processors nodes here because there are different control net types you can use here I use the line art to make it easier to replicate for the image to have similar elements and motions poses within that image And I use the first video frame which I get from using the get image from batch Well there are two nodes that are similar or have the same features You can use this KJ node get image from batch or in the comfy UI native nodes there's an image from batch You can use that as well Either one does the same feature to connect it here So let's say I am just going to bypass this one and use the control net here And also on top I have my Laura models because I'm just doing a quick fit fast way of generating the first frame for restyling the video here So therefore I am going to use flux turbo to have a quick way to uh preview how that looks If you want more details in your image you will be skipping this Laura note You can use the full potential of flux to generate the image And there's one thing very important If you want to generate consistent styles of characters for example you have your flux Laura for a character like in my case here I got mine connected with my character Laura This will be more beneficial for your consistent style character generated in your restyled videotovideo result Well you can use other nodes here like this load image But then the pose of the character for example this one the tennis player here the pose is different It's going to be hard for the next steps The one 2.1 fun control will capture these frames to mimic the control pose motions using different poses of the first image So it's better recommended to use your existing reference video capture the first image and then use that first image pose or line art or depth map whatever you want to use in the control net and then generate another new image based on that and in this way we will pass the image restyles by flux One more thing that's very important I will be cleaning some cache after the flux generation because when you use flux n1 2.1 together in one workflow I believe a lot of computers won't be able to reach the end and maybe in the middle of the one 2.1 generation you will get an out of memory error So I will be doing this pushing the cache and also cleaning up the memory that's storing the models This is very important right now for you to know Diffusion transformer models running locally especially on local PCs require us to handle our memory usage in our workflow So clean it up afterward to generate the image Here I got a pre-processor for videos Now as here as you can see I tried the DW pose and line art This is like what we used to do in animate diff we connect like two or three control nets in animated conditioning where this is having the same concept where when I generate this dance video as you can see we have the line art of the character and the DW pose of the character as well Combining this together we will have more accuracy in how this style looks like and also how the movement the motions of the mimicking character look like in our generated video here Even though as you can see in different camera angles like a close-up shot like this and then a far shot like this we have the same outfit of the character in my generated video because this isn't using different images for restyles This is using because of the diffusion transformer which is able to handle one style all the way through with our whole video generation So within this 8 seconds yeah I am able to have consistent styles like this with you know the Laura of my character generating that image and then also have my different camera view So in here this example is for Tik Tok dance but if you imagine you're using it for AI movies doing movie productions you have different camera angles even far shots close shots The outfit of your character won't be different if you have the correct way to reference your first image Therefore we are going to handle the pose and the line art Here you can also do the depth map that I tried in the previous video using dev anything v2 as the pre-processors The way I did it using two control prep-processors is that I bind those images into one image Now there's another node that has the same features which is coming from the wazu nodes custom node You can connect image A B multiple images together and output as the output images and this way you will also have the same features as what I have in this one combining two control nets into one image frame here So therefore if you want to use two control nets it's going to be using the image combined If you're using one control net only then you can bypass other pre-processors and maybe you want to use just the DW pose like this example that I have Just using DW pose you will have more freedom to customize how the character looks like Because as you can see the demons on my generic result have large wings that the reference video doesn't have That if I am applying the line art then I cannot add this wings and the different city views background that's behind the red demon in my generated video So in this case it depends on what kind of control net you want to do There's a lot of combinations There's no textbook answer here I just showed all these control net prep-processors just for letting you guys know that this WAN 2.1 function references from their model page Of course this referencing is in Chinese From what they have mentioned it supports candy dev pose and line art And of course I wrote this sentence in English making it easier and more convenient for you know lots of people to know what it's talking about So that is basically how we do the video control net image frame by frame Pass that to the one 2.1 video generation here So this group is moving on to here We got the restyled frames from Flux We have the video control pre-processors that I just talked about those image frames are going to pass into the control video So this is not going to be one image for this input This is a batch of image frames getting from there from the reference video and inputting it into here And then the start image is where we generate the flow flux image using the first frame of the reference video and restyle it into other formats other styles of looks And the good thing about one 2.1 when you see here that I'm using the 1.3 billion size parameters and mostly you will think oh this is going to be using text to video only because the one 2.1 original model 1.3 billion size is only used for text to video but it's not the case for the one 2.1 fund by Alibaba pal When you see here there's a hint of course the start image and the clip vision input So that means that this model weight even in 1.3 parameter size we are able to use an image as an input here The way that we do it is using the clip vision encode This is what is similar or the same method like what we use in image to video in one 2.1 as well We pass the clip vision and then the other import the input data is going to be the same where you have the clip text and the text encoder I do the text prompts on the top as an input data here So all the groups of processing have the same text prompts here synchronized already And the last part here the sampling using the K sampler And then on the top here I did the one video stage attentions and also the one video tcash as well as the patch model patcher order where this is going to be needed for installing the trident and the stage attention in your comfy UI system Now if you're using a virtual environment like how I did using you can install the trident and the stage attention in your virtual environments and you can try this out It runs faster If not then if you haven't installed this then you can bypass these groups and it will just be using the skip layer guidance and the model sampling and then going all through here the skip layer guidance that I talked about in the previous videos making more details of your generated video and right now we have the new node in comfy UI native nodes to use this skip layer guidance So therefore I change this node rather than using the other skip layer guidance in the KJ nodes that I used previously So I will give it a try with the native nodes again and the cfg0 star This is something new in the update of Comfy UI as well and previously it has already been in the KJ note as well You guys can check it out It's also used for video enhancement And there's one more node that you optionally can use which is the enhanced video using this one This is also able to do video enhancement giving more details for your generated video either way Yeah I have also connected this at the end of this connection here So it's also able to work and therefore I have put that as well in here And then we generate the first batch of video results in the first group of one 2.1 And then I clean up some cache of data We will have the other groups for one 2.1 for refinement as I talked about in the previous videos using the tile Laura models And this one I have you know applied what we discovered as a new thing in the AI video and applied it in a good way to use it for different scenarios and therefore I reuse those groups of workflow putting you know integrated in this new workflow for video to video and it works It's able to do the refinement using the tile Laura as well as I did a little bit of video upscaling at the front here when I received the first batch of video generation and then enlarging the size of the fetus frames And then we generate this using the tile Laura model So let's try that out with this Tik Tok dance style It's going to look like this one And I stitch all the parts together into 16 seconds here As you can see there's different backgrounds in different stitching There's still some improvement in this workflow that I can do to have coherent and consistent styles of background as well Right now it's focusing on the same style of the outfit But then since I used the character Laura in my flux therefore I have the same face and same body shape of the character here So let's try out this one and show you how that looks like when it's generated And we got the generated video here And this time I am using only the DW pose which is a lot simpler to handle for the control net rather than using multiple control net in one image frame in the control section And I got something like this one in the generated result As you can see you know the dress on the mimicking character here we have more you know motion compared with what we have in the source video here because we're only using the DW pose for the character skeleton where the outfit is going to be creatively generated by our AI video model More dynamic movement for the dress here compared with what I have in the previous generation here See the dress the movement the motions of the clothing all of it is using this line art as the guiding for the outfit and the body shape as well So it will be having less creativity when we use more control net into our video generation Video generation especially using the line art will be heavily influenced by what we have in the generated video So therefore well if it's a give and take if you want more creativity more dynamics of what you have in your reference image you will be better off referencing using the DW pose for such motions And as you can see the whole video here let's do this one It's going to be the same background and the character Even moving in different uh camera views we have a close-up shot of the character and then a far shot of the character We got the same outfit We got the same background as well because this is what the diffusion transformer is good at compared with what we have in animate diff using net architecture Now we got the consistent styles of the character outfit and we got even the background of our video In this case I am using the similar dance motions from the source video and using text prompts here A blue color blue and yellow color cheerleader uniform dancing in front of the camera smile and she is in the football field So therefore we got looks like the background for what we have generated in Flux and using that to transfer the style in our AI video using the one 2.1 fun control net model So therefore we got this style generated Also if you are going to have more time to refine your video as you can see the face looks kind of blurry here Well because it's in a very low resolution using one 2.1 in this way I would suggest if you want high resolution and more detail you will go for another step using the tile control Laura to do the refinement here and also using the CFG0 star as well as the skip layer guidance to add more details for your image for your video Mostly the skip layer guidance I would suggest using 10 for the layers Well for double layers and single layers I'm mostly using the same values here at this moment We'll see how that testing looks alike But this is what I usually use for the values in the layer settings And that is the whole structure of how this workflow will look Now let's try the other video that I played around with last night which is this male dancer on the rooftop And let's see because the movement of what he does is kind of creepy And I tried to use a red color demon as what I had from the inspired movement here in the past testing of this workflow And let's try other things using this video as well So by doing that first we are going to do our input referencing And here I got the load video upload that I have pre-uploaded in my comi input folder Therefore I will connect this first for the resize image and the length width and height we have to set as well like in the previous video that I talked about and also the similar workflow design that I have in all my other AI video workflows using the same structure as well for the width and height Now we can connect it in this node and basically we will have also passed those input values into the resize image and by doing that because this is a landscape dimension of the video we will have to change the numbers here as well going for by default the 1 2.1 using the 480p dimensions and then we basically have the video input for the referencing done here Now first of all I have put multiple groups I will suggest you guys first process the flux restyle first frame groups at the first step So by doing that I will disable other groups and only focus on the flux restyle group Or if you're a beginner and you don't know how to do that in flux or have limited GPU on your computer I suggest you use the beginner version of this workflow which I will upload in this link in the description below You guys can check it out It will be easier to handle but by just using an image as a load image So moving on to this group here Let's say I have this video I want to restyle this How does that look Let's use something like in the text prompts to do so Okay So first we create the first frame We have this image although it's not going to be a very high resolution because we are based on the 832 width and 480 pixel dimensions to create such an image So therefore we just use that for the restyle purpose We see the text property prompts here And going back to here I have defined the outfit style It's a hip-hop style girl wearing the MA1 jacket with blue jeans and a purple color tank top inter layer and white color sneaker shoes She's dancing on the street And we got this as an example So based on this image we will have the pose as well because I used the open pose this time And we will move on to the next step Again for an easier way or the beginner style I have this load image as the import for the first frame that will be in the beginner style of this workflow The beginner style version of this workflow will be attached as well and that will run easier So moving on to the next step we have the control net pre-processor for feature and also using the one video stage attention as well as enabling the output Of course we need the output and the one video control sampling and also let's refine the video at the second step as well So for the control net prep-processor for video we have to configure this We have tried to use the DW pose at this moment to make it simpler We don't need to blend two images of the control net motions together So by doing that well the easiest way is that using the auxiliary pre-processor You have many pre-processor options here where I'm showing this realistic line R DW pose etc This is just for demo purposes easier for letting you understand what is capable of running with different variants of control net models So for you know in a real practical way of using this workflow I will be you know maybe delete all this thing if I just only run one pre-processor but if you are not then you know you want to combine line art to DW pose or line art to depth map you can let's say combining these two it will be able to do as well but for now at this moment I just need the DW pose for what I need in this video So I bypass other pre-processors that I don't need and generate this right here And for the one 2.1 it is all preconfigured So it's already ready to go And let's try that out later here And then the last part here which is the one 2.1 video refinement we will have the output as well putting onto here showing in the last video combined display So let's run one time and we will wait for the result Okay we will see that after we finish creating the first frame here which I used the hip hop dance style and used my character Laura So therefore this body shape is going to use my character Laura Although this is a far shot it won't be able to create a clear face But then in case if you have a close-up shot I still have added this restored face reactor node here which is coming from the reactor node that was taken down on GitHub If you don't have that installed you can bypass this one or just delete this custom node So the next step is going to create the video prep-processor which I am using the skeleton here and this little popping locking dance motion that I have And then we have moved forward to the one fun control This is the core part of what we have in this video tutorial using the control net to mimic what we have in the reference video And then we got the generated result here As you can see the background although different camera angles are moving we still get very consistent styles of the background as well as the character outfit the building as you can see especially on the street Here we have very coherent styles without you know moving or breaking in the next frame when the camera moves into another angle that is thank you for the diffusion transformer architecture that we are able to get this coherent style unlike what we have in animate diff or mimic motions because using the DW pose or open pose with our reference first frame of the image it's similar to what we have in animate diff using IP adapter but then the IP adapter if you follow my previous longtime videos you will see that the IP adapter is not going to get exactly what you have in your reference image which gets from what I have here for example the building everything in the background as well we have exactly the same in the generated video where if you were using IP adapter before that's going to be very different you have some morphing or different flickering backgrounds even though you have a very good setting in controlNET as well so that is totally different and then the next step I did is using the tile control Laura where we have upscaled the image frames from the first generated video I just upscaled 1.5 times here to get the tile image a little bit larger So in order to do that I added another 30 steps of sampling It just enables having a sharper clearer overall everything in the video So this is how the output looks like Of course it won't be the final production If you said this is the video going to generate for example you got a face you want to make it more sharply clear that is going to be the next step or further steps for the post-p production of the video And this is the first version of how I did this using the AI video one 2.1 fun control using that It is just way overkill compared with the mimic motion that I did previously in tutorials and also the anime diffusion videotovideo workflow This one is well after if I add some little more enhancement it will be even better in performance That is so far what I have and I will see you guys in the next video Have a nice day See you

ComfyUI With Wan 2.1 Fun Control - Video2Video With ControlNet Finally For AI Video!

Channel: Benji’s AI Playground

Convert Another Video

Share transcript:

Want to generate another YouTube transcript?

Enter a YouTube URL below to generate a new transcript.