YouTube to Text Converter

Transcript of Wan Video With Echoshot In ComfyUI - Create Multiple Shots Of Your Actor

Video Transcript:

Hello everyone. So before the one video 2.2, we've got something cool for the Juan video 2.1. It's a new fine-tuned model based on the one video 2.1 text to video. We now have multiple shots of a character that you can generate using just one single text prompt. This is called the echo shot. Echosshot is essentially multiple shot portrait video generation. That means you can use one actor as they've mentioned already and create multiple shots within one single video generation. As you can see in this example, there are various examples here where you have a consistent character across multiple shots. You know, like one or two seconds per shot showing the character from different angles. You can let your imagination run wild on how you want to manage those shots within one text prompt. Basically, this is going to be a texttovideo model based on one 2.1 generating multiple shots within the video generation result. And you can see all the video demonstrations here. They're very coherent and the characters are consistent across those multiple shots. Again, it's not generating multiple instances of the video. It's using just one single video generation to make this happen. The architecture diagrams here explain how you can do it. First of all, you have a consistent character. You're going to use your text prompts to describe things like hairstyles, how the character is dressed, and so on. Then, you'll have several different shots. For example, in the typical way, it uses three shots in one video generation. You describe each shot, the backgrounds, actions, and behavior of the actors in the video. As you can see in these examples, we've got the character with different camera shots and different behaviors. For instance, in the first shot, the character is pointing a finger forward. Then the second shot transitions to the character outdoors smiling with trees and mountains in the background. The third shot shows the character in her bedroom holding a phone and swiping on the mobile phone and so on. All those different shots can be generated with one single video generation. In the hugging face official repo of Echo, there are files that allow you to play around with it. This is the one video 2.1 texttovideo model with 1.3 billion parameters. They've created this in PTH file extensions which you can also use in comfy UI. Another way to use this model is through the one video wrapper where you can find the one video comfy repo and use echo shot with the one video wrapper. There are some updates you need to do for the one videos rapper GitHub repo. In this repo, the latest update was made yesterday. As you can see, after you update it, you'll be able to use the EchoShot models. There's a new folder called Echosshot in the hugging face repo. You can click into it and you'll see two files. Currently, it's published as a safe tensor file about 2 GB in size, which is the 1.3 billion texttovideo model file. You're going to save that in the diffusion models folder. The other more flexible way and my preferred method is using Laura. As you can see in the second row here, we've got the one 2.1 EchoShot 1.3B Laura model. This will go into your Comfy UI models folder inside the subfolder where you have the Laura model folder to play around with. Today I'm going to use the Laura model way to run EchoShot rather than using the texttovide diffusion model. Why? Well, first of all, it's more beneficial because it's only a 300 megabyte file size. Plus, I already have the texttovideo 1.3b WAN 2.1 model. So, I don't need an additional WAN 2.11.3B model to run this. Currently, Echosshot is only trained on the 1.3 billion model. Hopefully, sooner or later, there will be a 14B model release. Or maybe they'll wait for one 2.2. 2 to come out and train a new updated version. Who knows, right? So, let's check out how we can use it at this moment. Once again, the one video wrapper is mostly for experimental purposes, as the author has talked about in the GitHub repo description. A lot of times, people misunderstand or misuse it by trying to put some of the latest updates of the models into a production grade workflow. Some of these models, like Echo, are just for experiments and research progress. Let's jump into the folder here. First of all, when you update your Comfy UI and the wrapper, you'll get some new custom nodes and example workflows. Go to the example workflows. You'll see the latest 1.3 billion echoshot examples. You can load this one into your Comfy UI workflow diagram. Like in my Comfy UI here, I've already updated my one video wrapper for this demo purpose. And I've also loaded the EchoShot example workflow. You can see the reference for echo shot. There are some Loras you can download. As I just mentioned, I'm not going to use the Echo model for the diffusion model. I'm going to use Laura to load it, which means I'll add another extra Laura loader here. So, I'm going to add another loader for the Laura and use the Echosshot Laura in this way. Right here, as you can see, I've selected the Laura. What you need to do is connect the Laura pipeline here and then go to the diffusion model loader. I'll set this strength to one for the echo shot. And the rest of the Loras here, you need to download them, but this is optional. It's just to optimize the video's performance. I'm going to use that as well in my video demo here. Talking about Laura and diffusion models, I already have the 1.3 billion base model for this one. So, here I have the one 2.1 text to video 1.3BFP32 safe tensor files. Therefore, I'm going to use the base precision set to FP32. In these options, I'm going to select the text encoder that I already have as well as the VAE that I have. Remember, the precision you select needs to match what your downloaded VAE and text encoder belong to. Next, we're going to check out the sampler here. in here. As you can see, the empty latent, which is usually the native way, has a higher number of frames by default in this template text from the example workflow. 149 frames for these three shots of video generation. Talking about the text prompt, you can see that there are three shots defined here using this format. Square brackets one, two, and three. You can set two or four shots as long as you have enough frames for the AI to generate in the one video sampler. This is the typical way of setting it up because it uses cost V and has self-forcing DMD Laura. This is similar to light X2V, but light X2V is used for the 14B models. Self-forcing is what we use for the 1.3B model, which allows us to use very low sampling steps in this case. One special thing here is theuler. We're seeing the dpmuler method being used. Whereas traditionally or normally we use flow match cost v or unipc for when videos. But this time we're using this method to generate. This is also what they're using for this AI model. By default, the text prompt comes with a red panda having three shots of this red panda doing different things in the video. Let's check out how we're going to generate this using all default settings. The only thing I'm changing is that instead of the diffusion model, I'm using the Laura model for echo shot only. I don't need the torch compile here because it's not really necessary for my computer setup right now. Let's check it out. How's this running? Oh, one more thing. Make sure you have all the optional lores pointing correctly in your system because sometimes the file names are different when you save them. Just make sure everything is correct and select the one you have in your system. Here I've downloaded the three models. I choose the right one and run it again. As you can see, it's successfully running. Now it's going to the model loader, the text encoder, and we're going to check out how much GPU VRAM this is going to consume. Because this is a 1.3B model, it shouldn't consume too much VRAM. Even though I'm using the full model, not the GUF Quantise model, and it's in FP32. Most consumer PCs should still be able to run a 1.3B model size. Just finishing talking and it's already generated. As you can see, it's using about 22 seconds in my computer setup and consuming 10 GB of VRAM with the highest peak at 11 GB of RAM to generate this video. So, the red panda, as we can see here, we've got three shots of different views of the same character doing different things in the video. This could be a really good and practical way to generate multiple shots of a character such as in a music video or cinematic scenes where you have that character specifically and suddenly you want very short durations of seconds. You want that character in multiple shots or multiple actions. You can apply something like echo shot for this situation. In other examples that I've tried, I changed the character to another one, not using the red panda, and did the same actions in this set of three shots for the character. I was able to use the text prompt to describe what kind of outfit the character is wearing and create a consistent character within those three shots of video. Here's another example I'm generating right now. I'm using another text prompt to do it. This one is going to have even more frames. I've set it to 201 frames to get a longer duration. As you can see, it just finished generating while I was talking, and it's using 14 GB of VRAM. Even though I bumped up the number of frames a little bit higher here, it's generating a more futuristic style of video. As you can see, multiple shots again, and the character is moving differently. In this scenario, it's more like a timeline case where you have the same for the backgrounds in the text prompts, but the character is doing different actions in the video. One thing we have to realize is that this is a texttovideo model. So in this case, if I have a specific robot that I want to use, not this robot for example, if I have my own reference robot like this one, I want to put that into my video. Using the normal way of the EchoShot example workflow won't work because we're using the empty embed. So we have to do something else to work around with the text to video 1.3B model. One thing we can use is the one video vasey. The vase has the 1.3 billion modules from the one video wrapper which you can also download in the hugging face repo under one video comfy. Scroll down to the bottom and you'll see the modules for 1.3b and 14b. Let's go this way to work around it because the one video vase is based on text to video. That means the texttovideo model used for the base model here for EchoShot is also compatible. In this way, I've tested it and it works. I'm able to reference a specific character for the video generation rather than relying on the text prompt and hoping for a good-looking character in my video. So, let's say I connect the one video base as the module model and pass it into the one video empty latent. We don't need this at the moment. instead. I've tried this already, experimented around, and it works with the video as well. I'm using the one video encode. The only thing we need is the reference image. In this case, the reference image is here. I've got the 3D style lady walking on the street in this screenshot. But instead of using that character, I'm going to drag and drop my example white colored robotic mea here. I'm going to reference this using the same text prompt to try another video generation. Before that, we need to connect the VAE to our W video basing the video dimensions such as width, height, and number of frames. By default, the number of frames is 149 in the example workflow. But since this is a 1.3B model, which is relatively small, I'm able to run even higher video lengths. Let's say I set it to 201 frames, it's able to generate pretty well here. Once again, this time I'm not using the empty latent. Just make sure this isn't connected. And let's see how that works. So once I've defined the characters here, just like we usually do when using reference to video with Vasy, we remove the background and put a white background behind the character. This makes the character the main focus object for the video generation. Then we pass this information into the sampler and let it run for the video generation. As you can see, the one video vase takes a little bit longer to generate because we've got the reference image as well as the vase module models. So, it will take a little bit longer to generate. Plus, I'm setting this to 201 frames, which adds up to 205 frames in the final sampling result. right here. It continues to consume about 24 GB of VRAM this time. Pretty crazy as I added few more add-on and I'm using that much with the 1.3B model to play around with. Let's check out the video result here. Obviously, this is different compared to what we had for the robot character. Even though I'm using the reference image, it's getting a little bit different. Although I have this image here, the armor is almost very close to the same except the helmet is a little bit different from my reference image. Well, with the 1.3B model, this is something that has limitations and can't create too much detail. But like the overall armor of this character is doing very similarly in the video reaction result here. If you're using a simpler character like what I did with the lady character or swapping it to the red panda, it works better. for example, using a reference image and having the same character throughout the entire video scene. So, let's go the easy way for the 1.3B model to run and do something correctly. Here, I did two examples using the same character reference image. I've got a pretty long prompt referencing this example from the EchoShot demo page where you see this series of prompts. You can twist it a little bit, change the character, the face, the look of the character in the text prompt. Then I've got this example showing three shots of the character. The last one is obviously talking on the phone. The previous generation I did was a short test on this character text prompt where it's the same room but with three different actions. Either way, it works. You can use one environment and do three different actions like this example. Or you can use echo shot with text prompts to control different environments within those three shots of video while keeping the same character. So this is all based on creativity how you want to play around with the text prompts and the combinations between different Laura models and vase as well. I think this is a pretty cool thing to experiment with, but it's not really ready for production grade use yet. Since this is still a 1.3 billion model, there are lots of things it can't do. like in the previous example where I used the futuristic robot, it couldn't create much detail for the helmet, etc. So, yeah, this is an experimental fun toy to play with. We're looking forward to Juan 2.2 coming out, and maybe in the next video, I'll be talking about Juan 2.2 and all these old models of Laura, Vasu, or whatever fine-tuned model is available. Hopefully, it's compatible with Juan 2.2 as well. I'll see you guys in the next one. Have a nice day. See youa.

Wan Video With Echoshot In ComfyUI - Create Multiple Shots Of Your Actor

Channel: Benji’s AI Playground

Convert Another Video

Share transcript:

Want to generate another YouTube transcript?

Enter a YouTube URL below to generate a new transcript.