YouTube to Text Converter

Transcript of Wan 2.1 Base MiniMax-Remover And NAG for Video Object Removal - Handy VFX AI Tool!

Video Transcript:

hello everyone we've got some really cool AI models coming from fine-tuned models for 1 2.1 again yes this AI models 1 2.1 have been absolutely rocking the open-source AI video market right now you can see a lot of fine-tuned merged and quantized models based on this one and today we've got some very cool add-on features for this AI model which is called the Minimax Remover now this feature is for trimming bad noise or unwanted video objects what that means is you can select any object in your video and mark it as bad noise then it basically just removes that noise in the latent space afterward we've got the VAE to decode and generate the video back for us so as you can see here we've got a lot of examples showing the before and after effects of how the videos look now moving down to the architecture diagrams as you can see it's using diffusion transformer models which as you know are pretty standard for AI video processing these days in stage two it's searching for what they call bad noise then in the last step it cuts out those bad noises from the process this way it has a much better removal of objects appearing in the video the results feel a lot more natural after you remove the object compared to other AI removal software or models a lot of times when you use those you'll see blocky blurry areas or some bad quality generated by the AI after removing an object like a blur where the object used to be but then this is just what they show in the demo i have actually tested this removal AI model myself and I have to say it still needs some improvement especially in terms of the model size which brings some limitations to the AI for example if you check the hugging face page you can see in the model card that this is based on WAN 2.1 with 1.3 billion parameters when you go into the files and check the transformer subfolder you'll notice this is only a 2.5 gig minis size AI model i've tested some examples like removing a character walking on the street and it was able to successfully remove that guy but then it left the camera because I hadn't segmented the camera so it ended up looking like there was a ghost holding a camera effect which is kind of funny but that's just how it is still you can see that it removed things pretty smoothly in some cases when you move to some other video scenes it doesn't fully remove certain objects for example take this guy here in the middle of the street after the effect I tested it and it still leaves some shadows even after removing the person in the middle of the street another example this is source footage of a group of friends dancing i tried removing this person in the front and also this lady in the middle i did two examples like that the first one I tried was removing the person in the front the lady and it generated something like this it's kind of pixelated because I was using a very low resolution to test at first just 480p so it's not super clear all the details are gone but still it was able to remove that person then I had to do another sampling to recover those dark areas just like in here if you don't remove those dark areas after the person is removed you'll see a mark where the person used to be like an empty region in this other example I removed the lady in the middle of the group of friends this time I did a little better using 720p resolution to generate and it did fully remove that person so yeah it can do it but you need to refine things a bit after the first removal just like the examples shown on the official page it's not 100% perfect for every kind of video clip as you can see it works well for small proportions of objects located in the video scene if the object is really large in the video scene like this guy here you'll still see some shadow remaining even in the official demos so there are definitely some limitations to this AI model still I think it's because of the probabilities tied to the fact that this is a 1.3 billion parameter model if they had say a 14 billion parameter model the quality would be way better but still it's fun to play around with and it's a very good tool i've talked to some VFX friends who do video effects and they use a lot of object removal in their work so I put together a workflow here just to test this AI model this is based on the WAN video wrapper and you can check out the WAN video wrapper hugging face repo to download this AI model the miniax remover as you can see is 1.3 billion parameters in FP16 because this model is really small fp16 is no problem at all for most computers yeah the whole package is able to run at a very small size just 2.25 GB once you download it into the diffusions model subfolder you can load it up in the models drop-own menu so like all the other diffusion model files I've stored in that folder you can select this one basically this is just like the normal way of connecting the model loader in the WAN video wrapper i used torch compile and then the Laura select i also used realism boost to make it more natural and realistic for some video footage on the front end here I've got a video loader to load the video of course we need to work on the footage we want to edit for instance in this video maybe I want to remove this person right in the middle of the bridge so I use the point editor to help visually locate whatever object I want to remove once I select the region and identify the object I want to remove we move to the segmentation section here that I've created using the same tool we use conditions and set this as the video segment this part is important because if you choose the wrong settings and use a single image it will load a lot slower so using the video is just more suitable for what we're doing right now next is the mask usually I'll set the expand mask to 30 or 20 depending on the range it depends on what kind of object you want to remove and the shape or form of that object for humans I usually set it to 30 then we move on to the sampling steps but before we get there we need to take a look at this thing this isn't the text encoder which is very normal for what we usually load when working with video text you edit and encode for your positive prompt and negative prompt but here this is something new it's called nag or normalized attention guidance this is another guidance logic algorithm for your text prompts basically as you can see because we're using fewer sampling steps now a lot of times when you're working with Flux for images or WAN 2.1 you'll be using Fusion X or COVID where you use very low sampling steps sometimes you'll even use a lower CFG when applying Nag here it's all based on the text prompts taking more effective control over generating your videos one thing that inspired me to use nag together with the minimax remover is that sometimes when we remove objects using this minimax remover model we still see some remnants of the removed object like shadows and of course in Comfy UI you'll see that it's not completely removed for example in this footage here it's a good example where you can still see the human shape and shadow remaining going back to the Nag project page as you can see even for Flux the concept still works for when 2.1 for example you've got text prompts like a dress floating underwater and then put women as the negative prompt that means you generate an output where the person wearing the white dress under the sea disappears or isn't there this approach uses to make the prompt adherence stronger than a normal text encoder so what I did here was apply nag with my text prompts i put both the original text prompts and the nag text prompts using the same output from my testing it doesn't matter whether you connect different text encoder nodes separately or use the same one for both text embeds then of course we connect the text embed output to the sampler this way we're doing something different with the text prompt logic what I thought to do here was describe a natural landscape view of the park with a bridge in the middle of the frame what I did was put human and shadows as the negative prompts just using the same concept they did on the Nag demo page so I ran this using the masked person but I expected there would still be some problems with the removal don't expect it to 100% remove shadows or sometimes objects that are still remaining again this is a very small size AI model so sometimes it can't do perfect work some video scenes I tested before recording this video couldn't remove everything at all so basically we have to test that's the main purpose of trying out new technologies let's run this and see how it looks one more thing I have to mention is that before you start this workflow or whatever similar workflow you're doing when you load the point editor you have to be careful don't load the sampler at the same time because I've already pinpointed the areas I want to include in the segmentations so I'll be enabling the sampler to run at the same time otherwise if you're working with a new image bypass this sampling group and run the video loading for the first time select the region with the point editor and run the segmentations again make sure you've got the right objects masked before enabling the sampling and running it again here's the result as you can see you can still see the shadow of the character honestly it's not great in some video footage I've tested it's unable to fully remove anything like this one it's unable to fully remove even though I followed all the sampling steps CFG shifts and everything mentioned in the coding here they're using inference steps of 12 and well they don't mention any CFG because they don't need CFG in this model as you can see not using CFG makes it highly efficient so there's no CFG at all here therefore we can mark the CFG to one in our sampler a very low shift number works well you can use six steps or 12 steps but since this is a repackaged model sometimes the safe tensor files won't work perfectly so I'll just use 12 steps to make sure there's enough inference to run that's what it is for this and this is a bad example a bad case to show you that this removal AI model isn't always able to remove everything from all video scenes let's try another one that I successfully removed in some cases okay for example like this one i've tested this footage and I'm able to remove the character so the character in the middle or the character in the front those two ladies are able to be removed in this example I'll remove the character in the middle to show you guys how that looks in the text prompt here because it's all humans i don't need to put human as the negative prompt i'll just keep the template of the negative prompt and add more detail about what's in the video scene in my positive prompt that's all I need once you see this you'll know the idea is that the character here is going to be masked and removed from all of this video footage so far through testing I've discovered that if the objects are a small proportion of the whole screen in the video it's easier for these AI models to remove them but if you've got a larger object or something that takes up a big mass in the video scene it's harder for the AI to completely remove that object that's what I found from my experience so far because of this video scene you see the lady in the middle of the group of friends it's just a very small proportion of the whole picture here in this case it's easier to remove for this AI model so far that's what I found through my few results i've tested about some video clips with this already and that's the conclusion I've come to let's check out how far we've come here as I'm just finishing talking we've got the generated result and it took 54 seconds to remove this video footage as you can see even though I didn't do any enhancement or second sampling steps the person is completely removed from the middle here they're gone yeah but even though they're gone you can still see the shadow of the character remaining that's just a limitation I think for this AI model just like this footage from their official page you can see there's no reflection of the animals and also like this one you can see the shadows remaining of the character as well as this blurry area right on the door here so this is still a limitation of the removal AI feature even in other AI models we're still seeing these issues we'll need another sampling step or another group of samplers to fix this issue if we want to use the footage after removal so in this example I'd say this is a pretty good result already for the minimax remover AI model the next step I'll do is something I tested before this video i'll use another group of WAN 2.1 and this time I'll use the base because it gives us more control net conditioning leveraging this i also did two text encoders here just to show you that it works with separate text encoders or you can connect the NAG using separate text encoders or you can use the same text encoder node with the same text prompts just like the first group here where I connected to the text embed input in this NAG node so whichever way you want to connect it it still works there's no exact answer for how you should connect it back to the second sampler i call this the WAN base refiner just having another sampling pass usually we call it that we'll be using DW pose because even after removing the person in the middle we still have their friends dancing here so we'll use DW pose and also depth anything V2 this time I'll set the strength a little lower about 50% that way it won't include the shadows here if you use 100% strength for depth anything you'll almost include the shadows like this but in this setting since it's lower there will be no shadow at all after you generate let's check it out and see how it works after the refiner we'll connect the output from the first group and pass that to our second refiner group i also added realism boost which comes from the Fusion X hugging face repo you can check that out there's a folder for other Loras that I've also downloaded there let's run it and it should generate pretty quickly now since we're using Fusion X text to video and cost V with very low sampling steps because of cost V okay so we've got the generated result here as you can see the first one is kind of okay but not really high quality you can still see a little bit of blurriness and because the character was masked and removed you see another lady behind that position her hand is kind of broken and then suddenly pops up right there we'll refine that using the WAN base i call this just a simple group for refinement so I labeled it when vase refine as you can see the character in the back their hands are now completely visible and displayed very smoothly throughout the entire video we don't have any hidden shadows on the ground anymore because we just refined the whole image like this i was also using 720p resolution to generate this video so the resolution is higher for the entire video generation even though this is in low res it's still 720p in the second sampling here I also used video upscaling with the model and upscaled it two times we're able to see that in the full view like this you can see the hand from this character appears throughout the entire video the character in the middle is completely gone with no shadow at all and the character in the back we can see the leg is complete no morphing or blurry areas this is how we can completely optimize the video output after object removal also because I used a higher resolution this time with the Fusion XVase FP16 I'm able to have a better output but there's a drawback here if you're using FP16 with higher model weights you'll consume a lot more VRAM because I've got higher hardware spec now i'm able to process this even previously the Nvidia 4090 couldn't handle this memory reserve size so it's a give and take for whatever you're trying to generate one more thing you can use mootly GPU node to load another CUDA with wrapper node so there's a node for this you can drag that up here and use the Comfy UI multiGPU custom node pack then load the WAN video model loader multiGPU which is made for the WAN video wrapper to connect this model data normally if you're using the normal native node diffusion loader it's usually a purple dot and you connect it with the native case sampler like this this multiGPU setup also creates a custom node for connecting with the WAN video wrapper so you can use multiGPU this way to connect with your model and allocate devices as well if you've got multiple GPUs you can allocate the first CUDA for this model and the second CUDA for other tasks like the text encoder or VAE you can use the first device to handle that so yeah this is one method you can use for video refinement it's not always the case there's no textbook answer for this if you just want to do normal video to video without the wand 2.1 vase for example I can delete all of this group and use only the video output here use that for the VAE encode we've got the one video encode from the one video wrapper you can just use V2V like this put the samples on top here and there you go you'll just render whatever it is playing around with the D noiseis strength for high or low to have another sampling step but if you use higher D noiseis of course the output video will look totally different so that's another drawback as well check out both the miniax remover and the nag for guidance here these are pretty cool tools released at a good time coming together to offer this video object removal solution so I'll see you guys in the next video have a nice day see you

Wan 2.1 Base MiniMax-Remover And NAG for Video Object Removal - Handy VFX AI Tool!

Channel: Benji’s AI Playground

Convert Another Video

Share transcript:

Want to generate another YouTube transcript?

Enter a YouTube URL below to generate a new transcript.