Transcript of Unlock wan2.2's Secret Power! Create Complex Animations From Just 2 Images
Video Transcript:
hello everyone today we're bringing you a hidden skill of wan2.2 which is first last frame interpolation this feature is incredibly powerful as you can see it can not only perform very simple first last frame interpolation operations but also handle very complex scenarios for example look at this it involves quite a few scene and action changes the character first lies on the ground then stands up walks to the table and then sits down as you can see there are many actions but you'll find that 1 2 2 can excellent display all of them next I will reveal how this workflow is implemented now let's first look at how I discovered this feature I initially found a post on Twitter published by Comfyui Wiki it mentioned that it could achieve a first last frame function and included a screenshot we can click to view it it might not be very clear so I'll open it in a new window and zoom in so everyone can see it clearly as you can see it mainly uses a node called one first last frame video this node was previously used in a dedicated 1 2 1 first last frame model it has two images connected to start image and end image respectively then it generates new positive conditions negative conditions and latent followed by Wan 22 sampling I thought it was simple so I tried it first let's look at my first workflow built exactly as described in the post let's see the generated effect this did not produce a very good first last frame effect it performed a scene switch right so it was very disappointing however I felt that it should be possible so I made some improvements what was the problem it was mainly that Clip Vision Start image and Clip Vision end image might not have been connected because we also mentioned when discussing first last frame models before that these two are quite important so I improved it alright let's look at my improved version this is the first last frame and it can generate effects like this and I found through testing that it is very stable in this situation that is it's very easy to achieve this effect additionally everyone should pay attention to the frame count my current setting is 81 frames you should preferably change it to 121 frames this will make it more stable good this is our first usable workflow let's briefly summarize we loaded the Juan 22 original high noise model and low noise model for the weight type I chose the FPA UMT 5 encoder and the Juan 21 V encoder decoder which is compatible with Juan 21 additionally I loaded a clip H which we previously mentioned is a visual encoder officially released by Comfyy first last frame images two images mainly for two inputs as the first and last frames of the very crucial node 1 first last frame 2 video after being encoded by the encoder we just mentioned it can generate clip vision output which is then connected to our clip vision start image and end image prompt part I kept it simple a girl sitting on a bench playing with her phone there's nothing much to say sampling part before sampling I performed acceleration I used the Lightex 2V model which we have used multiple times I used rank 128 Loras which were added to both the high noise and low noise models the subsequent steps are basically consistent with the official workflow however my sampler uses LCM CFG uses 10 with 8 steps of sampling the first four steps perform high noise model denoising and the latter four steps perform low noise model denoising note the difference here the high noise model needs to add noise also return with leftover noise is set to enable whereas it's set to disable in the official workflow other settings are the same after decoding this is the effect it's quite good isn't it after this test I wondered if it could perform interpolation for more complex actions so I conducted a second test for this second test the first and last frames became more complex the first image shows the same character with a similar scene lying on the ground picking up a phone the second scene shows her sitting in front of a computer working diligently I also made corresponding changes to the prompt I a girl stands up walks to the computer sits down and works hard other settings were basically the same looking at the final generated result the first scene and the second scene switched directly I felt this wasn't what I wanted it's possible it couldn't handle complex action interpolation but I didn't give up so I tried another method then we got the third video take a look in the third video we achieved a very good effect what improvements were made here mainly based on these aspects first I made the prompt more precise it now states a girl stands up walks to the computer sits down and works hard all three key actions are included I hoped the prompt could better guide the video this is the first improvement second I changed the frame length to 121 frames this is because I believe we need to match the main model for the best effect for 1 to 2 the ideal frame count should be 121 frames so I changed it from 81 frames to 121 frames also I previously found that if it's 81 frames the first and last frames are actually not complete so based on these two considerations I changed it however the resolution was not changed it's a vertical resolution of 480 by 832 the third improvement is that I added this a one video nag node this is a negative prompt node this node was developed by Keiji and is included in the kjnoode extension package how do you use it I mainly use this when denoising the high noise model we connect this negative prompt to the conditioning then connect the model the connected model is then sent to our save value this way the negative prompt becomes quite effective in the negative prompt I added an element called scene transition this means I don't want a scene transition effect to appear this can effectively ensure that we achieve the desired function within the same scene so we achieved a fantastic effect didn't we therefore the final workflow state should look like this some might ask if this workflow is generalizable let's test a more complex example take a look here are two images the characters in these two images are related but don't look exactly the same the basic composition is similar but the backgrounds are different and the clothes are also different however we wanted to have a costume change effect so as you can see I still used this workflow I change the prompt to a girl walking a golden light flashes and the girl changes her costume let's look at the effect the girl has a walking motion but she didn't actually move her body twisted that's fine then we see a golden light flash and the girl changes her costume this is a very impressive effect so you'll find that the interpolation capability is very strong even if the first and last frames have significant scene changes we can still complete this task very well so this is the workflow we discussed today let's briefly summarize some key points the main model must be I to v not t to v I did not use vape if you use vape it might be a different scenario always remember to connect the clip vision start image clip vision end image start image and end image the frame count must be 121 frames the prompt must be detailed making it easier for the model to achieve a smooth transition this is very crucial remember to add the nag node before adding it we need to write seen transition in the negative prompt additionally many people might have this why is nag only added to the high noise model and not the low noise model this is actually fundamentally related to the principle of 1 2 2 we all know that 1 22 is a multi modal expert Mo model it switches between the high noise model and the low noise model the switching criterion refers to the signal to noise ratio Snr if it reaches 1 to 1 meaning the ratio of signal to noise is 1 to 1 then it switches this can be seen in the original paper however you should note that in Comfi it cannot achieve automatic switching therefore Comfi uses a crude method which is first 4 steps then last 4 steps if it's the first 4 steps and the last 4 steps we cannot guarantee that the Snr at this time is exactly 1 to one this can only be described as a crude substitute method as we've mentioned when discussing many models before implementing MOH models in comfy y is quite difficult but you must understand that based on fundamental principles the high noise model is used to construct the outline the low noise model is used for optimization refinement once the basic outline is constructed the general actions and basic animation are already generated therefore I only need to ensure that the high noise model can construct the desired scene during composition then during refinement because it's limited by the high noise model meaning my composition is already set you can only refine within this framework you cannot deviate from this framework right so it's not needed here I want to remind everyone not to overlook the technology itself and treat it merely as a tool this will limit your development alright that's all for today I hope this way of thinking provides some basic inspiration for everyone what are you waiting for go and try it yourself follow me to become someone who understands AI
Unlock wan2.2's Secret Power! Create Complex Animations From Just 2 Images
Channel: Veteran AI
Share transcript:
Want to generate another YouTube transcript?
Enter a YouTube URL below to generate a new transcript.