Transcript of ComfyUI Tutorial Series: Ep11 - LLM, Prompt Generation, img2txt, txt2txt Overview
Video Transcript:
welcome to episode 11 of our comfy UI tutorial Series today I want to talk about prompt generation specifically using large language models llms to create prompts based on instructions that can be integrated into your workflow I'll also cover how to generate a prompt from an image let's begin with how to use a vision language model to generate a prompt from an image go to the manager and then to custom nodes manager search for the word Florence and choose the node with ID 269 made by ke click install this will install a node that lets us use Florence 2 a vision language model developed by Microsoft it's specifically designed for tasks involving images and Associated text such as image captioning and object detection you usually need to restart but first let's update comfy UI since there are some new updates available you can click on the link to read more about each release on their page where you'll find detailed information about the update if you're on the GitHub page where you installed comfy UI look for the releases section on the right click on it to see what's new and make sure to check it from time to time click update all wait for the update to finish then click the restart button and confirm this will install the node and update the interface always check the nodes after installation as some may need additional settings or dependencies to work correctly if you look at the installed nodes you should see the one you added click on its name to open the node page scroll down to the installation section to see any dependencies you need to install without these you might encounter errors that are hard to diagnose since I'm using the portable version in the series and if you followed from episode one you are to we need to run a command in that folder copy this long command which will install the necessary components listed in the requirements text file go to The Comfy UI Windows portable folder in the address bar type CMD and press enter this will open a command window with the path set to that folder paste the command and press enter let it install the necessary components then close the window also close the comfy UI window so you can restart comfy UI as well when you run comfy UI you might see an error about not finding the path of the llm folder don't worry after you run your first workflow the models will automatically download into that folder and the error will disappear the llm folder should appear after the first run clear your workflow and start fresh to explain the process step by step double click on the canvas and search for Florence add the node Florence to run this node has inputs and outputs on the left it needs an image so add a load image node and connect the image output to the image input of the Florence node it also needs a model so search for Florence again and add the node download and load Florence to model the first time you run it it will automatically download the model for you connect the nodes and now we need to Output The Prompt but what should we choose from the output we need the caption option a caption is a descriptive sentence or phrase that explains or summarizes the content of an image to Output the text use the show any node or any node that displays text this one is from The Comfy easy use pack which we installed a few episodes back the first time you run it it will download the model and necessary files be patient as this only happens the first time when it's ready you'll see the llm folder with the forence to base model folder inside this is because the default model is selected if the output doesn't look like a prompt it might be set to region caption I click on that and choose caption instead running it will give you the caption if the caption is too short you can use detailed caption to get uh longer prompt you can choose the level of detail you need I prefer more detailed caption for a nice prompt there are different versions of the Florence models similar to stable diffusion the base model is smaller and faster so if you don't have a powerful video card stick with it you can also try versions like sd3 captioner which will also download a new folder the first time it runs the node page includes a link for downloading the model folders if you encounter any download errors now everything should be ready the first run takes longer but after that getting a caption from an image should be quick about half a second I tried a few of these models and some are faster and more accurate than others but it also depends on the image and the stable diffusion model you're using to generate it none work perfectly but they provide a good starting point for a prompt Cog 2.2 also gives some nice results my favorites are the sd3 captioner Cog version and prompt gen version you can see all the folders I have here and if you look inside each one has different model sizes Cog is the largest followed by sd3 captioner and prompt gen with the base model being the smallest let's see how we can integrate this into any workflow um I'll load a basic sdxl workflow just like in episode 3 so let's add those nodes really quick first I add the Florence to run node I like to add that first because I think about it as needing to run Florence so it makes sense to choose the Run node first next it needs the load image node so it has an image to analyze and get a caption from then it needs the model as that's like the brain that knows how to extract the caption from the image now all that's left is to connect it to our workflow specifically to the text encode node where we we usually put our prompt but how do we connect it since we only have a clip input we need to create another input to do this go to convert widget to input and choose convert text to input now we have a text input ready and we can connect our caption output to the text input will it run let's give it a try so now instead of taking the prompt from the text encoder as we usually do we're using the Florence model to get a description of the image as text this text is then sent to the text encoder and the process continues normally with diffusion and other steps just like we explained in other episodes with the text to image workflow and the result is a chair haha all right pixaroma take a seat on your newly generated chair and reflect on your mistakes change the task to be a more detailed caption instead of a region caption I'll also add a show any node so we can check what's going on inside that node and see the caption we get yes that works better so we interpreted an image got the prompt and ran that prompt to generate a new image that is somewhat similar to the original it's text to image not image to image here's one generated with the sd3 captioner version of Florence looks nice let's try the Cog version as well the first time it loads the model it's a bit slower but it will be faster the second time test different models to see which one works best for your system in terms of speed and quality um let's choose a different image like this anime girl now the model I'm using isn't specifically fine-tuned for anime so even if I test different Florence models it captures the colors subject and pose correctly but doesn't always get the style right and even when it does my sdxl model might not interpret it accurately of course we can adjust the prompt to improve understanding or use a model fine tune for that style but let's see if flux can interpret that prompt better I selected those four nodes and copied them then I loaded a flux Dev workflow from episode 10 and pasted the nodes remember we just need to convert the widget to input first to be able to connect it to any workflow now I can run it and let me move the nodes so they fit on the screen for you to see flux is much more accurate sometimes I get blurry images with the dev version and the white background can trigger it occasionally along with some random seeds but as you can see flux is quite good at interpreting prompts let's clear the workflow and run more tests I'll add another image with a portrait of a woman I'll change the control after generation to increment and use a fixed seed then I'll duplicate it a few times and load a different Florence model into each of these tiny workflows now when I run it all those workflows will execute you can see how each interprets the same C differently though the prompts are generally similar with just different wording sometimes when you have multiple workflows running you might want to run only one without deleting muting or bypassing the others a simple way to do this is to go to the final node of the workflow you want to run rightclick and select the option Q selected output node this option is part of the RG3 custom node pack even with multiple workflows it will only run the selected one this can be quite useful at times let me delete the other workflows and use just these four nodes select all the nodes you want right click on the canvas and choose the option add group to selected nodes this will create a group that fits all those nodes when you move this new group it will move all the nodes it touches by right clicking and going to edit group you can change the title to something easier to remember like Florence to base if you want now I'm copying and pasting those nodes changing the Florence model to the Cog 2 version and repeating the steps by adding a group and giving it a new name so now we have two different groups each containing similar but distinct workflows if I hit Q prompt it will run both workflows and you can see the flow from one node to another but what if I want to deactivate a group quickly and easily right click on the canvas go to RG3 comfy and find the settings for that node scroll down to the settings and enable will show Fast toggles in group headers then set it to always and click save now you'll see two small icons at the top right of each group the first icon with an arrow through a node is bypass this skips processing inside the group while keeping the data flow intact the second icon mute completely disables the group stopping both processing and data flow if you prefer to have multiple workflows in one place you can organize them this way if I mute the first workflow and hit Q prompt you'll see that only the second workflow runs while the first one is stopped but we can do more double click on the canvas and search for fast groups in this case I want the fast groups mutter node add that and you'll see it automatically list the names of each group on the canvas with a switch next to each name you can click on the switch to turn the group on or off providing a very easy way to mute groups with a single button from one place pretty cool right you also have an arrow next to each name that will take you directly to that group helping you focus on it quickly let's right click on the second group go to edit group and change the color to Yellow now if we rightclick on the fast groups mutter node and select properties panel we have extra settings in the match colors field type the color name yellow in this case this will make the fast groups mutter node only show the yellow groups if you're unsure of the available color names you can check the color input which lists all the color names for example if you put pale blue in the match colors field it will only show that color if you leave the match colors field empty all groups will be visible without sorting now let's delete the second workflow to remove a group right click on the group choose edit group and then select remove Let's test with a practical example edit the group title and add something like prompt from image now the switch is easier to understand as it clearly shows enable prompt from image yes or no I'll add a positive node just like we used before in our workflows and include a text like my prompt you probably guess that I want to create a switch so I can choose between using a prompt from image or my custom prompt instead I'll add a node called any switch from RG3 now I'll connect the caption output to the any1 input and the positive output to the any2 input for the string output which is essentially just a text output I'll add a show any node so we can see the result from the switch if I run the workflow I get the prompt from the image um as you can see here but if I mute the group notice how it's connected you can mute it from the group itself or from the fast groups muter when I run it with the group muted I get my custom prompt instead let me add another positive node and connect it to any3 input I'll add the text prompt number three if I run it I still get my prompt the way the any switch works is by taking the first available option from the list since anyone is disabled it uses any two which has my prompt if I bypass the any2 entry you'll see that it now uses any3 which has prompt number three if all are disabled I get a null value so an easy way to use it is to keep all entries disabled and just enable the one you want to use I will keep only the group that generates a prompt from the image and one positive node that I will use for my prompt one way this could be useful is to run the prompt generator first then copy the resulting prompt after disabling The Prompt generator paste The Prompt into the positive node allowing me to edit it before it goes to the out put this setup gives us a simple switch to either use the prompt generated from the image or my own prompt let's see how we can implement this in our workflow we'll load a flux Dev workflow from episode 10 I will convert the widget to an input for the text encoder so we can connect our new nodes there I'm pasting these nodes to avoid recreating them now from the any switch I can connect to the text input where we used to have our prompt the text encoder node I will move the nodes closer together and then Rec create the group around the node for some reason I wasn't able to copy and paste the group as well I'll name it prompt from image you can see the title has been updated in the switch now we can test the workflow if we want to change the prompt we can copy it then disable the prompt generation in the switch next paste The Prompt into my positive node now I can make some changes like giving the woman purple hair and a red coat instead when I run it we get a woman with purple hair which is what I wanted once the workflow is working as expected you can experiment with different Florence models to see if you can improve the results the outcome is quite nice I've created a more Compact and organized workflow for you so let's zoom in first you upload an image I'll choose this building image next select the model and the task I'll stick with the Florence base model since it's faster for the task depending on how detailed you want the prompt I prefer using the more detailed caption option when you run the workflow you'll get a description of the image as a preview since the prompt from image option is enabled you can see it's set to yes it will use that prompt and process it as usual through K sampler I created a group here to simplify things once it's finished you'll get this nice image of a building if I want to use my own prompt I can simply disable prompt from image I aim to make the workflow logical so I hope it makes sense sense when prompt from image is set to no it will use my prompt instead which could be something like a cat if I want however I can also copy the previously generated prompt for the building and modify it changing it from a 3D rendering to a digital painting for example and adding some fantasy and futuristic elements just make sure prompt from image is disabled so it uses my prompt as a result I get this nice illustration of building pretty cool I'd say now let's see how we can add a large language model to help generate prompts from text instructions rather than from an image this is similar to how we ask chat GPT for prompts go to the manager then to the custom nodes manager search for Surge and find the node with ID 97 named Serge llm click on install wait for the installation to finish then click restart and okay after comfy is restarted go back to the custom nodes manager sort by installed and locate the node remember to check if there are any additional requirements or dependencies if you click on the title you'll be taken to the gith hub page for that node according to the installation instructions we need a folder named llm Guff in the models folder navigate to the models folder and create a new folder named llm orgu now we need to download and place a model called mistol in that folder you can download it from the provided link I'll include the links for you there are several versions of the mistal model available generally the larger the model the better the performance but also slower they recommend the Q4 version but I've also downloaded the Q8 version if you encounter any issues such as an error about missing llama CPP you can try running the following commands in the respective folder to try to resolve the problem for me it worked without problem so hope it works for you as well I will restart comfy UI and when it starts you can check the command window to see that some nodes are running including the search node which is running without any issues this means no additional steps are needed now let's clear the canvas and start fresh so I can explain how it works double click on the canvas and search for llm add the surge llm node then to view the generated text at a show any node in the text field enter something like a cute cartoon cat when you run the workflow you might encounter an error don't be alarmed by errors they often have simple fixes even if you don't understand everything the error message says look for keywords that make sense here the error indicates that a model is missing but we've already downloaded a model into the folder we created the issue seems to be that the uh model is undefined simply hit refresh and you should see the model now you know after refreshing you should get a nice long prompt if you try something else like a cool car you'll get a different prompt note that if you try running it again without making any changes it won't work because it's using the same seed to get a new result either change the seed or modify the text let's load the flux workflow again and see how we can integrate it I've added the surge llm node if we try to connect the output from generated to the clip node it won't work because we need a text input so rightclick and select convert widget to input then choose convert text to input now you can connect the generated output to the text input I'll also add a node from generated to monitor what's happening such as a show any node to see how the prompt looks for the text I'll enter a bunny in the forest then I'll run the workflow while while it's running let's arrange the nodes a bit so they're clearer and easier to view the results look nice let's take a closer look at what's Happening Here we have the text where we put our prompt and we also have the instructions which say generate a prompt from prompt using the given format this means the instructions are set to generate a prompt from whatever is in that text box however if I want I can include all the instructions directly for example I could write generate a prompt from a cute cartoon bunny and it will still work because the necessary details are included in the instructions to make this more userfriendly and give myself better control I'll right click on the Node select convert widget to input and choose convert instructions to input this will allow me to add a positive node and provide ample space to write my text making it easier to see an edit I can write my prompt here but remember we need instructions so use something like generate a prompt from now it generates what I want using those instructions you can add another positive note if needed but keep in mind that only one positive node can be connected at a time to handle multiple pieces of text I'll add a text concatenate node this lets me connect both the first and second texts for the first input I'll enter the instruction like generate a prompt for and for the delimiter I'll use a space instead of a comma to keep it continuous in the second positive node I'll put the subject and details of what I want the prompt to be about now we can connect the concatenated string to the instructions to see how these text parts are combined I'll add a show any node when I run the workflow you can see how the instruction is combined generate a prompt for a cute cartoon bunny it uses the first part and then the second part separated by a space the result is a cute bunny I also tried pushing the limits of the models to see what else we can do for instance I have this image of a building if you're using Windows 11 you can rightclick on the image and select copy as path to get the file path if you're using Windows 10 you'll need to click on home at the top where you can find the copy path option since we don't have an image input and this works with text I thought of pasting the image path here um I tried that and as you can see it generated a prompt like generate a path 4 which included the file path however it also provided extra information about the image that I didn't want in the prompt so I changed the instructions to exclude the path and file name when I ran it again I got an image description although it described the building it wasn't similar to my actual building indicating that the Florence model did a better job initially it can still be interesting to use and might produce some unexp expected results however it seems like it needs a vision model optimized for describing images uh similar to what we use with Florence at the start the settings might not be ideal leading to overly creative results or we might need a different model if anyone knows of a model suitable for image descriptions that is in Guff format please let me know I'm aware that lava is another model that can handle image descriptions but the ggf version I tried didn't work as expected let's try copying the path of this portrait of a woman to see what we get I pasted the path and ran the workflow resulting in this illustration clearly it's very creative if you want results similar to your image you should use the Florence workflow if you're looking for something more experimental or fun you can use this method or perhaps search for a better model specialized in image descriptions I generated this robot using the instructions and just want to point out something important if you try to generate again you'll notice that the prompt doesn't change this is because the seed is fixed even though it says random seed which can be confusing to get a different result click on the Arrow to change the seed or type a different number each time you run it with a new seed you'll get a new generation just remember to change the seed each time you want a different result I think this is set up so that the model doesn't reload each time you run the prompt which speeds up the generation process when you change the seed it starts from the beginning resulting in a slower generation speed I have created a few workflows for you to experiment with for example I have one that requires all the instructions in a single field you start with the instruction followed by the description I also have another one where the instructions are in two parts you can put the instructions in the first part and just modify the second part making it easier to adjust additionally I created a more compact version that is better organized allowing you to add the description in one place while also having the option to disable the generator if you prefer to use your own prompt I also made a workflow that handles only text similar to chat GPT but split into two parts for convenience this one includes more advanced options for responses and I've noted what each setting does there is also a simpler version with default values and no Advanced options let's test it out the first run will be slower depending on the model you're using the generation time likely depends on the length of the text as well you can also request shorter prompts or even ask for a poem or a product description for a digital product I haven't tested all the limits but feel free to explore and see what it can do it's free local and useful I've created a formula for chat GPT that can help you generate detailed prompts for flux on Discord in the pixaroma workflows channel look for this formula copy the provided part and then go to chat GPT I use the Mini version since it doesn't have generation limits paste and run the formula there it will generate a prompt for you wait for the dot to disappear then simply click copy code after that go to your flux workflow and paste The Prompt there when I run it I get this image for example if you ask for something like a cartoon bunny ninja in a bamboo forest you'll get this result as you can see you can obtain some detailed prompts that are longer which speeds up the prompt creation process and you can edit the prompt afterward if needed I've tried to cover as many options as possible for generating prompts but uh there are more possibilities I've already spent a lot of time on this episode so I'll stop here thank you all for your support if you'd like to support the channel you can join the membership however a simple like or comment would also be greatly appreciated thank you for watching and have a great day [Music]
ComfyUI Tutorial Series: Ep11 - LLM, Prompt Generation, img2txt, txt2txt Overview
Channel: pixaroma
Share transcript:
Want to generate another YouTube transcript?
Enter a YouTube URL below to generate a new transcript.