Timon Grabovac Data Scientist

Achieving Consistent Characters in Image Generative AI

Over the last couple of years, the field of image generative AI has witnessed significant advancements, making this technology increasingly accessible to a wide range of users. Key entities in this space, such as Midjourney, DALL-E by OpenAI, and Stable Diffusion by Stability AI, have been instrumental in democratizing the use of AI for creating complex, creative visual content with speed and ease. Every month, the technology improves, and looking back, we cannot help but be amazed at the rapid progress that this field is experiencing (example of Midjourney’s model improvement can be seen in the image below).

OpenAI and Midjourney focus on accessibility and ease of use, while Stability AI is creating open-source models in its Stable Diffusion family and counting on the community to further tweak and develop their base model. It is thanks to the widespread adoption and customization of the Stable Diffusion models by hobbyists, artists, and developers across the globe that we have many new free tools and methods available to anyone to use and develop further.

While image generation is already at an unprecedented level, there are some aspects that are a common problem for all image generative AI models. One of the most prominent is character consistency, which will be the topic of this blog. We will look at the current capabilities of the three mainstream image generative AI solutions/models and present one of the possible solutions (with many examples) for creation of consistent characters—Low-Rank Adaptation (LoRA). The aim of this blog post is to present this approach from a high-level perspective, while providing basic guidelines on how to use this method yourself.

Non-identical twins

Character consistency is often viewed as the missing link to broad use of image generative AI, as this feature would unlock possibilities such as automated content creation for graphic novels, comic books, animated series and video games, personalized marketing campaigns and more. Allowing artists to focus on the creative process and spend less time producing illustrations or concept art, thus speeding up production and increasing the overall quality of the product.

Midjourney’s latest feature release (at the time of writing) – Character Reference tells us that the main actors in the industry also identified this problem as a major point of improvement. To illustrate the current state of this technology, we look at a sample of images created by DALL-E 3, Midjourney (with the new Character Reference feature) and Stable Diffusion’s (JuggernautXL model) image2image workflow, using Pareto’s founder Miha as the test subject. The goal was to get an image of Miha wearing a Christmas hat with the best out of four images selected (main prompt: A man wearing a Christmas hat).

No model does a sufficiently good job with this simple task, as either the facial features get changed quite significantly or the Christmas hat is missing. While Midjourney and DALL-E 3 do not support other tools (apart from Character Reference), we do have that option when using the Stable Diffusion or other open-source models. There are commonly used workflows using methods like textual inversion, ControlNet and complete base model finetuning (colloquially known as Dreambooth), that can give good results. However, the fastest and most used method is fine-tuning using a LoRA technique. It quickly produces good results while not being too computationally complex or requiring a large training dataset.

Who is this LoRA you speak of?

Low-Rank Adaptation (LoRA) is a technique originally developed for fine-tuning large language models efficiently by updating only a small subset of the model’s parameters (source). This approach allows for substantial changes in the model’s behaviour with minimal adjustments. We can think of it as “teaching” a new concept to our original model.

Extending this method to the realm of image generation, for example using the Stable Diffusion model, LoRA allows for the fine-tuning of the main model, in order to produce images that meet specific character, style, theme, or content requirements.

While this approach is very strong, it does not come without any drawbacks. While being less computationally intensive than full model fine-tuning, it is more difficult to find the optimal parameters for the training method compared to full fine-tuning. The method also does not work well with multiple subject compositions, as in most cases all of the characters will be the LoRA trained subject.

Dataset should ideally consist of at least 15 images of good quality and varied composition (portrait, full body, side view, different poses…), these images also need to be tagged and organised in a way that is compatible with the used training script.

One of the biggest advantages of LoRA is its accessibility. You don’t need a supercomputer to train LoRA models; mid-range consumer GPUs are sufficient, allowing you to use the methods discussed here on your own equipment.

Do you really not need a supercomputer?

It is entirely possible to train LoRA models on a mid range consumer GPUs, meaning that you can use the methods presented here for yourself. In the example workflow presented in this post, we used an EC2 instance of type g5.2xlarge with the following hardware specifications:

VRAM: 24 GiB (NVIDIA A10G Tensor Core GPU)
RAM: 32 GiB
STORAGE: 500 GiB EBS Volume

It is mounted with the following AMI, that supports all libraries and functionalities necessary:

Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.1.0 (Ubuntu 20.04)

Software used:

Kohya_ss and dependencies (for training the LoRA)
Automatic1111 (for testing the LoRA and generating images)

It is, of course, possible to use different configurations than the one listed, but with less VRAM, the training takes longer while also being impossible for some newer models (SDXL based models require at least 12 GiB VRAM). There are some other software options available (both tools listed above are open-source, so you can even run the code independently), but these two are the most mainstream and user-friendly. There are a lot of good tutorials on how to set up the tools and install necessary dependencies on your local machine or using a specific cloud provider, so you can try it out for yourself if you wish.

Sending the model to the gym (training)

We will use the DreamBooth LoRA script for Stable Diffusion, which is implemented in the Kohya_ss scripts. Roughly speaking, LoRA will learn the character/object/style that is constant across your training dataset (which is not specifically tagged) and assign it to the specified activation tag. This simplified explanation of LoRA training can be imagined as an equation with a single unknown (the target character/style/concept), which then allows the model to extract (learn) that unknown:

The activation tag (in the example above: “example_eleph”, should be a tag the model does not know yet)

As stated above, the model will learn the constant across the images in the dataset. Meaning, if all the images of the elephant that we want to train have the elephant standing still, the model will incorporate that into our activation tag and is unlikely to generate good results of the elephant in different positions (running, lying down, sitting…). In order to mitigate this, we can introduce more composition variety in the training dataset, or (less effective but sometimes the only option) add a tag for elephants’ position (in this case “standing”).

Dataset preparation

The LoRA training dataset should consist of at least 15 high quality images of your character (the training will work with fewer images, but it will be difficult to obtain good results). In our example, we had 22 of them, 5 were face closeup shots and 17 were portraits. If you wish to generate whole body and side shots, you should also include those in the dataset in order to get better results. Each of these images should then be named by a consecutive integer, converted to .jpg and tagged with a .txt file of the same name. Below is an example image from the dataset with corresponding tags. The first tag is the activation tag and should be identical for all of the images as it will serve as a trigger for the final LoRA (in the .txt file, the tags are just a comma separated list). It is not necessary to have all of the images in the same dimensions, as Kohya_ss takes care of this during training using bucketing.

We used so called booru tags (comma separated list), but you can also use descriptive tags (example for the image above: mihamlakar, portrait of a smiling man in a beige long sleeved shirt with a simple single-color background). The choice should be aligned with the way the original model you are fine-tuning was trained, but for simple compositions and if using simple prompts when generating, there is not much difference.

Final step of dataset preparation is organising the images and caption files in the correct folder structure compatible with the Kohya_ss’ scripts, which for our case looks like this:

.
└── lora_miha/
    ├── Images/
└── 15_mihamlakar/
├── 1.jpg
├── 1.txt
├── 2.jpg
├── 2.txt
└── …
    ├── Model/
    └── Logs/

The number 15 in the name of the folder containing images determines the number of repeats for each image during the training process, which couples with the number of epochs, one of the training parameters that will be discussed in the following subchapter.

Fine-tuning process

Using Kohya_ss scripts for training is very straightforward (especially if working in their GUI). The quality of the resulting LoRA does however greatly depend on the parameters chosen and, of course the dataset as mentioned above. Because of that, finding optimal parameters is crucial if you wish to maximise the LoRA effectiveness.

There is not really a generic set of parameters that are the best choice, as this greatly depends on the model you are fine-tuning as well as the dataset. Default values set by Kohya_ss are a good starting point and if coupled with recommendations of the model’s developer, can already give solid results, but some experimentation is always helpful. Understanding the parameters allows us to further improve our models, there is plenty of good guides available on this topic:

We found that the best approach is to experiment with different values and then use something like the X/Y/Z plot to find the best LoRA model. Saving intermediary epochs during training also allows us to test the model at multiple steps throughout the learning process.

Finding the best epoch and LoRA strength

As stated above, finding the best LoRA model is an experimental process, where the multiple different epochs and parameter configurations need to be tested. For that, we can leverage the powerful X/Y/Z script of the AUTOMATIC1111 interface, to programmatically vary the models and compare their outputs.

To keep things shorter, we will focus only on finding the best epoch and strength of the LoRA, but the same principle can be applied to other parameters we varied in the fine-tuning process.

In the AUTOMATIC1111 interface the LoRA model is triggered within the prompt using the following syntax (can differ for different filename settings):

If we do a search and replace operation with a generic prompt to generate X/Y plots of different variations, we can observe the effect strength parameter and the number of training steps (higher epochs underwent more training steps) on the generated images.

First, we will look at the simple prompt: portrait of a happy man.

We can clearly see how epoch and the strength parameter affect the resulting image. Looking at lower epochs, we see a different style (more akin to a painting) and a lot less likeness to our subject. This is even more obvious at lower values of the strength parameter in the columns from the left of the image. Looking at the first column from the left we see that even at the 18th epoch the effect of the LoRA model is insufficient in producing a desired degree of likeness.

In the above photos, we can also see that at higher strength values, we are getting a different composition, which happens to be very similar to one of our training images. This brings us to one of the most general problems in all of machine learning – overfitting. Ideally, we wish to achieve a desired degree of likeness to our subject, but keep the flexibility that would allow us to add, change, or vary certain features of the character – such as wearing a Christmas hat. We can again use the X/Y plot with a different prompt (portrait of a man wearing a Christmas hat), to look into selecting the right model epoch and strength parameter, that will allow us to keep some flexibility of the generated images.

Since we had a clean and varied dataset, we only ran into obvious troubles with overfitting at very high epochs and strengths. We can see that the Christmas hat looks odd and is not generated as nicely at higher epochs and strengths.

Using the two experiments above as well as some additional tests at higher resolution of these two parameters, we find that the model with 22 training epochs and the strength parameter of 0.85 reliably gives us desirable results.

What comes out of the oven?

Using the model we chose in the previous chapter, we obtain a desired level of likeness to our subject while keeping the flexibility we need in order to generate varied compositions using this subject. Below is a compilation of the best results that were obtained using this LoRA model (and a minigame in the last image). All the prompts of course included the LoRA trigger command (mihamlakar, <lora:juggxl2_mihamlakar-000022:0.85>). We abstain from comments and let you judge the quality and drawbacks of the results yourself.

prompt: photo of a man running in a marathon

prompt: happy man wearing white tennis clothes and holding a Wimbledon trophy standing on a grass tennis court

prompt: close up dynamic photo of a man surfing

prompt: photo of a man on a red carpet holding an oscar award

prompt: photo of a cowboy in full gear with a lasso riding a horse

test: which Miha is “fake”? (answer is at the end of the conclusion)

Conclusion

Helping artists on projects using image generative AI, has shown us that the main missing feature, to widespread adoption of AI tools for production of graphic novels, comic books, children’s books, etc. is in fact character consistency. Our hope is that with this blog post, we have given you an idea of how to apply the LoRA method of fine-tuning to tackle this problem at the current time.

While the LoRA approach to obtaining consistent characters might not be the final solution to this issue, it is certainly an impressive tool that is readily available and gives good results. It does introduce some drawbacks, such as complex parameter selection for optimal results.

We believe that while the current character consistency methods might not be perfect, they still allow us to use the exciting new technology of image generative AI for improving and simplifying the work of artists and content creators. With the rapid advancement of technology, as much from the corporate side as from the very motivated open-source community, who knows, in a couple of years, you might not be able to correctly identify the third image of Miha as “fake”!

So, here we are. You’re reading post number nine on this new blog. We plan to post here when we have something to say about how we build products and function as a company, not just because some editorial calendar says so. We want to respect people’s time, so sign up for our newsletter below and each quarter, we’ll send you the good stuff with no fluff.

Blog

Some other cool things
we've worked on

Topic Extraction In Chatbot Applications

Discover how our AI company overcame the challenge of lengthy deployment times in CI/CD pipelines for containerized applications.

Accelerating Docker Deployments in CI/CD Pipelines with Caching Strategies

Discover how our AI company overcame the challenge of lengthy deployment times in CI/CD pipelines for containerized applications.

Bridging the Gap Between Text and Images in Computer Vision With CLIP

Explore the transformative power of CLIP (Contrastive Language-Image Pretraining) in computer vision, bridging the gap between text and images.

Achieving Consistent Characters in Image Generative AI

Non-identical twins

Who is this LoRA you speak of?

Do you really not need a supercomputer?