Benefiting from large-scale pre-trained text-to-image (T2I) generative models, impressive progress has been achieved in customized image generation, which aims to generate user-specified concepts. Existing approaches have extensively focused on single-concept customization and still encounter challenges when it comes to complex scenarios that involve combining multiple concepts. These approaches often require retraining/fine-tuning using a few images, leading to time-consuming training processes and impeding their swift implementation. Furthermore, the reliance on multiple images to represent a singular concept increases the difficulty of customization.
To this end, we propose FreerCustom, a novel tuning-free method to generate customized images of multi-concept composition based on reference concepts, using only one image per concept as input. Specifically, we introduce a new multi-reference self-attention (MRSA) mechanism and a weighted mask strategy that enables the generated image to access and focus more on the reference concepts. In addition, MRSA leverages our key finding that input concepts are better preserved when providing images with context interactions. Experiments show that our method's produced images are consistent with the given concepts and better aligned with the input text. Our method outperforms or performs on par with other training-based methods in terms of multi-concept composition and single-concept customization, but is simpler.
Given a set of reference images $ \mathcal{I} = \{I_1, I_2, I_3\} $ and their corresponding prompts $\mathcal{P} = \{P_1, P_2, P_3\}$, we generate a multi-concept customized composition image $I$ aligned to the target prompt $P$. Our method works entirely within the latent space. At each intermediate timestep $t$, we extract key and value features from input images that include contextual interactions. We then apply a weighted mask strategy to emphasize the relevant target features and filter out irrelevant content. The refined features are leveraged by our proposed MRSA to generate high-quality customized results.
Existing methods can be broadly categorized into two types: those that require fine-tuning on a small number of images and those that involve training on large-scale datasets. In contrast, our approach does not require any training and is able to produce results that are on par with, or even superior to these methods. Additionally, our method supports training-free video customization.
Comparisons of multi-concept composition.
Comparisons of single-concept customization.
Single-concept customization. Our method enables extensive customization of a single concept by inputting a single image
Our method generates objects with similar appearance and materials as the input image.
(a) BLIP Diffusion with Ours.
(b) ControlNet with Ours.
Our method can enhance ControlNet and BLIP Diffusion in a plug-and-play manner. (a) By using our method, the output of BLIP diffusion becomes more faithful to the input image and better aligned with the input text. (b) Furthermore, ControlNet can generate results that are consistent in layout and identity when combined with ours.