FreerCustom: Training-Free Multi-Concept Customization for Image and Video Generation

CAD&CG, Zhejiang University, China
*Equal Contribution Corrsponding Author

Customized Image Results


Interpolate start reference image.
Interpolate start reference image.

Customized Video Results

Given our available computational resources and the trade-off between the cost and performance of various video generation models, we use AnimateDiff as the base model for our video generation. However, the results generated by AnimateDiff tend to exhibit flickering effects. To enhance the visual clarity, we apply an off-the-shelf deflickering tool CapCut to process the generated results. For the video presented here, the first row shows the raw results generated by AnimateDiff, while the second row displays the same videos after being processed with the deflickering tool.

Despite the flickering effects caused by Animatediff, the key contribution of our work remains a key highlight. The primary innovation of our approach lies in its ability to generate customized videos without any additional training, effectively handling both single and multiple concept scenarios.

Interpolate start reference image.

Add New Concepts and Remove Existing Concepts


Our method supports both single-concept and multi-concept image and video generation. Our greatest advantage lies in being training-free and requiring minimal user input. The generated results are on par with and even surpass those of existing training-based methods.

Abstract

Benefiting from large-scale pre-trained text-to-image (T2I) generative models, impressive progress has been achieved in customized image generation, which aims to generate user-specified concepts. Existing approaches have extensively focused on single-concept customization and still encounter challenges when it comes to complex scenarios that involve combining multiple concepts. These approaches often require retraining/fine-tuning using a few images, leading to time-consuming training processes and impeding their swift implementation. Furthermore, the reliance on multiple images to represent a singular concept increases the difficulty of customization.

To this end, we propose FreerCustom, a novel tuning-free method to generate customized images of multi-concept composition based on reference concepts, using only one image per concept as input. Specifically, we introduce a new multi-reference self-attention (MRSA) mechanism and a weighted mask strategy that enables the generated image to access and focus more on the reference concepts. In addition, MRSA leverages our key finding that input concepts are better preserved when providing images with context interactions. Experiments show that our method's produced images are consistent with the given concepts and better aligned with the input text. Our method outperforms or performs on par with other training-based methods in terms of multi-concept composition and single-concept customization, but is simpler.

Method Overview


Given a set of reference images $ \mathcal{I} = \{I_1, I_2, I_3\} $ and their corresponding prompts $\mathcal{P} = \{P_1, P_2, P_3\}$, we generate a multi-concept customized composition image $I$ aligned to the target prompt $P$. Our method works entirely within the latent space. At each intermediate timestep $t$, we extract key and value features from input images that include contextual interactions. We then apply a weighted mask strategy to emphasize the relevant target features and filter out irrelevant content. The refined features are leveraged by our proposed MRSA to generate high-quality customized results.

Paradigm Comparison


Existing methods can be broadly categorized into two types: those that require fine-tuning on a small number of images and those that involve training on large-scale datasets. In contrast, our approach does not require any training and is able to produce results that are on par with, or even superior to these methods. Additionally, our method supports training-free video customization.

Qualitative Comparison

Comparisons of multi-concept composition.

Comparisons of single-concept customization.

More Visual Results

Single-concept customization. Our method enables extensive customization of a single concept by inputting a single image

Appearance Transfer

Our method generates objects with similar appearance and materials as the input image.

Empower Other Methods

(a) BLIP Diffusion with Ours.

(b) ControlNet with Ours.

Our method can enhance ControlNet and BLIP Diffusion in a plug-and-play manner. (a) By using our method, the output of BLIP diffusion becomes more faithful to the input image and better aligned with the input text. (b) Furthermore, ControlNet can generate results that are consistent in layout and identity when combined with ours.