OIR

Abstract

Diffusion-based image editing methods have achieved remarkable advances in text-driven image editing. The editing task aims to convert an input image with the original text prompt into the desired image that is well aligned with the target text prompt. By comparing the original and target prompts, we can obtain numerous editing pairs, each comprising an object and its corresponding editing target. To allow editability while maintaining fidelity to the input image, existing editing methods typically involve a fixed number of inversion steps that project the whole input image to its noisier latent representation, followed by a denoising process guided by the target prompt. However, we find that the optimal number of inversion steps for achieving ideal editing results varies significantly among different editing pairs, owing to varying editing difficulties. Therefore, the current literature, which relies on a fixed number of inversion steps, produces sub-optimal generation quality, especially when handling multiple editing pairs in a natural image. To this end, we propose a new image editing paradigm, dubbed Object-aware Inversion and Reassembly (OIR), to enable object-level fine-grained editing. Specifically, we design a new search metric, which determines the optimal inversion steps for each editing pair, by jointly considering the editability of the target and the fidelity of the non-editing region. When editing an image, we first search for the optimal inversion step for each editing pair with our search metric and edit them separately to circumvent concept mismatch. Subsequently, we propose an additional reassembly step to seamlessly integrate the respective editing results and the non-editing region to obtain the final edited image. To systematically evaluate the effectiveness of our method, we collect two datasets for benchmarking single- and multi-object editing, respectively. Experiments demonstrate that our method achieves superior performance in editing object shapes, colors, materials, categories, etc., especially in multi-object editing scenarios.

Method

Search Metric: (a) For an editing pair, we obtain the candidate images by denoising each inverted latent. (b) We use a mask generator to jointly compute the metrics S_e and S_ne, and finally we obtain S by computing their average. Search Metric represents a new editing paradigm, comparable to the feature-injected image editing method.

OIR: (a) We create guided prompts for all editing pairs using P_o and P_t. (b) For each editing pair, we utilize the search pipeline to automatically find the optimal inversion step. (c) From each optimal inversion step, we guide the denoising individually using its specific prompt. We crop the denoised latent of the editing regions and splice them with the inverted latent of the non-editing region's at the reassembly step. Subsequently, we apply a re-inversion process to the reassembled latent and denoise it guided by P_t. OIR has delivered outstanding results in multi-object editing.

Other results

Visualization of the Search Metric: The images on the right are the candidate images acquired using the proposed search metric for each editing pair. In the bottom-left corner, a curve is plotted with its x-axis representing the inversion step and the y-axis denoting S of the search metric.

BibTeX


        @article{yang2023OIR,
          title={Object-aware Inversion and Reassembly for Image Editing},
          author={Yang, Zhen and Ding, Ganggui and Wang, Wen and Chen, Hao and Zhuang, Bohan and Shen, Chunhua},
          publisher={arXiv preprint arXiv:2310.12149},
          year={2023},
        }

vase → teapot book → plate with fried egg and fork	cat → tiger straw hat → colorful straw hat	house → church tables → tank	starfish → stuffed toy starfish foam → chequered blanket
lighthouse → rocket taking off blue sky → sky with sunset	monitor → gift box keyboard → game controller	TV → TV playing a cartoon picture bench → crochet bench	plant → Lego plant human hand → wooden table
Wall-E → crochet Wall-E water → a pile of confetti	dumplings → fry rice chopsticks → knife	car → cabriolet wooden house → glass house	rabbit wearing glasses → origami rabbit wearing glasses book → newspaper

Object-aware Inversion and Reassembly for Image Editing

vase → teapotbook → plate with fried egg and fork

cat → tigerstraw hat → colorful straw hat

house → churchtables → tank

starfish → stuffed toy starfishfoam → chequered blanket

lighthouse → rocket taking offblue sky → sky with sunset

monitor → gift boxkeyboard → game controller

TV → TV playing a cartoon picturebench → crochet bench

plant → Lego planthuman hand → wooden table

Wall-E → crochet Wall-Ewater → a pile of confetti

dumplings → fry ricechopsticks → knife

car → cabrioletwooden house → glass house

rabbit wearing glasses → origami rabbit wearing glassesbook → newspaper

Abstract

Method

Other results

BibTeX

vase → teapot
book → plate with fried egg and fork

cat → tiger
straw hat → colorful straw hat

house → church
tables → tank

starfish → stuffed toy starfish
foam → chequered blanket

lighthouse → rocket taking off
blue sky → sky with sunset

monitor → gift box
keyboard → game controller

TV → TV playing a cartoon picture
bench → crochet bench

plant → Lego plant
human hand → wooden table

Wall-E → crochet Wall-E
water → a pile of confetti

dumplings → fry rice
chopsticks → knife

car → cabriolet
wooden house → glass house

rabbit wearing glasses → origami rabbit wearing glasses
book → newspaper