Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality
Facial Makeup & Exaggerated Expressions & Extreme Poses & Fast Motion & Face Occlusion
Long-take Video & Fast Motion & Reappearing Face
Exaggerated Expressions & Extreme Poses & Long-take Video
Exaggerated Expressions & Cinematic Lighting & Extreme Poses
Cinematic Lighting & Semi-transparent Face Occlusion
Abstract
Video face swapping is crucial in film and entertainment production, where achieving high fidelity and temporal consistency over long and complex video sequences remains a significant challenge. Inspired by recent advances in reference-guided image editing, we explore whether rich visual attributes from source videos can be similarly leveraged to enhance both fidelity and temporal coherence in video face swapping. Building on this insight, this work presents LivingSwap, the first video reference guided face swapping model. Our approach employs keyframes as conditioning signals to inject the target identity, enabling flexible and controllable editing. By combining keyframe conditioning with video reference guidance, the model performs temporal stitching to ensure stable identity preservation and high-fidelity reconstruction across long video sequences. To address the scarcity of data for reference-guided training, we construct a paired face-swapping dataset, Face2Face, and further reverse the data pairs to ensure reliable ground-truth supervision. Extensive experiments demonstrate that our method achieves state-of-the-art results, seamlessly integrating the target identity with the source video’s expressions, lighting, and motion, while significantly reducing manual effort in production workflows.
Methodology
Overview of the proposed LivingSwap framework for video face swapping. (1) Keyframes are used as temporal anchors to ensure consistent identity injection across long sequences. (2) We feed the source video as a reference, enabling high-fidelity reconstruction of non-identity attributes such as lighting and expressions. (3) By sequentially generating chunks and propagating the final frame of the previous chunk as guidance, LivingSwap achieves seamless transitions in long videos. (4) We use Per-frame Edit method to generate the data and reverse data roles to construct paired samples, ensuring reliable and artifact-free learning.