Given only a reference image, our model generates a realistic audio-video with synchronized speech.
Given a reference image and audio clip, our model generates a realistic video with matching appearance, similar voice, and natural motion.
Given a reference image and a depth map video, our model generates a realistic video with matching appearance and depth-guided motion.
Given a reference image and a pose sequence video, our model generates a realistic video with matching appearance and pose-guided motion.
Given a reference image, an audio clip, and a depth map video, our model generates a realistic video with matching appearance, voice, and depth-guided motion.
Given a reference image, an audio clip, and a pose sequence video, our model generates a realistic video with matching appearance, voice, and pose-guided motion.