MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation


Liyang Li*, Wen Wang*, Canyu Zhao, Tianjian Feng, Zhiyue Zhao, Hao Chen, Chunhua Shen
Zhejiang University
* Equal contribution

Ref Image → Generated Audio-Video

Given only a reference image, our model generates a realistic audio-video with synchronized speech.

ref
Ref ImageGenerated
View Caption
[Visual] A man on stage, holding a microphone. Wearing a blue shirt over a gray t-shirt, left arm extended outward. Warm orange-yellow background.
[Speech] “The Thatcher pubes, that’s what it would be called.”
ref
Ref ImageGenerated
View Caption
[Visual] A woman with shoulder-length brown hair, wearing a white top with a floral pattern. Gold earrings. Concerned or questioning expression.
[Speech] “Sent away family somewhere.”
ref
Ref ImageGenerated
View Caption
[Visual] A man sitting in an office setting. Short dark hair, mustache. Light purple button-up shirt over a beige t-shirt. Hands clasped.
[Speech] “I don’t feel good about it, but he just.”
ref
Ref ImageGenerated
View Caption
[Visual] A man with short curly brown hair and a beard, wearing a dark blue jacket with a “Jeffersonian Institute of Forensic Medicine” patch.
[Speech] “So sure our victim was a firefighter after all.”
ref
Ref ImageGenerated
View Caption
[Visual] A man standing in a professional office setting. Gray suit, white shirt, purple tie. Pocket square in jacket.
[Speech] “and I am not going back again.”
ref
Ref ImageGenerated
View Caption
[Visual] A person with medium-length wavy brown hair, wearing a green jacket with a white fur-lined collar over a dark shirt. Outdoor background.
[Speech] “Yeah, but the birds are still here.”
ref
Ref ImageGenerated
View Caption
[Visual] A person with blonde curly hair wearing a red sweater with green and blue patterns. Holding up a clear plastic bag filled with medicine bottles.
[Speech] “Not anymore. Here’s a variety of medicine.”
ref
Ref ImageGenerated
View Caption
[Visual] A woman with dark hair tied back is shown. She is wearing a white lab coat. The background appears to be an indoor setting, possibly a medical or laboratory environment, with a window and some blurred objects visible.
[Speech] “Your husband must be a pretty important guy.”

Ref Audio + Ref Image → Generated Audio-Video

Given a reference image and audio clip, our model generates a realistic video with matching appearance, similar voice, and natural motion.

ref
Ref ImageRef Audio
Generated
View Caption
[Visual] A man is sitting in what appears to be an office or study. He is wearing a plaid blazer, a checkered shirt, and a patterned tie. He has glasses and short brown hair. The background shows a window with a view of a building and a shelf with several books. The man is looking to his right and appears to be speaking
[Speech] “Right. She was our in-house publicist about ten years”
ref
Ref ImageRef Audio
Generated
View Caption
[Visual] a woman with short, dark, wavy hair. She is wearing a black leather jacket over a colorful striped shirt. Her expression appears to be one of concern or worry. In the background, there is a man with short, dark hair, wearing a light blue shirt and dark pants. He is walking away from the woman, and the setting appears to be an outdoor area with trees and a road. The lighting suggests it might be daytime, but the sky looks overcast
[Speech] “No one is”
ref
Ref ImageRef Audio
Generated
View Caption
[Visual] A woman with long, wavy, light brown hair is holding a black phone to her ear. She appears to be in a professional setting, possibly an office or a conference room. There is another person in the background, wearing a white shirt and a dark tie, but they are out of focus. The woman is wearing a white top. The lighting is bright, suggesting an indoor environment
[Speech] “Thank you for coming all the way down here”
ref
Ref ImageRef Audio
Generated
View Caption
[Visual] a woman with long, dark hair. She is wearing a black top. Her expression appears to be one of surprise or shock, with her mouth slightly open and her eyes wide. The background is somewhat blurred, but it seems to be an indoor setting with a greenish, bokeh-like light effect
[Speech] “I found that it was best to ask”
ref
Ref ImageRef Audio
Generated
View Caption
[Visual] a man with light-colored hair and glasses, wearing a dark cardigan over a plaid shirt. He appears to be in a room with a fireplace in the background. The man is looking towards another person who is not fully visible in the frame. The lighting in the room is warm and there are some reflections on the fireplace
[Speech] “I'm sorry, I shouldn't be so”
ref
Ref ImageRef Audio
Generated
View Caption
[Visual] A man is shown in a dimly lit room. He has a receding hairline, a beard, and is wearing a dark suit with a white shirt and a patterned tie. There is a small American flag pin on his left lapel. The background appears to be a storage area with shelves and boxes. The lighting is low, creating a somewhat somber atmosphere
[Speech] “I'll have your passport returned”
ref
Ref ImageRef Audio
Generated
View Caption
[Visual] A man is standing in what appears to be a room with a decorative wall in the background. The wall has a pattern of light-colored tiles and some gold accents. The man is wearing a dark purple cap and a dark shirt with a floral pattern. He has a black neck brace around his neck. He is gesturing with his right hand as he speaks
[Speech] “And knowing how to take care of your body”
ref
Ref ImageRef Audio
Generated
View Caption
[Visual] a man with short, dark hair, wearing a dark shirt. He appears to be in a room with blinds covering a window in the background. The lighting is somewhat dim, and the man's face is in focus. There is another person partially visible in the foreground, wearing a light-colored shirt, but their face is not shown. The man's expression seems serious or concerned
[Speech] “It's nothing personal”

Depth + Ref Image → Generated Audio-Video

Given a reference image and a depth map video, our model generates a realistic video with matching appearance and depth-guided motion.

ref
Ref ImageDepth
Generated
View Caption
[Visual] A man is sitting in a dimly lit room. He is wearing a blue and white plaid shirt and has dark hair. He appears to be in a relaxed posture, leaning slightly forward.
[Speech] “Beautiful memories, man.”
ref
Ref ImageDepth
Generated
View Caption
[Visual] A man with light brown, slightly wavy hair wearing a dark jacket over a light blue shirt. His expression appears to be one of concern or distress, gesturing with his hands.
[Speech] “I swear I didn’t do anything.”
ref
Ref ImageDepth
Generated
View Caption
[Visual] A man sitting in the passenger seat of a car, wearing a dark suit with a white shirt. His expression seems serious or focused.
[Speech] “You gotta work a little outside the system.”
ref
Ref ImageDepth
Generated
View Caption
[Visual] A woman with short, dark hair wearing a dark green sweater. She appears to be in a room with a bookshelf in the background, looking concerned.
[Speech] “Listen, I will look into reviving myself.”
ref
Ref ImageDepth
Generated
View Caption
[Visual] A woman in a blue scrub top standing in a medical setting. She has shoulder-length brown hair and is gesturing as if explaining something.
[Speech] “The gems are small enough he should be able to excrete them.”
ref
Ref ImageDepth
Generated
View Caption
[Visual] A man wearing a black jacket with “DEA” on it, speaking into a microphone outdoors with trees and a building in the background.
[Speech] “Which point we apprehended three individuals in place.”
ref
Ref ImageDepth
Generated
View Caption
[Visual] A man talking on a phone with short, dark hair, wearing a dark jacket over a light-colored shirt. The background appears to be indoors with dim lighting.
[Speech] “Please spare me the boring details.”
ref
Ref ImageDepth
Generated
View Caption
[Visual] A man with short, dark hair wearing a black and white striped polo shirt in an indoor setting with large windows and plants.
[Speech] “I don’t suppose that’s a Rubenesque 19.”

Pose + Ref Image → Generated Audio-Video

Given a reference image and a pose sequence video, our model generates a realistic video with matching appearance and pose-guided motion.

ref
Ref ImagePose
Generated
View Caption
[Visual] a woman in a dimly lit setting. She has dark hair pulled back and is wearing a blue shirt with a dark jacket over it. She also has large hoop earrings. Her expression appears to be one of surprise or concern as she looks to the side. There is another person partially visible in the background, but their face is not shown. The overall lighting is low, creating a somewhat dramatic or intense atmosphere.
[Speech] “No alarms, no contact with control.".”
ref
Ref ImagePose
Generated
View Caption
[Visual] A woman with brown hair tied back in a ponytail is shown. She is wearing a blue shirt with a dark blue collar. She has earrings on. The background appears to be an indoor setting with some blue lighting. There is another person partially visible on the right side of the frame, but only their arm and part of their body are shown. The woman seems to be in the middle of a conversation or reacting to something.
[Speech] “an injury sustained in the attack but of”
ref
Ref ImagePose
Generated
View Caption
[Visual] an older man with gray hair and a beard. He is wearing a gray suit with a white shirt and a patterned tie. He appears to be in an office setting, as there are papers and a whiteboard in the background. The lighting is dim, giving the scene a somewhat somber atmosphere. The man is looking to the side with a serious expression.
[Speech] “No. Okay, since we're a bunch of angers.”
ref
Ref ImagePose
Generated
View Caption
[Visual] a woman with blonde hair styled in an updo. She is wearing a blue and black striped dress and large, ornate earrings with a green gemstone. The background appears to be an indoor setting with a painting on the wall and some furniture partially visible. The woman is looking to the side.
[Speech] “Well maybe that's why he hasn't returned my call.”
ref
Ref ImagePose
Generated
View Caption
[Visual] a man with short, graying hair and a beard. He is wearing a dark hooded garment with a chainmail-like collar. The background appears to be a stone wall, suggesting an old or medieval setting. The man's expression seems serious or intense.
[Speech] “Using heirs, bastards and otherwise.”
ref
Ref ImagePose
Generated
View Caption
[Visual] a man with short brown hair and a beard. He is wearing a dark, textured vest over a high-collared shirt. The background appears to be a dimly lit room with a wooden wall that has some intricate carvings or engravings. The lighting is focused on the man, creating a dramatic effect.
[Speech] “For the man that I chose, my lord.".”
ref
Ref ImagePose
Generated
View Caption
[Visual] a man in a suit with a serious expression. He has short, dark hair and is looking slightly to the side. In the foreground, there is a potted plant with large green leaves and a red flower. The background appears to be a wooden wall or door. The lighting is natural, suggesting it might be daytime.
[Speech] “None.”
ref
Ref ImagePose
Generated
View Caption
[Visual] a man with short brown hair and a beard, wearing a dark blue sweater over a plaid shirt. He is in a dimly lit room with a green wall in the background. There is a shelf with a white object on it to the right. The man appears to be speaking, and his mouth is open as if in the middle of a conversation. His expression seems serious or focused.
[Speech] “I know how we can survive. I also know how we can die.”

Ref Audio + Depth + Ref Image → Generated Audio-Video

Given a reference image, an audio clip, and a depth map video, our model generates a realistic video with matching appearance, voice, and depth-guided motion.

ref
Ref ImageRef AudioDepth
Generated
View Caption
[Visual] A man is holding a telephone receiver to his ear. He has short, dark hair and is wearing a light-colored jacket with a plaid shirt underneath. The background appears to be an indoor setting, possibly a kitchen or dining area, with a patterned wall and some furniture visible. The lighting is somewhat dim, giving the scene a somewhat somber or serious atmosphere. The man's expression seems focused or concerned as he listens to the phone.
[Speech] “All we need is to walk Harold.”
ref
Ref ImageRef AudioDepth
Generated
View Caption
[Visual] A man in a white tuxedo with a black bow tie is singing into a microphone. He is holding the microphone in his right hand and gesturing with his left hand. The background is dark with neon signs. One sign says "OON" in blue and another sign has red Chinese characters. The man has short black hair and is looking down while singing. The setting appears to be a stage or performance area.
[Speech] “I'm sick of men making excuses.”
ref
Ref ImageRef AudioDepth
Generated
View Caption
[Visual] An older man with gray hair is sitting in a black leather chair. He is wearing a dark blue suit, a white shirt, and a red tie with a pattern of small white squares. The background features a window with sheer white curtains and patterned red and green curtains on either side. The lighting is dim, creating a somewhat somber atmosphere.
[Speech] “Well, I've already pulled every string I can.”
ref
Ref ImageRef AudioDepth
Generated
View Caption
[Visual] a person with long dark hair, wearing a red and gray uniform with a circular emblem on the chest. The background appears to be a high-tech setting with various screens and panels, some of which display blue and white graphics. The person is standing and looking to the side.
[Speech] “According to Regor two's history that's right.”
ref
Ref ImageRef AudioDepth
Generated
View Caption
[Visual] a close-up of a person with long, wavy blonde hair. The lighting is dim, creating a warm, moody atmosphere. The person appears to be in a relaxed setting, possibly a bar or a cozy room. There is a glass in the foreground, suggesting they might be drinking something.
[Speech] “to have sex and”
ref
Ref ImageRef AudioDepth
Generated
View Caption
[Visual] a woman with shoulder - length brown hair wearing glasses and a gray blazer over a black shirt. She is seated in a room with a dark background. There is a window with vertical blinds behind her, and a silhouette of a dancing figure is visible on the right side of the image. The lighting is dim, creating a somewhat somber atmosphere.
[Speech] “the new big thing these days is”
ref
Ref ImageRef AudioDepth
Generated
View Caption
[Visual] A man in a dark suit, white shirt, and dark tie is shown. He has short, dark hair and is looking slightly to the side. The background is dark and out of focus, with a small sign partially visible behind him. The lighting is dim, creating a serious and somewhat tense atmosphere.
[Speech] “We made the ticket out of here.”
ref
Ref ImageRef AudioDepth
Generated
View Caption
[Visual] a man with short, dark hair and a beard. He is wearing a dark shirt. The background appears to be a dimly lit room with a patterned wall and a green door. The lighting is low, creating a somewhat dramatic or intense atmosphere.
[Speech] “Before we roll anywhere the rest of the team has to be here.”

Ref Audio + Pose + Ref Image → Generated Audio-Video

Given a reference image, an audio clip, and a pose sequence video, our model generates a realistic video with matching appearance, voice, and pose-guided motion.

ref
Ref ImageRef AudioPose
Generated
View Caption
[Visual] a woman with long dark hair and bangs, wearing a dark blazer over a light blue top. She is in a room with a bulletin board in the background, which has several photos pinned to it. The lighting is dim, creating a somewhat serious atmosphere. The woman appears to be engaged in a conversation with someone whose back is to the camera. The person's hair is gray and they are wearing a dark suit. The woman's expression seems to be one of concern or seriousness.
[Speech] “Whispering is just not enough to induce the baby.”
ref
Ref ImageRef AudioPose
Generated
View Caption
[Visual] A woman with curly hair, some of which is gray, is shown. She is wearing a light gray blazer over a white top. She has a necklace on. The background is a solid brownish color. The woman appears to be speaking, her mouth is slightly open and her eyes are looking downwards.
[Speech] “it's really important to think about all" followed by a pause and then "the different ways that we can help".”
ref
Ref ImageRef AudioPose
Generated
View Caption
[Visual] a man in a professional setting. He has dark, wavy hair and is wearing round glasses. His attire includes a gray tweed suit jacket, a light blue dress shirt, and a patterned tie. The background features a window with a view of greenery outside, and there are some framed pictures or certificates on the wall. The lighting is soft and natural, coming from the window.
[Speech] “At the top of all his brilliance he had a genius.”
ref
Ref ImageRef AudioPose
Generated
View Caption
[Visual] A man with long hair, a beard, and glasses is standing in a dimly lit room. He is wearing a dark jacket over a gray shirt. Behind him, there is a sign with the word "piper" partially visible. The background appears to be a door or a wall with some indistinct shapes and colors.
[Speech] “You fucked around with son of Anton's.”
ref
Ref ImageRef AudioPose
Generated
View Caption
[Visual] a person with curly, light-colored hair. The lighting is dim, with a light source coming from behind, creating a silhouette effect. The person is wearing a dark-colored top. The background is dark and out of focus, with some indistinct shapes and colors.
[Speech] “Nobody. And you can take over this hall.”
ref
Ref ImageRef AudioPose
Generated
View Caption
[Visual] a man with a beard and mustache, wearing a dark vest over a white shirt. He appears to be in a dimly lit room, possibly a study or office, with a dark background. The man is looking slightly to the side, and his expression seems serious or contemplative. There are no other people or objects clearly visible in the frame.
[Speech] “If there's blood on the streets of Chinatown, it's because.".”
ref
Ref ImageRef AudioPose
Generated
View Caption
[Visual] A man is shown in a dimly lit room. He has a receding hairline, a beard, and is wearing a dark suit with a white shirt and a patterned tie. There is a small American flag pin on his left lapel. The background appears to be a storage area with shelves and boxes. The lighting is low, creating a somewhat somber atmosphere.
[Speech] “I'll have your passport returned.”
ref
Ref ImageRef AudioPose
Generated
View Caption
[Visual] A man is sitting in a room, talking on a phone. He has short, graying hair and is wearing a dark-colored long-sleeve shirt. His left hand is holding the phone to his ear, and his right hand is gesturing as he speaks. The background shows a softly lit room with some furniture, including a lamp and a cabinet. The overall lighting is warm and dim.
[Speech] “because if claire doesn't get her halloween she turns into”

Qualitative Comparison

ref
[Visual] An elderly woman with short, gray hair. She is wearing round glasses and a patterned coat with black buttons. She also has a chunky necklace. The background appears to be an indoor setting with some blurred elements, possibly furniture or decorations. The lighting is somewhat dim, giving the scene a subdued feel.
[Speech] “You have no idea.”
Ours (MMControl)
AniPortrait
Hallo3
HunyuanCustom
SadTalker
ref
[Visual] A woman with long dark hair, wearing a light-colored top. She has a necklace with a small pendant. The background appears to be indoors, with a neutral-colored wall and some darker elements that might be furniture or decor. The lighting is dim, creating a somewhat somber or serious atmosphere.
[Speech] “Maybe not but Jessica I have watched you my whole.”
Ours (MMControl)
AniPortrait
Hallo3
HunyuanCustom
SadTalker
ref
[Visual] A close-up of a man in a suit and tie. He has short, light brown hair and is looking to his right with a serious expression. The background is blurred, but it appears to be an indoor setting, possibly an office or a room with some furniture. The lighting is dim, creating a somewhat somber atmosphere.
[Speech] “By the way, I’m gonna find out what it is.”
Ours (MMControl)
AniPortrait
Hallo3
HunyuanCustom
SadTalker
ref
[Visual] A woman with blonde hair styled in loose waves. She is wearing large, round, blue-framed glasses and has a headset with a microphone attached to it. She is wearing a dark top and a necklace. The background appears to be indoors, with some blurred lights visible, suggesting a dimly lit environment. The woman seems to be in the middle of speaking or reacting to something.
[Speech] “and I did some research on Courtney too.”
Ours (MMControl)
AniPortrait
Hallo3
HunyuanCustom
SadTalker
ref
[Visual] A woman with long, dark brown hair and bangs is shown. She is wearing a dark blue velvet top. She has a ring on her right hand and is pointing with her index finger. The background appears to be a light-colored wall with a patterned design. There is a black object partially visible on the right side of the frame.
[Speech] “I have a friend. She’s an artist. I like her.”
Ours (MMControl)
AniPortrait
Hallo3
HunyuanCustom
SadTalker