Abstract Highlights Method Results Analysis Examples

OmniJigsaw
Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering

Yiduo Jia1* Muzhi Zhu1* Hao Zhong1 Mingyu Liu1 Yuling Xi1 Hao Chen1† Bin Qin2 Yongjie Yang2 Zhenbo Luo2 Chunhua Shen1†
1 Zhejiang University  ·  2 Xiaomi Inc.
* Equal contribution  ·  † Corresponding authors

ABSTRACT

To extend the reinforcement learning post-training paradigm to omni-modal models for concurrently bolstering video-audio understanding and collaborative reasoning, we propose OmniJigsaw, a generic self-supervised framework built upon a temporal reordering proxy task.

Centered on the chronological reconstruction of shuffled audio-visual clips, this paradigm strategically orchestrates visual and auditory signals to compel cross-modal integration through three distinct strategies: Joint Modality Integration (JMI), Sample-level Modality Selection (SMS), and Clip-level Modality Masking (CMM).

Recognizing that the efficacy of such proxy tasks is fundamentally tied to puzzle quality, we design a two-stage coarse-to-fine data filtering pipeline, which facilitates the efficient adaptation of OmniJigsaw to massive unannotated omni-modal data.

Our analysis reveals a "bi-modal shortcut phenomenon" in joint modality integration and demonstrates that fine-grained clip-level modality masking mitigates this issue while outperforming sample-level modality selection. Extensive evaluations on 15 benchmarks show substantial gains in video, audio, and collaborative reasoning.

HIGHLIGHTS

🧩

Self-Supervised Proxy Task

Pioneers jigsaw-based RL post-training in the omni-modal domain using temporal reordering of shuffled audio-visual clips—requiring zero manual annotation.

🎯

Modality Orchestration

Three strategies (JMI, SMS, CMM) that govern cross-modal information flow, investigating the bi-modal shortcut phenomenon and compelling deep multi-modal reasoning.

🔧

Scalable Data Pipeline

A two-stage coarse-to-fine filtering pipeline (signal-based + semantic CoT screening) that transforms massive unannotated data into high-quality training puzzles.

📊

15 Benchmark Gains

CMM achieves +4.38 on MLVU-Test, +2.50 on MMAR, and +1.70 on OmniVideoBench over a strong Qwen3-Omni baseline.

OMNIJIGSAW FRAMEWORK

A self-supervised RL post-training framework that enhances omni-modal reasoning through modality-orchestrated temporal reordering.

OmniJigsaw Pipeline

Figure 1. OmniJigsaw framework overview. JMI (top) provides full audio-visual access; CMM (center) enforces an information bottleneck via adaptive clip-wise masking; SMS (bottom) selects the primary informative modality.

Strategy 1 · Baseline

Joint Modality Integration (JMI)

Retains complete synchronized visual and acoustic information for all clips. Acts as an identity mapping that preserves full omni-modal integrity.

Strategy 2 · Intermediate

Sample-level Modality Selection (SMS)

Deploys the model as a global dominance analyzer to identify the primary modality per sample, mitigating interference from less informative streams.

Strategy 3 · Advanced

Clip-level Modality Masking (CMM)

Evaluates semantic density per clip and selectively masks the less salient modality, creating a cross-modal information bottleneck that forces deep integration.

Best Performing

DATA FILTERING PIPELINE

Data Filtering Pipeline

Figure 2. Two-stage data filtering pipeline: signal-based heuristic filtering ensures omni-modal integrity, followed by semantic-based CoT screening for narrative logic and state transitions.

QUANTITATIVE RESULTS

Comprehensive evaluation across video, audio, and omni-modal collaborative reasoning benchmarks.

Method AoTBenchTUNA-BenchTempCompassVideo-TTVideo-HolmesMLVU-TestVideo-MMEMLVU
Omni-R152.0953.8463.1036.8040.7252.5963.2065.69
HumanOmniV248.5849.1663.8640.3042.9049.0065.6066.70
Qwen3-Omni-30B64.8862.5770.6344.3050.8457.9772.1070.01
VideoJigsaw67.45 ↑2.5763.41 ↑0.8472.28 ↑1.6544.90 ↑0.6051.99 ↑1.1560.16 ↑2.1972.90 ↑0.8071.90 ↑1.89
OmniJigsaw (JMI)66.83 ↑1.9562.78 ↑0.2171.08 ↑0.4544.70 ↑0.4050.2458.76 ↑0.7972.90 ↑0.8071.39 ↑1.38
OmniJigsaw (SMS)68.12 ↑3.2465.15 ↑2.5872.03 ↑1.4045.80 ↑1.5052.26 ↑1.4261.75 ↑3.7872.90 ↑0.8072.63 ↑2.62
OmniJigsaw (CMM)68.90 ↑4.0265.29 ↑2.7272.03 ↑1.4046.10 ↑1.8052.53 ↑1.6962.35 ↑4.3873.10 ↑1.0072.26 ↑2.25

ABLATIONS & INSIGHTS

01

Bi-Modal Shortcut Phenomenon. Under JMI, redundant audio-visual cues allow the model to rely on the dominant modality alone, bypassing deep cross-modal reasoning and underperforming uni-modal baselines.

02

Clip-level > Sample-level. CMM consistently outperforms SMS across fine-grained sub-capabilities. Fine-grained local modality selection conforms to the dynamic flow of audio-visual information, maximizing local information entropy.

03

Data Quality is Critical. Training without the data filtering pipeline leads to significant degradation (−3.99 on MLVU-Test), confirming that puzzle solvability depends on identifiable state evolution between clips.

04

Discount Factor as Catalyst. The accuracy-dependent discount factor suppresses sub-optimal solutions, maintaining a persistent upward training trajectory and preventing premature convergence.

Ablation Study on Data Quality & Reward Discount Factor

Results under CMM strategy. ↓blue = drop w/o filtering; ↓red = drop w/o discount factor.

Method Video Audio Omni-Modal
AoTTUNATempCV-TTV-HolmesMLVU-TV-MMEMLVU MMAU-PMMAU-TMMMSUMMAR D-OmniIntentOVB
OmniJigsaw66.8366.2072.3446.5048.2962.7569.3073.4658.5976.3070.7071.0071.0968.8940.50
w/o Filtering64.94
↓1.89
65.29
↓0.91
71.14
↓1.20
45.40
↓1.10
47.41
↓0.88
58.76
↓3.99
68.60
↓0.70
72.59
↓0.87
57.67
↓0.92
76.00
↓0.30
70.40
↓0.30
68.90
↓2.10
70.01
↓1.08
66.77
↓2.12
40.10
↓0.40
w/o DF66.00
↓0.83
64.11
↓2.09
71.27
↓1.07
44.70
↓1.80
47.96
↓0.33
61.95
↓0.80
68.60
↓0.70
72.36
↓1.10
57.81
↓0.78
75.90
↓0.40
70.48
↓0.22
69.30
↓1.70
69.76
↓1.33
67.66
↓1.23
39.50
↓1.00
Radar comparison

Figure 3. Performance comparison of JMI, CMM, and uni-modal Jigsaw.

Bar comparison

Figure 4. Sub-capability comparison between CMM and SMS.

Reward comparison

Figure 5. Task reward dynamics during training.

Discount factor ablation

Figure 6. Optimization dynamics with and without the discount factor.

The accuracy-dependent discount factor amplifies the value gap between sub-optimal and optimal solutions, driving the model to explore more aggressively and avoid premature convergence at local plateaus.

EXAMPLES

CoT Reasoning Comparison: CMM vs JMI

Figure 7. CoT reasoning comparison at training step 800. CMM (left) compels the model to jointly analyze visual and auditory cues by masking less salient modalities, while JMI (right) exhibits a bi-modal shortcut by "solely relying on linguistic cues."

Sub-Scene Captioning

Figure 8. Sub-Scene Captioning: Baseline vs OmniJigsaw (CMM).

Video Summarization

Figure 9. Video Summarization: Baseline vs OmniJigsaw (CMM).

Rejection Case 1

Case 1: Indistinct State Changes — rejected by semantic screening.

Rejection Case 2

Case 2: Disjointed Narrative — rejected by semantic screening.