Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration

1Zhejiang University, China 
*Equal Contribution

Abstract

Long-horizon video-audio reasoning and fine-grained pixel understanding impose conflicting requirements on omnimodal models: dense temporal coverage demands many low-resolution frames, whereas precise grounding calls for high-resolution inputs. We tackle this trade-off with a two-system architecture : a Global Reasoning System selects informative keyframes and rewrites the task at low spatial cost, while a Detail Understanding System performs pixel-level grounding on the selected high-resolution snippets. Because optimal keyframe selection and reformulation are ambiguous and hard to supervise, we formulate them as a reinforcement-learning (RL) problem and present Omni-R1, an end-to-end RL framework built on Group Relative Policy Optimization. Omni-R1 trains the Global Reasoning System through hierarchical rewards obtained via online collaboration with the Detail Understanding System, requiring only one epoch of RL on small task splits. Experiments on two challenging benchmarks, Referring Audio-Visual Segmentation (RefAVS) and Reasoning Video Object Segmentation (REVOS), show that Omni-R1 not only surpasses strong supervised baselines but also outperforms specialized state-of-the-art models, while substantially improving out-of-domain generalization and mitigating multimodal hallucination. Our results demonstrate the first successful application of RL to large-scale omnimodal reasoning and highlight a scalable path toward universally foundation models.

Model Architecture

Omni-R1 Model Architecture

Figure 1: Overview of the Omni-R1 architecture featuring two-system collaboration for omnimodal reasoning with reinforcement learning.

VOS Visualization Demo

Omni-R1 demonstrates superior performance across diverse temporal reasoning tasks.

Visual Reasoning Examples

Simple Scene Mask Quality

Simple Scene Mask Quality

Omni-R1 preserves mask quality compared to other methods that incorporating SAM decoder fine-tuning.

Click to view full size
Video Context Understanding

Video Context Understanding

Our method comprehends video context and segments the predicted bottles to be picked up after video ends.

Click to view full size
Detail Reasoning

Detail Reasoning

Omni-R1 is capable of fine-grained understanding as System 1, yet delegates decision-making to System 2 by preserving the recaption process.

Click to view full size
Abstract Reasoning

Abstract Reasoning

Omni-R1 remains excellent general video understanding and reasoning capabilities and knowledge.

Click to view full size
AVS the cello

AVS the cello

Omni-R1 demonstrates exceptional audio-visual segmentation capabilities, accurately identifying the cello in the audio.

Click to view full size
AVS the violin

AVS the violin

Omni-R1 demonstrates exceptional audio-visual segmentation capabilities, accurately identifying the violin making the fastest rhythm in the audio.

Click to view full size

Performance Evaluation

General Benchmark Results

General Benchmark Results

Figure 2: Performance comparison on general reasoning benchmarks. Omni-R1 achieves state-of-the-art results across multiple evaluation metrics.

Audio-Visual-Hallucination Benchmark

AVH Benchmark Results

Figure 3: Results on AVHBench demonstrating superior multimodal integration capabilities.

RefAVS Benchmark

RefAVS Benchmark Results

Figure 4: Performance on RefAVS benchmark showing Omni-R1's excellence in referring audio-visual segmentation tasks.

REVOS Benchmark

REVOS Benchmark Results

Figure 5: Results on REVOS benchmark demonstrating superior performance in referring expression video object segmentation.

BibTeX

@article{zhong2025omnir1reinforcementlearningomnimodal,
  title={Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration}, 
  author={Hao Zhong and Muzhi Zhu and Zongze Du and Zheng Huang and Canyu Zhao and Mingyu Liu 
	      and Wen Wang and Hao Chen and Chunhua Shen},
  year={2025},
  eprint={2505.20256},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2505.20256},
}