Long-horizon video-audio reasoning and fine-grained pixel understanding impose conflicting requirements on
omnimodal models: dense temporal coverage demands many low-resolution frames, whereas precise grounding
calls for high-resolution inputs. We tackle this trade-off with a two-system architecture
: a
Global Reasoning System selects informative keyframes and rewrites the task at low
spatial
cost,
while a Detail Understanding System performs pixel-level grounding on the selected
high-resolution
snippets.
Because optimal keyframe selection and reformulation are ambiguous and hard to supervise, we formulate
them
as a reinforcement-learning (RL) problem and present Omni-R1, an end-to-end RL framework
built on Group Relative Policy Optimization.
Omni-R1 trains the Global Reasoning System through hierarchical rewards obtained via online
collaboration
with the Detail Understanding System, requiring only one epoch of RL on small task splits.
Experiments on two challenging benchmarks, Referring Audio-Visual Segmentation (RefAVS) and Reasoning
Video
Object Segmentation (REVOS), show that Omni-R1 not only surpasses strong supervised baselines but also
outperforms specialized state-of-the-art models, while substantially improving out-of-domain
generalization
and mitigating multimodal hallucination.
Our results demonstrate the first successful application of RL to large-scale omnimodal reasoning and
highlight a scalable path toward universally foundation models.