ACTIVE-O3 Logo

ACTIVE-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO

1Zhejiang University, China 
2Ant Group, China 

Abstract

Active vision, also known as active perception, refers to the process of actively selecting where and how to look in order to gather task-relevant information. It is a critical component of efficient perception and decision-making in humans and advanced embodied agents. Recently, the use of Multimodal Large Language Models (MLLMs) as central planning and decision-making modules in robotic systems has gained extensive attention. However, despite the importance of active perception in embodied intelligence, there is little to no exploration of how MLLMs can be equipped with or learn active perception capabilities. In this paper, we first provide a systematic definition of MLLM-based active perception tasks. We point out that the recently proposed GPT-o3 model's zoom-in search strategy can be regarded as a special case of active perception; however, it still suffers from low search efficiency and inaccurate region selection. To address these issues, we propose ACTIVE-O3, a purely reinforcement learning-based training framework built on top of GRPO, designed to equip MLLMs with active perception capabilities. We further establish a comprehensive benchmark suite to evaluate ACTIVE-O3 across both general open-world tasks—such as small-object and dense object grounding—and domain-specific scenarios, including small object detection in remote sensing and autonomous driving, as well as fine-grained interactive segmentation. Experimental results demonstrate that ACTIVE-O3 significantly enhances active perception capabilities compared to Qwen-VL2.5-CoT. For example, Figure 1 shows an example of zero-shot reasoning on the V* benchmark, where ACTIVE- O3 successfully identifies the number on the traffic light by zooming in on the relevant region, while Qwen2.5-VL fails to do so. Moreover, across all downstream tasks, ACTIVE-O3 consistently improves performance under fixed computational budgets. We hope that our work here can provide a simple codebase and evaluation protocol to facilitate future research on active perception MLLM.

Model Architecture

ACTIVE-o3 Model Architecture

Figure 1: Overview of the ACTIVE-o3 architecture and working pipeline.

Visual Results Demo

ACTIVE-o3 demonstrates superior performance across diverse active perception tasks.

Zero-shot Reasoning Examples

Active Perception Example 1

Active Perception Example 1

ACTIVE-o3 demonstrates effective active perception strategies for improved performance.

Click to view full size
Active Perception Example 2

Active Perception Example 2

Our method shows intelligent sample selection and learning efficiency.

Click to view full size
Active Perception Example 3

Active Perception Example 3

ACTIVE-o3 demonstrates robust performance across different scenarios.

Click to view full size
Active Perception Example 4

Active Perception Example 4

Our approach shows superior adaptation and learning capabilities.

Click to view full size

Comprehensive Visual Analysis

Three Visual Comparisons

Comprehensive Visual Analysis

Detailed comparison results showcasing ACTIVE-o3's performance across multiple evaluation scenarios and benchmarks.

Click to view full size

Performance Evaluation

LVIS Benchmark

LVIS Benchmark Results

SODA Benchmark

SODA Benchmark Results

Fine-Grained Interactive Segmentation

Fine-Grained Interactive Segmentation Results
@article{zhu2025activeo3empoweringmultimodallarge,
      title={Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO}, 
      author={Muzhi Zhu and Hao Zhong and Canyu Zhao and Zongze Du and Zheng Huang and Mingyu Liu and Hao Chen 
	      and Cheng Zou and Jingdong Chen and Ming Yang and Chunhua Shen},
      year={2025},
      eprint={2505.21457},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.21457}, 
}