1Ant Group
2Zhejiang University
3Central South University
4HKUST(GZ)
* Equal Contribution † Corresponding Author
Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and reasoning, shifting long-video understanding into an agentic exploration process. As a plug-and-play framework, it incrementally streams videos to construct a Hierarchical Graph Memory—a top-down three-tier architecture for semantic abstraction, anchored by a foundational graph capturing spatiotemporal and causal relations. During inference, the reasoning model employs agentic tool-augmented retrieval, navigating hierarchies, searching nodes, and traversing logical edges via an Observation-Reason-Action loop. Experiments show MemDreamer achieves SOTA results across four mainstream benchmarks, narrowing the gap with human experts to only 3.7 points. It constrains the reasoning context window to merely 2% of full-context ingestion while delivering a 12.5 point absolute accuracy gain. Furthermore, statistical analysis uncovers a strong positive linear correlation between a VLM's performance on logic reasoning and long-video understanding benchmarks, establishing agentic capability scaling as a new paradigm for multimodal comprehension.
Separates perception from reasoning via persistent memory and agentic retrieval, bypassing long-context bottlenecks. Reveals for the first time a positive correlation between reasoning capacity and long-video performance.
Achieves 90.7 on LVBench (only 3.7 from human experts), with 12.5-point absolute gain over end-to-end baselines using identical backbone models across 4 benchmarks.
Three-tier coarse-to-fine topology (Video Root → Super Events → Macro Events) with entity-event subgraphs capturing spatiotemporal and causal dependencies.
Multi-dimensional tool bank (Navigation + Search + Graph Traversal) drives an Observation-Reason-Action loop, transforming static understanding into active multi-step exploration.
| Method | Reason Model | LVBench | LongVideoBench | Video-MME Long (w/o sub) |
EgoSchema Val |
|||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Avg. | ER | EU | Rea | KIR | Sum | TG | Avg. | Long | ||||
| End-to-End VLMs (Proprietary) | ||||||||||||
| GPT-4o | GPT-4o | 48.9 | 48.9 | 49.5 | 50.3 | 48.1 | 50.0 | 40.9 | 66.7 | 60.9 | 65.3 | 70.4 |
| Gemini-2.0-Flash | Gemini-2.0-Flash | 48.6 | 47.4 | 48.5 | 44.4 | 56.8 | 41.4 | 39.3 | — | 45.7 | 63.0 | 71.2 |
| Gemini-2.5-Pro | Gemini-2.5-Pro | 72.0 | 71.5 | 71.1 | 67.7 | 80.0 | 63.5 | 69.1 | 71.0 | 68.6 | 75.9 | 72.8 |
| Gemini-3.1-Pro | Gemini-3.1-Pro | 78.2 | 78.1 | 76.7 | 74.1 | 86.3 | 70.7 | 81.4 | 78.6 | 77.0 | 80.3 | 76.4 |
| OpenAI-o3 | OpenAI-o3 | 57.1 | 57.6 | 56.4 | 50.8 | 62.9 | 67.2 | 46.8 | 66.7 | 60.6 | 64.7 | 63.2 |
| Seed1.5VL | Seed1.5VL-Thinking | 64.6 | 65.4 | 63.4 | 68.0 | 53.6 | 63.7 | 46.6 | — | 74.4 | — | — |
| End-to-End VLMs (Open-Source) | ||||||||||||
| Qwen3-VL | Qwen3-VL-235B-A22B-Thinking | 63.6 | 63.7 | 62.6 | 63.1 | 62.6 | 65.5 | 59.6 | 71.4 | 67.1 | 71.2 | 75.9 |
| Qwen3.5 | Qwen3.5-35A3B | 71.3 | 72.2 | 69.4 | 74.4 | 68.9 | 67.5 | 66.2 | — | — | 62.6 | — |
| GLM-4.6V | GLM-4.6V-106B-A12B | 59.5 | 58.8 | 59.2 | 69.7 | 57.8 | 61.8 | 53.9 | 67.6 | 58.7 | 66.0 | 68.8 |
| GLM-4.5V | GLM-4.5V-106B-A12B | 53.4 | 55.3 | 53.1 | 57.5 | 55.3 | 39.6 | 49.9 | 66.4 | 54.1 | 64.7 | 68.4 |
| InternVL2.5 | InternVL2.5-78B | 43.6 | 43.8 | 42.0 | 51.0 | 37.9 | 42.1 | 36.8 | — | — | 62.6 | — |
| InternVL3.5 | InternVL3.5-30BA3B | 44.4 | 42.7 | 44.1 | 48.3 | 46.4 | 36.2 | 40.9 | 62.9 | 52.7 | 64.1 | 86.8 |
| AdaRETAKE | AdaRETAKE | 53.3 | 53.0 | 50.7 | 54.7 | 62.2 | 37.9 | 45.5 | 67.0 | — | 65.0 | — |
| Agentic Video Understanding Methods | ||||||||||||
| VideoTree | VideoTree | 28.8 | 30.3 | 25.1 | 31.9 | 26.5 | 25.5 | 27.7 | — | — | — | 67.0 |
| VideoAgent | VideoAgent | 29.3 | 28.0 | 30.3 | 28.0 | 28.0 | 36.4 | 29.3 | — | — | — | 63.2 |
| VCA | VCA | 41.3 | 43.7 | 40.7 | 46.2 | 37.8 | 27.3 | 38.0 | — | — | — | 73.6 |
| MR. Video | MR. Video | 60.8 | 59.8 | 57.4 | 57.7 | 71.4 | 50.0 | 58.8 | — | 61.6 | 61.8 | 73.0 |
| M3-Agent | M3-Agent | 49.3 | — | — | — | — | — | — | — | — | 61.8 | — |
| WorldMM-GPT | GPT-5 | 61.9 | — | — | — | — | — | — | — | — | 76.6 | — |
| MM-Mem | MM-Mem | — | — | — | — | — | — | — | — | — | 66.1 | — |
| DVD | OpenAI-o3 | 74.2 | 73.4 | 73.3 | 70.7 | 80.4 | 74.1 | 72.3 | 71.6 | 68.6 | 67.3 | 76.6 |
| VideoARM | OpenAI-o3 | 79.7 | — | — | — | — | — | — | 78.0 | 76.4 | 81.2 | 76.2 |
| VideoSeek | GPT-5 | 68.4 | — | — | — | — | — | — | — | 73.5 | 60.9 | — |
| MemDreamer (Ours, Plug-and-Play Agentic Framework) | ||||||||||||
| MemDreamer | Qwen3-VL-235B-A22B-Thinking | 84.8 (+21.2) | 84.6 | 85.2 | 80.6 | 85.6 | 84.5 | 87.7 | 86.3 (+14.9) | 83.2 (+16.1) | 86.2 (+15.0) | 87.4 (+11.5) |
| MemDreamer | Gemini-2.5-Pro | 80.7 (+8.7) | 77.1 | 83.6 | 81.1 | 82.5 | 91.4 | 86.4 | 78.6 (+7.6) | 78.7 (+10.1) | 85.0 (+9.1) | 88.2 (+15.4) |
| MemDreamer | Gemini-3.1-Pro | 90.7 (+12.5) | 90.1 | 90.6 | 89.6 | 91.4 | 89.7 | 91.8 | 92.9 (+14.3) | 91.0 (+14.0) | 92.1 (+11.8) | 87.8 (+11.4) |
| Human Expert | Human | 94.4 | — | — | — | — | — | |||||
Main results on four long-video understanding benchmarks. Green values indicate improvement over the strongest end-to-end baseline using the same backbone. LVBench sub-dimensions: ER (Event Recognition), EU (Event Understanding), Rea (Reasoning), KIR (Key Information Retrieval), Sum (Summarization), TG (Temporal Grounding).
Explore how MemDreamer constructs and navigates hierarchical graph memory for hours-long videos.
Coming Soon@misc{chen2026memdreamerdecouplingperceptionreasoning,
title={MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism},
author={Cong Chen and Guo Gan and Kaixiang Ji and ChaoYang Zhang and Zhen Yang and Guangming Yao and Hao Chen and Jingdong Chen and Yi Yuan and Chunhua Shen},
year={2026},
eprint={2606.07512},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2606.07512},
}