MemDreamer Logo MemDreamer Decoupling Perception and Reasoning for Long Video Understanding via Agentic Hierarchical Graph Memory

Cong Chen1,2,*, Guo Gan2,*, Kaixiang Ji1,*, Zhaoyang Zhang1,3, Zhen Yang4, Guangming Yao1, Hao Chen2, Jingdong Chen1, Yi Yuan1, Chunhua Shen1,2,†

1Ant Group    2Zhejiang University    3Central South University    4HKUST(GZ)
* Equal Contribution   † Corresponding Author

MemDreamer key results: correlation between reasoning ability and long video understanding, and context window comparison
(Left) Decoupling perception unlocks reasoning ability for long video understanding. End-to-end performance is insensitive to reasoning ability, while MemDreamer exhibits a strong linear trend (R=0.897). (Right) MemDreamer only needs a 5-6K reasoning window, 41-124x smaller than end-to-end input.

Abstract

TL;DR MemDreamer achieves SOTA on 4 long-video benchmarks, narrows the gap with human experts to only 3.7 points, uses merely 2% of the context window while delivering a 12.5-point absolute gain over end-to-end baselines.

Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and reasoning, shifting long-video understanding into an agentic exploration process. As a plug-and-play framework, it incrementally streams videos to construct a Hierarchical Graph Memory—a top-down three-tier architecture for semantic abstraction, anchored by a foundational graph capturing spatiotemporal and causal relations. During inference, the reasoning model employs agentic tool-augmented retrieval, navigating hierarchies, searching nodes, and traversing logical edges via an Observation-Reason-Action loop. Experiments show MemDreamer achieves SOTA results across four mainstream benchmarks, narrowing the gap with human experts to only 3.7 points. It constrains the reasoning context window to merely 2% of full-context ingestion while delivering a 12.5 point absolute accuracy gain. Furthermore, statistical analysis uncovers a strong positive linear correlation between a VLM's performance on logic reasoning and long-video understanding benchmarks, establishing agentic capability scaling as a new paradigm for multimodal comprehension.

Key Contributions

🔄

Decoupled Paradigm

Separates perception from reasoning via persistent memory and agentic retrieval, bypassing long-context bottlenecks. Reveals for the first time a positive correlation between reasoning capacity and long-video performance.

🏆

SOTA Performance

Achieves 90.7 on LVBench (only 3.7 from human experts), with 12.5-point absolute gain over end-to-end baselines using identical backbone models across 4 benchmarks.

📊

Hierarchical Graph Memory

Three-tier coarse-to-fine topology (Video Root → Super Events → Macro Events) with entity-event subgraphs capturing spatiotemporal and causal dependencies.

🤖

Agentic Tool-Augmented Retrieval

Multi-dimensional tool bank (Navigation + Search + Graph Traversal) drives an Observation-Reason-Action loop, transforming static understanding into active multi-step exploration.

Hierarchical Graph Memory

MemDreamer Hierarchical Graph Memory visualization
MemDreamer organizes a long video as a three-tier coarse-to-fine hierarchy with cross-tier topological edges. At the leaf tier, each segment is a subgraph of entities and micro-events connected via spatial-attribute, subject-object, and temporal-causal edges.

Method Overview

MemDreamer architectural workflow
Architectural workflow of MemDreamer. (Left) Memory construction comprises three phases: streaming adaptive segmentation, downward subgraph extraction, and upward hierarchical aggregation. (Right) Three tool categories—Hierarchical Navigation, Precise Search, and Local Graph Traversal—support an agentic retrieval mechanism driven by an Observation-Reason-Action loop.

Main Results

Method Reason Model LVBench LongVideoBench Video-MME
Long (w/o sub)
EgoSchema
Val
Avg. ER EU Rea KIR Sum TG Avg. Long
End-to-End VLMs (Proprietary)
GPT-4oGPT-4o 48.948.949.550.348.150.040.9 66.760.965.370.4
Gemini-2.0-FlashGemini-2.0-Flash 48.647.448.544.456.841.439.3 45.763.071.2
Gemini-2.5-ProGemini-2.5-Pro 72.071.571.167.780.063.569.1 71.068.675.972.8
Gemini-3.1-ProGemini-3.1-Pro 78.278.176.774.186.370.781.4 78.677.080.376.4
OpenAI-o3OpenAI-o3 57.157.656.450.862.967.246.8 66.760.664.763.2
Seed1.5VLSeed1.5VL-Thinking 64.665.463.468.053.663.746.6 74.4
End-to-End VLMs (Open-Source)
Qwen3-VLQwen3-VL-235B-A22B-Thinking 63.663.762.663.162.665.559.6 71.467.171.275.9
Qwen3.5Qwen3.5-35A3B 71.372.269.474.468.967.566.2 62.6
GLM-4.6VGLM-4.6V-106B-A12B 59.558.859.269.757.861.853.9 67.658.766.068.8
GLM-4.5VGLM-4.5V-106B-A12B 53.455.353.157.555.339.649.9 66.454.164.768.4
InternVL2.5InternVL2.5-78B 43.643.842.051.037.942.136.8 62.6
InternVL3.5InternVL3.5-30BA3B 44.442.744.148.346.436.240.9 62.952.764.186.8
AdaRETAKEAdaRETAKE 53.353.050.754.762.237.945.5 67.065.0
Agentic Video Understanding Methods
VideoTreeVideoTree 28.830.325.131.926.525.527.7 67.0
VideoAgentVideoAgent 29.328.030.328.028.036.429.3 63.2
VCAVCA 41.343.740.746.237.827.338.0 73.6
MR. VideoMR. Video 60.859.857.457.771.450.058.8 61.661.873.0
M3-AgentM3-Agent 49.3 61.8
WorldMM-GPTGPT-5 61.9 76.6
MM-MemMM-Mem 66.1
DVDOpenAI-o3 74.273.473.370.780.474.172.3 71.668.667.376.6
VideoARMOpenAI-o3 79.7 78.076.481.276.2
VideoSeekGPT-5 68.4 73.560.9
MemDreamer (Ours, Plug-and-Play Agentic Framework)
MemDreamerQwen3-VL-235B-A22B-Thinking 84.8 (+21.2) 84.685.280.685.684.587.7 86.3 (+14.9) 83.2 (+16.1) 86.2 (+15.0) 87.4 (+11.5)
MemDreamerGemini-2.5-Pro 80.7 (+8.7) 77.183.681.182.591.486.4 78.6 (+7.6) 78.7 (+10.1) 85.0 (+9.1) 88.2 (+15.4)
MemDreamerGemini-3.1-Pro 90.7 (+12.5) 90.190.689.691.489.791.8 92.9 (+14.3) 91.0 (+14.0) 92.1 (+11.8) 87.8 (+11.4)
Human ExpertHuman 94.4

Main results on four long-video understanding benchmarks. Green values indicate improvement over the strongest end-to-end baseline using the same backbone. LVBench sub-dimensions: ER (Event Recognition), EU (Event Understanding), Rea (Reasoning), KIR (Key Information Retrieval), Sum (Summarization), TG (Temporal Grounding).

Case Analysis

Case study: Direct drill-down on Japanese culinary travelogue
Case 1: Direct Drill-Down. The agent localizes the question to a specific super event, drills into the subgraph, and traces the causal chain (pour → boil → spread → mesh-tray drying) to correctly answer "Drying them." End-to-end baselines see the pour shot but lack the downstream causal context, mistaking it for "Cleaning them."
Case study: Multi-round reformulation on Nigerian news broadcast
Case 2: Multi-Round Reformulation. When the first-round retrieval returns no matching node, the agent logs the negative result, reformulates by dropping the "cowboy hat" detail (unlikely in textual descriptions), and successfully retrieves the correct macro event in round 2. Round 3 drills into the subgraph to identify "Rivers Politics" via entity description and OCR overlay.

Interactive Memory Visualization

Hierarchical Graph Memory Explorer

Explore how MemDreamer constructs and navigates hierarchical graph memory for hours-long videos.

Coming Soon

Citation

@misc{chen2026memdreamerdecouplingperceptionreasoning,
      title={MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism}, 
      author={Cong Chen and Guo Gan and Kaixiang Ji and ChaoYang Zhang and Zhen Yang and Guangming Yao and Hao Chen and Jingdong Chen and Yi Yuan and Chunhua Shen},
      year={2026},
      eprint={2606.07512},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.07512}, 
}