MemDreamer Decoupling Perception and Reasoning for Long Video Understanding via Agentic Hierarchical Graph Memory

Cong Chen^1,2,*, Guo Gan^2,*, Kaixiang Ji^1,*, Zhaoyang Zhang^1,3, Zhen Yang⁴, Guangming Yao¹, Hao Chen², Jingdong Chen¹, Yi Yuan¹, Chunhua Shen^1,2,†

¹Ant Group ²Zhejiang University ³Central South University ⁴HKUST(GZ)
* Equal Contribution † Corresponding Author

Paper

Code

HuggingFace Demo (Coming Soon)

MemDreamer key results: correlation between reasoning ability and long video understanding, and context window comparison

(Left) Decoupling perception unlocks reasoning ability for long video understanding. End-to-end performance is insensitive to reasoning ability, while MemDreamer exhibits a strong linear trend (R=0.897). (Right) MemDreamer only needs a 5-6K reasoning window, 41-124x smaller than end-to-end input.

Abstract

TL;DR MemDreamer achieves SOTA on 4 long-video benchmarks, narrows the gap with human experts to only 3.7 points, uses merely 2% of the context window while delivering a 12.5-point absolute gain over end-to-end baselines.

Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and reasoning, shifting long-video understanding into an agentic exploration process. As a plug-and-play framework, it incrementally streams videos to construct a Hierarchical Graph Memory—a top-down three-tier architecture for semantic abstraction, anchored by a foundational graph capturing spatiotemporal and causal relations. During inference, the reasoning model employs agentic tool-augmented retrieval, navigating hierarchies, searching nodes, and traversing logical edges via an Observation-Reason-Action loop. Experiments show MemDreamer achieves SOTA results across four mainstream benchmarks, narrowing the gap with human experts to only 3.7 points. It constrains the reasoning context window to merely 2% of full-context ingestion while delivering a 12.5 point absolute accuracy gain. Furthermore, statistical analysis uncovers a strong positive linear correlation between a VLM's performance on logic reasoning and long-video understanding benchmarks, establishing agentic capability scaling as a new paradigm for multimodal comprehension.

Key Contributions

🔄

Decoupled Paradigm

Separates perception from reasoning via persistent memory and agentic retrieval, bypassing long-context bottlenecks. Reveals for the first time a positive correlation between reasoning capacity and long-video performance.

🏆

SOTA Performance

Achieves 90.7 on LVBench (only 3.7 from human experts), with 12.5-point absolute gain over end-to-end baselines using identical backbone models across 4 benchmarks.

📊

Hierarchical Graph Memory

Three-tier coarse-to-fine topology (Video Root → Super Events → Macro Events) with entity-event subgraphs capturing spatiotemporal and causal dependencies.

🤖

Agentic Tool-Augmented Retrieval

Multi-dimensional tool bank (Navigation + Search + Graph Traversal) drives an Observation-Reason-Action loop, transforming static understanding into active multi-step exploration.

Main Results

Method	Reason Model	LVBench							LongVideoBench		Video-MME Long (w/o sub)	EgoSchema Val
Method	Reason Model	Avg.	ER	EU	Rea	KIR	Sum	TG	Avg.	Long	Video-MME Long (w/o sub)	EgoSchema Val
End-to-End VLMs (Proprietary)
GPT-4o	GPT-4o	48.9	48.9	49.5	50.3	48.1	50.0	40.9	66.7	60.9	65.3	70.4
Gemini-2.0-Flash	Gemini-2.0-Flash	48.6	47.4	48.5	44.4	56.8	41.4	39.3	—	45.7	63.0	71.2
Gemini-2.5-Pro	Gemini-2.5-Pro	72.0	71.5	71.1	67.7	80.0	63.5	69.1	71.0	68.6	75.9	72.8
Gemini-3.1-Pro	Gemini-3.1-Pro	78.2	78.1	76.7	74.1	86.3	70.7	81.4	78.6	77.0	80.3	76.4
OpenAI-o3	OpenAI-o3	57.1	57.6	56.4	50.8	62.9	67.2	46.8	66.7	60.6	64.7	63.2
Seed1.5VL	Seed1.5VL-Thinking	64.6	65.4	63.4	68.0	53.6	63.7	46.6	—	74.4	—	—
End-to-End VLMs (Open-Source)
Qwen3-VL	Qwen3-VL-235B-A22B-Thinking	63.6	63.7	62.6	63.1	62.6	65.5	59.6	71.4	67.1	71.2	75.9
Qwen3.5	Qwen3.5-35A3B	71.3	72.2	69.4	74.4	68.9	67.5	66.2	—	—	62.6	—
GLM-4.6V	GLM-4.6V-106B-A12B	59.5	58.8	59.2	69.7	57.8	61.8	53.9	67.6	58.7	66.0	68.8
GLM-4.5V	GLM-4.5V-106B-A12B	53.4	55.3	53.1	57.5	55.3	39.6	49.9	66.4	54.1	64.7	68.4
InternVL2.5	InternVL2.5-78B	43.6	43.8	42.0	51.0	37.9	42.1	36.8	—	—	62.6	—
InternVL3.5	InternVL3.5-30BA3B	44.4	42.7	44.1	48.3	46.4	36.2	40.9	62.9	52.7	64.1	86.8
AdaRETAKE	AdaRETAKE	53.3	53.0	50.7	54.7	62.2	37.9	45.5	67.0	—	65.0	—
Agentic Video Understanding Methods
VideoTree	VideoTree	28.8	30.3	25.1	31.9	26.5	25.5	27.7	—	—	—	67.0
VideoAgent	VideoAgent	29.3	28.0	30.3	28.0	28.0	36.4	29.3	—	—	—	63.2
VCA	VCA	41.3	43.7	40.7	46.2	37.8	27.3	38.0	—	—	—	73.6
MR. Video	MR. Video	60.8	59.8	57.4	57.7	71.4	50.0	58.8	—	61.6	61.8	73.0
M3-Agent	M3-Agent	49.3	—	—	—	—	—	—	—	—	61.8	—
WorldMM-GPT	GPT-5	61.9	—	—	—	—	—	—	—	—	76.6	—
MM-Mem	MM-Mem	—	—	—	—	—	—	—	—	—	66.1	—
DVD	OpenAI-o3	74.2	73.4	73.3	70.7	80.4	74.1	72.3	71.6	68.6	67.3	76.6
VideoARM	OpenAI-o3	79.7	—	—	—	—	—	—	78.0	76.4	81.2	76.2
VideoSeek	GPT-5	68.4	—	—	—	—	—	—	—	73.5	60.9	—
MemDreamer (Ours, Plug-and-Play Agentic Framework)
MemDreamer	Qwen3-VL-235B-A22B-Thinking	84.8 (+21.2)	84.6	85.2	80.6	85.6	84.5	87.7	86.3 (+14.9)	83.2 (+16.1)	86.2 (+15.0)	87.4 (+11.5)
MemDreamer	Gemini-2.5-Pro	80.7 (+8.7)	77.1	83.6	81.1	82.5	91.4	86.4	78.6 (+7.6)	78.7 (+10.1)	85.0 (+9.1)	88.2 (+15.4)
MemDreamer	Gemini-3.1-Pro	90.7 (+12.5)	90.1	90.6	89.6	91.4	89.7	91.8	92.9 (+14.3)	91.0 (+14.0)	92.1 (+11.8)	87.8 (+11.4)
Human Expert	Human	94.4	—						—	—	—	—

Main results on four long-video understanding benchmarks. Green values indicate improvement over the strongest end-to-end baseline using the same backbone. LVBench sub-dimensions: ER (Event Recognition), EU (Event Understanding), Rea (Reasoning), KIR (Key Information Retrieval), Sum (Summarization), TG (Temporal Grounding).

Case Analysis

Case study: Direct drill-down on Japanese culinary travelogue

Case 1: Direct Drill-Down. The agent localizes the question to a specific super event, drills into the subgraph, and traces the causal chain (pour → boil → spread → mesh-tray drying) to correctly answer "Drying them." End-to-end baselines see the pour shot but lack the downstream causal context, mistaking it for "Cleaning them."

Case study: Multi-round reformulation on Nigerian news broadcast

Case 2: Multi-Round Reformulation. When the first-round retrieval returns no matching node, the agent logs the negative result, reformulates by dropping the "cowboy hat" detail (unlikely in textual descriptions), and successfully retrieves the correct macro event in round 2. Round 3 drills into the subgraph to identify "Rivers Politics" via entity description and OCR overlay.

Citation

@misc{chen2026memdreamerdecouplingperceptionreasoning,
      title={MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism}, 
      author={Cong Chen and Guo Gan and Kaixiang Ji and ChaoYang Zhang and Zhen Yang and Guangming Yao and Hao Chen and Jingdong Chen and Yi Yuan and Chunhua Shen},
      year={2026},
      eprint={2606.07512},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.07512}, 
}

MemDreamer Decoupling Perception and Reasoning for Long Video Understanding via Agentic Hierarchical Graph Memory

Abstract

Key Contributions

Decoupled Paradigm

SOTA Performance

Hierarchical Graph Memory

Agentic Tool-Augmented Retrieval

Hierarchical Graph Memory

Method Overview

Main Results

Case Analysis

Interactive Memory Visualization

Hierarchical Graph Memory Explorer

Citation