CVPR 2026 · Denver, Colorado Zhejiang University · Ant Group · Westlake University

Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

Hao Zhong*, Muzhi Zhu*, Shenyan Zeng*, Anzhou Li, Cong Chen, Hua Geng, Duochao Shi, Wentao Ye, Tao Lin, Hao Chen, Chunhua Shen

* Equal contribution  ·  Corresponding authors

The Question

Can MLLMs match the same physical region across extreme viewpoint changes?

Physical-world MLLMs need more than object recognition. Wide-baseline matching (WBM) asks a model to bind local cues into a globally consistent cross-view map — demanding geometry, semantics, fine-grained perception, and occlusion reasoning all at once.

We turn this physical-world spatial skill into a scalable, verifiable training signal and benchmark. View pairs harvested from RGB-D / SfM reconstructions come with exact, checkable correspondences — so models can be trained with verifiable rewards and evaluated without subjective judgement.

70.5 F1
DCRL on ReasonMatch-Bench — surpassing GPT-5-mini and Gemini-2.5-Pro
2,810
Stratified image pairs in ReasonMatch-Bench
220k
Automatically harvested training corpus from RGB-D and SfM reconstructions
Viewpoint

Large camera motion

Objects and shelves move from frontal to oblique views, so local appearance alone is ambiguous.

Visibility

Occlusion and no-match cases

Some annotated regions leave the frame or become hidden, requiring explicit unmatched predictions.

Perception

Fine-grained labels

Dense visual prompts stress label reading, object boundaries, and small region discrimination.

Reasoning

Global consistency

Correct matches depend on scene layout, repeated structures, and stable anchors across views.

Method

DCRL — Dynamic Correspondence Reinforcement Learning

We harvest RGB-D / SfM view pairs with verifiable supervision and train an 8B MLLM with format and matching rewards — no explicit chain-of-thought labels required.

Data

220k multi-view pairs

RGB-D videos and SfM reconstructions provide real-world viewpoint diversity.

RE10KDL3DVCO3DuCO3DScanNet
Supervision

Verified point pool

Depth or shared 3D landmarks produce 10–50 spatially separated correspondences per pair.

Curriculum

Dynamic scheduling

  • Viewpoint: Identical → Hard
  • Point-level: L1 → L2 → L3
  • Spatial distribution refinement
Outcome

DCRL · 8B MLLM

RL with a holistic matching reward turns matching into a transferable spatial skill.

Supervision is mined directly from multi-view reconstructions, so every region correspondence is geometrically grounded and machine-checkable. The model is optimized end-to-end with reinforcement learning using both format compliance and a holistic matching reward over all query regions:

Reward rmatch  =  1n   Σi=1n   𝟙[ (i) = f*(i) ]
VP

Image-level viewpoint progression

Overlap bins schedule pairs from near-identical views toward hard wide baselines.

L1–3

Point-level correspondence curriculum

L1 unambiguous → L2 selective → L3 partial matching, with distractors and no-match cases.

SD

Spatial distribution refinement

Sampling evolves from sparse global anchors toward denser fine-grained spatial layouts.

Benchmark

ReasonMatch-Bench

A scalable, verifiable MLLM benchmark of 2,810 image pairs, curated from a 220k-pair corpus and balanced across data sources, scene types, and task difficulty levels.

2,810
Evaluation image pairs with machine-checkable correspondences
10–50
Spatially separated verified correspondences per preprocessed pair
Viewpoint
Stratified by camera displacement — from modest overlap to extreme wide baselines.
L1–L3
Progressively harder matching tasks with distractors and unmatched regions

Data sources

uCO3D28.0%
ScanNet27.7%
DL3DV27.0%
RE10K17.2%

Scene types

Indoor55.1%
Object28.0%
Outdoor16.9%

Task levels

L132.5%
L236.8%
L330.7%
L1

Unambiguous matching

One-to-one correspondences, with no distractors in either view.

L2

Selective matching

Target-view distractors force the model to select the geometrically consistent counterpart.

L3

Partial matching

Distractors and unmatched regions appear in both views, modeling occlusion and limited overlap.

On the hard subset, humans reach 84.0 F1 while the best model (DCRL) reaches 52.0 F1 — a substantial headroom that marks wide-baseline matching as an open challenge for MLLMs.

Results

Targeted RL makes an 8B model outperform far stronger baselines

DCRL reaches 70.5 F1 on ReasonMatch-Bench, outperforming all evaluated open- and closed-source MLLM baselines while transferring to related spatial-reasoning benchmarks.

Hard high-divergence subset

Human84.0
DCRL52.0
GPT-5-mini37.2

Transfer beyond matching

OmniSpatial43.6 → 48.9
MindCube40.0 → 43.5
SAT Real70.0 → 75.3

Main benchmark

ReasonMatch-Bench
Overall ReasonMatch-Bench F1Higher is better · 2,810 image pairs
Model / SettingF1
Qwen3-VL-8B-Instruct27.5
GPT-4o (241106)33.5
Claude-4.5-Sonnet41.7
Gemini-2.5-Pro42.8
Qwen3-VL-235B49.2
GPT-5-Chat51.5
GPT-5-mini57.9
Qwen3-VL-8B + DCRL70.5
Full-benchmark overall F1 and selected DCRL difficulty slices. Object-centric L3 remains challenging.

Generalization & transfer

Held-out skills & datasets
Transfer to external spatial benchmarksSkill transfers from matching to other spatial tasks
BenchmarkBaseDCRLGain
ReasonMatch27.570.5+43.0
OmniSpatial43.648.9+5.3
MindCube40.043.5+3.5
SAT Real70.075.3+5.3
Why reinforcement learning?RL transfers under distribution shift; SFT does not
StrategyOmniSpatialSAT RealReasonMatch
Base43.670.027.5
SFT42.641.351.0
DCRL48.975.370.5

Curriculum design

Ablation
Dynamic curriculum beats fixed-difficulty trainingReasonMatch-Bench F1 under the same RL setup
Training strategyReasonMatch F1Δ vs. uniform
Uniform sampling65.3
Easy-only samples59.9−10.6
Hard-only samples62.3−8.2
Dynamic curriculum70.5+5.2

Qualitative

Reasoning over a wide-baseline pair

Given two views of the same scene, the model must map region IDs from image A to image B. DCRL reasons over multi-tier scene structure; the base-style model over-relies on the nearest anchors.

A cross-view matching example on a plant-shelf scene. Qwen3-VL+DCRL produces correct multi-tier shelf reasoning and viewpoint-invariant matching, while GPT-5-mini produces a wrong inference by over-relying on nearest anchors.
Left: Qwen3-VL + DCRL — correct, via multi-tier shelf reasoning and viewpoint-invariant matching. Right: GPT-5-mini — wrong, from over-reliance on nearest anchors.

Open Release

Code, benchmark archives, and evaluation scripts

The release is designed for reproducible evaluation and recipe inspection. Training data is not included; training entry points are provided for users with compatible LMDB-formatted data.

Code repository

Paper-specific training, reward, buffer, and evaluation code built on top of a vendored verl stack.

Open GitHub

Benchmark data

ModelScope release includes reasonmatch_bench.tar.gz and ood_dataset.tar.gz.

Open Dataset

Evaluation protocol

Evaluation expects an OpenAI-compatible chat endpoint and reports benchmark summaries from saved predictions.

Citation

Citation

BibTeX

If you find ReasonMatch useful, please cite our work.

@InProceedings{Zhong_2026_CVPR,
  author    = {Zhong, Hao and Zhu, Muzhi and Zeng, Shenyan and Li, Anzhou and Chen, Cong and Geng, Hua and Shi, Duochao and Ye, Wentao and Lin, Tao and Chen, Hao and Shen, Chunhua},
  title     = {Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month     = {June},
  year      = {2026},
  pages     = {16768-16778}
}