ReasonMatch — Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

The Question

Can MLLMs match the same physical region across extreme viewpoint changes?

Physical-world MLLMs need more than object recognition. Wide-baseline matching (WBM) asks a model to bind local cues into a globally consistent cross-view map — demanding geometry, semantics, fine-grained perception, and occlusion reasoning all at once.

We turn this physical-world spatial skill into a scalable, verifiable training signal and benchmark. View pairs harvested from RGB-D / SfM reconstructions come with exact, checkable correspondences — so models can be trained with verifiable rewards and evaluated without subjective judgement.

70.5 F1

DCRL on ReasonMatch-Bench — surpassing GPT-5-mini and Gemini-2.5-Pro

2,810

Stratified image pairs in ReasonMatch-Bench

220k

Automatically harvested training corpus from RGB-D and SfM reconstructions

Viewpoint

Large camera motion

Objects and shelves move from frontal to oblique views, so local appearance alone is ambiguous.

Visibility

Occlusion and no-match cases

Some annotated regions leave the frame or become hidden, requiring explicit unmatched predictions.

Perception

Fine-grained labels

Dense visual prompts stress label reading, object boundaries, and small region discrimination.

Reasoning

Global consistency

Correct matches depend on scene layout, repeated structures, and stable anchors across views.

Method

DCRL — Dynamic Correspondence Reinforcement Learning

We harvest RGB-D / SfM view pairs with verifiable supervision and train an 8B MLLM with format and matching rewards — no explicit chain-of-thought labels required.

Data

220k multi-view pairs

RGB-D videos and SfM reconstructions provide real-world viewpoint diversity.

RE10KDL3DVCO3DuCO3DScanNet

Supervision

Verified point pool

Depth or shared 3D landmarks produce 10–50 spatially separated correspondences per pair.

Curriculum

Dynamic scheduling

Viewpoint: Identical → Hard
Point-level: L1 → L2 → L3
Spatial distribution refinement

Outcome

DCRL · 8B MLLM

RL with a holistic matching reward turns matching into a transferable spatial skill.

Supervision is mined directly from multi-view reconstructions, so every region correspondence is geometrically grounded and machine-checkable. The model is optimized end-to-end with reinforcement learning using both format compliance and a holistic matching reward over all query regions:

Reward r_match = 1n Σ_i=1ⁿ 𝟙[ f̂(i) = f^*(i) ]

Image-level viewpoint progression

Overlap bins schedule pairs from near-identical views toward hard wide baselines.

L1–3

Point-level correspondence curriculum

L1 unambiguous → L2 selective → L3 partial matching, with distractors and no-match cases.

Spatial distribution refinement

Sampling evolves from sparse global anchors toward denser fine-grained spatial layouts.

Benchmark

ReasonMatch-Bench

A scalable, verifiable MLLM benchmark of 2,810 image pairs, curated from a 220k-pair corpus and balanced across data sources, scene types, and task difficulty levels.

2,810

Evaluation image pairs with machine-checkable correspondences

10–50

Spatially separated verified correspondences per preprocessed pair

Viewpoint

Stratified by camera displacement — from modest overlap to extreme wide baselines.

L1–L3

Progressively harder matching tasks with distractors and unmatched regions

Data sources

uCO3D28.0%

ScanNet27.7%

DL3DV27.0%

RE10K17.2%

Scene types

Indoor55.1%

Object28.0%

Outdoor16.9%

Task levels

L132.5%

L236.8%

L330.7%

Unambiguous matching

One-to-one correspondences, with no distractors in either view.

Selective matching

Target-view distractors force the model to select the geometrically consistent counterpart.

Partial matching

Distractors and unmatched regions appear in both views, modeling occlusion and limited overlap.

On the hard subset, humans reach 84.0 F1 while the best model (DCRL) reaches 52.0 F1 — a substantial headroom that marks wide-baseline matching as an open challenge for MLLMs.

Results

Targeted RL makes an 8B model outperform far stronger baselines

DCRL reaches 70.5 F1 on ReasonMatch-Bench, outperforming all evaluated open- and closed-source MLLM baselines while transferring to related spatial-reasoning benchmarks.

Hard high-divergence subset

Human84.0

DCRL52.0

GPT-5-mini37.2

Transfer beyond matching

OmniSpatial43.6 → 48.9

MindCube40.0 → 43.5

SAT Real70.0 → 75.3

Main benchmark

ReasonMatch-Bench

Overall ReasonMatch-Bench F1Higher is better · 2,810 image pairs

Model / Setting	F1
Qwen3-VL-8B-Instruct	27.5
GPT-4o (241106)	33.5
Claude-4.5-Sonnet	41.7
Gemini-2.5-Pro	42.8
Qwen3-VL-235B	49.2
GPT-5-Chat	51.5
GPT-5-mini	57.9
Qwen3-VL-8B + DCRL	70.5

OverallF1 · 0–100

DCRL70.5

GPT-5-mini57.9

GPT-5-Chat51.5

Qwen3-VL-235B49.2

Gemini-2.5-Pro42.8

Claude-4.5-Sonnet41.7

GPT-4o33.5

Qwen3-VL-8B27.5

DCRL by task difficultyselected F1

Outdoor L190.9

Indoor L184.6

Outdoor L373.6

Indoor L367.0

Object L333.7

Full-benchmark overall F1 and selected DCRL difficulty slices. Object-centric L3 remains challenging.

Generalization & transfer

Held-out skills & datasets

Transfer to external spatial benchmarksSkill transfers from matching to other spatial tasks

Benchmark	Base	DCRL	Gain
ReasonMatch	27.5	70.5	+43.0
OmniSpatial	43.6	48.9	+5.3
MindCube	40.0	43.5	+3.5
SAT Real	70.0	75.3	+5.3

Why reinforcement learning?RL transfers under distribution shift; SFT does not

Strategy	OmniSpatial	SAT Real	ReasonMatch
Base	43.6	70.0	27.5
SFT	42.6	41.3	51.0
DCRL	48.9	75.3	70.5

Curriculum design

Ablation

Dynamic curriculum beats fixed-difficulty trainingReasonMatch-Bench F1 under the same RL setup

Training strategy	ReasonMatch F1	Δ vs. uniform
Uniform sampling	65.3	—
Easy-only samples	59.9	−10.6
Hard-only samples	62.3	−8.2
Dynamic curriculum	70.5	+5.2

Qualitative

Reasoning over a wide-baseline pair

Given two views of the same scene, the model must map region IDs from image A to image B. DCRL reasons over multi-tier scene structure; the base-style model over-relies on the nearest anchors.

A cross-view matching example on a plant-shelf scene. Qwen3-VL+DCRL produces correct multi-tier shelf reasoning and viewpoint-invariant matching, while GPT-5-mini produces a wrong inference by over-relying on nearest anchors. — **Left:** Qwen3-VL + DCRL — correct, via multi-tier shelf reasoning and viewpoint-invariant matching. **Right:** GPT-5-mini — wrong, from over-reliance on nearest anchors.

Open Release

Code, benchmark archives, and evaluation scripts

The release is designed for reproducible evaluation and recipe inspection. Training data is not included; training entry points are provided for users with compatible LMDB-formatted data.

Code repository

Paper-specific training, reward, buffer, and evaluation code built on top of a vendored verl stack.

Open GitHub

Benchmark data

Dataset release includes reasonmatch_bench.tar.gz and ood_dataset.tar.gz.

Open Hugging Face Open ModelScope

Evaluation protocol

Evaluation expects an OpenAI-compatible chat endpoint and reports benchmark summaries from saved predictions.

Citation

BibTeX

If you find ReasonMatch useful, please cite our work.

@InProceedings{Zhong_2026_CVPR,
  author    = {Zhong, Hao and Zhu, Muzhi and Zeng, Shenyan and Li, Anzhou and Chen, Cong and Geng, Hua and Shi, Duochao and Ye, Wentao and Lin, Tao and Chen, Hao and Shen, Chunhua},
  title     = {Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month     = {June},
  year      = {2026},
  pages     = {16768-16778}
}