Visual Spatial Tuning

Capturing spatial relationships from visual inputs is a cornerstone of human-like general intelligence. Several previous studies have tried to enhance the spatial awareness of Vision-Language Models (VLMs) by adding extra expert encoders, which brings extra overhead and usually harms general capabilities. To enhance the spatial ability in general architectures, we introduce Visual Spatial Tuning (VST), a comprehensive framework to cultivate VLMs with human-like visuospatial abilities, from spatial perception to reasoning.
(1) We first attempt to enhance spatial perception in VLMs by constructing a large-scale dataset termed VST-P, which comprises 4.1 million samples spanning 19 skills across single views, multiple images, and videos.
(2) Then, we present VST-R, a curated dataset with 135K samples that instruct models to reason in space.
(3) In particular, we adopt a progressive training pipeline: supervised fine-tuning to build foundational spatial knowledge, followed by reinforcement learning to further improve spatial reasoning abilities.
Without the side-effect to general capabilities, the proposed VST consistently achieves state-of-the-art results on several spatial benchmarks, including 34.8% on MMSI-Bench and 61.2% on VSIBench. It turns out that the Vision-Language-Action models can be significantly enhanced with the proposed spatial tuning paradigm, paving the way for more physically grounded AI.
Overview of the VST dataset. (a) The distribution of VST-P, which is used for SFT. (b) The distribution of VST-R, which is used for CoT cold start and RL. `SR' denotes spatial reasoning, and `GR' denotes general reasoning.The VST-P dataset contains 4.1 M samples across 19 different tasks for supervised fine-tuning, covering three primary vision scenarios, i.e., single-image, multi-image, and video. The VLM tuned on this dataset exhibits significantly enhanced fundamental spatial perception capabilities. Notably, there is a ~20% improvement on CVBench-3D, a ~5% increase on BLINK, and a ~16% gain on VSIBench.
The VST-R dataset contains 135K samples with two parts: one part includes CoT steps to teach the model how to reason, and the other part provides rule-checkable data used in online RL to improve the reasoning ability. The VLM tuned on this dataset demonstrates significantly enhanced spatial reasoning abilities. There is an 8.9% improvement on MMSI-Bench.
VST exhibits superior performance in spatial perception and reasoning while maintaining strong competitiveness in general multi-modal understanding.
| Models | CV | 3DSR | MMSI | BLINK | VSI | MMStar | MMB | RealworldQA | MMMU | OCRBench | AI2D |
|---|---|---|---|---|---|---|---|---|---|---|---|
| GPT-4o | 76.0 | 45.3 | 30.3 | 65.9 | 34.0 | 65.1 | 84.3 | 76.2 | 70.7 | 80.6 | 84.9 |
| Gemini-2.5-Pro | - | - | 36.9 | 70.6 | - | 77.5 | 90.1 | 78.0 | 81.7 | 86.6 | 88.4 |
| Seed1.5-VL | 85.2 | 61.6 | 29.7 | 72.1 | 41.5 | 77.8 | 89.9 | 78.4 | 77.9 | 86.1 | 87.3 |
| LLava-OneVision-7B | 61.9 | 54.4 | 26.6 | 48.2 | 32.4 | 61.7 | 80.8 | 66.3 | 48.8 | 62.2 | 81.4 |
| Qwen2.5-VL-3B | 71.8 | 50.2 | 26.5 | 47.6 | 29.6 | 55.9 | 79.9 | 65.4 | 47.9 | 79.7 | 81.6 |
| Qwen2.5-VL-7B | 75.4 | 53.2 | 25.9 | 56.4 | 38.9 | 63.9 | 83.5 | 68.5 | 58.6 | 86.4 | 83.9 |
| InternVL3-8B | 81.0 | 55.7 | 25.7 | 55.5 | 42.1 | 68.2 | 83.4 | 70.8 | 62.7 | 88.0 | 85.2 |
| MiMo-VL-7B-RL | 82.3 | 50.8 | 29.3 | 62.4 | 37.2 | 65.1 | 84.4 | 68.2 | 66.7 | 86.6 | 83.5 |
| SpaceR-7B | 74.8 | 53.3 | 20.1 | 55.4 | 43.5 | 61.6 | 84.3 | 64.7 | 53.1 | 85.9 | 85.5 |
| SPAR-8B | 80.7 | 57.5 | - | 43.9 | 41.1 | - | 79.9 | 64.7 | - | - | - |
| VST-3B-SFT (ours) | 84.4 | 54.1 | 30.2 | 59.1 | 57.9 | 58.0 | 80.9 | 68.4 | 45.2 | 83.7 | 82.5 |
| VST-3B-RL (ours) | 84.2 | 56.5 | 31.3 | 57.2 | 57.7 | 58.9 | 80.5 | 68.5 | 49.8 | 80.9 | 82.4 |
| VST-7B-SFT (ours) | 85.5 | 54.6 | 32.0 | 62.1 | 60.6 | 63.1 | 83.3 | 72.2 | 50.6 | 85.5 | 84.9 |
| VST-7B-RL (ours) | 86.5 | 60.1 | 34.8 | 62.6 | 61.2 | 63.5 | 83.0 | 68.5 | 49.4 | 86.1 | 83.5 |
Visuospatial tuning enables robust perception and reasoning over scenes, even without explicit 3D features.
| Methods | Avg. | Obj. Count | Abs. Dist. | Obj. Size | Room Size | Rel. Dist | Rel. Dir. | Route Plan | Appr. Order |
|---|---|---|---|---|---|---|---|---|---|
| GPT-4o | 34.0 | 46.2 | 5.3 | 43.8 | 38.2 | 37.0 | 41.3 | 31.5 | 28.5 |
| Gemini-1.5-Pro | 45.4 | 56.2 | 30.9 | 64.1 | 43.6 | 51.3 | 46.3 | 36.0 | 34.6 |
| LLaVA-OneVision-7B | 32.4 | 47.7 | 20.2 | 47.4 | 12.3 | 42.5 | 35.2 | 29.4 | 24.4 |
| LLaVA-Video-7B | 35.6 | 48.5 | 14.0 | 47.8 | 24.2 | 43.5 | 42.4 | 34.0 | 30.6 |
| Qwen2.5-VL-7B | 32.7 | 34.5 | 19.4 | 47.6 | 40.8 | 32.8 | 24.5 | 32.5 | 29.4 |
| SAT-7B | - | - | - | - | - | 47.3 | 41.1 | 37.1 | 36.1 |
| InternVL-Spatial-8B | - | 68.7 | 40.9 | 63.1 | 54.3 | 47.7 | - | 29.9 | 60.5 |
| SpaceR-7B | 43.5 | 61.9 | 28.6 | 60.9 | 35.2 | 38.2 | 46.0 | 31.4 | 45.6 |
| VILASR-7B | 45.4 | 63.5 | 34.4 | 60.6 | 30.9 | 48.9 | 45.2 | 30.4 | 49.2 |
| VLM-3R-7B | 60.9 | 70.2 | 49.4 | 69.2 | 67.1 | 65.4 | 80.5 | 45.4 | 40.1 |
| VST-3B-SFT (ours) | 57.9 | 69.3 | 45.4 | 71.8 | 62.4 | 59.0 | 46.0 | 38.7 | 70.2 |
| VST-3B-RL (ours) | 57.7 | 66.6 | 45.0 | 72.8 | 60.9 | 59.9 | 47.6 | 40.7 | 68.3 |
| VST-7B-SFT (ours) | 60.6 | 72.0 | 44.4 | 74.3 | 68.3 | 59.7 | 55.8 | 44.9 | 65.2 |
| VST-7B-RL (ours) | 61.2 | 71.6 | 43.8 | 75.5 | 69.2 | 60.0 | 55.6 | 44.3 | 69.2 |
VST shows strong 3D object detection abilities.

The integration of spatial knowledge provides a significant performance benefit to VLA models.
| VLA Backbone | LIBERO-spatial | LIBERO-object | LIBERO-goal | LIBERO-10 | AVG |
|---|---|---|---|---|---|
| Qwen2.5-VL-3B | 56.6 | 86.6 | 53.8 | 15.2 | 53.1 |
| Qwen2.5-VL-3B (VST, ours) | 65.0 (+8.4) | 88.4 (+1.8) | 67.8 (+4.0) | 25.6 (+10.4) | 61.7 (+8.6) |
Task: "pick up the black bowl between the plate and the ramekin and place it on the plate"
Task: "pick up the alphabet soup and place it in the basket"
Task: "open the middle drawer of the cabinet"
Task: "put both the cream cheese box and the butter in the basket"
We normalize the input data by projecting all images onto a virtual camera with a predefined, uniform FoV.
Scene caption describes the layout information and spatial relationships within the image.
We provide RGB images, the corresponding BEV visualizations, detailed object information, and question-answer pairs to prompt the teacher VLM.
Case:


@article{vst,
title={Visual Spatial Tuning},
author={Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, Guang Shi, Hengshuang Zhao},
journal={arXiv preprint arXiv:2511.05491},
year={2025}
}
powered by Academic Project Page Template