Visual Spatial Tuning

Rui Yang*¹Ziyu Zhu*³Yanwei Li^2,†Jingjia Huang²Shen Yan²Siyuan Zhou²Zhe Liu¹Xiangtai Li²Shuangye Li²Wenqian Wang²Yi Lin^2,§Hengshuang Zhao^1,§

¹The University of Hong Kong²ByteDance Seed³Tsinghua University* Equal contribution, † Project leader, § Correspondence

arxiv code hf

Abstract

Capturing spatial relationships from visual inputs is a cornerstone of human-like general intelligence. Several previous studies have tried to enhance the spatial awareness of Vision-Language Models (VLMs) by adding extra expert encoders, which brings extra overhead and usually harms general capabilities. To enhance the spatial ability in general architectures, we introduce Visual Spatial Tuning (VST), a comprehensive framework to cultivate VLMs with human-like visuospatial abilities, from spatial perception to reasoning.
(1) We first attempt to enhance spatial perception in VLMs by constructing a large-scale dataset termed VST-P, which comprises 4.1 million samples spanning 19 skills across single views, multiple images, and videos.
(2) Then, we present VST-R, a curated dataset with 135K samples that instruct models to reason in space.
(3) In particular, we adopt a progressive training pipeline: supervised fine-tuning to build foundational spatial knowledge, followed by reinforcement learning to further improve spatial reasoning abilities.
Without the side-effect to general capabilities, the proposed VST consistently achieves state-of-the-art results on several spatial benchmarks, including $34.8\%$ on MMSI-Bench and $61.2\%$ on VSIBench. It turns out that the Vision-Language-Action models can be significantly enhanced with the proposed spatial tuning paradigm, paving the way for more physically grounded AI.

Dataset

Overview of the VST dataset. (a) The distribution of VST-P, which is used for SFT. (b) The distribution of VST-R, which is used for CoT cold start and RL. `SR' denotes spatial reasoning, and `GR' denotes general reasoning.

VST-Perception (VST-P)

The VST-P dataset contains 4.1 M samples across 19 different tasks for supervised fine-tuning, covering three primary vision scenarios, i.e., single-image, multi-image, and video. The VLM tuned on this dataset exhibits significantly enhanced fundamental spatial perception capabilities. Notably, there is a ~20% improvement on CVBench-3D, a ~5% increase on BLINK, and a ~16% gain on VSIBench.

VST-Reasoning (VST-R)

The VST-R dataset contains 135K samples with two parts: one part includes CoT steps to teach the model how to reason, and the other part provides rule-checkable data used in online RL to improve the reasoning ability. The VLM tuned on this dataset demonstrates significantly enhanced spatial reasoning abilities. There is an 8.9% improvement on MMSI-Bench.

Results

Comparison with state-of-the-art VLMs on spatial benchmarks and general benchmarks.

VST exhibits superior performance in spatial perception and reasoning while maintaining strong competitiveness in general multi-modal understanding.

Models	CV	3DSR	MMSI	BLINK	VSI	MMStar	MMB	RealworldQA	MMMU	OCRBench	AI2D
GPT-4o	76.0	45.3	30.3	65.9	34.0	65.1	84.3	76.2	70.7	80.6	84.9
Gemini-2.5-Pro	-	-	36.9	70.6	-	77.5	90.1	78.0	81.7	86.6	88.4
Seed1.5-VL	85.2	61.6	29.7	72.1	41.5	77.8	89.9	78.4	77.9	86.1	87.3
LLava-OneVision-7B	61.9	54.4	26.6	48.2	32.4	61.7	80.8	66.3	48.8	62.2	81.4
Qwen2.5-VL-3B	71.8	50.2	26.5	47.6	29.6	55.9	79.9	65.4	47.9	79.7	81.6
Qwen2.5-VL-7B	75.4	53.2	25.9	56.4	38.9	63.9	83.5	68.5	58.6	86.4	83.9
InternVL3-8B	81.0	55.7	25.7	55.5	42.1	68.2	83.4	70.8	62.7	88.0	85.2
MiMo-VL-7B-RL	82.3	50.8	29.3	62.4	37.2	65.1	84.4	68.2	66.7	86.6	83.5
SpaceR-7B	74.8	53.3	20.1	55.4	43.5	61.6	84.3	64.7	53.1	85.9	85.5
SPAR-8B	80.7	57.5	-	43.9	41.1	-	79.9	64.7	-	-	-
VST-3B-SFT (ours)	84.4	54.1	30.2	59.1	57.9	58.0	80.9	68.4	45.2	83.7	82.5
VST-3B-RL (ours)	84.2	56.5	31.3	57.2	57.7	58.9	80.5	68.5	49.8	80.9	82.4
VST-7B-SFT (ours)	85.5	54.6	32.0	62.1	60.6	63.1	83.3	72.2	50.6	85.5	84.9
VST-7B-RL (ours)	86.5	60.1	34.8	62.6	61.2	63.5	83.0	68.5	49.4	86.1	83.5

Comparison with state-of-the-art VLMs on VSI-Bench

Visuospatial tuning enables robust perception and reasoning over scenes, even without explicit 3D features.

Methods	Avg.	Obj. Count	Abs. Dist.	Obj. Size	Room Size	Rel. Dist	Rel. Dir.	Route Plan	Appr. Order
GPT-4o	34.0	46.2	5.3	43.8	38.2	37.0	41.3	31.5	28.5
Gemini-1.5-Pro	45.4	56.2	30.9	64.1	43.6	51.3	46.3	36.0	34.6
LLaVA-OneVision-7B	32.4	47.7	20.2	47.4	12.3	42.5	35.2	29.4	24.4
LLaVA-Video-7B	35.6	48.5	14.0	47.8	24.2	43.5	42.4	34.0	30.6
Qwen2.5-VL-7B	32.7	34.5	19.4	47.6	40.8	32.8	24.5	32.5	29.4
SAT-7B	-	-	-	-	-	47.3	41.1	37.1	36.1
InternVL-Spatial-8B	-	68.7	40.9	63.1	54.3	47.7	-	29.9	60.5
SpaceR-7B	43.5	61.9	28.6	60.9	35.2	38.2	46.0	31.4	45.6
VILASR-7B	45.4	63.5	34.4	60.6	30.9	48.9	45.2	30.4	49.2
VLM-3R-7B	60.9	70.2	49.4	69.2	67.1	65.4	80.5	45.4	40.1
VST-3B-SFT (ours)	57.9	69.3	45.4	71.8	62.4	59.0	46.0	38.7	70.2
VST-3B-RL (ours)	57.7	66.6	45.0	72.8	60.9	59.9	47.6	40.7	68.3
VST-7B-SFT (ours)	60.6	72.0	44.4	74.3	68.3	59.7	55.8	44.9	65.2
VST-7B-RL (ours)	61.2	71.6	43.8	75.5	69.2	60.0	55.6	44.3	69.2

Comparison AP@15 on SUN RGB-D 3D object detection benchmark

VST shows strong 3D object detection abilities.

Expanding to VLA Model (Success rate comparison on the LIBERO benchmark)

The integration of spatial knowledge provides a significant performance benefit to VLA models.

VLA Backbone	LIBERO-spatial	LIBERO-object	LIBERO-goal	LIBERO-10	AVG
Qwen2.5-VL-3B	56.6	86.6	53.8	15.2	53.1
Qwen2.5-VL-3B (VST, ours)	65.0 (+8.4)	88.4 (+1.8)	67.8 (+4.0)	25.6 (+10.4)	61.7 (+8.6)

Task: "pick up the black bowl between the plate and the ramekin and place it on the plate"

Qwen2.5-VL-3B (❌)

VST-tuned Qwen2.5-VL-3B (✅)

Task: "pick up the alphabet soup and place it in the basket"

Qwen2.5-VL-3B (❌)

VST-tuned Qwen2.5-VL-3B (✅)

Task: "open the middle drawer of the cabinet"

Qwen2.5-VL-3B (❌)

VST-tuned Qwen2.5-VL-3B (✅)

Task: "put both the cream cheese box and the butter in the basket"

Qwen2.5-VL-3B (❌)

VST-tuned Qwen2.5-VL-3B (✅)

Special Design

1. FoV unification

We normalize the input data by projecting all images onto a virtual camera with a predefined, uniform FoV.

2. Scene caption

Scene caption describes the layout information and spatial relationships within the image.

3. Prompting with BEV annotation

We provide RGB images, the corresponding BEV visualizations, detailed object information, and question-answer pairs to prompt the teacher VLM.

Case:

w/o BEV annotation
Question: If the small cup on the right side of the white L-shaped desk near a water bottle is in front of the black printer on the left side of the white L-shaped desk next to a stack of colorful drawers, which direction is the metal trash can on the floor between the tall white storage cabinets and the white desk from the printer? Options: A. front-left B. front-right C. left D. back-left.
CoT: First, I look at the first image. The left wall is lined with tall white storage cabinets with metal handles, stretching almost the entire length of the wall. To the right, there's a white desk holding a stack of colorful drawers-these drawers are a key reference because they appear in both images. Between the cabinets and the desk, a small metal trash can sits on the brown carpet, positioned close to the base of the cabinets. In the background, a coat hanger with hanging items leans against the wall, and a few boxes are scattered on the floor near the cabinets. An office chair is partially visible next to the desk, facing the cabinets. Next, I examine the second image. It's taken from a closer, more forward angle, with the camera tilted up slightly and turned to the right-this shift makes the desk the focal point of the foreground. The desk is now clearly L-shaped: one section extends towards the left (where the colorful drawers are) and another curves towards the right (where two black monitors sit). On the right side of the desk, near the monitors, there's a water bottle and a small cup-this is the cup mentioned in the question. On the left side of the desk, next to the colorful drawers, a black printer rests on the surface-this is the printer referenced. A door is visible in the background, and the coat hanger from the first image is still in place, now near the door. The office chair is fully visible in the foreground, facing the desk, and new items like a lamp, a telephone, and papers are scattered on the desk. To piece together the entire scene layout, I connect the elements from both images. The tall white storage cabinets run along the left wall of the room, providing a fixed reference point. The white desk is positioned against the right wall, extending into an L-shape: one arm of the L faces the cabinets (as seen in the first image, where the desk is against the right wall and the cabinets are on the left), and the other arm faces the center of the room (as seen in the second image, where the desk curves towards the monitors). The metal trash can is permanently located between the cabinets and the left end of the desk-this is consistent across both images, as it's clearly visible in the first image and implied in the second (since the cabinets and desk's left end are still present). Now, I focus on the relative positions of the printer, cup, and trash can. The question states the cup is in front of the printer, so I need to establish the printer's orientation. In the second image, the printer is on the left end of the desk, next to the colorful drawers. [error direction] The cup is on the right end of the desk, near the monitors. For the cup to be "in front" of the printer, the printer must be facing towards the cup-this means the printer's front direction is along the length of the desk, from the left end (where the printer is) to the right end (where the cup is). This makes sense because the desk is a continuous surface, and the printer is positioned to face the main working area of the desk (where the monitors and cup are). Next, I need to determine where the trash can is relative to the printer. The trash can is between the tall white storage cabinets (on the left wall) and the left end of the desk (where the printer is). From the printer's perspective-sitting on the left end of the desk, facing towards the cup (front direction)-the trash can is in front of the printer (towards the cabinets, which are in the direction opposite to the camera in the first image) but slightly to the right. Why? Because the printer is at the very left end of the desk, and the trash can is positioned between that end and the cabinets. So from the printer's view, the trash can isn't directly to the left (which would be towards the cabinets) but rather in front, with a slight right shift relative to the printer's front axis (which is towards the cup). This right shift is because the trash can is not aligned perfectly with the printer's left side but rather between the printer's position and the cabinets, making it fall into the front-right quadrant from the printer's perspective. To confirm this, I visualize standing at the printer's position on the desk. If I face the cup (front), my left side would be towards the cabinets, and my right side would be towards the rest of the desk. The trash can is between me (the printer) and the cabinets, but not directly to my left-instead, it's in front of me, but a little to the right. This places the trash can in the front-right direction from the printer. Summarizing my reasoning, the trash can is located in the front-right direction relative to the printer, which matches option B.

w/ BEV annotation
Question: If the small metal trash can on the floor near the white storage cabinets in the first image is north of the small printer on the white cabinet near the door in the second image, what direction is the clear plastic bottle on the white desk with two monitors in the second image from the printer? Options: A. southeast, B. northeast, C. south, D. northwest
CoT: First, I look at the first image. The scene has a long row of white storage cabinets along the left wall, a white desk with shelves holding folders on the right, and a small metal trash can sitting on the floor between the cabinets and the desk. There's also a coat hanger in the corner and some boxes stacked nearby. The trash can is clearly positioned on the left side of the room, close to the storage cabinets. Next, I examine the second image, which was taken from a closer, more forward angle-like the camera moved toward the desk and tilted up slightly. The desk is now in the foreground, holding two black monitors, a keyboard, a mouse, a telephone, and a clear plastic bottle. Against the back wall near the door, there's a small printer on a white cabinet. The door and coat hanger from the first image are still visible in the background, but the trash can isn't present here. Putting both images together, the room's layout makes sense: the left wall has storage cabinets, the center/right has a desk, and the back wall has a door. The trash can (from image 1) is on the left side near the cabinets, the printer (from image 2) is on the back right near the door, and the bottle (from image 2) is on the foreground desk. The question says the trash can is north of the printer. Since the trash can is on the left side of the room and the printer is on the back right, "north" from the printer's perspective must be toward the left-toward the trash can. Now, I need to figure out where the bottle is relative to the printer. The bottle is on the desk, which is in the foreground of the second image. From the printer's position (back right near the door), the desk is in front of it-closer to the camera. The bottle is on the part of the desk that's to the left of the printer (since the desk extends from the foreground to the back wall, and the printer is on a cabinet against the back wall to the right of the desk). So, from the printer's view: the bottle is to the left (the same direction as north, since north is toward the trash can on the left) and forward (since the desk is closer than the trash can). Combining these, the bottle is in the northwest direction from the printer. That matches option D.

Citation

@article{vst,
  title={Visual Spatial Tuning},
  author={Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, Guang Shi, Hengshuang Zhao},
  journal={arXiv preprint arXiv:2511.05491},
  year={2025}
}