# Spatial-MLLM **Repository Path**: yu_shaonian/Spatial-MLLM ## Basic Information - **Project Name**: Spatial-MLLM - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-07-24 - **Last Updated**: 2025-07-24 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README
# ✨Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence✨

Diankun Wu1*, Fangfu Liu1*, Yi-Hsin Hung1, Yueqi Duan1,
*Equal Contribution.
1Tsinghua University

               ![Teaser Visualization](assets/teaser-spatialmllm.png)
Spatial-MLLM: We propose Spatial-MLLM, a method that significantly enhances the visual-based spatial intelligence of existing video MLLMs. As shown, Spatial-MLLM can understand and reason about the underlying scene based on video input and achieves SOTA performance in a wide range of spatial reasoning tasks. ## 📢 News - 🎉[05/30/2025] We release [Spatial-MLLM-subset-sft](https://huggingface.co/Diankun/Spatial-MLLM-subset-sft), which is training on a subset of our proposed Spatial-MLLM-120k dataset. We also release the evaluation code on VSI-Bench. - 🔥[05/30/2025] We release "Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence". Check our [project page](https://diankun-wu.github.io/Spatial-MLLM/) and [arXiv paper](https://arxiv.org/pdf/). ## 🌟 Overview ![Pipeline Visualization](assets/pipeline-spatialmllm.png) Overview of Spatial-MLLM. Our model is composed of a 2D visual encoder, a spatial encoder which is initialized from a feed-forward visual geometry foundation model, a connector, and a large language model backbone. At inference time, we incorporate a space-aware frame sampling strategy to select spatially informative frames when the number of input frames is limited due to GPU memory constraints. ## 🎉 Performance ![Results Visualization](assets/eval_VSIbench.png) ![Results Visualization](assets/eval_scanqa_sqa3d.png) ## ⚙️ Setup ### 1. Clone Repository ```bash git clone https://github.com/diankun-wu/Spatial-MLLM cd Spatial-MLLM ``` ### 2. Environment Setup 1. **Create conda environment:** ```bash conda create -n spatial-mllm python=3.10 -y conda activate spatial-mllm ``` 2. **Install required packages for inference and evaluation:** ```bash pip install torch==2.6.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124 # Adjust the CUDA version as needed pip install transformers==4.51.3 accelerate==1.5.2 qwen_vl_utils decord ray Levenshtein tyro pip install flash-attn --no-build-isolation ``` ## 💻 Inference and Evaluation ### Inference To run inference, use the provided script: ```bash python scripts/inference.py ``` This will: - Automatically download the `Spatial-MLLM-subset-sft` model from Hugging Face Hub. - Process the input video and text prompt (specified in script parameters) and generate the response. The script use bfloat16 precision by default and requires ~13GB VRAM . For a full list of options, see the inline help: ```bash python scripts/inference.py --help ``` ### Evaluation on VSI-Bench To evaluate the model on VSI-Bench, you should first download the VSI-Bench dataset and place it in the `evaluate/annotation/VSIBench` directory. You can use the following command: ```bash # download the VSI-Bench dataset from Hugging Face huggingface-cli download --resume-download nyu-visionx/VSI-Bench --local-dir evaluate/annotation/VSIBench --repo-type dataset # extract the downloaded dataset unzip evaluate/annotation/VSIBench/arkitscenes.zip -d evaluate/annotation/VSIBench unzip evaluate/annotation/VSIBench/scannet.zip -d evaluate/annotation/VSIBench unzip evaluate/annotation/VSIBench/scannetpp.zip -d evaluate/annotation/VSIBench ``` Then you can use the following command to evaluate the model: ```bash export CUDA_VISIBLE_DEVICES=0,1,2,3 # Set the GPU devices you want to use python evaluate/eval_vsibench.py \ --model_path Diankun/Spatial-MLLM-subset-sft \ --video_root evaluate/annotation/VSIBench \ --model_type spatial-mllm-subset-sft \ --batch_size 8 \ ``` or you can use the provided bash script: ```bash bash scripts/evaluate_vsibench.sh ``` ## 🚀Todo List - [ ] Release the full Spatial-MLLM model and the code for space-aware frame sampling. - [ ] Release the evaluation code on ScanQA and SQA3D. - [ ] Release the training code for Spatial-MLLM. - [ ] Release the Spatial-MLLM-120k dataset and its creation scripts. ## 📚 Citation If you find it useful for your research and applications, please cite our paper using this BibTeX: ```bibtex @article{wu2025spatialmllmboostingmllmcapabilities, title={Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence}, author={Wu, Diankun and Liu, Fangfu and Hung, Yi-Hsin and Duan, Yueqi}, journal={arXiv preprint arXiv:2505.23747}, year={2025} } ``` ## Acknowledgements Thanks to these great repositories: [thinking-in-space](https://github.com/vision-x-nyu/thinking-in-space), [VGGT](https://github.com/facebookresearch/vggt), [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL),[open-r1](https://github.com/huggingface/open-r1), [R1-V](https://github.com/Deep-Agent/R1-V), [VLM-R1](https://github.com/om-ai-lab/VLM-R1) and many other inspiring works in the community.