SLAG

Scalable Language-Augmented Gaussian Splatting

Accepted to IEEE Robotics and Automation Letters (RA-L), 2025

1Stanford University

SLAG is a scalable method for augmenting large Gaussian splatting scenes with language features. Our method leverages multi-GPU parallelization to accelerate scene encoding and overcome the memory limitations of single-GPU systems.


exp_grasp
SLAG computes embeddings as normalized weighted averages of Gaussian parameters, enabling fully parallel processing (SAM-CLIP-Rasterization-Masking) with a simple aggregation step at the end. Bottom: LERF teatime, figurines and MegaNerf rubble scenes with corresponding query words and query responses.

Abstract

Language-augmented scene representations hold great promise for large-scale robotics applications such as search-and-rescue, smart cities, and mining. By enabling open-vocabulary natural language queries, language-augmented Gaussian splatting allows flexible retrieval of semantic information from large 3D scenes and integration with LLM agents. Many of these use cases are time-sensitive and data-intensive, demanding both rapid scene encoding and scalablility. Deploying such representationson for inference on robots with limited computational resources further compounds the challenge. To address this, we introduce SLAG—a multi-GPU framework for language-augmented Gaussian splatting that significantly improves the speed and scalability of language-augmenting large 3D Gaussian scenes.

Method Overview

Our method integrates 2D visual-language model features into 3D scenes using SAM and CLIP. Unlike prior approaches, SLAG eliminates the need for a loss function to compute per-Gaussian language embeddings. Instead, it derives embeddings from 3D Gaussian scene parameters via a normalized weighted average, enabling highly parallelized scene encoding. Additionally, we introduce a vector database for efficient language embedding feature storage and inference retrieval. To further improve scalability, we introduce a vector database that stores language embeddings for fast and efficient querying at inference time.


exp_grasp
Scene encoding flow with the SAM-CLIP-Rasterization-Parameter Masking phase and the Aggregation phase.

Quantitative Results

SLAG significantly accelerates scene encoding while maintaining state-of-the-art open-vocabulary segmentation performance. Our experiments demonstrate an 19× speedup in scene encoding on a 16-GPU setup compared to OpenGaussian, with no loss in embedding quality on the ScanNet and LERF datasets.


exp_grasp
Scene encoding time comparison of OpenGaussians with 1 GPU, SLAG with 1 GPU and SLAG with 16 GPUs (units are seconds). SLAG is 1.4× faster with 1 GPU and 19× faster with 16 GPUs.


exp_grasp
3D binary open vocabulary segmentation scores on LERF. SLAG is on par with state of the art methods.

BibTeX

@article{szilagyi2025slag,
      title={SLAG: Scalable Language-Augmented Gaussian Splatting}, 
      author={Szilagyi, Laszlo and Engelmann, Francis and Bohg, Jeannette},
      journal={IEEE Robotics and Automation Letters}, 
      year={2025},
      doi={10.1109/LRA.2025.3573203}
}