Language-augmented scene representations hold significant potential for robotics applications such as search and rescue, smart cities, and mining, where fast, natural language-based queries over large areas are crucial. Existing methods often struggle with slow embedding speeds and limited scene sizes due to single-GPU constraints. Additionally, deploying these embeddings on resource-constrained edge devices like NVIDIA Jetson remains a challenge.
To address these issues, we propose a multi-GPU framework for language-augmented Gaussian splatting, enabling faster, scalable embedding of large scenes. We build on prior methods that map 2D visual-language model embeddings to 3D scenes using SAM and CLIP. Unlike approaches that rely on optimization-based embedding computation, our method simplifies the process by computing embeddings as a normalized, weighted sum of masked language features across multiple viewpoints. We also integrate a vector database for efficient storage and retrieval of embeddings, with a partitioning mechanism to support deployment on mobile robots with limited resources. Our experiments show an 18x speedup in embedding computation on a 16-GPU setup while achieving state-of-the-art embedding quality. These results highlight the effectiveness of our method for real-world robotics applications.
SLAG achieves significant speedup in embedding while maintaining state-of-the-art embedding performance.
@article{szilagyi2024slag,
title={SLAG: Scaling Language Augmented Gaussian Splatting},
author={Laszlo Szilagyi, Francis Engelmann and Jeannette Bohg},
booktitle={},
year={},
url={}
}