VALA

Visibility-Aware Language Aggregation for Open-Vocabulary Segmentation in 3D Gaussian Splatting

Sen Wang1,2, Kunyi Li1,2, Siyun Liang1, Elena Alegret1, Jing Ma4,
Nassir Navab1,2, Stefano Gasperini1,2,3
1Technical University of Munich 2Munich Center for Machine Learning 3VisualAIs 4Ludwig Maximilian University of Munich
3DV 2026
Teaser

TL;DR. VALA assigns 2D language features to 3D Gaussians in a visibility-aware way and aggregates noisy multi-view features via a streaming cosine geometric median for consistent open-vocabulary 3D segmentation.

Abstract

Recently, distilling open-vocabulary language features from 2D images into 3D Gaussians has attracted significant attention. Although existing methods achieve impressive language-based interactions of 3D scenes, we observe two fundamental issues: background Gaussians contributing negligibly to a rendered pixel get the same feature as the dominant foreground ones, and multi-view inconsistencies due to view-specific noise in language embeddings. We introduce Visibility-Aware Language Aggregation (VALA), a lightweight yet effective method that computes marginal contributions for each ray and applies a visibility-aware gate to retain only visible Gaussians. Moreover, we propose a streaming weighted geometric median in cosine space to merge noisy multi-view features. Our method yields a robust, view-consistent language feature embedding in a fast and memory-efficient manner. VALA improves open-vocabulary localization and segmentation across reference datasets, consistently surpassing existing works.

Method Overview

Method overview

Overview of VALA. We compute per-ray marginal contributions to obtain visibility-aware gating, and aggregate multi-view language features with a streaming cosine geometric median to reduce view-specific noise and improve cross-view consistency.

Visibility-aware Gating

Visibility-aware gating

VALA retains only visible Gaussians when propagating 2D language features, preventing background Gaussians (with negligible contribution) from inheriting foreground semantics.

Results

LeRF-OVS qualitative results

Qualitative results on LeRF-OVS.

ScanNet qualitative results

Qualitative results on ScanNet.

Outdoor dataset qualitative results

Supplementary qualitative results on an outdoor dataset (Waymo).

Additional ScanNet results

Supplementary qualitative results on ScanNet.

Stress test curve

Robustness under mask boundary corruptions. mIoU/mAcc (%) are shown on the left y-axis; Disp (lower is better) on the right y-axis. We vary the erosion/dilation radius r (pixels). VALA degrades more slowly than Occam’s and its ablation without gating (VALA w/o G), while achieving lower Disp across severities.

BibTeX

@misc{wang2025vala,
  title={Visibility-Aware Language Aggregation for Open-Vocabulary Segmentation in 3D Gaussian Splatting},
  author={Sen Wang and Kunyi Li and Siyun Liang and Elena Alegret and Jing Ma and Nassir Navab and Stefano Gasperini},
  year={2025},
  eprint={2509.05515},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2509.05515}
}