The growing diversity and dynamics of urban environments demand 3D semantic segmentation methods that can recognize a wide range of objects without relying on predefined classes or time-consuming labelled training data. As urban scenes evolve and application requirements vary across locations, flexible, annotation-free 3D segmentation methods are becoming increasingly desirable for large-scale 3D analytics. This work presents the first training-free, open-vocabulary (OV) method for 3D aerial point cloud classification and benchmarks it against state-of-the-art supervised 3D neural networks for the semantic enrichment of these geospatial data. The proposed approach leverages open-vocabulary object recognition in multiple 2D imagery and subsequently projects and refines these detections in 3D space, enabling semantic labelling without prior class definitions or annotated data. In contrast, the supervised baselines are trained on labelled datasets and restricted to a fixed set of object categories. We evaluate all methods with quantitative metrics and qualitative analysis, highlighting their respective strengths, limitations and suitability for scalable urban 3D mapping. By removing the dependency on annotated data and fixed taxonomies, this work represents a key step toward adaptive, scalable and semantic understanding of 3D urban environments.
Open-Vocabulary Segmentation of Aerial Point Clouds
Alami, Ashkan;Remondino, Fabio
2026-01-01
Abstract
The growing diversity and dynamics of urban environments demand 3D semantic segmentation methods that can recognize a wide range of objects without relying on predefined classes or time-consuming labelled training data. As urban scenes evolve and application requirements vary across locations, flexible, annotation-free 3D segmentation methods are becoming increasingly desirable for large-scale 3D analytics. This work presents the first training-free, open-vocabulary (OV) method for 3D aerial point cloud classification and benchmarks it against state-of-the-art supervised 3D neural networks for the semantic enrichment of these geospatial data. The proposed approach leverages open-vocabulary object recognition in multiple 2D imagery and subsequently projects and refines these detections in 3D space, enabling semantic labelling without prior class definitions or annotated data. In contrast, the supervised baselines are trained on labelled datasets and restricted to a fixed set of object categories. We evaluate all methods with quantitative metrics and qualitative analysis, highlighting their respective strengths, limitations and suitability for scalable urban 3D mapping. By removing the dependency on annotated data and fixed taxonomies, this work represents a key step toward adaptive, scalable and semantic understanding of 3D urban environments.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
