SPIRAL: Semantic-Aware Progressive LiDAR Scene Generation

1 Technical University of Munich, 2 Fudan University,
3 National University of Singapore, 4 Munich Center for Machine Learning

NeurIPS 2025

Abstract

Leveraging recent diffusion models, LiDAR-based large-scale 3D scene generation has achieved great success. While recent voxel-based approaches can generate both geometric structures and semantic labels, existing range-view methods are limited to producing unlabeled LiDAR scenes. Relying on pretrained segmentation models to predict the semantic maps often results in suboptimal cross-modal consistency. To address this limitation while preserving the advantages of range-view representations, such as computational efficiency and simplified network design, we propose SPIRAL, a novel range-view LiDAR diffusion model that simultaneously generates depth, reflectance images, and semantic maps. Furthermore, we introduce novel semantic-aware metrics to evaluate the quality of the generated labeled range-view data. Experiments on the SemanticKITTI and nuScenes datasets demonstrate that Spiral achieves state-of-the-art performance with the smallest parameter size, outperforming two-step methods that combine the generative and segmentation models. Additionally, we validate that range images generated by SPIRAL can be effectively used for synthetic data augmentation in the downstream segmentation training, significantly reducing the labeling effort on LiDAR data.


MY ALT TEXT

Figure 1: Visualization of LiDAR scenes and their semantic labels jointly generated by SPIRAL, exhibiting high geometric fidelity and semantic-geometric consistency.

Method


MY ALT TEXT

(a) Two-step methods: Existing range-view LiDAR generative models typically generate only depth and reflectance images, requiring an additional pre-trained segmentation model to predict semantic labels. (b)SPIRAL: In contrast, Spiral jointly generates depth, reflectance, and semantic maps. A closed-loop inference mechanism (highlighted in the dash arrow) further improves cross-modal consistency.

Pipeline


MY ALT TEXT

(a) Unconditional Step: Spiral takes noisy LiDAR scenes xt as input and predicts both the semantic map ŷt and the noise \(\hat{\epsilon}_t\), where the switch A is off and B is on. (b) Conditional Step: Spiral predicts \(\hat{\epsilon}_t\) conditioned on the given semantic map y smoothed by the progressive filter, where the switch A is on and B is off. (c) Progressive Filter: During inference, Spiral begins in an open-loop mode with unconditional denoising. Once the predicted semantic map reaches high confidence, it switches to a closed-loop mode that alternates between unconditional and conditional steps, enhancing cross-modal consistency.

Visualization of the Generated LiDAR Scenes

scene 1

scene 2

scene 3

BibTeX

@article{zhu2025spiral,
      title={SPIRAL: Semantic-Aware Progressive LiDAR Scene Generation},
      author={Zhu, Dekai and Hu, Yixuan and Liu, Youquan and Lu, Dongyue and Kong, Lingdong and Ilic, Slobodan},
      journal={arXiv preprint arXiv:2505.22643},
      year={2025}
}