UP-Fuse: Uncertainty-guided LiDAR-Camera Fusion for 3D Panoptic Segmentation

Abstract

LiDAR-camera fusion enhances 3D panoptic segmentation by leveraging camera images to complement sparse LiDAR scans, but it also introduces a critical failure mode. Under adverse conditions, degradation or failure of the camera sensor can significantly compromise the reliability of the perception system. To address this problem, we introduce UP-Fuse, a novel uncertainty-aware fusion framework in the 2D range-view that remains robust under camera sensor degradation, calibration drift, and sensor failure. Raw LiDAR data is first projected into the range-view and encoded by a LiDAR encoder, while camera features are simultaneously extracted and projected into the same shared space. At its core, UP-Fuse employs an uncertainty-guided fusion module that dynamically modulates cross-modal interaction using predicted uncertainty maps. These maps are learned by quantifying representational divergence under diverse visual degradations, ensuring that only reliable visual cues influence the fused representation. The fused range-view features are decoded by a novel hybrid 2D-3D transformer that mitigates spatial ambiguities inherent to the 2D projection and directly predicts 3D panoptic segmentation masks. Extensive experiments on Panoptic nuScenes, SemanticKITTI, and our introduced Panoptic Waymo benchmark demonstrate the efficacy and robustness of UP-Fuse, which maintains strong performance even under severe visual corruption or misalignment, making it well suited for robotic perception in safety-critical settings.

Technical Approach

Figure: Illustration of the proposed UP-Fuse architecture. LiDAR and multi-view camera images are fused onto a shared space of range-view feature representations. The Uncertainty-Aware Fusion Module adaptively integrates modalities via uncertainty-weighted deformable cross-modal interaction to attenuate unreliable visual cues. Finally, a Hybrid 2D-3D Panoptic Decoder generates 3D predictions. Paths and blocks shown in brown are used only during training.

UP-Fuse performs LiDAR–camera fusion for 3D panoptic segmentation by first bringing both sensors into a shared 2D range-view (spherical) representation. The raw LiDAR point cloud is projected into a dense range-view image and encoded with a hierarchical backbone to produce multi-scale geometric features. In parallel, each calibrated camera view is encoded to extract rich appearance cues, and these image features are geometrically aligned into the same range-view grid via a view transformation step: LiDAR points are projected into the cameras to form sparse depth, densified with depth completion, back-projected into 3D in the LiDAR frame, and finally re-projected into range view to create a dense correspondence for warping multi-view image features. This yields pixel-aligned LiDAR and camera feature pyramids in the same coordinate system, enabling efficient and spatially consistent fusion while retaining the full 360° LiDAR context.

At the core of the method is an uncertainty-aware fusion module that decides not only where image information is useful, but also when it is trustworthy. Camera reliability is modelled by training a lightweight predictor to estimate feature instability under realistic, non-spatial camera degradations (e.g., photometric shifts, dropout, and domain-style changes). The resulting uncertainty map dynamically down-weights unreliable visual features inside a deformable cross-modal attention fusion, so LiDAR queries selectively attend to visual evidence that is both relevant and reliable. The fused range-view representation is then processed by a hybrid 2D–3D panoptic decoder: a Mask2Former-style pixel+transformer decoder produces query-based mask predictions, while a 3D-aware mask head “unprojects” features back to points using local range-consistent neighbors to avoid label bleeding from occlusions and reduce 360° wrap-around fragmentation. This design yields direct 3D panoptic masks that stay robust under camera corruption, calibration drift, or partial sensor failure.