UP-Fuse performs LiDAR–camera fusion for 3D panoptic segmentation by first bringing both sensors into a shared 2D range-view (spherical) representation. The raw LiDAR point cloud is projected into a dense range-view image and encoded with a hierarchical backbone to produce multi-scale geometric features. In parallel, each calibrated camera view is encoded to extract rich appearance cues, and these image features are geometrically aligned into the same range-view grid via a view transformation step: LiDAR points are projected into the cameras to form sparse depth, densified with depth completion, back-projected into 3D in the LiDAR frame, and finally re-projected into range view to create a dense correspondence for warping multi-view image features. This yields pixel-aligned LiDAR and camera feature pyramids in the same coordinate system, enabling efficient and spatially consistent fusion while retaining the full 360° LiDAR context.
At the core of the method is an uncertainty-aware fusion module that decides not only where image information is useful, but also when it is trustworthy. Camera reliability is modelled by training a lightweight predictor to estimate feature instability under realistic, non-spatial camera degradations (e.g., photometric shifts, dropout, and domain-style changes). The resulting uncertainty map dynamically down-weights unreliable visual features inside a deformable cross-modal attention fusion, so LiDAR queries selectively attend to visual evidence that is both relevant and reliable. The fused range-view representation is then processed by a hybrid 2D–3D panoptic decoder: a Mask2Former-style pixel+transformer decoder produces query-based mask predictions, while a 3D-aware mask head “unprojects” features back to points using local range-consistent neighbors to avoid label bleeding from occlusions and reduce 360° wrap-around fragmentation. This design yields direct 3D panoptic masks that stay robust under camera corruption, calibration drift, or partial sensor failure.