6533b7cefe1ef96bd12577bd

RESEARCH PRODUCT

Depth Attention for Scene Understanding

Zongwei Wu

subject

Multi-Modal fusionApprentissage profond[INFO.INFO-TS] Computer Science [cs]/Signal and Image ProcessingDeep Learning for Computer VisionVision par ordinateurRGB-D FusionComputer visionDeep learningVision par Ordinateur et Intelligence Artificielle[INFO] Computer Science [cs]

description

Deep learning models can nowadays teach a machine to realize a number of tasks, even with better precision than human beings. Among all the modules of an intelligent machine, perception is the most essential part without which all other action modules have difficulties in safely and precisely realizing the target task under complex scenes. Conventional perception systems are based on RGB images which provide rich texture information about the 3D scene. However, the quality of RGB images highly depends on environmental factors, which further influence the performance of deep learning models. Therefore, in this thesis, we aim to improve the performance and robustness of RGB models with complementary depth cues by proposing novel RGB-D fusion designs. Traditionally, pixel-wise concatenation with addition and convolution is the widely applied approach for RGB-D fusion designs. Inspired by the success of attention modules in deep networks, in this thesis we analyze and propose different depth-aware attention modules and demonstrate our effectiveness in basic segmentation tasks such as saliency detection and semantic segmentation. First, we leverage the geometric cues and propose a novel depth-wise channel of attention. We merge the fine-grained details and the semantic cues to constrain the channel attention into various local regions, improving the model discriminability during the feature extraction. Second, we investigate the depth-adapted offset which serves as a local but deformable spatial attention for convolution. Our approach forces the networks to take more relevant pixels into account with the help of depth prior. Third, we improve the contextualized awareness within RGB-D fusion by leveraging transformer attention. We show that transformer attention can improve the model robustness against feature misalignment. Last but not least, we focus on fusion architecture by proposing an adaptive fusion design. We learn the trade-off between early and late fusion with respect to the depth quality, yielding a more robust manner to merge RGB-D cues for deep networks. Extensive comparisons on the reference benchmarks validate the effectiveness of our proposed methods compared to other fusion alternatives.

https://hal.science/tel-04066754