Abstract:
To improve the performance of object detection methods in complex scenes, a multimodal object detection model based on feature interaction and adaptive grouping fusion is proposed by combining deep learning algorithms with multimodal information fusion technology. The model uses infrared and visible object images as inputs, constructs a symmetrical dual-branch feature extraction structure based on the PP-LCNet network, and introduces a feature interaction module to ensure complementary information between different modal object features during the extraction process. Secondly, a binary grouping attention mechanism was designed. Global pooling combined with the sign function was used to group the output features of the interaction module into their respective object categories, and spatial attention mechanisms were used to enhance the object information in each group of features. Finally, based on the group-enhanced features, similar feature groups at different scales were extracted, and multi-scale fusion was carried out through adaptive weighting from deep to shallow. Object prediction was then achieved based on the fused features at each scale. The experimental results show that the proposed method significantly improves multimodal feature interaction, key feature enhancement, and multi-scale fusion. Moreover, in complex scenarios, the model exhibits higher robustness and can be better applied to different scenarios.