Efficient Multimodal Feature Refinement via Adaptive RGB-IR Interaction for Robust Drone Detection and Classification

Thien Huynh-The, Van-Phuc Hoang, Thanh-Dat Tran

Abstract


@font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4; mso-font-charset:0; mso-generic-font-family:roman; mso-font-pitch:variable; mso-font-signature:-536870145 1107305727 0 0 415 0;}@font-face {font-family:Aptos; panose-1:2 11 0 4 2 2 2 2 2 4; mso-font-charset:0; mso-generic-font-family:swiss; mso-font-pitch:variable; mso-font-signature:536871559 3 0 0 415 0;}@font-face {font-family:"Times New Roman \(Body CS\)"; panose-1:2 11 6 4 2 2 2 2 2 4; mso-font-alt:"Times New Roman"; mso-font-charset:0; mso-generic-font-family:roman; mso-font-pitch:auto; mso-font-signature:0 0 0 0 0 0;}p.MsoNormal, li.MsoNormal, div.MsoNormal {mso-style-unhide:no; mso-style-qformat:yes; mso-style-parent:""; margin-top:0cm; margin-right:0cm; margin-bottom:8.0pt; margin-left:0cm; line-height:115%; mso-pagination:widow-orphan; font-size:12.0pt; font-family:"Times New Roman",serif; mso-fareast-font-family:Aptos; mso-fareast-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman \(Body CS\)"; color:black; mso-themecolor:text1; mso-font-kerning:1.0pt; mso-ligatures:standardcontextual;}.MsoChpDefault {mso-style-type:export-only; mso-default-props:yes; mso-fareast-font-family:Aptos; mso-fareast-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman \(Body CS\)"; color:black; mso-themecolor:text1;}.MsoPapDefault {mso-style-type:export-only; margin-bottom:8.0pt; line-height:115%;}div.WordSection1 {page:WordSection1;}

The rapid proliferation of unmanned aerial vehicles (UAVs) has intensified the need for robust surveillance systems capable of distinguishing drones from biological entities like birds in unpredictable environments. While multispectral vision provides a resilient alternative to uni-modal sensors under adverse weather and lighting, existing architectures often struggle with cross-modal feature alignment and noise-induced spatial distortions. This paper proposes Multispectral Attention Context and Receptive-field Network (MACR-Net), an ultra-lightweight multimodal framework designed for high-precision drone detection. MACR-Net introduces a Global-Local Cross-Scale Interaction (GLCI) module to capture multi-scale semantic context and a Multimodal Spatial Cross-Perception (MSCP) mechanism to adaptively fuse RGB-IR streams while preserving target-specific thermal and structural signatures. Furthermore, we design an improved hybrid neck integrating Coordinate-Aware Attention (CAA) and Receptive Field Deformable (RFD) modules to anchor precise spatial coordinates and mitigate geometric distortions. Experimental results on the benchmark Multimodal Drone Detection Dataset demonstrate that MACR-Net outperforms state-of-the-art models, achieving a peak mAP_50 of  91.13% and a significant mAP_50-95 of 65.77%. Remarkably, the architecture maintains an extremely compact footprint with only 2.77M parameters and 0.77 GFLOPs, establishing an optimal balance between superior detection robustness and real-time feasibility for resource-constrained edge deployment.




DOI: http://dx.doi.org/10.21553/rev-jec.451

Copyright (c) 2026 REV Journal on Electronics and Communications


ISSN: 1859-378X

Copyright © 2011-2025
Radio and Electronics Association of Vietnam
All rights reserved