TY - JOUR
T1 - ACNTrack
T2 - Agent cross-attention guided Multimodal Multi-Object Tracking with Neural Kalman Filter
AU - Zhang, Lian
AU - Wang, Lingxue
AU - Wu, Yuzhen
AU - Chen, Mingkun
AU - Zheng, Dezhi
AU - Cai, Yi
N1 - Publisher Copyright:
© 2025 Elsevier B.V.
PY - 2025/10/14
Y1 - 2025/10/14
N2 - Exploring and associating the complementary information from visible, thermal infrared, and low-light images is crucial for advancing Multimodal Multi-Object Tracking (MMOT). While previous studies have shown that efficient feature fusion modules can bolster tracking performance in complex environments, these methods often encounter constraints in global feature interaction and computational efficiency. We present a novel multimodal multi-object tracker based on a tracking-by-detection paradigm, comprising a multimodal detector and a data associator. A dual cross-attention feature fusion detection framework, predicated on an agent attention mechanism, is introduced to enhance feature interaction efficiency and effectively capture cross-modal complementary information. To more accurately capture detailed and complex information inherent in each modality, we propose a Feature Pyramid Shared Convolution (FPS-Conv) operation to supersede the Spatial Pyramid Pooling Fast (SPPF) operation within the detector. Additionally, a Neural Kalman Filter (NKF) is developed to augment the performance of the data associator, which dynamically adjusts process and observation noise in accordance with the current motion state. Our innovative fusion architecture significantly reduces computational complexity while maintaining high-quality feature interactions, and our proposed NKF demonstrates superior performance in handling diverse motion patterns compared to traditional fixed-parameter approaches. Experimental results validate these advantages, with our proposed method achieving state-of-the-art results on the KAIST, FLIR, and UniRTL test datasets and demonstrated competitive performance on the VT-MOT dataset.
AB - Exploring and associating the complementary information from visible, thermal infrared, and low-light images is crucial for advancing Multimodal Multi-Object Tracking (MMOT). While previous studies have shown that efficient feature fusion modules can bolster tracking performance in complex environments, these methods often encounter constraints in global feature interaction and computational efficiency. We present a novel multimodal multi-object tracker based on a tracking-by-detection paradigm, comprising a multimodal detector and a data associator. A dual cross-attention feature fusion detection framework, predicated on an agent attention mechanism, is introduced to enhance feature interaction efficiency and effectively capture cross-modal complementary information. To more accurately capture detailed and complex information inherent in each modality, we propose a Feature Pyramid Shared Convolution (FPS-Conv) operation to supersede the Spatial Pyramid Pooling Fast (SPPF) operation within the detector. Additionally, a Neural Kalman Filter (NKF) is developed to augment the performance of the data associator, which dynamically adjusts process and observation noise in accordance with the current motion state. Our innovative fusion architecture significantly reduces computational complexity while maintaining high-quality feature interactions, and our proposed NKF demonstrates superior performance in handling diverse motion patterns compared to traditional fixed-parameter approaches. Experimental results validate these advantages, with our proposed method achieving state-of-the-art results on the KAIST, FLIR, and UniRTL test datasets and demonstrated competitive performance on the VT-MOT dataset.
KW - Feature Pyramid Shared
KW - Multi-object tracking
KW - Multimodal image
KW - Neural Kalman Filter
UR - http://www.scopus.com/pages/publications/105009515417
U2 - 10.1016/j.neucom.2025.130811
DO - 10.1016/j.neucom.2025.130811
M3 - Article
AN - SCOPUS:105009515417
SN - 0925-2312
VL - 650
JO - Neurocomputing
JF - Neurocomputing
M1 - 130811
ER -