ACNTrack: Agent cross-attention guided Multimodal Multi-Object Tracking with Neural Kalman Filter

Lian Zhang; Lingxue Wang; Yuzhen Wu; Mingkun Chen; Dezhi Zheng; Yi Cai

doi:10.1016/j.neucom.2025.130811

ACNTrack: Agent cross-attention guided Multimodal Multi-Object Tracking with Neural Kalman Filter

Lian Zhang, Lingxue Wang^*, Yuzhen Wu, Mingkun Chen, Dezhi Zheng, Yi Cai

^*此作品的通讯作者

Beijing Institute of Technology

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

Exploring and associating the complementary information from visible, thermal infrared, and low-light images is crucial for advancing Multimodal Multi-Object Tracking (MMOT). While previous studies have shown that efficient feature fusion modules can bolster tracking performance in complex environments, these methods often encounter constraints in global feature interaction and computational efficiency. We present a novel multimodal multi-object tracker based on a tracking-by-detection paradigm, comprising a multimodal detector and a data associator. A dual cross-attention feature fusion detection framework, predicated on an agent attention mechanism, is introduced to enhance feature interaction efficiency and effectively capture cross-modal complementary information. To more accurately capture detailed and complex information inherent in each modality, we propose a Feature Pyramid Shared Convolution (FPS-Conv) operation to supersede the Spatial Pyramid Pooling Fast (SPPF) operation within the detector. Additionally, a Neural Kalman Filter (NKF) is developed to augment the performance of the data associator, which dynamically adjusts process and observation noise in accordance with the current motion state. Our innovative fusion architecture significantly reduces computational complexity while maintaining high-quality feature interactions, and our proposed NKF demonstrates superior performance in handling diverse motion patterns compared to traditional fixed-parameter approaches. Experimental results validate these advantages, with our proposed method achieving state-of-the-art results on the KAIST, FLIR, and UniRTL test datasets and demonstrated competitive performance on the VT-MOT dataset.

源语言	英语
文章编号	130811
期刊	Neurocomputing
卷	650
DOI	http://doi.org/10.1016/j.neucom.2025.130811
出版状态	已出版 - 14 10月 2025

访问文件

10.1016/j.neucom.2025.130811

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{8592b070a0c244d186f80ce0cd99c790,

title = "ACNTrack: Agent cross-attention guided Multimodal Multi-Object Tracking with Neural Kalman Filter",

abstract = "Exploring and associating the complementary information from visible, thermal infrared, and low-light images is crucial for advancing Multimodal Multi-Object Tracking (MMOT). While previous studies have shown that efficient feature fusion modules can bolster tracking performance in complex environments, these methods often encounter constraints in global feature interaction and computational efficiency. We present a novel multimodal multi-object tracker based on a tracking-by-detection paradigm, comprising a multimodal detector and a data associator. A dual cross-attention feature fusion detection framework, predicated on an agent attention mechanism, is introduced to enhance feature interaction efficiency and effectively capture cross-modal complementary information. To more accurately capture detailed and complex information inherent in each modality, we propose a Feature Pyramid Shared Convolution (FPS-Conv) operation to supersede the Spatial Pyramid Pooling Fast (SPPF) operation within the detector. Additionally, a Neural Kalman Filter (NKF) is developed to augment the performance of the data associator, which dynamically adjusts process and observation noise in accordance with the current motion state. Our innovative fusion architecture significantly reduces computational complexity while maintaining high-quality feature interactions, and our proposed NKF demonstrates superior performance in handling diverse motion patterns compared to traditional fixed-parameter approaches. Experimental results validate these advantages, with our proposed method achieving state-of-the-art results on the KAIST, FLIR, and UniRTL test datasets and demonstrated competitive performance on the VT-MOT dataset.",

keywords = "Feature Pyramid Shared, Multi-object tracking, Multimodal image, Neural Kalman Filter",

author = "Lian Zhang and Lingxue Wang and Yuzhen Wu and Mingkun Chen and Dezhi Zheng and Yi Cai",

note = "Publisher Copyright: {\textcopyright} 2025 Elsevier B.V.",

year = "2025",

month = oct,

day = "14",

doi = "10.1016/j.neucom.2025.130811",

language = "English",

volume = "650",

journal = "Neurocomputing",

issn = "0925-2312",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - ACNTrack

T2 - Agent cross-attention guided Multimodal Multi-Object Tracking with Neural Kalman Filter

AU - Zhang, Lian

AU - Wang, Lingxue

AU - Wu, Yuzhen

AU - Chen, Mingkun

AU - Zheng, Dezhi

AU - Cai, Yi

PY - 2025/10/14

Y1 - 2025/10/14

N2 - Exploring and associating the complementary information from visible, thermal infrared, and low-light images is crucial for advancing Multimodal Multi-Object Tracking (MMOT). While previous studies have shown that efficient feature fusion modules can bolster tracking performance in complex environments, these methods often encounter constraints in global feature interaction and computational efficiency. We present a novel multimodal multi-object tracker based on a tracking-by-detection paradigm, comprising a multimodal detector and a data associator. A dual cross-attention feature fusion detection framework, predicated on an agent attention mechanism, is introduced to enhance feature interaction efficiency and effectively capture cross-modal complementary information. To more accurately capture detailed and complex information inherent in each modality, we propose a Feature Pyramid Shared Convolution (FPS-Conv) operation to supersede the Spatial Pyramid Pooling Fast (SPPF) operation within the detector. Additionally, a Neural Kalman Filter (NKF) is developed to augment the performance of the data associator, which dynamically adjusts process and observation noise in accordance with the current motion state. Our innovative fusion architecture significantly reduces computational complexity while maintaining high-quality feature interactions, and our proposed NKF demonstrates superior performance in handling diverse motion patterns compared to traditional fixed-parameter approaches. Experimental results validate these advantages, with our proposed method achieving state-of-the-art results on the KAIST, FLIR, and UniRTL test datasets and demonstrated competitive performance on the VT-MOT dataset.

AB - Exploring and associating the complementary information from visible, thermal infrared, and low-light images is crucial for advancing Multimodal Multi-Object Tracking (MMOT). While previous studies have shown that efficient feature fusion modules can bolster tracking performance in complex environments, these methods often encounter constraints in global feature interaction and computational efficiency. We present a novel multimodal multi-object tracker based on a tracking-by-detection paradigm, comprising a multimodal detector and a data associator. A dual cross-attention feature fusion detection framework, predicated on an agent attention mechanism, is introduced to enhance feature interaction efficiency and effectively capture cross-modal complementary information. To more accurately capture detailed and complex information inherent in each modality, we propose a Feature Pyramid Shared Convolution (FPS-Conv) operation to supersede the Spatial Pyramid Pooling Fast (SPPF) operation within the detector. Additionally, a Neural Kalman Filter (NKF) is developed to augment the performance of the data associator, which dynamically adjusts process and observation noise in accordance with the current motion state. Our innovative fusion architecture significantly reduces computational complexity while maintaining high-quality feature interactions, and our proposed NKF demonstrates superior performance in handling diverse motion patterns compared to traditional fixed-parameter approaches. Experimental results validate these advantages, with our proposed method achieving state-of-the-art results on the KAIST, FLIR, and UniRTL test datasets and demonstrated competitive performance on the VT-MOT dataset.

KW - Feature Pyramid Shared

KW - Multi-object tracking

KW - Multimodal image

KW - Neural Kalman Filter

UR - http://www.scopus.com/pages/publications/105009515417

U2 - 10.1016/j.neucom.2025.130811

DO - 10.1016/j.neucom.2025.130811

M3 - Article

AN - SCOPUS:105009515417

SN - 0925-2312

VL - 650

JO - Neurocomputing

JF - Neurocomputing

M1 - 130811

ER -

ACNTrack: Agent cross-attention guided Multimodal Multi-Object Tracking with Neural Kalman Filter

摘要

访问文件

其它文件与链接

指纹

引用此