MCSSAFNet: A multi-scale state-space attention fusion network for RGBT tracking

Chunbo Zhao; Bo Mo; Dawei Li; Xinchun Wang; Jie Zhao; Junwei Xu

doi:10.1016/j.optcom.2024.131394

MCSSAFNet: A multi-scale state-space attention fusion network for RGBT tracking

Chunbo Zhao, Bo Mo^*, Dawei Li, Xinchun Wang, Jie Zhao, Junwei Xu

^*Corresponding author for this work

School of Aerospace Engineering

Research output: Contribution to journal › Article › peer-review

Abstract

Current cross-modal feature fusion research mainly adopts the deep features of the last layer of the backbone network as inputs, ignoring the utilization of detailed information in the shallow features of the backbone network, leading to certain limitations of the model in coping with the various challenges of rapid changes of the target in cross-modal images. To solve this problem, this paper proposes a novel tracker based on the Multiscale State Space Attention Fusion Network (MCSSAFNet), which realizes the learning and fusion of different modal feature information at different scales by introducing Mamba. On this basis, an adaptive-aware loss function is proposed to adaptively weight the classification loss firstly, to solve the imbalance between the classification score and the localization score by enhancing the learning attention to the difficult samples, and to improve the ability to discriminate the difficult targets. Adaptive weighting is then performed for IoU loss to enhance the learning of high-quality samples while improving the learning of low-quality samples, which in turn improves the model IoU accuracy. Comprehensive experimental validation is carried out on four mainstream RGBT open tracking datasets, namely, RGBT210, RGBT234, LasHeR, and VTUAV, and the experimental results show that the tracking performance of the proposed algorithm outperforms the existing algorithms and achieves a running speed of 37 fps on a GTX 3090 GPU.

Original language	English
Article number	131394
Journal	Optics Communications
Volume	577
DOIs	http://doi.org/10.1016/j.optcom.2024.131394
Publication status	Published - Mar 2025

Keywords

Adaptive-aware loss
Mamba
Multiscale fusion
RGBT tracking
State space modeling

Access to Document

10.1016/j.optcom.2024.131394

Cite this

@article{c68a9046c5fd43c6b79987755aab300b,

title = "MCSSAFNet: A multi-scale state-space attention fusion network for RGBT tracking",

abstract = "Current cross-modal feature fusion research mainly adopts the deep features of the last layer of the backbone network as inputs, ignoring the utilization of detailed information in the shallow features of the backbone network, leading to certain limitations of the model in coping with the various challenges of rapid changes of the target in cross-modal images. To solve this problem, this paper proposes a novel tracker based on the Multiscale State Space Attention Fusion Network (MCSSAFNet), which realizes the learning and fusion of different modal feature information at different scales by introducing Mamba. On this basis, an adaptive-aware loss function is proposed to adaptively weight the classification loss firstly, to solve the imbalance between the classification score and the localization score by enhancing the learning attention to the difficult samples, and to improve the ability to discriminate the difficult targets. Adaptive weighting is then performed for IoU loss to enhance the learning of high-quality samples while improving the learning of low-quality samples, which in turn improves the model IoU accuracy. Comprehensive experimental validation is carried out on four mainstream RGBT open tracking datasets, namely, RGBT210, RGBT234, LasHeR, and VTUAV, and the experimental results show that the tracking performance of the proposed algorithm outperforms the existing algorithms and achieves a running speed of 37 fps on a GTX 3090 GPU.",

keywords = "Adaptive-aware loss, Mamba, Multiscale fusion, RGBT tracking, State space modeling",

author = "Chunbo Zhao and Bo Mo and Dawei Li and Xinchun Wang and Jie Zhao and Junwei Xu",

note = "Publisher Copyright: {\textcopyright} 2024 Elsevier B.V.",

year = "2025",

month = mar,

doi = "10.1016/j.optcom.2024.131394",

language = "English",

volume = "577",

journal = "Optics Communications",

issn = "0030-4018",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - MCSSAFNet

T2 - A multi-scale state-space attention fusion network for RGBT tracking

AU - Zhao, Chunbo

AU - Mo, Bo

AU - Li, Dawei

AU - Wang, Xinchun

AU - Zhao, Jie

AU - Xu, Junwei

PY - 2025/3

Y1 - 2025/3

N2 - Current cross-modal feature fusion research mainly adopts the deep features of the last layer of the backbone network as inputs, ignoring the utilization of detailed information in the shallow features of the backbone network, leading to certain limitations of the model in coping with the various challenges of rapid changes of the target in cross-modal images. To solve this problem, this paper proposes a novel tracker based on the Multiscale State Space Attention Fusion Network (MCSSAFNet), which realizes the learning and fusion of different modal feature information at different scales by introducing Mamba. On this basis, an adaptive-aware loss function is proposed to adaptively weight the classification loss firstly, to solve the imbalance between the classification score and the localization score by enhancing the learning attention to the difficult samples, and to improve the ability to discriminate the difficult targets. Adaptive weighting is then performed for IoU loss to enhance the learning of high-quality samples while improving the learning of low-quality samples, which in turn improves the model IoU accuracy. Comprehensive experimental validation is carried out on four mainstream RGBT open tracking datasets, namely, RGBT210, RGBT234, LasHeR, and VTUAV, and the experimental results show that the tracking performance of the proposed algorithm outperforms the existing algorithms and achieves a running speed of 37 fps on a GTX 3090 GPU.

AB - Current cross-modal feature fusion research mainly adopts the deep features of the last layer of the backbone network as inputs, ignoring the utilization of detailed information in the shallow features of the backbone network, leading to certain limitations of the model in coping with the various challenges of rapid changes of the target in cross-modal images. To solve this problem, this paper proposes a novel tracker based on the Multiscale State Space Attention Fusion Network (MCSSAFNet), which realizes the learning and fusion of different modal feature information at different scales by introducing Mamba. On this basis, an adaptive-aware loss function is proposed to adaptively weight the classification loss firstly, to solve the imbalance between the classification score and the localization score by enhancing the learning attention to the difficult samples, and to improve the ability to discriminate the difficult targets. Adaptive weighting is then performed for IoU loss to enhance the learning of high-quality samples while improving the learning of low-quality samples, which in turn improves the model IoU accuracy. Comprehensive experimental validation is carried out on four mainstream RGBT open tracking datasets, namely, RGBT210, RGBT234, LasHeR, and VTUAV, and the experimental results show that the tracking performance of the proposed algorithm outperforms the existing algorithms and achieves a running speed of 37 fps on a GTX 3090 GPU.

KW - Adaptive-aware loss

KW - Mamba

KW - Multiscale fusion

KW - RGBT tracking

KW - State space modeling

UR - http://www.scopus.com/pages/publications/85211752130

U2 - 10.1016/j.optcom.2024.131394

DO - 10.1016/j.optcom.2024.131394

M3 - Article

AN - SCOPUS:85211752130

SN - 0030-4018

VL - 577

JO - Optics Communications

JF - Optics Communications

M1 - 131394

ER -

MCSSAFNet: A multi-scale state-space attention fusion network for RGBT tracking

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this