CVT-Track: Concentrating on Valid Tokens for One-Stream Tracking

Jianan Li; Xiaoying Yuan; Haolin Qin; Ying Wang; Xincong Liu; Tingfa Xu

doi:10.1109/TCSVT.2024.3452231

CVT-Track: Concentrating on Valid Tokens for One-Stream Tracking

Jianan Li^*, Xiaoying Yuan, Haolin Qin, Ying Wang, Xincong Liu, Tingfa Xu^*

^*此作品的通讯作者

光电学院

科研成果: 期刊稿件 › 文章 › 同行评审

2 引用（Scopus）

摘要

In the domain of single object tracking, the Ground Truth bounding box is intentionally sized larger than the minimum dimensions required to enclose the target in the initial video frame, inadvertently including extraneous elements and interferences in the template image. Moreover, significant appearance changes of the target during movement present substantial challenges for maintaining robust tracking. To address these issues, this study introduces a novel one-stream tracking framework named CVT-Track. CVT-Track comprises two main components: the Target Valid Token Collection (TaVTC) and the Temporal Valid Token Collection (TeVTC) modules. The TaVTC module effectively mitigates background noise and interference from similar targets, thereby sharpening the focus on the target's unique features and enhancing tracking accuracy. Conversely, the TeVTC module skillfully extracts target information from historical frames, capturing the target's dynamic appearance changes throughout the tracking process and thereby improving tracking robustness. The synergistic operation of these modules markedly enhances both the accuracy and robustness of tracking. Empirical evaluations demonstrate that CVT-Track achieves state-of-the-art performance across multiple datasets and maintains superior inference speeds.

源语言	英语
页（从-至）	33-44
页数	12
期刊	IEEE Transactions on Circuits and Systems for Video Technology
卷	35
期	1
DOI	http://doi.org/10.1109/TCSVT.2024.3452231
出版状态	已出版 - 2025

访问文件

10.1109/TCSVT.2024.3452231

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{d395c4ded6e14df68f591cb9a3d6d10b,

title = "CVT-Track: Concentrating on Valid Tokens for One-Stream Tracking",

abstract = "In the domain of single object tracking, the Ground Truth bounding box is intentionally sized larger than the minimum dimensions required to enclose the target in the initial video frame, inadvertently including extraneous elements and interferences in the template image. Moreover, significant appearance changes of the target during movement present substantial challenges for maintaining robust tracking. To address these issues, this study introduces a novel one-stream tracking framework named CVT-Track. CVT-Track comprises two main components: the Target Valid Token Collection (TaVTC) and the Temporal Valid Token Collection (TeVTC) modules. The TaVTC module effectively mitigates background noise and interference from similar targets, thereby sharpening the focus on the target's unique features and enhancing tracking accuracy. Conversely, the TeVTC module skillfully extracts target information from historical frames, capturing the target's dynamic appearance changes throughout the tracking process and thereby improving tracking robustness. The synergistic operation of these modules markedly enhances both the accuracy and robustness of tracking. Empirical evaluations demonstrate that CVT-Track achieves state-of-the-art performance across multiple datasets and maintains superior inference speeds.",

keywords = "One-stream tracking, temporal information, valid tokens, vision transformer",

author = "Jianan Li and Xiaoying Yuan and Haolin Qin and Ying Wang and Xincong Liu and Tingfa Xu",

note = "Publisher Copyright: {\textcopyright} 1991-2012 IEEE.",

year = "2025",

doi = "10.1109/TCSVT.2024.3452231",

language = "English",

volume = "35",

pages = "33--44",

journal = "IEEE Transactions on Circuits and Systems for Video Technology",

issn = "1051-8215",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "1",

}

TY - JOUR

T1 - CVT-Track

T2 - Concentrating on Valid Tokens for One-Stream Tracking

AU - Li, Jianan

AU - Yuan, Xiaoying

AU - Qin, Haolin

AU - Wang, Ying

AU - Liu, Xincong

AU - Xu, Tingfa

PY - 2025

Y1 - 2025

N2 - In the domain of single object tracking, the Ground Truth bounding box is intentionally sized larger than the minimum dimensions required to enclose the target in the initial video frame, inadvertently including extraneous elements and interferences in the template image. Moreover, significant appearance changes of the target during movement present substantial challenges for maintaining robust tracking. To address these issues, this study introduces a novel one-stream tracking framework named CVT-Track. CVT-Track comprises two main components: the Target Valid Token Collection (TaVTC) and the Temporal Valid Token Collection (TeVTC) modules. The TaVTC module effectively mitigates background noise and interference from similar targets, thereby sharpening the focus on the target's unique features and enhancing tracking accuracy. Conversely, the TeVTC module skillfully extracts target information from historical frames, capturing the target's dynamic appearance changes throughout the tracking process and thereby improving tracking robustness. The synergistic operation of these modules markedly enhances both the accuracy and robustness of tracking. Empirical evaluations demonstrate that CVT-Track achieves state-of-the-art performance across multiple datasets and maintains superior inference speeds.

AB - In the domain of single object tracking, the Ground Truth bounding box is intentionally sized larger than the minimum dimensions required to enclose the target in the initial video frame, inadvertently including extraneous elements and interferences in the template image. Moreover, significant appearance changes of the target during movement present substantial challenges for maintaining robust tracking. To address these issues, this study introduces a novel one-stream tracking framework named CVT-Track. CVT-Track comprises two main components: the Target Valid Token Collection (TaVTC) and the Temporal Valid Token Collection (TeVTC) modules. The TaVTC module effectively mitigates background noise and interference from similar targets, thereby sharpening the focus on the target's unique features and enhancing tracking accuracy. Conversely, the TeVTC module skillfully extracts target information from historical frames, capturing the target's dynamic appearance changes throughout the tracking process and thereby improving tracking robustness. The synergistic operation of these modules markedly enhances both the accuracy and robustness of tracking. Empirical evaluations demonstrate that CVT-Track achieves state-of-the-art performance across multiple datasets and maintains superior inference speeds.

KW - One-stream tracking

KW - temporal information

KW - valid tokens

KW - vision transformer

UR - http://www.scopus.com/pages/publications/85202732375

U2 - 10.1109/TCSVT.2024.3452231

DO - 10.1109/TCSVT.2024.3452231

M3 - Article

AN - SCOPUS:85202732375

SN - 1051-8215

VL - 35

SP - 33

EP - 44

JO - IEEE Transactions on Circuits and Systems for Video Technology

JF - IEEE Transactions on Circuits and Systems for Video Technology

IS - 1

ER -

CVT-Track: Concentrating on Valid Tokens for One-Stream Tracking

摘要

访问文件

其它文件与链接

指纹

引用此