A Two-Stage Cognitive Framework for Referring Multi-Object Tracking: Mimicking Human Cognitive Processes

Lian Zhang; Yuzhen Wu; Lingxue Wang; Mingkun Chen; Yi Cai

doi:10.1117/12.3054715

A Two-Stage Cognitive Framework for Referring Multi-Object Tracking: Mimicking Human Cognitive Processes

Lian Zhang, Yuzhen Wu, Lingxue Wang^*, Mingkun Chen, Yi Cai

^*Corresponding author for this work

School of Optics and Photonics

Beijing Institute of Technology

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

Abstract

In this study, we propose a two-stage cognitive architecture for Referring Multi-Object Tracking (RMOT), inspired by human cognitive processes in event understanding. This framework distinguishes between simple tasks, which can be rapidly comprehended, and complex tasks that require deeper analysis. In the fast-understanding phase, our method achieves 25 FPS using a language-guided detector based on the GroundingDINO model, which quickly infers detection instances of primary targets specified by input text. These instances are efficiently matched with tracking trajectories using an association module that incorporates an extra exiting decision mechanism alongside a minimized feature extraction overhead RE-ID model to enhance association efficiency. This enhancement significantly accelerates the matching process while maintaining high accuracy. In the subsequent slow-understanding phase, our approach re-evaluates the textual semantics relative to each target along the tracking trajectories, ensuring accurate correlation between detected objects and their respective descriptions. Notably, our methodology saves 12 minutes compared to the latest algorithms, improving efficiency without compromising accuracy. Central to this dual-stage framework is the cascade attention architecture within the knowledge unification module. We employ the agent attention mechanism, enabling the model to selectively focus on relevant features within both local object and global scene contexts. By dynamically weighting feature contributions, agent attention enhances the model's ability to discern critical information, improving both tracking precision and contextual awareness. Overall, our two-stage cognitive architecture demonstrates significant enhancements in performance and speed, achieving a relative performance improvement of 2.17 HOTA.

Original language	English
Title of host publication	Tenth Symposium on Novel Optoelectronic Detection Technology and Applications
Editors	Chen Ping
Publisher	SPIE
ISBN (Electronic)	9781510688148
DOIs	http://doi.org/10.1117/12.3054715
Publication status	Published - 2025
Event	10th Symposium on Novel Optoelectronic Detection Technology and Applications - Taiyuan, China Duration: 1 Nov 2024 → 3 Nov 2024

Publication series

Name	Proceedings of SPIE - The International Society for Optical Engineering
Volume	13511
ISSN (Print)	0277-786X
ISSN (Electronic)	1996-756X

Conference

Conference	10th Symposium on Novel Optoelectronic Detection Technology and Applications
Country/Territory	China
City	Taiyuan
Period	1/11/24 → 3/11/24

Keywords

Agent Attention
Referring Multi-Object Tracking
Two-stage Cognitive Architecture

Access to Document

10.1117/12.3054715

Cite this

Zhang, L., Wu, Y., Wang, L., Chen, M., & Cai, Y. (2025). A Two-Stage Cognitive Framework for Referring Multi-Object Tracking: Mimicking Human Cognitive Processes. In C. Ping (Ed.), Tenth Symposium on Novel Optoelectronic Detection Technology and Applications Article 135110L (Proceedings of SPIE - The International Society for Optical Engineering; Vol. 13511). SPIE. http://doi.org/10.1117/12.3054715

@inproceedings{4fae302baaa54295836668e72e78691c,

title = "A Two-Stage Cognitive Framework for Referring Multi-Object Tracking: Mimicking Human Cognitive Processes",

abstract = "In this study, we propose a two-stage cognitive architecture for Referring Multi-Object Tracking (RMOT), inspired by human cognitive processes in event understanding. This framework distinguishes between simple tasks, which can be rapidly comprehended, and complex tasks that require deeper analysis. In the fast-understanding phase, our method achieves 25 FPS using a language-guided detector based on the GroundingDINO model, which quickly infers detection instances of primary targets specified by input text. These instances are efficiently matched with tracking trajectories using an association module that incorporates an extra exiting decision mechanism alongside a minimized feature extraction overhead RE-ID model to enhance association efficiency. This enhancement significantly accelerates the matching process while maintaining high accuracy. In the subsequent slow-understanding phase, our approach re-evaluates the textual semantics relative to each target along the tracking trajectories, ensuring accurate correlation between detected objects and their respective descriptions. Notably, our methodology saves 12 minutes compared to the latest algorithms, improving efficiency without compromising accuracy. Central to this dual-stage framework is the cascade attention architecture within the knowledge unification module. We employ the agent attention mechanism, enabling the model to selectively focus on relevant features within both local object and global scene contexts. By dynamically weighting feature contributions, agent attention enhances the model's ability to discern critical information, improving both tracking precision and contextual awareness. Overall, our two-stage cognitive architecture demonstrates significant enhancements in performance and speed, achieving a relative performance improvement of 2.17 HOTA.",

keywords = "Agent Attention, Referring Multi-Object Tracking, Two-stage Cognitive Architecture",

author = "Lian Zhang and Yuzhen Wu and Lingxue Wang and Mingkun Chen and Yi Cai",

note = "Publisher Copyright: {\textcopyright} 2025 SPIE.; 10th Symposium on Novel Optoelectronic Detection Technology and Applications ; Conference date: 01-11-2024 Through 03-11-2024",

year = "2025",

doi = "10.1117/12.3054715",

language = "English",

series = "Proceedings of SPIE - The International Society for Optical Engineering",

publisher = "SPIE",

editor = "Chen Ping",

booktitle = "Tenth Symposium on Novel Optoelectronic Detection Technology and Applications",

address = "United States",

}

Zhang, L, Wu, Y, Wang, L, Chen, M & Cai, Y 2025, A Two-Stage Cognitive Framework for Referring Multi-Object Tracking: Mimicking Human Cognitive Processes. in C Ping (ed.), Tenth Symposium on Novel Optoelectronic Detection Technology and Applications., 135110L, Proceedings of SPIE - The International Society for Optical Engineering, vol. 13511, SPIE, 10th Symposium on Novel Optoelectronic Detection Technology and Applications, Taiyuan, China, 1/11/24. http://doi.org/10.1117/12.3054715

A Two-Stage Cognitive Framework for Referring Multi-Object Tracking: Mimicking Human Cognitive Processes. / Zhang, Lian; Wu, Yuzhen; Wang, Lingxue et al.
Tenth Symposium on Novel Optoelectronic Detection Technology and Applications. ed. / Chen Ping. SPIE, 2025. 135110L (Proceedings of SPIE - The International Society for Optical Engineering; Vol. 13511).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - A Two-Stage Cognitive Framework for Referring Multi-Object Tracking

T2 - 10th Symposium on Novel Optoelectronic Detection Technology and Applications

AU - Zhang, Lian

AU - Wu, Yuzhen

AU - Wang, Lingxue

AU - Chen, Mingkun

AU - Cai, Yi

PY - 2025

Y1 - 2025

N2 - In this study, we propose a two-stage cognitive architecture for Referring Multi-Object Tracking (RMOT), inspired by human cognitive processes in event understanding. This framework distinguishes between simple tasks, which can be rapidly comprehended, and complex tasks that require deeper analysis. In the fast-understanding phase, our method achieves 25 FPS using a language-guided detector based on the GroundingDINO model, which quickly infers detection instances of primary targets specified by input text. These instances are efficiently matched with tracking trajectories using an association module that incorporates an extra exiting decision mechanism alongside a minimized feature extraction overhead RE-ID model to enhance association efficiency. This enhancement significantly accelerates the matching process while maintaining high accuracy. In the subsequent slow-understanding phase, our approach re-evaluates the textual semantics relative to each target along the tracking trajectories, ensuring accurate correlation between detected objects and their respective descriptions. Notably, our methodology saves 12 minutes compared to the latest algorithms, improving efficiency without compromising accuracy. Central to this dual-stage framework is the cascade attention architecture within the knowledge unification module. We employ the agent attention mechanism, enabling the model to selectively focus on relevant features within both local object and global scene contexts. By dynamically weighting feature contributions, agent attention enhances the model's ability to discern critical information, improving both tracking precision and contextual awareness. Overall, our two-stage cognitive architecture demonstrates significant enhancements in performance and speed, achieving a relative performance improvement of 2.17 HOTA.

AB - In this study, we propose a two-stage cognitive architecture for Referring Multi-Object Tracking (RMOT), inspired by human cognitive processes in event understanding. This framework distinguishes between simple tasks, which can be rapidly comprehended, and complex tasks that require deeper analysis. In the fast-understanding phase, our method achieves 25 FPS using a language-guided detector based on the GroundingDINO model, which quickly infers detection instances of primary targets specified by input text. These instances are efficiently matched with tracking trajectories using an association module that incorporates an extra exiting decision mechanism alongside a minimized feature extraction overhead RE-ID model to enhance association efficiency. This enhancement significantly accelerates the matching process while maintaining high accuracy. In the subsequent slow-understanding phase, our approach re-evaluates the textual semantics relative to each target along the tracking trajectories, ensuring accurate correlation between detected objects and their respective descriptions. Notably, our methodology saves 12 minutes compared to the latest algorithms, improving efficiency without compromising accuracy. Central to this dual-stage framework is the cascade attention architecture within the knowledge unification module. We employ the agent attention mechanism, enabling the model to selectively focus on relevant features within both local object and global scene contexts. By dynamically weighting feature contributions, agent attention enhances the model's ability to discern critical information, improving both tracking precision and contextual awareness. Overall, our two-stage cognitive architecture demonstrates significant enhancements in performance and speed, achieving a relative performance improvement of 2.17 HOTA.

KW - Agent Attention

KW - Referring Multi-Object Tracking

KW - Two-stage Cognitive Architecture

UR - http://www.scopus.com/pages/publications/85219217667

U2 - 10.1117/12.3054715

DO - 10.1117/12.3054715

M3 - Conference contribution

AN - SCOPUS:85219217667

T3 - Proceedings of SPIE - The International Society for Optical Engineering

BT - Tenth Symposium on Novel Optoelectronic Detection Technology and Applications

A2 - Ping, Chen

PB - SPIE

Y2 - 1 November 2024 through 3 November 2024

ER -

A Two-Stage Cognitive Framework for Referring Multi-Object Tracking: Mimicking Human Cognitive Processes

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this