A Two-Stage Cognitive Framework for Referring Multi-Object Tracking: Mimicking Human Cognitive Processes

Lian Zhang, Yuzhen Wu, Lingxue Wang*, Mingkun Chen, Yi Cai

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In this study, we propose a two-stage cognitive architecture for Referring Multi-Object Tracking (RMOT), inspired by human cognitive processes in event understanding. This framework distinguishes between simple tasks, which can be rapidly comprehended, and complex tasks that require deeper analysis. In the fast-understanding phase, our method achieves 25 FPS using a language-guided detector based on the GroundingDINO model, which quickly infers detection instances of primary targets specified by input text. These instances are efficiently matched with tracking trajectories using an association module that incorporates an extra exiting decision mechanism alongside a minimized feature extraction overhead RE-ID model to enhance association efficiency. This enhancement significantly accelerates the matching process while maintaining high accuracy. In the subsequent slow-understanding phase, our approach re-evaluates the textual semantics relative to each target along the tracking trajectories, ensuring accurate correlation between detected objects and their respective descriptions. Notably, our methodology saves 12 minutes compared to the latest algorithms, improving efficiency without compromising accuracy. Central to this dual-stage framework is the cascade attention architecture within the knowledge unification module. We employ the agent attention mechanism, enabling the model to selectively focus on relevant features within both local object and global scene contexts. By dynamically weighting feature contributions, agent attention enhances the model's ability to discern critical information, improving both tracking precision and contextual awareness. Overall, our two-stage cognitive architecture demonstrates significant enhancements in performance and speed, achieving a relative performance improvement of 2.17 HOTA.

Original languageEnglish
Title of host publicationTenth Symposium on Novel Optoelectronic Detection Technology and Applications
EditorsChen Ping
PublisherSPIE
ISBN (Electronic)9781510688148
DOIs
Publication statusPublished - 2025
Event10th Symposium on Novel Optoelectronic Detection Technology and Applications - Taiyuan, China
Duration: 1 Nov 20243 Nov 2024

Publication series

NameProceedings of SPIE - The International Society for Optical Engineering
Volume13511
ISSN (Print)0277-786X
ISSN (Electronic)1996-756X

Conference

Conference10th Symposium on Novel Optoelectronic Detection Technology and Applications
Country/TerritoryChina
CityTaiyuan
Period1/11/243/11/24

Keywords

  • Agent Attention
  • Referring Multi-Object Tracking
  • Two-stage Cognitive Architecture

Fingerprint

Dive into the research topics of 'A Two-Stage Cognitive Framework for Referring Multi-Object Tracking: Mimicking Human Cognitive Processes'. Together they form a unique fingerprint.

Cite this