TY - JOUR
T1 - Enhanced Grounding DINO
T2 - Efficient Cross-Modality Block for Open-Set Object Detection in Remote Sensing
AU - Hu, Zibo
AU - Gao, Kun
AU - Wang, Jingyi
AU - Yang, Zhijia
AU - Zhang, Zefeng
AU - Cheng, Haobo
AU - Li, Wei
N1 - Publisher Copyright:
© 2008-2012 IEEE.
PY - 2025
Y1 - 2025
N2 - Open-set object detection unifies candidate category object detection and remote sensing visual grounding, and can simultaneously meet candidate category multiobject detection and text-guided object detection. Most existing open-set detectors are developed based on candidate category detectors by introducing text information. These methods need to process text and images at the same time, which will increase their training overhead and computational complexity. The open-set detector consists of a backbone, neck, and prediction head, with the neck being the main source of computational complexity due to multiscale self-attention and cross-modal attention. However, little research has focused on improving their computational efficiency while maintaining model performance. This article addresses this gap by proposing an enhanced grounding DINO to optimize the neck network, reducing computational complexity while preserving model performance. Specifically, the key contributions are the proposed efficient cross-modality block, which consists of the multiscale visual-cross-text fusion module (MSVCTFM) and inverse pyramid feature refinement (IPFR). The efficient cross-modality block reduces the computational complexity of both multiscale visual feature refinement and the fusion of text and visual features, while maintaining model performance. The MSVCTFM decouples and optimizes the fusion of multiscale visual and text features, thereby enhancing model performance. The IPFR further reduces the computational complexity involved in refining multiscale visual features. The method achieves a 49.7% reduction in GFLOPs, improves performance on visual grounding datasets DIOR-RSVG and RSVG-HR, and delivers competitive results on the candidate category dataset DOTA.
AB - Open-set object detection unifies candidate category object detection and remote sensing visual grounding, and can simultaneously meet candidate category multiobject detection and text-guided object detection. Most existing open-set detectors are developed based on candidate category detectors by introducing text information. These methods need to process text and images at the same time, which will increase their training overhead and computational complexity. The open-set detector consists of a backbone, neck, and prediction head, with the neck being the main source of computational complexity due to multiscale self-attention and cross-modal attention. However, little research has focused on improving their computational efficiency while maintaining model performance. This article addresses this gap by proposing an enhanced grounding DINO to optimize the neck network, reducing computational complexity while preserving model performance. Specifically, the key contributions are the proposed efficient cross-modality block, which consists of the multiscale visual-cross-text fusion module (MSVCTFM) and inverse pyramid feature refinement (IPFR). The efficient cross-modality block reduces the computational complexity of both multiscale visual feature refinement and the fusion of text and visual features, while maintaining model performance. The MSVCTFM decouples and optimizes the fusion of multiscale visual and text features, thereby enhancing model performance. The IPFR further reduces the computational complexity involved in refining multiscale visual features. The method achieves a 49.7% reduction in GFLOPs, improves performance on visual grounding datasets DIOR-RSVG and RSVG-HR, and delivers competitive results on the candidate category dataset DOTA.
KW - Efficient cross-modality block
KW - inverse pyramid feature refinement (IPFR)
KW - multiscale visual-cross-text fusion module (MSVCTFM)
KW - open-set object detection
UR - http://www.scopus.com/pages/publications/105007296333
U2 - 10.1109/JSTARS.2025.3575770
DO - 10.1109/JSTARS.2025.3575770
M3 - Article
AN - SCOPUS:105007296333
SN - 1939-1404
VL - 18
SP - 15291
EP - 15303
JO - IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
JF - IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
ER -