Enhanced Grounding DINO: Efficient Cross-Modality Block for Open-Set Object Detection in Remote Sensing

Zibo Hu; Kun Gao; Jingyi Wang; Zhijia Yang; Zefeng Zhang; Haobo Cheng; Wei Li

doi:10.1109/JSTARS.2025.3575770

Enhanced Grounding DINO: Efficient Cross-Modality Block for Open-Set Object Detection in Remote Sensing

Zibo Hu, Kun Gao^*, Jingyi Wang, Zhijia Yang, Zefeng Zhang, Haobo Cheng, Wei Li

^*此作品的通讯作者

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

Open-set object detection unifies candidate category object detection and remote sensing visual grounding, and can simultaneously meet candidate category multiobject detection and text-guided object detection. Most existing open-set detectors are developed based on candidate category detectors by introducing text information. These methods need to process text and images at the same time, which will increase their training overhead and computational complexity. The open-set detector consists of a backbone, neck, and prediction head, with the neck being the main source of computational complexity due to multiscale self-attention and cross-modal attention. However, little research has focused on improving their computational efficiency while maintaining model performance. This article addresses this gap by proposing an enhanced grounding DINO to optimize the neck network, reducing computational complexity while preserving model performance. Specifically, the key contributions are the proposed efficient cross-modality block, which consists of the multiscale visual-cross-text fusion module (MSVCTFM) and inverse pyramid feature refinement (IPFR). The efficient cross-modality block reduces the computational complexity of both multiscale visual feature refinement and the fusion of text and visual features, while maintaining model performance. The MSVCTFM decouples and optimizes the fusion of multiscale visual and text features, thereby enhancing model performance. The IPFR further reduces the computational complexity involved in refining multiscale visual features. The method achieves a 49.7% reduction in GFLOPs, improves performance on visual grounding datasets DIOR-RSVG and RSVG-HR, and delivers competitive results on the candidate category dataset DOTA.

源语言	英语
页（从-至）	15291-15303
页数	13
期刊	IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
卷	18
DOI	http://doi.org/10.1109/JSTARS.2025.3575770
出版状态	已出版 - 2025
已对外发布	是

访问文件

10.1109/JSTARS.2025.3575770

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{c11e36be4f0c4dc08afe914d0de57b2b,

title = "Enhanced Grounding DINO: Efficient Cross-Modality Block for Open-Set Object Detection in Remote Sensing",

abstract = "Open-set object detection unifies candidate category object detection and remote sensing visual grounding, and can simultaneously meet candidate category multiobject detection and text-guided object detection. Most existing open-set detectors are developed based on candidate category detectors by introducing text information. These methods need to process text and images at the same time, which will increase their training overhead and computational complexity. The open-set detector consists of a backbone, neck, and prediction head, with the neck being the main source of computational complexity due to multiscale self-attention and cross-modal attention. However, little research has focused on improving their computational efficiency while maintaining model performance. This article addresses this gap by proposing an enhanced grounding DINO to optimize the neck network, reducing computational complexity while preserving model performance. Specifically, the key contributions are the proposed efficient cross-modality block, which consists of the multiscale visual-cross-text fusion module (MSVCTFM) and inverse pyramid feature refinement (IPFR). The efficient cross-modality block reduces the computational complexity of both multiscale visual feature refinement and the fusion of text and visual features, while maintaining model performance. The MSVCTFM decouples and optimizes the fusion of multiscale visual and text features, thereby enhancing model performance. The IPFR further reduces the computational complexity involved in refining multiscale visual features. The method achieves a 49.7\% reduction in GFLOPs, improves performance on visual grounding datasets DIOR-RSVG and RSVG-HR, and delivers competitive results on the candidate category dataset DOTA.",

keywords = "Efficient cross-modality block, inverse pyramid feature refinement (IPFR), multiscale visual-cross-text fusion module (MSVCTFM), open-set object detection",

author = "Zibo Hu and Kun Gao and Jingyi Wang and Zhijia Yang and Zefeng Zhang and Haobo Cheng and Wei Li",

note = "Publisher Copyright: {\textcopyright} 2008-2012 IEEE.",

year = "2025",

doi = "10.1109/JSTARS.2025.3575770",

language = "English",

volume = "18",

pages = "15291--15303",

journal = "IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing",

issn = "1939-1404",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Enhanced Grounding DINO

T2 - Efficient Cross-Modality Block for Open-Set Object Detection in Remote Sensing

AU - Hu, Zibo

AU - Gao, Kun

AU - Wang, Jingyi

AU - Yang, Zhijia

AU - Zhang, Zefeng

AU - Cheng, Haobo

AU - Li, Wei

PY - 2025

Y1 - 2025

N2 - Open-set object detection unifies candidate category object detection and remote sensing visual grounding, and can simultaneously meet candidate category multiobject detection and text-guided object detection. Most existing open-set detectors are developed based on candidate category detectors by introducing text information. These methods need to process text and images at the same time, which will increase their training overhead and computational complexity. The open-set detector consists of a backbone, neck, and prediction head, with the neck being the main source of computational complexity due to multiscale self-attention and cross-modal attention. However, little research has focused on improving their computational efficiency while maintaining model performance. This article addresses this gap by proposing an enhanced grounding DINO to optimize the neck network, reducing computational complexity while preserving model performance. Specifically, the key contributions are the proposed efficient cross-modality block, which consists of the multiscale visual-cross-text fusion module (MSVCTFM) and inverse pyramid feature refinement (IPFR). The efficient cross-modality block reduces the computational complexity of both multiscale visual feature refinement and the fusion of text and visual features, while maintaining model performance. The MSVCTFM decouples and optimizes the fusion of multiscale visual and text features, thereby enhancing model performance. The IPFR further reduces the computational complexity involved in refining multiscale visual features. The method achieves a 49.7% reduction in GFLOPs, improves performance on visual grounding datasets DIOR-RSVG and RSVG-HR, and delivers competitive results on the candidate category dataset DOTA.

AB - Open-set object detection unifies candidate category object detection and remote sensing visual grounding, and can simultaneously meet candidate category multiobject detection and text-guided object detection. Most existing open-set detectors are developed based on candidate category detectors by introducing text information. These methods need to process text and images at the same time, which will increase their training overhead and computational complexity. The open-set detector consists of a backbone, neck, and prediction head, with the neck being the main source of computational complexity due to multiscale self-attention and cross-modal attention. However, little research has focused on improving their computational efficiency while maintaining model performance. This article addresses this gap by proposing an enhanced grounding DINO to optimize the neck network, reducing computational complexity while preserving model performance. Specifically, the key contributions are the proposed efficient cross-modality block, which consists of the multiscale visual-cross-text fusion module (MSVCTFM) and inverse pyramid feature refinement (IPFR). The efficient cross-modality block reduces the computational complexity of both multiscale visual feature refinement and the fusion of text and visual features, while maintaining model performance. The MSVCTFM decouples and optimizes the fusion of multiscale visual and text features, thereby enhancing model performance. The IPFR further reduces the computational complexity involved in refining multiscale visual features. The method achieves a 49.7% reduction in GFLOPs, improves performance on visual grounding datasets DIOR-RSVG and RSVG-HR, and delivers competitive results on the candidate category dataset DOTA.

KW - Efficient cross-modality block

KW - inverse pyramid feature refinement (IPFR)

KW - multiscale visual-cross-text fusion module (MSVCTFM)

KW - open-set object detection

UR - http://www.scopus.com/pages/publications/105007296333

U2 - 10.1109/JSTARS.2025.3575770

DO - 10.1109/JSTARS.2025.3575770

M3 - Article

AN - SCOPUS:105007296333

SN - 1939-1404

VL - 18

SP - 15291

EP - 15303

JO - IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

JF - IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

ER -

Enhanced Grounding DINO: Efficient Cross-Modality Block for Open-Set Object Detection in Remote Sensing

摘要

访问文件

其它文件与链接

指纹

引用此