Enhanced Grounding DINO: Efficient Cross-Modality Block for Open-Set Object Detection in Remote Sensing

Zibo Hu; Kun Gao; Jingyi Wang; Zhijia Yang; Zefeng Zhang; Haobo Cheng; Wei Li

doi:10.1109/JSTARS.2025.3575770

Enhanced Grounding DINO: Efficient Cross-Modality Block for Open-Set Object Detection in Remote Sensing

Zibo Hu, Kun Gao^*, Jingyi Wang, Zhijia Yang, Zefeng Zhang, Haobo Cheng, Wei Li

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

Abstract

Open-set object detection unifies candidate category object detection and remote sensing visual grounding, and can simultaneously meet candidate category multiobject detection and text-guided object detection. Most existing open-set detectors are developed based on candidate category detectors by introducing text information. These methods need to process text and images at the same time, which will increase their training overhead and computational complexity. The open-set detector consists of a backbone, neck, and prediction head, with the neck being the main source of computational complexity due to multiscale self-attention and cross-modal attention. However, little research has focused on improving their computational efficiency while maintaining model performance. This article addresses this gap by proposing an enhanced grounding DINO to optimize the neck network, reducing computational complexity while preserving model performance. Specifically, the key contributions are the proposed efficient cross-modality block, which consists of the multiscale visual-cross-text fusion module (MSVCTFM) and inverse pyramid feature refinement (IPFR). The efficient cross-modality block reduces the computational complexity of both multiscale visual feature refinement and the fusion of text and visual features, while maintaining model performance. The MSVCTFM decouples and optimizes the fusion of multiscale visual and text features, thereby enhancing model performance. The IPFR further reduces the computational complexity involved in refining multiscale visual features. The method achieves a 49.7% reduction in GFLOPs, improves performance on visual grounding datasets DIOR-RSVG and RSVG-HR, and delivers competitive results on the candidate category dataset DOTA.

Original language	English
Pages (from-to)	15291-15303
Number of pages	13
Journal	IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
Volume	18
DOIs	http://doi.org/10.1109/JSTARS.2025.3575770
Publication status	Published - 2025
Externally published	Yes

Keywords

Efficient cross-modality block
inverse pyramid feature refinement (IPFR)
multiscale visual-cross-text fusion module (MSVCTFM)
open-set object detection

Access to Document

10.1109/JSTARS.2025.3575770

Cite this

@article{c11e36be4f0c4dc08afe914d0de57b2b,

title = "Enhanced Grounding DINO: Efficient Cross-Modality Block for Open-Set Object Detection in Remote Sensing",

abstract = "Open-set object detection unifies candidate category object detection and remote sensing visual grounding, and can simultaneously meet candidate category multiobject detection and text-guided object detection. Most existing open-set detectors are developed based on candidate category detectors by introducing text information. These methods need to process text and images at the same time, which will increase their training overhead and computational complexity. The open-set detector consists of a backbone, neck, and prediction head, with the neck being the main source of computational complexity due to multiscale self-attention and cross-modal attention. However, little research has focused on improving their computational efficiency while maintaining model performance. This article addresses this gap by proposing an enhanced grounding DINO to optimize the neck network, reducing computational complexity while preserving model performance. Specifically, the key contributions are the proposed efficient cross-modality block, which consists of the multiscale visual-cross-text fusion module (MSVCTFM) and inverse pyramid feature refinement (IPFR). The efficient cross-modality block reduces the computational complexity of both multiscale visual feature refinement and the fusion of text and visual features, while maintaining model performance. The MSVCTFM decouples and optimizes the fusion of multiscale visual and text features, thereby enhancing model performance. The IPFR further reduces the computational complexity involved in refining multiscale visual features. The method achieves a 49.7\% reduction in GFLOPs, improves performance on visual grounding datasets DIOR-RSVG and RSVG-HR, and delivers competitive results on the candidate category dataset DOTA.",

keywords = "Efficient cross-modality block, inverse pyramid feature refinement (IPFR), multiscale visual-cross-text fusion module (MSVCTFM), open-set object detection",

author = "Zibo Hu and Kun Gao and Jingyi Wang and Zhijia Yang and Zefeng Zhang and Haobo Cheng and Wei Li",

note = "Publisher Copyright: {\textcopyright} 2008-2012 IEEE.",

year = "2025",

doi = "10.1109/JSTARS.2025.3575770",

language = "English",

volume = "18",

pages = "15291--15303",

journal = "IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing",

issn = "1939-1404",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Enhanced Grounding DINO

T2 - Efficient Cross-Modality Block for Open-Set Object Detection in Remote Sensing

AU - Hu, Zibo

AU - Gao, Kun

AU - Wang, Jingyi

AU - Yang, Zhijia

AU - Zhang, Zefeng

AU - Cheng, Haobo

AU - Li, Wei

PY - 2025

Y1 - 2025

N2 - Open-set object detection unifies candidate category object detection and remote sensing visual grounding, and can simultaneously meet candidate category multiobject detection and text-guided object detection. Most existing open-set detectors are developed based on candidate category detectors by introducing text information. These methods need to process text and images at the same time, which will increase their training overhead and computational complexity. The open-set detector consists of a backbone, neck, and prediction head, with the neck being the main source of computational complexity due to multiscale self-attention and cross-modal attention. However, little research has focused on improving their computational efficiency while maintaining model performance. This article addresses this gap by proposing an enhanced grounding DINO to optimize the neck network, reducing computational complexity while preserving model performance. Specifically, the key contributions are the proposed efficient cross-modality block, which consists of the multiscale visual-cross-text fusion module (MSVCTFM) and inverse pyramid feature refinement (IPFR). The efficient cross-modality block reduces the computational complexity of both multiscale visual feature refinement and the fusion of text and visual features, while maintaining model performance. The MSVCTFM decouples and optimizes the fusion of multiscale visual and text features, thereby enhancing model performance. The IPFR further reduces the computational complexity involved in refining multiscale visual features. The method achieves a 49.7% reduction in GFLOPs, improves performance on visual grounding datasets DIOR-RSVG and RSVG-HR, and delivers competitive results on the candidate category dataset DOTA.

AB - Open-set object detection unifies candidate category object detection and remote sensing visual grounding, and can simultaneously meet candidate category multiobject detection and text-guided object detection. Most existing open-set detectors are developed based on candidate category detectors by introducing text information. These methods need to process text and images at the same time, which will increase their training overhead and computational complexity. The open-set detector consists of a backbone, neck, and prediction head, with the neck being the main source of computational complexity due to multiscale self-attention and cross-modal attention. However, little research has focused on improving their computational efficiency while maintaining model performance. This article addresses this gap by proposing an enhanced grounding DINO to optimize the neck network, reducing computational complexity while preserving model performance. Specifically, the key contributions are the proposed efficient cross-modality block, which consists of the multiscale visual-cross-text fusion module (MSVCTFM) and inverse pyramid feature refinement (IPFR). The efficient cross-modality block reduces the computational complexity of both multiscale visual feature refinement and the fusion of text and visual features, while maintaining model performance. The MSVCTFM decouples and optimizes the fusion of multiscale visual and text features, thereby enhancing model performance. The IPFR further reduces the computational complexity involved in refining multiscale visual features. The method achieves a 49.7% reduction in GFLOPs, improves performance on visual grounding datasets DIOR-RSVG and RSVG-HR, and delivers competitive results on the candidate category dataset DOTA.

KW - Efficient cross-modality block

KW - inverse pyramid feature refinement (IPFR)

KW - multiscale visual-cross-text fusion module (MSVCTFM)

KW - open-set object detection

UR - http://www.scopus.com/pages/publications/105007296333

U2 - 10.1109/JSTARS.2025.3575770

DO - 10.1109/JSTARS.2025.3575770

M3 - Article

AN - SCOPUS:105007296333

SN - 1939-1404

VL - 18

SP - 15291

EP - 15303

JO - IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

JF - IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

ER -

Enhanced Grounding DINO: Efficient Cross-Modality Block for Open-Set Object Detection in Remote Sensing

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this