R2G: Reasoning to ground in 3D scenes

Yixuan Li; Zan Wang; Wei Liang

doi:10.1016/j.patcog.2025.111728

R2G: Reasoning to ground in 3D scenes

Yixuan Li, Zan Wang, Wei Liang^*

^*此作品的通讯作者

计算机学院

Beijing Institute of Technology

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

We propose Reasoning to Ground (R2G), a neural symbolic model that grounds the target objects in 3D scenes in a reasoning manner. Unlike previous works that rely on end-to-end models for grounding, which often function as black boxes, our approach seeks to provide a more interpretable and reliable solution. R2Gexplicitly models the 3D scene using a semantic concept-based scene graph, recurrently simulates the attention transferring across object entities, and interpretably grounding the target objects with the highest attention score. Specifically, we embed multiple object properties within the graph nodes and spatial relations among entities within the edges through a predefined semantic vocabulary. To guide attention transfer, we employ learning or prompting-based approaches to interpret the referential utterance into reasoning instructions within the same semantic space. In each reasoning round, we either (1) merge current attention distribution with the similarity between instructions and embedded entity properties, or (2) shift the attention across the scene graph based on the similarity between instructions and embedded spatial relations. The experiments on Sr3D/Nr3D benchmarks show that R2G achieves a comparable result with the prior works while offering improved interpretability, breaking a new path for 3D grounding. The code and dataset for this work are available at:http://sites.google.com/view/reasoning-to-ground.

源语言	英语
文章编号	111728
期刊	Pattern Recognition
卷	168
DOI	http://doi.org/10.1016/j.patcog.2025.111728
出版状态	已出版 - 12月 2025

访问文件

10.1016/j.patcog.2025.111728

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{f38ff3e3e76e4115984e24c4cdde48c4,

title = "R2G: Reasoning to ground in 3D scenes",

abstract = "We propose Reasoning to Ground (R2G), a neural symbolic model that grounds the target objects in 3D scenes in a reasoning manner. Unlike previous works that rely on end-to-end models for grounding, which often function as black boxes, our approach seeks to provide a more interpretable and reliable solution. R2Gexplicitly models the 3D scene using a semantic concept-based scene graph, recurrently simulates the attention transferring across object entities, and interpretably grounding the target objects with the highest attention score. Specifically, we embed multiple object properties within the graph nodes and spatial relations among entities within the edges through a predefined semantic vocabulary. To guide attention transfer, we employ learning or prompting-based approaches to interpret the referential utterance into reasoning instructions within the same semantic space. In each reasoning round, we either (1) merge current attention distribution with the similarity between instructions and embedded entity properties, or (2) shift the attention across the scene graph based on the similarity between instructions and embedded spatial relations. The experiments on Sr3D/Nr3D benchmarks show that R2G achieves a comparable result with the prior works while offering improved interpretability, breaking a new path for 3D grounding. The code and dataset for this work are available at:http://sites.google.com/view/reasoning-to-ground.",

keywords = "3D grounding, Neural-symbolic, Reasoning",

author = "Yixuan Li and Zan Wang and Wei Liang",

note = "Publisher Copyright: {\textcopyright} 2025 Elsevier Ltd",

year = "2025",

month = dec,

doi = "10.1016/j.patcog.2025.111728",

language = "English",

volume = "168",

journal = "Pattern Recognition",

issn = "0031-3203",

publisher = "Elsevier Ltd.",

}

TY - JOUR

T1 - R2G

T2 - Reasoning to ground in 3D scenes

AU - Li, Yixuan

AU - Wang, Zan

AU - Liang, Wei

PY - 2025/12

Y1 - 2025/12

N2 - We propose Reasoning to Ground (R2G), a neural symbolic model that grounds the target objects in 3D scenes in a reasoning manner. Unlike previous works that rely on end-to-end models for grounding, which often function as black boxes, our approach seeks to provide a more interpretable and reliable solution. R2Gexplicitly models the 3D scene using a semantic concept-based scene graph, recurrently simulates the attention transferring across object entities, and interpretably grounding the target objects with the highest attention score. Specifically, we embed multiple object properties within the graph nodes and spatial relations among entities within the edges through a predefined semantic vocabulary. To guide attention transfer, we employ learning or prompting-based approaches to interpret the referential utterance into reasoning instructions within the same semantic space. In each reasoning round, we either (1) merge current attention distribution with the similarity between instructions and embedded entity properties, or (2) shift the attention across the scene graph based on the similarity between instructions and embedded spatial relations. The experiments on Sr3D/Nr3D benchmarks show that R2G achieves a comparable result with the prior works while offering improved interpretability, breaking a new path for 3D grounding. The code and dataset for this work are available at:http://sites.google.com/view/reasoning-to-ground.

AB - We propose Reasoning to Ground (R2G), a neural symbolic model that grounds the target objects in 3D scenes in a reasoning manner. Unlike previous works that rely on end-to-end models for grounding, which often function as black boxes, our approach seeks to provide a more interpretable and reliable solution. R2Gexplicitly models the 3D scene using a semantic concept-based scene graph, recurrently simulates the attention transferring across object entities, and interpretably grounding the target objects with the highest attention score. Specifically, we embed multiple object properties within the graph nodes and spatial relations among entities within the edges through a predefined semantic vocabulary. To guide attention transfer, we employ learning or prompting-based approaches to interpret the referential utterance into reasoning instructions within the same semantic space. In each reasoning round, we either (1) merge current attention distribution with the similarity between instructions and embedded entity properties, or (2) shift the attention across the scene graph based on the similarity between instructions and embedded spatial relations. The experiments on Sr3D/Nr3D benchmarks show that R2G achieves a comparable result with the prior works while offering improved interpretability, breaking a new path for 3D grounding. The code and dataset for this work are available at:http://sites.google.com/view/reasoning-to-ground.

KW - 3D grounding

KW - Neural-symbolic

KW - Reasoning

UR - http://www.scopus.com/pages/publications/105005493555

U2 - 10.1016/j.patcog.2025.111728

DO - 10.1016/j.patcog.2025.111728

M3 - Article

AN - SCOPUS:105005493555

SN - 0031-3203

VL - 168

JO - Pattern Recognition

JF - Pattern Recognition

M1 - 111728

ER -

R2G: Reasoning to ground in 3D scenes

摘要

访问文件

其它文件与链接

指纹

引用此