Doctopus: A System for Budget-aware Structural Data Extraction from Unstructured Documents

Yuanhao Zhong, Yuhao Deng, Chengliang Chai, Ruixin Gu, Ye Yuan, Guoren Wang, Lei Cao

科研成果: 书/报告/会议事项章节会议稿件同行评审

1 引用 (Scopus)

摘要

Extracting structured data from unstructured documents is essential for applications like analytical SQL queries and decision-making. Strategies such as pre-trained language models (PLMs) can be employed, but they often fall short in quality. Large language models (LLMs) have shown effectiveness in attribute extraction but are costly, making them impractical for large-scale document sets. To best trade off quality and cost, we present Doctopus, a system designed for accurate and cost-effective attribute extraction. Overall, Doctopus combines LLMs with non-LLM strategies to achieve an optimal quality-cost balance. First, the system employs an index-based approach to efficiently identify and process only relevant chunks. Afterwards, it further estimates the quality of multiple strategies for each attribute. Finally, based on the cost and estimated quality, Doctopus dynamically selects the optimal strategies through budget-aware optimization. With a real-life scenario, we demonstrate that Doctopus allows users to extract attributes accurately and affordably. The corresponding video is available at http://youtu.be/Cxl-PfvZY10?si=NYoHt2SyD9KHqd6V.

源语言英语
主期刊名SIGMOD-Companion 2025 - Companion of the 2025 International Conference on Management of Data
编辑Amol Deshpande, Ashraf Aboulnaga, Babak Salimi, Badrish Chandramouli, Bill Howe, Boon Thau Loo, Boris Glavic, Carlo Curino, Daisy Zhe Wang, Dan Suciu, Daniel Abadi, Divesh Srivastava, Eugene Wu, Faisal Nawab, Ihab Ilyas, Jeffrey Naughton, Jennie Rogers, Jignesh Patel, Joy Arulraj, Jun Yang, Karima Echihabi, Kenneth Ross, Khuzaima Daudjee, Laks Lakshmanan, Minos Garofalakis, Mirek Riedewald, Mohamed Mokbel, Mourad Ouzzani, Oliver Kennedy, Oliver Kennedy, Paolo Papotti, Peter Alvaro, Peter Bailis, Renee Miller, Senjuti Basu Roy, Sergey Melnik, Stratos Idreos, Sudeepa Roy, Theodoros Rekatsinas, Viktor Leis, Wenchao Zhou, Wolfgang Gatterbauer, Zack Ives
出版商Association for Computing Machinery
275-278
页数4
ISBN(电子版)9798400715648
DOI
出版状态已出版 - 22 6月 2025
活动2025 ACM SIGMOD/PODS International Conference on Management of Data, SIGMOD-Companion 2025 - Berlin, 德国
期限: 22 6月 202527 6月 2025

出版系列

姓名Proceedings of the ACM SIGMOD International Conference on Management of Data
ISSN(印刷版)0730-8078

会议

会议2025 ACM SIGMOD/PODS International Conference on Management of Data, SIGMOD-Companion 2025
国家/地区德国
Berlin
时期22/06/2527/06/25

指纹

探究 'Doctopus: A System for Budget-aware Structural Data Extraction from Unstructured Documents' 的科研主题。它们共同构成独一无二的指纹。

引用此