TY - GEN
T1 - Doctopus
T2 - 2025 ACM SIGMOD/PODS International Conference on Management of Data, SIGMOD-Companion 2025
AU - Zhong, Yuanhao
AU - Deng, Yuhao
AU - Chai, Chengliang
AU - Gu, Ruixin
AU - Yuan, Ye
AU - Wang, Guoren
AU - Cao, Lei
N1 - Publisher Copyright:
© 2025 ACM.
PY - 2025/6/22
Y1 - 2025/6/22
N2 - Extracting structured data from unstructured documents is essential for applications like analytical SQL queries and decision-making. Strategies such as pre-trained language models (PLMs) can be employed, but they often fall short in quality. Large language models (LLMs) have shown effectiveness in attribute extraction but are costly, making them impractical for large-scale document sets. To best trade off quality and cost, we present Doctopus, a system designed for accurate and cost-effective attribute extraction. Overall, Doctopus combines LLMs with non-LLM strategies to achieve an optimal quality-cost balance. First, the system employs an index-based approach to efficiently identify and process only relevant chunks. Afterwards, it further estimates the quality of multiple strategies for each attribute. Finally, based on the cost and estimated quality, Doctopus dynamically selects the optimal strategies through budget-aware optimization. With a real-life scenario, we demonstrate that Doctopus allows users to extract attributes accurately and affordably. The corresponding video is available at http://youtu.be/Cxl-PfvZY10?si=NYoHt2SyD9KHqd6V.
AB - Extracting structured data from unstructured documents is essential for applications like analytical SQL queries and decision-making. Strategies such as pre-trained language models (PLMs) can be employed, but they often fall short in quality. Large language models (LLMs) have shown effectiveness in attribute extraction but are costly, making them impractical for large-scale document sets. To best trade off quality and cost, we present Doctopus, a system designed for accurate and cost-effective attribute extraction. Overall, Doctopus combines LLMs with non-LLM strategies to achieve an optimal quality-cost balance. First, the system employs an index-based approach to efficiently identify and process only relevant chunks. Afterwards, it further estimates the quality of multiple strategies for each attribute. Finally, based on the cost and estimated quality, Doctopus dynamically selects the optimal strategies through budget-aware optimization. With a real-life scenario, we demonstrate that Doctopus allows users to extract attributes accurately and affordably. The corresponding video is available at http://youtu.be/Cxl-PfvZY10?si=NYoHt2SyD9KHqd6V.
KW - cost optimization
KW - information extraction
KW - query enhancement
UR - http://www.scopus.com/pages/publications/105010185673
U2 - 10.1145/3722212.3725103
DO - 10.1145/3722212.3725103
M3 - Conference contribution
AN - SCOPUS:105010185673
T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data
SP - 275
EP - 278
BT - SIGMOD-Companion 2025 - Companion of the 2025 International Conference on Management of Data
A2 - Deshpande, Amol
A2 - Aboulnaga, Ashraf
A2 - Salimi, Babak
A2 - Chandramouli, Badrish
A2 - Howe, Bill
A2 - Loo, Boon Thau
A2 - Glavic, Boris
A2 - Curino, Carlo
A2 - Zhe Wang, Daisy
A2 - Suciu, Dan
A2 - Abadi, Daniel
A2 - Srivastava, Divesh
A2 - Wu, Eugene
A2 - Nawab, Faisal
A2 - Ilyas, Ihab
A2 - Naughton, Jeffrey
A2 - Rogers, Jennie
A2 - Patel, Jignesh
A2 - Arulraj, Joy
A2 - Yang, Jun
A2 - Echihabi, Karima
A2 - Ross, Kenneth
A2 - Daudjee, Khuzaima
A2 - Lakshmanan, Laks
A2 - Garofalakis, Minos
A2 - Riedewald, Mirek
A2 - Mokbel, Mohamed
A2 - Ouzzani, Mourad
A2 - Kennedy, Oliver
A2 - Kennedy, Oliver
A2 - Papotti, Paolo
A2 - Alvaro, Peter
A2 - Bailis, Peter
A2 - Miller, Renee
A2 - Roy, Senjuti Basu
A2 - Melnik, Sergey
A2 - Idreos, Stratos
A2 - Roy, Sudeepa
A2 - Rekatsinas, Theodoros
A2 - Leis, Viktor
A2 - Zhou, Wenchao
A2 - Gatterbauer, Wolfgang
A2 - Ives, Zack
PB - Association for Computing Machinery
Y2 - 22 June 2025 through 27 June 2025
ER -