Cost-effective Missing Value Imputation for Data-effective Machine Learning

Chengliang Chai, Kaisen Jin, Nan Tang, Ju Fan, Dongjing Miao, Jiayi Wang, Yuyu Luo, Guoliang Li, Ye Yuan, Guoren Wang

Research output: Contribution to journalArticlepeer-review

Abstract

Given a dataset with incomplete data (e.g., missing values), training a machine learning model over the incomplete data requires two steps. First, it requires a data-effective step that cleans the data in order to improve the data quality (and the model quality on the cleaned data). Second, it requires a data-efficient step that selects a core subset of the data (called coreset) such that the trained models on the entire data and the coreset have similar model quality, in order to save the computational cost of training. The first-data-effective-then-data-efficient methods are too costly, because they are expensive to clean the whole data; while the first-data-efficient-then-data-effective methods have low model quality, because they cannot select high-quality coreset for incomplete data. In this article, we investigate the problem of coreset selection over incomplete data for data-effective and data-efficient machine learning. The essential challenge is how to model the incomplete data for selecting high-quality coreset. To this end, we propose the GoodCore framework towards selecting a good coreset over incomplete data with low cost. To model the unknown complete data, we utilize the combinations of possible repairs as possible worlds of the incomplete data. Based on possible worlds, GoodCore selects an expected optimal coreset through gradient approximation without training ML models. We formally define the expected optimal coreset selection problem, prove its NP-hardness, and propose a greedy algorithm with an approximation ratio. To make GoodCore more efficient, we propose optimization methods that incorporate human-in-the-loop imputation or automatic imputation method into our framework. Moreover, a group-based strategy is utilized to further accelerate the coreset selection with incomplete data given large datasets. Experimental results show the effectiveness and efficiency of our framework with low cost.

Original languageEnglish
Article number10
JournalACM Transactions on Database Systems
Volume50
Issue number3
DOIs
Publication statusPublished - 14 May 2025

Keywords

  • coreset selection
  • data cleaning
  • Data-centric AI
  • machine learning

Fingerprint

Dive into the research topics of 'Cost-effective Missing Value Imputation for Data-effective Machine Learning'. Together they form a unique fingerprint.

Cite this