Cost-effective Missing Value Imputation for Data-effective Machine Learning

Chengliang Chai; Kaisen Jin; Nan Tang; Ju Fan; Dongjing Miao; Jiayi Wang; Yuyu Luo; Guoliang Li; Ye Yuan; Guoren Wang

doi:10.1145/3716376

Cost-effective Missing Value Imputation for Data-effective Machine Learning

Chengliang Chai, Kaisen Jin, Nan Tang, Ju Fan, Dongjing Miao, Jiayi Wang, Yuyu Luo, Guoliang Li, Ye Yuan, Guoren Wang

School of Computer Science and Technology

Research output: Contribution to journal › Article › peer-review

Abstract

Given a dataset with incomplete data (e.g., missing values), training a machine learning model over the incomplete data requires two steps. First, it requires a data-effective step that cleans the data in order to improve the data quality (and the model quality on the cleaned data). Second, it requires a data-efficient step that selects a core subset of the data (called coreset) such that the trained models on the entire data and the coreset have similar model quality, in order to save the computational cost of training. The first-data-effective-then-data-efficient methods are too costly, because they are expensive to clean the whole data; while the first-data-efficient-then-data-effective methods have low model quality, because they cannot select high-quality coreset for incomplete data. In this article, we investigate the problem of coreset selection over incomplete data for data-effective and data-efficient machine learning. The essential challenge is how to model the incomplete data for selecting high-quality coreset. To this end, we propose the GoodCore framework towards selecting a good coreset over incomplete data with low cost. To model the unknown complete data, we utilize the combinations of possible repairs as possible worlds of the incomplete data. Based on possible worlds, GoodCore selects an expected optimal coreset through gradient approximation without training ML models. We formally define the expected optimal coreset selection problem, prove its NP-hardness, and propose a greedy algorithm with an approximation ratio. To make GoodCore more efficient, we propose optimization methods that incorporate human-in-the-loop imputation or automatic imputation method into our framework. Moreover, a group-based strategy is utilized to further accelerate the coreset selection with incomplete data given large datasets. Experimental results show the effectiveness and efficiency of our framework with low cost.

Original language	English
Article number	10
Journal	ACM Transactions on Database Systems
Volume	50
Issue number	3
DOIs	http://doi.org/10.1145/3716376
Publication status	Published - 14 May 2025

Keywords

coreset selection
data cleaning
Data-centric AI
machine learning

Access to Document

10.1145/3716376

Cite this

@article{a9755e3c16aa415398d1d278a916519b,

title = "Cost-effective Missing Value Imputation for Data-effective Machine Learning",

abstract = "Given a dataset with incomplete data (e.g., missing values), training a machine learning model over the incomplete data requires two steps. First, it requires a data-effective step that cleans the data in order to improve the data quality (and the model quality on the cleaned data). Second, it requires a data-efficient step that selects a core subset of the data (called coreset) such that the trained models on the entire data and the coreset have similar model quality, in order to save the computational cost of training. The first-data-effective-then-data-efficient methods are too costly, because they are expensive to clean the whole data; while the first-data-efficient-then-data-effective methods have low model quality, because they cannot select high-quality coreset for incomplete data. In this article, we investigate the problem of coreset selection over incomplete data for data-effective and data-efficient machine learning. The essential challenge is how to model the incomplete data for selecting high-quality coreset. To this end, we propose the GoodCore framework towards selecting a good coreset over incomplete data with low cost. To model the unknown complete data, we utilize the combinations of possible repairs as possible worlds of the incomplete data. Based on possible worlds, GoodCore selects an expected optimal coreset through gradient approximation without training ML models. We formally define the expected optimal coreset selection problem, prove its NP-hardness, and propose a greedy algorithm with an approximation ratio. To make GoodCore more efficient, we propose optimization methods that incorporate human-in-the-loop imputation or automatic imputation method into our framework. Moreover, a group-based strategy is utilized to further accelerate the coreset selection with incomplete data given large datasets. Experimental results show the effectiveness and efficiency of our framework with low cost.",

keywords = "coreset selection, data cleaning, Data-centric AI, machine learning",

author = "Chengliang Chai and Kaisen Jin and Nan Tang and Ju Fan and Dongjing Miao and Jiayi Wang and Yuyu Luo and Guoliang Li and Ye Yuan and Guoren Wang",

note = "Publisher Copyright: {\textcopyright} 2025 Copyright held by the owner/author(s).",

year = "2025",

month = may,

day = "14",

doi = "10.1145/3716376",

language = "English",

volume = "50",

journal = "ACM Transactions on Database Systems",

issn = "0362-5915",

publisher = "Association for Computing Machinery (ACM)",

number = "3",

}

TY - JOUR

T1 - Cost-effective Missing Value Imputation for Data-effective Machine Learning

AU - Chai, Chengliang

AU - Jin, Kaisen

AU - Tang, Nan

AU - Fan, Ju

AU - Miao, Dongjing

AU - Wang, Jiayi

AU - Luo, Yuyu

AU - Li, Guoliang

AU - Yuan, Ye

AU - Wang, Guoren

PY - 2025/5/14

Y1 - 2025/5/14

N2 - Given a dataset with incomplete data (e.g., missing values), training a machine learning model over the incomplete data requires two steps. First, it requires a data-effective step that cleans the data in order to improve the data quality (and the model quality on the cleaned data). Second, it requires a data-efficient step that selects a core subset of the data (called coreset) such that the trained models on the entire data and the coreset have similar model quality, in order to save the computational cost of training. The first-data-effective-then-data-efficient methods are too costly, because they are expensive to clean the whole data; while the first-data-efficient-then-data-effective methods have low model quality, because they cannot select high-quality coreset for incomplete data. In this article, we investigate the problem of coreset selection over incomplete data for data-effective and data-efficient machine learning. The essential challenge is how to model the incomplete data for selecting high-quality coreset. To this end, we propose the GoodCore framework towards selecting a good coreset over incomplete data with low cost. To model the unknown complete data, we utilize the combinations of possible repairs as possible worlds of the incomplete data. Based on possible worlds, GoodCore selects an expected optimal coreset through gradient approximation without training ML models. We formally define the expected optimal coreset selection problem, prove its NP-hardness, and propose a greedy algorithm with an approximation ratio. To make GoodCore more efficient, we propose optimization methods that incorporate human-in-the-loop imputation or automatic imputation method into our framework. Moreover, a group-based strategy is utilized to further accelerate the coreset selection with incomplete data given large datasets. Experimental results show the effectiveness and efficiency of our framework with low cost.

AB - Given a dataset with incomplete data (e.g., missing values), training a machine learning model over the incomplete data requires two steps. First, it requires a data-effective step that cleans the data in order to improve the data quality (and the model quality on the cleaned data). Second, it requires a data-efficient step that selects a core subset of the data (called coreset) such that the trained models on the entire data and the coreset have similar model quality, in order to save the computational cost of training. The first-data-effective-then-data-efficient methods are too costly, because they are expensive to clean the whole data; while the first-data-efficient-then-data-effective methods have low model quality, because they cannot select high-quality coreset for incomplete data. In this article, we investigate the problem of coreset selection over incomplete data for data-effective and data-efficient machine learning. The essential challenge is how to model the incomplete data for selecting high-quality coreset. To this end, we propose the GoodCore framework towards selecting a good coreset over incomplete data with low cost. To model the unknown complete data, we utilize the combinations of possible repairs as possible worlds of the incomplete data. Based on possible worlds, GoodCore selects an expected optimal coreset through gradient approximation without training ML models. We formally define the expected optimal coreset selection problem, prove its NP-hardness, and propose a greedy algorithm with an approximation ratio. To make GoodCore more efficient, we propose optimization methods that incorporate human-in-the-loop imputation or automatic imputation method into our framework. Moreover, a group-based strategy is utilized to further accelerate the coreset selection with incomplete data given large datasets. Experimental results show the effectiveness and efficiency of our framework with low cost.

KW - coreset selection

KW - data cleaning

KW - Data-centric AI

KW - machine learning

UR - http://www.scopus.com/pages/publications/105009965972

U2 - 10.1145/3716376

DO - 10.1145/3716376

M3 - Article

AN - SCOPUS:105009965972

SN - 0362-5915

VL - 50

JO - ACM Transactions on Database Systems

JF - ACM Transactions on Database Systems

IS - 3

M1 - 10

ER -

Cost-effective Missing Value Imputation for Data-effective Machine Learning

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this