HARNESSING DIVERSITY FOR IMPORTANT DATA SELECTION IN PRETRAINING LARGE LANGUAGE MODELS

Chi Zhang; Huaping Zhong; Kuan Zhang; Chengliang Chai; Rui Wang; Xinlin Zhuang; Tianyi Bai; Jiantao Qiu; Lei Cao; Ju Fan; Ye Yuan; Guoren Wang; Conghui He

HARNESSING DIVERSITY FOR IMPORTANT DATA SELECTION IN PRETRAINING LARGE LANGUAGE MODELS

Chi Zhang, Huaping Zhong, Kuan Zhang, Chengliang Chai^*, Rui Wang, Xinlin Zhuang, Tianyi Bai, Jiantao Qiu, Lei Cao, Ju Fan, Ye Yuan, Guoren Wang, Conghui He^*

^*此作品的通讯作者

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

摘要

Data selection is of great significance in pretraining large language models, given the variation in quality within the large-scale available training corpora. To achieve this, researchers are currently investigating the use of data influence to measure the importance of data instances, i.e., a high influence score indicates that incorporating this instance to the training set is likely to enhance the model performance. Consequently, they select the top-k instances with the highest scores. However, this approach has several limitations. (1) Calculating the accurate influence of all available data is time-consuming. (2) The selected data instances are not diverse enough, which may hinder the pretrained model's ability to generalize effectively to various downstream tasks. In this paper, we introduce Quad, a data selection approach that considers both quality and diversity by using data influence to achieve state-of-the-art pretraining results. To compute the influence (i.e., the quality) more accurately and efficiently, we incorporate the attention layers to capture more semantic details, which can be accelerated through the Kronecker product. For the diversity, Quad clusters the dataset into similar data instances within each cluster and diverse instances across different clusters. For each cluster, if we opt to select data from it, we take some samples to evaluate the influence to prevent processing all instances. Overall, we favor clusters with highly influential instances (ensuring high quality) or clusters that have been selected less frequently (ensuring diversity), thereby well balancing between quality and diversity. Experiments on Slimpajama and FineWeb over 7B large language models demonstrate that Quad significantly outperforms other data selection methods with a low FLOPs consumption. Further analysis also validates the effectiveness of our influence calculation. Our code and data are available at (http://anonymous.4open.science/r/Quad/).

源语言	英语
主期刊名	13th International Conference on Learning Representations, ICLR 2025
出版商	International Conference on Learning Representations, ICLR
页	22941-22964
页数	24
ISBN（电子版）	9798331320850
出版状态	已出版 - 2025
活动	13th International Conference on Learning Representations, ICLR 2025 - Singapore, 新加坡期限: 24 4月 2025 → 28 4月 2025

出版系列

姓名	13th International Conference on Learning Representations, ICLR 2025

会议

会议	13th International Conference on Learning Representations, ICLR 2025
国家/地区	新加坡
市	Singapore
时期	24/04/25 → 28/04/25

其它文件与链接

链接到 Scopus 的出版物

引用此

Zhang, C., Zhong, H., Zhang, K., Chai, C., Wang, R., Zhuang, X., Bai, T., Qiu, J., Cao, L., Fan, J., Yuan, Y., Wang, G., & He, C. (2025). HARNESSING DIVERSITY FOR IMPORTANT DATA SELECTION IN PRETRAINING LARGE LANGUAGE MODELS. 在 13th International Conference on Learning Representations, ICLR 2025 (页码 22941-22964). (13th International Conference on Learning Representations, ICLR 2025). International Conference on Learning Representations, ICLR.

@inproceedings{5fd261fd3cc4465f8c77018dcfa740fe,

title = "HARNESSING DIVERSITY FOR IMPORTANT DATA SELECTION IN PRETRAINING LARGE LANGUAGE MODELS",

abstract = "Data selection is of great significance in pretraining large language models, given the variation in quality within the large-scale available training corpora. To achieve this, researchers are currently investigating the use of data influence to measure the importance of data instances, i.e., a high influence score indicates that incorporating this instance to the training set is likely to enhance the model performance. Consequently, they select the top-k instances with the highest scores. However, this approach has several limitations. (1) Calculating the accurate influence of all available data is time-consuming. (2) The selected data instances are not diverse enough, which may hinder the pretrained model's ability to generalize effectively to various downstream tasks. In this paper, we introduce Quad, a data selection approach that considers both quality and diversity by using data influence to achieve state-of-the-art pretraining results. To compute the influence (i.e., the quality) more accurately and efficiently, we incorporate the attention layers to capture more semantic details, which can be accelerated through the Kronecker product. For the diversity, Quad clusters the dataset into similar data instances within each cluster and diverse instances across different clusters. For each cluster, if we opt to select data from it, we take some samples to evaluate the influence to prevent processing all instances. Overall, we favor clusters with highly influential instances (ensuring high quality) or clusters that have been selected less frequently (ensuring diversity), thereby well balancing between quality and diversity. Experiments on Slimpajama and FineWeb over 7B large language models demonstrate that Quad significantly outperforms other data selection methods with a low FLOPs consumption. Further analysis also validates the effectiveness of our influence calculation. Our code and data are available at (http://anonymous.4open.science/r/Quad/).",

author = "Chi Zhang and Huaping Zhong and Kuan Zhang and Chengliang Chai and Rui Wang and Xinlin Zhuang and Tianyi Bai and Jiantao Qiu and Lei Cao and Ju Fan and Ye Yuan and Guoren Wang and Conghui He",

note = "Publisher Copyright: {\textcopyright} 2025 13th International Conference on Learning Representations, ICLR 2025. All rights reserved.; 13th International Conference on Learning Representations, ICLR 2025 ; Conference date: 24-04-2025 Through 28-04-2025",

year = "2025",

language = "English",

series = "13th International Conference on Learning Representations, ICLR 2025",

publisher = "International Conference on Learning Representations, ICLR",

pages = "22941--22964",

booktitle = "13th International Conference on Learning Representations, ICLR 2025",

}

Zhang, C, Zhong, H, Zhang, K, Chai, C, Wang, R, Zhuang, X, Bai, T, Qiu, J, Cao, L, Fan, J, Yuan, Y, Wang, G & He, C 2025, HARNESSING DIVERSITY FOR IMPORTANT DATA SELECTION IN PRETRAINING LARGE LANGUAGE MODELS. 在 13th International Conference on Learning Representations, ICLR 2025. 13th International Conference on Learning Representations, ICLR 2025, International Conference on Learning Representations, ICLR, 页码 22941-22964, 13th International Conference on Learning Representations, ICLR 2025, Singapore, 新加坡, 24/04/25.

HARNESSING DIVERSITY FOR IMPORTANT DATA SELECTION IN PRETRAINING LARGE LANGUAGE MODELS. / Zhang, Chi; Zhong, Huaping; Zhang, Kuan 等.
13th International Conference on Learning Representations, ICLR 2025. International Conference on Learning Representations, ICLR, 2025. 页码 22941-22964 (13th International Conference on Learning Representations, ICLR 2025).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - HARNESSING DIVERSITY FOR IMPORTANT DATA SELECTION IN PRETRAINING LARGE LANGUAGE MODELS

AU - Zhang, Chi

AU - Zhong, Huaping

AU - Zhang, Kuan

AU - Chai, Chengliang

AU - Wang, Rui

AU - Zhuang, Xinlin

AU - Bai, Tianyi

AU - Qiu, Jiantao

AU - Cao, Lei

AU - Fan, Ju

AU - Yuan, Ye

AU - Wang, Guoren

AU - He, Conghui

PY - 2025

Y1 - 2025

N2 - Data selection is of great significance in pretraining large language models, given the variation in quality within the large-scale available training corpora. To achieve this, researchers are currently investigating the use of data influence to measure the importance of data instances, i.e., a high influence score indicates that incorporating this instance to the training set is likely to enhance the model performance. Consequently, they select the top-k instances with the highest scores. However, this approach has several limitations. (1) Calculating the accurate influence of all available data is time-consuming. (2) The selected data instances are not diverse enough, which may hinder the pretrained model's ability to generalize effectively to various downstream tasks. In this paper, we introduce Quad, a data selection approach that considers both quality and diversity by using data influence to achieve state-of-the-art pretraining results. To compute the influence (i.e., the quality) more accurately and efficiently, we incorporate the attention layers to capture more semantic details, which can be accelerated through the Kronecker product. For the diversity, Quad clusters the dataset into similar data instances within each cluster and diverse instances across different clusters. For each cluster, if we opt to select data from it, we take some samples to evaluate the influence to prevent processing all instances. Overall, we favor clusters with highly influential instances (ensuring high quality) or clusters that have been selected less frequently (ensuring diversity), thereby well balancing between quality and diversity. Experiments on Slimpajama and FineWeb over 7B large language models demonstrate that Quad significantly outperforms other data selection methods with a low FLOPs consumption. Further analysis also validates the effectiveness of our influence calculation. Our code and data are available at (http://anonymous.4open.science/r/Quad/).

AB - Data selection is of great significance in pretraining large language models, given the variation in quality within the large-scale available training corpora. To achieve this, researchers are currently investigating the use of data influence to measure the importance of data instances, i.e., a high influence score indicates that incorporating this instance to the training set is likely to enhance the model performance. Consequently, they select the top-k instances with the highest scores. However, this approach has several limitations. (1) Calculating the accurate influence of all available data is time-consuming. (2) The selected data instances are not diverse enough, which may hinder the pretrained model's ability to generalize effectively to various downstream tasks. In this paper, we introduce Quad, a data selection approach that considers both quality and diversity by using data influence to achieve state-of-the-art pretraining results. To compute the influence (i.e., the quality) more accurately and efficiently, we incorporate the attention layers to capture more semantic details, which can be accelerated through the Kronecker product. For the diversity, Quad clusters the dataset into similar data instances within each cluster and diverse instances across different clusters. For each cluster, if we opt to select data from it, we take some samples to evaluate the influence to prevent processing all instances. Overall, we favor clusters with highly influential instances (ensuring high quality) or clusters that have been selected less frequently (ensuring diversity), thereby well balancing between quality and diversity. Experiments on Slimpajama and FineWeb over 7B large language models demonstrate that Quad significantly outperforms other data selection methods with a low FLOPs consumption. Further analysis also validates the effectiveness of our influence calculation. Our code and data are available at (http://anonymous.4open.science/r/Quad/).

UR - http://www.scopus.com/pages/publications/105010229468

M3 - Conference contribution

AN - SCOPUS:105010229468

T3 - 13th International Conference on Learning Representations, ICLR 2025

SP - 22941

EP - 22964

BT - 13th International Conference on Learning Representations, ICLR 2025

PB - International Conference on Learning Representations, ICLR

T2 - 13th International Conference on Learning Representations, ICLR 2025

Y2 - 24 April 2025 through 28 April 2025

ER -

Zhang C, Zhong H, Zhang K, Chai C, Wang R, Zhuang X 等. HARNESSING DIVERSITY FOR IMPORTANT DATA SELECTION IN PRETRAINING LARGE LANGUAGE MODELS. 在 13th International Conference on Learning Representations, ICLR 2025. International Conference on Learning Representations, ICLR. 2025. 页码 22941-22964. (13th International Conference on Learning Representations, ICLR 2025).

HARNESSING DIVERSITY FOR IMPORTANT DATA SELECTION IN PRETRAINING LARGE LANGUAGE MODELS

摘要

出版系列

会议

其它文件与链接

指纹

引用此