Algorithms for online fault tolerance server consolidation

Boyu Li, Bin Wu, Meng Shen, Hao Peng, Weisheng Li, Hong Zhang, Jie Gan, Zhihong Tian*, Guangquan Xu

*此作品的通讯作者

科研成果: 期刊稿件文章同行评审

摘要

We study a novel replication mechanism to ensure service continuity against multiple simultaneous server failures. In this mechanism, each item represents a computing task and is replicated into ξ+1 servers for some integer ξ≥1, with workloads specified by the amount of required resources. If one or more servers fail, the affected workloads can be redirected to other servers that host replicas associated with the same item, such that the service is not interrupted by the failure of up to ξ servers. This requires that any feasible assignment algorithm must reserve some capacity in each server to accommodate the workload redirected from potential failed servers without overloading, and determining the optimal method for reserving capacity becomes a key issue. Unlike existing algorithms that assume that no two servers share replicas of more than one item, we first formulate capacity reservation for a general arbitrary scenario. Due to the combinatorial nature of this problem, finding the optimal solution is difficult. To this end, we propose a Generalized and Simple Calculating Reserved Capacity (GSCRC) algorithm, with a time complexity only related to the number of items packed in the server. In conjunction with GSCRC, we propose a robust replica packing algorithm with capacity optimization (RobustPack), which aims to minimize the number of servers hosting replicas and tolerate multiple server failures. Through theoretical analysis and experimental evaluations, we show that the RobustPack algorithm can achieve better performance.

源语言英语
页(从-至)514-523
页数10
期刊Digital Communications and Networks
11
2
DOI
出版状态已出版 - 4月 2025
已对外发布

指纹

探究 'Algorithms for online fault tolerance server consolidation' 的科研主题。它们共同构成独一无二的指纹。

引用此