6533b7d5fe1ef96bd1264f25
RESEARCH PRODUCT
Improving checkpointing intervals by considering individual job failure probabilities
André BrinkmannAlvaro FrankReza SalkhordehManuel Baumgartnersubject
High probabilitySystems simulationComputer scienceBatch processingInterval (mathematics)Medium RiskResilience (network)Reliability engineeringdescription
Checkpointing is a popular resilience method in HPC and its efficiency highly depends on the choice of the checkpoint interval. Standard analytical approaches optimize intervals for big, long-running jobs that fail with high probability, while they are unable to minimize checkpointing overheads for jobs with a low or medium probability of failing. Nevertheless, our analysis of batch traces of four HPC systems shows that these jobs are extremely common.We therefore propose an iterative checkpointing algorithm to compute efficient intervals for jobs with a medium risk of failure. The method also supports big and long-running jobs by converging to the results of various traditional methods for these. We validated our algorithm using batch system simulations including traces from four HPC systems and compared it to five alternative checkpoint methods. The evaluations show up to 40% checkpoint savings for individual jobs when using our method, while improving checkpointing costs of complete HPC systems between 2.8% and 24.4% compared to the best alternative approach.
year | journal | country | edition | language |
---|---|---|---|---|
2021-05-01 | 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) |