0000000000265987

AUTHOR

Alvaro Frank

Improving checkpointing intervals by considering individual job failure probabilities

Checkpointing is a popular resilience method in HPC and its efficiency highly depends on the choice of the checkpoint interval. Standard analytical approaches optimize intervals for big, long-running jobs that fail with high probability, while they are unable to minimize checkpointing overheads for jobs with a low or medium probability of failing. Nevertheless, our analysis of batch traces of four HPC systems shows that these jobs are extremely common.We therefore propose an iterative checkpointing algorithm to compute efficient intervals for jobs with a medium risk of failure. The method also supports big and long-running jobs by converging to the results of various traditional methods for…

research product

Reducing False Node Failure Predictions in HPC

Future HPC applications must be able to scale to thousands of compute nodes, while running for several days. The increased runtime and node count inconveniently raises the probability of hardware failures that may interrupt computations. Scientists must therefore protect their simulations against hardware failures. This is typically done using frequent checkpoint& restart, which may have significant overheads. Consequently, the frequency in which checkpoints are taken should be minimized. Predicting hardware failures ahead of time is a promising approach to address this problem, but has remaining issues like false alarms at large scales. In this paper, we introduce the probability of unnece…

research product

Effects and Benefits of Node Sharing Strategies in HPC Batch Systems

Processor manufacturers today scale performance by increasing the number of cores on each CPU. Unfortunately, not all HPC applications can efficiently saturate all cores of a single node, even if they successfully scale to thousands of nodes. For these applications, sharing nodes with other applications can help to stress different resources on the nodes to more efficiently use them. Previous work has shown that the performance impact of node sharing is very application dependent but very little work has studied its effects within batch systems and for complex parallel application mixes. Administrators therefore typically fear the complexity of running a batch system supporting node sharing…

research product