6533b82afe1ef96bd128b727

RESEARCH PRODUCT

Reducing False Node Failure Predictions in HPC

Martin SchulzTim SussDai YangAndré BrinkmannAlvaro Frank

subject

Computer scienceNode (networking)Metric (mathematics)False positive paradoxFault toleranceFalse positive rateResilience (network)Cluster (spacecraft)Reliability engineering

description

Future HPC applications must be able to scale to thousands of compute nodes, while running for several days. The increased runtime and node count inconveniently raises the probability of hardware failures that may interrupt computations. Scientists must therefore protect their simulations against hardware failures. This is typically done using frequent checkpoint& restart, which may have significant overheads. Consequently, the frequency in which checkpoints are taken should be minimized. Predicting hardware failures ahead of time is a promising approach to address this problem, but has remaining issues like false alarms at large scales. In this paper, we introduce the probability of unnecessarily triggering checkpoints (UC) to evaluate the quality of node level failure predictors for checkpointing large-scale applications. This metric is used to show how current predictors suffer from too many false alarms at large node counts. Further, we propose a new failure predictor that chains several machine learning classifiers to make predictions with minimal false alarms. We aim for extremely low false positive rates to guarantee that no unnecessary checkpoints will be performed even for very large node counts. Our experiments based on real system traces from a large production cluster show that our predictor achieves a lead-up time of four minutes, a recall of 0.7302, a false positive rate of 0.0004, a precision of 0.9944 and a probability of unnecessary checkpoints (UC) of 0.00011 for 1024 nodes.

https://doi.org/10.1109/hipc.2019.00047