6533b82afe1ef96bd128b727
RESEARCH PRODUCT
Reducing False Node Failure Predictions in HPC
Martin SchulzTim SussDai YangAndré BrinkmannAlvaro Franksubject
Computer scienceNode (networking)Metric (mathematics)False positive paradoxFault toleranceFalse positive rateResilience (network)Cluster (spacecraft)Reliability engineeringdescription
Future HPC applications must be able to scale to thousands of compute nodes, while running for several days. The increased runtime and node count inconveniently raises the probability of hardware failures that may interrupt computations. Scientists must therefore protect their simulations against hardware failures. This is typically done using frequent checkpoint& restart, which may have significant overheads. Consequently, the frequency in which checkpoints are taken should be minimized. Predicting hardware failures ahead of time is a promising approach to address this problem, but has remaining issues like false alarms at large scales. In this paper, we introduce the probability of unnecessarily triggering checkpoints (UC) to evaluate the quality of node level failure predictors for checkpointing large-scale applications. This metric is used to show how current predictors suffer from too many false alarms at large node counts. Further, we propose a new failure predictor that chains several machine learning classifiers to make predictions with minimal false alarms. We aim for extremely low false positive rates to guarantee that no unnecessary checkpoints will be performed even for very large node counts. Our experiments based on real system traces from a large production cluster show that our predictor achieves a lead-up time of four minutes, a recall of 0.7302, a false positive rate of 0.0004, a precision of 0.9944 and a probability of unnecessary checkpoints (UC) of 0.00011 for 1024 nodes.
year | journal | country | edition | language |
---|---|---|---|---|
2019-12-01 | 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC) |