6533b836fe1ef96bd12a129a
RESEARCH PRODUCT
Effects and Benefits of Node Sharing Strategies in HPC Batch Systems
Alvaro FrankTim SussAndré Brinkmannsubject
Job scheduler020203 distributed computingSingle nodeComputer scienceDistributed computing0202 electrical engineering electronic engineering information engineeringBatch processing020201 artificial intelligence & image processingWorkload02 engineering and technologycomputer.software_genrecomputerScheduling (computing)description
Processor manufacturers today scale performance by increasing the number of cores on each CPU. Unfortunately, not all HPC applications can efficiently saturate all cores of a single node, even if they successfully scale to thousands of nodes. For these applications, sharing nodes with other applications can help to stress different resources on the nodes to more efficiently use them. Previous work has shown that the performance impact of node sharing is very application dependent but very little work has studied its effects within batch systems and for complex parallel application mixes. Administrators therefore typically fear the complexity of running a batch system supporting node sharing and also fear that interference between co-allocated jobs in practice leads to worse performance. This paper focuses on sharing nodes by oversubscribing cores through hyper-threading. We introduce new node sharing strategies for batch systems by deriving extensions to the well-known backfill and first fit algorithms. These strategies have been implemented in the SLURM workload manager and the evaluation is based on NERSC Trinity scientific mini applications. The evaluation of our node sharing strategies shows no overhead when using co-allocation, but an increased computational efficiency of 19% and an increased scheduling efficiency of 25.2% compared to standard node allocation.
year | journal | country | edition | language |
---|---|---|---|---|
2019-05-01 | 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) |