6533b820fe1ef96bd12798ac

RESEARCH PRODUCT

Accelerating Application Migration in HPC

Lars NagelTim SüßStefan LankesSimon PickartzAndré BrinkmannRamy Gad

subject

Mean time between failuresComputer sciencebusiness.industry020206 networking & telecommunications02 engineering and technologyLoad balancing (computing)Computer securitycomputer.software_genreShared resourceVirtual machine0202 electrical engineering electronic engineering information engineering020201 artificial intelligence & image processingbusinesscomputerComputer network

description

It is predicted that the number of cores per node will rapidly increase with the upcoming era of exascale supercomputers. As a result, multiple applications will have to share one node and compete for the (often scarce) resources available on this node. Furthermore, the growing number of hardware components causes a decrease in the mean time between failures. Application migration between nodes has been proposed as a tool to mitigate these two problems: Bottlenecks due to resource sharing can be addressed by load balancing schemes which migrate applications; and hardware errors can often be tolerated by the system if faulty nodes are detected and processes are migrated ahead of time.

https://doi.org/10.1007/978-3-319-46079-6_46