6533b7cffe1ef96bd1258e07

RESEARCH PRODUCT

AIOC2: A deep Q-learning approach to autonomic I/O congestion control in Lustre

Shijun DengAndré BrinkmannCheng WenYang WangLingfang Zeng

subject

ExploitComputer Networks and CommunicationsComputer sciencebusiness.industryQ-learningInterference (wave propagation)SupercomputerComputer Graphics and Computer-Aided DesignTheoretical Computer ScienceNetwork congestionArtificial IntelligenceHardware and ArchitectureEmbedded systemLustre (file system)Latency (engineering)businessThroughput (business)Software

description

Abstract In high performance computing systems, I/O congestion is a common problem in large-scale distributed file systems. However, the current implementation mainly requires administrator to manually design low-level implementation and optimization, we proposes an adaptive I/O congestion control framework, named AIOC 2 , which can not only adaptively tune the I/O congestion control parameters, but also exploit the deep Q-learning method to start the training parameters and optimize the tuning for different types of workloads from the server and the client at the same time. AIOC 2 combines the feedback-based dynamic I/O congestion control and deep Q-learning parameter tuning technology to achieve autonomic I/O congestion control, improve system I/O throughput, and thus reduce I/O latency without human interference. Experimental results show that AIOC 2 can greatly reduce the impact of I/O congestion on I/O throughput and I/O latency performance in Lustre clusters. Compared to existing Lustre cluster systems, AIOC 2 can increase write I/O throughput by 34.82% and decrease I/O latency by 26.17% on average.

https://doi.org/10.1016/j.parco.2021.102855