Optimizing HPC Scheduling: A Hierarchical Reinforcement Learning Approach for Intelligent Job Selection and Allocation

Abstract

High-performance computing (HPC) systems face increasing challenges in job scheduling due to the evolving complexity of computational tasks and the growing diversity and heterogeneity of resources. Job allocation is a critical aspect in contemporary HPC systems, due to compute nodes possessing an increased capacity in terms of physical resources and having the capability to execute multiple jobs simultaneously. However, job allocation is often overlooked in existing reinforcement learning (RL)-based schedulers that mainly focus on selecting suitable jobs from the job queue and leave allocation to overly simplistic policies, such as first-available allocation. The bin-packing nature at the node level of modern HPC necessitates more refined and intelligent allocation strategies. This paper introduces HeraSched, a novel hierarchical reinforcement learning (HRL)-based scheduler, adept at intelligent job selection and allocation strategies.

Publication
The Journal of Supercomputing

Related