MetaPilot: A DRL-Based Controller for Dynamic Adaptation to Shifting Scheduling Objectives in HPC Systems

Abstract

Efficient job scheduling in high-performance computing (HPC) systems necessitates the simultaneous consideration of system-centric objectives, such as maximizing resource utilization, and user-centric objectives, such as minimizing job waiting times. In practice, the relative importance of these objectives is not static, but shifts dynamically in response to fluctuations in workload characteristics and system state. However, existing scheduling frameworks — including both traditional workload managers and reinforcement learning (RL)-based methods — typically rely on fixed policies or fixed reward functions that encode a predetermined combination of objectives. As a result, they lack the flexibility to adjust their scheduling priorities as workload intensity or system conditions change. In this work, we propose MetaPilot, a deep reinforcement learning-based controller designed to enable dynamic adaptation to shifting scheduling objectives in HPC systems.

Publication
Future Generation Computer Systems

Related