MetaPilot: A DRL-Based Controller for Dynamic Adaptation to Shifting Scheduling Objectives in HPC Systems

Lingfei Wang, Maria A Rodriguez, Nir Lipovetzky

November 2025

Abstract

Efficient job scheduling in high-performance computing (HPC) systems necessitates the simultaneous consideration of system-centric objectives, such as maximizing resource utilization, and user-centric objectives, such as minimizing job waiting times. In practice, the relative importance of these objectives is not static, but shifts dynamically in response to fluctuations in workload characteristics and system state. However, existing scheduling frameworks — including both traditional workload managers and reinforcement learning (RL)-based methods — typically rely on fixed policies or fixed reward functions that encode a predetermined combination of objectives. As a result, they lack the flexibility to adjust their scheduling priorities as workload intensity or system conditions change. In this work, we propose MetaPilot, a deep reinforcement learning-based controller designed to enable dynamic adaptation to shifting scheduling objectives in HPC systems.

Type

Journal article

Publication

Future Generation Computer Systems

MetaPilot: A DRL-Based Controller for Dynamic Adaptation to Shifting Scheduling Objectives in HPC Systems

Abstract

Related