Kubernetes v1.36 marks a major leap forward in workload-aware scheduling, building on the foundation laid in v1.35. With AI/ML and batch jobs demanding more intelligent resource management, this release introduces a cleaner architectural separation, new APIs, and advanced scheduling capabilities. Here are 10 things you need to know about the v1.36 improvements that make Kubernetes scheduling more efficient, scalable, and topology-aware.
1. Clean Separation of Workload and PodGroup APIs
In v1.35, the Workload API bundled both static templates and runtime state. v1.36 decouples these responsibilities: the Workload API now serves as a static template, while the new PodGroup API manages runtime state. This separation simplifies the scheduler’s logic—it watches only PodGroups, not Workloads—boosting performance and scalability. Controllers stamp out PodGroup instances from Workload templates, enabling per-replica sharding of status updates. This architectural shift is the cornerstone for all subsequent scheduling enhancements in this release.
2. Introduction of the PodGroup Scheduling Cycle
The kube-scheduler now features a dedicated PodGroup scheduling cycle that can process an entire group of Pods atomically. Instead of scheduling Pods one by one, the scheduler evaluates the group as a whole, checking gang constraints (e.g., minimum Pod count) before committing any Pod. This atomic approach eliminates partial scheduling failures and paves the way for future enhancements like batch-level optimizations. The new cycle integrates seamlessly with existing scheduling plugins.
3. Streamlined Scheduler Performance with Decoupled APIs
Because the scheduler reads scheduling info directly from the PodGroup instead of parsing the Workload object, it eliminates unnecessary resource watching. This reduces CPU and memory overhead, especially in clusters with many Workloads. The decoupling also allows independent scaling of Workload and PodGroup controllers. For large-scale batch jobs, this means faster scheduling decisions and reduced latency—critical for time-sensitive AI training workloads.
4. Topology-Aware Scheduling – First Iteration
v1.36 debuts topology-aware scheduling for PodGroups, enabling the scheduler to optimize placements based on node locality (e.g., rack, zone). While still early, this first iteration allows administrators to define topology constraints in the PodGroup spec. For example, you can ensure that all Pods of a gang are placed within the same failure domain or spread across zones. This functionality is crucial for high-performance computing and data-intensive jobs that benefit from low-latency interconnects.
5. Workload-Aware Preemption – First Steps
v1.36 introduces the initial phase of workload-aware preemption, where the scheduler considers the entire PodGroup’s priority before preempting individual Pods. Previously, preemption decisions were based on single-Pod priorities, which could break gang constraints. Now, the scheduler evaluates whether evicting a Pod helps (or harms) the group’s scheduling goals. This is a fundamental step toward smarter resource sharing between batch and interactive workloads.
6. Dynamic Resource Allocation (DRA) Support for PodGroups
With ResourceClaim support for Workloads and PodGroups, v1.36 unlocks Dynamic Resource Allocation (DRA) for grouped Pods. DRA allows workloads to request specialized hardware (e.g., GPUs, FPGAs) with fine-grained lifecycle management. By extending DRA to PodGroups, multiple Pods in a gang can share a single ResourceClaim or receive individual claims. This is a game-changer for AI training jobs that require consistent accelerator access across all workers.
7. Job Controller Integration – Phase 1
The Job controller—the standard batch workload manager—now integrates with the new Workload and PodGroup APIs. In this first phase, Jobs automatically create a Workload template and corresponding PodGroup instances, enabling gang scheduling for existing Job manifests with minimal changes. This demonstrates real-world readiness and provides a migration path for users currently relying on custom scheduling solutions. Future phases will add more advanced interactions.
8. Replacement of v1alpha1 API with v1alpha2
The new Workload and PodGroup APIs live under scheduling.k8s.io/v1alpha2, completely replacing the previous v1alpha1 version. The old API (where PodGroup status was embedded in Workload) is deprecated and removed. Users must update their manifests to use the new version. The migration is straightforward: static templates now define PodGroup specs, and controllers handle runtime instances. This API evolution cleans up the experimental foundations and sets a stable path forward.
9. Enhanced Gang Scheduling Through PodGroup Templates
Gang scheduling—requiring a minimum number of Pods to run simultaneously—is now defined declaratively in the Workload template. Each template can specify a podGroupTemplate with a minCount field (e.g., 4). The scheduler ensures that all Pods in the group can be scheduled before assigning any. This avoids deadlocks where partial allocations waste resources. The template approach also simplifies configuration updates: change the template, and new PodGroups automatically adopt the new policy.
10. Real-World Testing and Community Feedback
Kubernetes v1.36’s scheduling enhancements have been tested with real AI/ML workloads (including distributed training frameworks like PyTorch and TensorFlow). Early adopters report improved scheduling success rates and reduced job completion times. The community has provided feedback on the new APIs, leading to refinements in the PodGroup status conditions and error reporting. This iterative development ensures that the features are robust enough for production use, with more improvements planned for upcoming releases.
In conclusion, Kubernetes v1.36 represents a significant evolution in workload-aware scheduling, shifting from monolithic APIs to a decoupled, scalable architecture. The new PodGroup API, atomic scheduling cycles, and first-class support for gang scheduling, topology awareness, and preemption bring Kubernetes closer to meeting the demanding requirements of AI/ML and batch workloads. Whether you’re running distributed training or large-scale data processing, these enhancements provide the tools you need to optimize resource utilization and job performance.