-
Notifications
You must be signed in to change notification settings - Fork 722
Description
Search before asking
- I had searched in the issues and found no similar feature requirement.
Description
Summary
We propose adding KubeRay Federation capability to enable RayCluster deployment and auto-scaling across multiple Kubernetes clusters. This feature allows organizations to unify fragmented compute resources (especially GPUs) distributed across different availability zones, cloud providers, or physical locations into a single logical Ray cluster.
Motivation
The Problem
In large-scale production environments, compute resources are often distributed across:
- Multiple availability zones (AZs) within a single cloud provider
- Multiple cloud vendors (for supply assurance and cost optimization)
- On-premise and cloud hybrid environments
Current KubeRay is limited to single Kubernetes cluster deployment, which creates several operational challenges:
1. Fragmented GPU Resources
Organizations procure GPUs from multiple cloud vendors to ensure supply. These resources are physically isolated across different K8s clusters, making it impossible to:
- create a single RayCluster spanning multiple AZs
- dynamically scale workers across cluster boundaries
- efficiently utilize all available resources as a unified pool (for scheduling user ray job within a single Ray cluster)
2. Operational Overhead with Current Workarounds
To work around single-cluster limitations, teams currently:
- Deploy multiple smaller Ray clusters (e.g., 20 clusters × 10 GPUs instead of one cluster × 200 GPUs) and split datasets into multiple Ray Data jobs for large-scale computing.
- These jobs are isolated and cannot be balanced across Ray clusters, which may lead to long-tail performance issues.
- Manually start/stop clusters across different AZs when resources are unavailable
3. Virtual Kubelet Limitations
Some organizations use Virtual Kubelet to aggregate resources into a single entry-point K8s cluster. However, this approach has scalability issues:
- The entry-point K8s control plane must handle all pods, creating bottlenecks
- During peak scaling (e.g., scaling from 10K to 400K+ cores within an hour), the control plane may fail to keep up
- Single cluster pod capacity limits require multiple entry-point clusters, increasing deployment complexity
Use case
Use Cases: Large-Scale Elastic Batch Inference
In large-scale batch inference scenarios, workloads often require hundreds or even thousands of GPUs to process massive datasets within tight time windows. However, due to GPUs are procured from multiple cloud vendors and availability zones, these GPU resources are scattered across different Kubernetes clusters.
Current Pain Points:
1. Forced Multi-Cluster Deployment
Users are forced to create multiple RayCluster instances across different K8s clusters and manually partition datasets and submit jobs separately to each cluster.
2. Dynamic Resource Availability Creates Imbalance
In elastic resource environments, idle capacity curves vary across cloud vendors and AZs over time — compute power is constantly fluctuating.
- At time T1: Cloud-A/AZ1 has 200 idle GPUs, Cloud-B/AZ1 has 50 idle GPUs
- At time T2: Cloud-A/AZ1 drops to 80 GPUs (preemption), Cloud-B/AZ1 scales to 300 GPUs
Once data is partitioned and distributed to separate clusters, Ray Data cannot perform cross-cluster load balancing. The cluster with fewer resources becomes a bottleneck while others sit idle, which causes severe long-tail effects.
3. Operational Complexity
Managing multiple RayClusters creates significant burden:
- Deployment complexity: Manually create/configure clusters across different K8s environments
- Data partitioning: Estimate resource availability and split data accordingly (often inaccurate)
- Monitoring overhead: Track job progress across multiple dashboards
- Failure handling: Manually redistribute work when one cluster fails or is preempted
Desired State with KubeRay Federation:
- Single job submission: one RayCluster, one dataset, one job
- Automatic load balancing: Ray Data dynamically schedules tasks to available resources across all AZs
- Preemption resilience: When GPUs are reclaimed in one AZ, work automatically shifts to others
- Optimal resource utilization: No idle waiting; work flows to wherever capacity exists
- Simplified operations: Single cluster to deploy, monitor, and maintain
Scope & Suitable Workloads
Federated KubeRay is designed to enable unified Ray clusters across multiple Kubernetes clusters, AZs, or cloud providers. While it brings significant benefits, it is important to clarify when it is suitable and when it is not.
Suitable Scenarios
- Compute-Intensive Workloads
- Model inference (OCR, ASR, embeddings, LLM inference)
- Large-scale batch processing where computation dominates communication
- Workloads where single jobs require hundreds to thousands of GPUs across clusters
- Elastic / Multi-Cloud Resource Environments
- Resources are scattered across multiple cloud vendors or availability zones
- GPU availability fluctuates due to preemption or autoscaling
- Desire to dynamically balance workloads with Ray-level task / actor across multiple Kubernetes clusters
Scenarios Where Federation May Not Be Ideal
- IO-Bound or Lightweight Tasks
- Simple filtering, text transformations, or small-scale ETL
- When network latency dominates execution time, local execution is more efficient
- Workloads with Large-Scale Shuffle
- Applications that require frequent all-to-all data exchanges between tasks
- Examples: large-scale joins, sorts, or other shuffle-heavy distributed algorithms
Feel free to suggest additional scenarios.
Design Considerations & Community Concerns
1. Cross-AZ Communication Overhead
A key concern is the potential communication overhead when Ray workers are distributed across multiple availability zones (AZs) or Kubernetes clusters.
In practice, the impact depends heavily on the workload characteristics.
For compute-heavy workloads such as, the computation time per data block is significantly larger than the data transfer cost. And in the current LLM era, compute-intensive workloads are becoming the dominant use case, which motivates exploring cross-cluster federation.
Prior Art & References
We noticed that Tencent may have implemented a similar approach, which serves as a useful reference: ray-forward-2025
We're happy to share more details about our production use cases. Looking forward to the community's feedback!
Related issues
No response
Are you willing to submit a PR?
- Yes I am willing to submit a PR!