[Feature] KubeRay Federation: Multi-Cluster RayCluster Deployment and AutoScaling

### Search before asking

- [x] I had searched in the [issues](https://github.com/ray-project/kuberay/issues) and found no similar feature requirement.


### Description

# Summary
We propose adding **KubeRay Federation** capability to enable RayCluster deployment and auto-scaling across multiple Kubernetes clusters. This feature allows organizations to unify fragmented compute resources (especially GPUs) distributed across different availability zones, cloud providers, or physical locations into a single logical Ray cluster.

# Motivation
## The Problem
In large-scale production environments, compute resources are often distributed across:
- Multiple availability zones (AZs) within a single cloud provider
- Multiple cloud vendors (for supply assurance and cost optimization)
- On-premise and cloud hybrid environments

Current KubeRay is limited to single Kubernetes cluster deployment, which creates several operational challenges:

## 1. Fragmented GPU Resources
Organizations procure GPUs from multiple cloud vendors to ensure supply. These resources are physically isolated across different K8s clusters, making it impossible to:
- create a single RayCluster spanning multiple AZs
- dynamically scale workers across cluster boundaries
- efficiently utilize all available resources as a unified pool (for scheduling user ray job within a single Ray cluster)

## 2. Operational Overhead with Current Workarounds
To work around single-cluster limitations, teams currently:
- Deploy multiple smaller Ray clusters (e.g., 20 clusters × 10 GPUs instead of one cluster × 200 GPUs) and split datasets into multiple Ray Data jobs for large-scale computing. 
    - These jobs are isolated and cannot be balanced across Ray clusters, which may lead to long-tail performance issues.
- Manually start/stop clusters across different AZs when resources are unavailable

## 3. Virtual Kubelet Limitations
Some organizations use Virtual Kubelet to aggregate resources into a single entry-point K8s cluster. However, this approach has scalability issues:
- The entry-point K8s control plane must handle all pods, creating bottlenecks
- During peak scaling (e.g., scaling from 10K to 400K+ cores within an hour), the control plane may fail to keep up
- Single cluster pod capacity limits require multiple entry-point clusters, increasing deployment complexity

### Use case

# Use Cases: Large-Scale Elastic Batch Inference
In large-scale batch inference scenarios, workloads often require **hundreds or even thousands of GPUs** to process massive datasets within tight time windows. However, due to GPUs are procured from multiple cloud vendors and availability zones, these GPU resources are **scattered** across different Kubernetes clusters.

<img width="3600" height="2967" alt="Image" src="https://github.com/user-attachments/assets/bfdbef9f-2f23-4797-8cb6-037da1fddec7" />

## Current Pain Points:
### 1. Forced Multi-Cluster Deployment
Users are forced to create multiple RayCluster instances across different K8s clusters and manually partition datasets and submit jobs separately to each cluster.

### 2. Dynamic Resource Availability Creates Imbalance
In elastic resource environments, idle capacity curves vary across cloud vendors and AZs over time — compute power is constantly fluctuating. 
- At time T1: Cloud-A/AZ1 has 200 idle GPUs, Cloud-B/AZ1 has 50 idle GPUs
- At time T2: Cloud-A/AZ1 drops to 80 GPUs (preemption), Cloud-B/AZ1 scales to 300 GPUs
Once data is partitioned and distributed to separate clusters, Ray Data cannot perform cross-cluster load balancing. The cluster with fewer resources becomes a bottleneck while others sit idle, which causes severe long-tail effects.

### 3. Operational Complexity
Managing multiple RayClusters creates significant burden:
- Deployment complexity: Manually create/configure clusters across different K8s environments
- Data partitioning: Estimate resource availability and split data accordingly (often inaccurate)
- Monitoring overhead: Track job progress across multiple dashboards
- Failure handling: Manually redistribute work when one cluster fails or is preempted

Desired State with KubeRay Federation:
- Single job submission: one RayCluster, one dataset, one job
- Automatic load balancing: Ray Data dynamically schedules tasks to available resources across all AZs
- Preemption resilience: When GPUs are reclaimed in one AZ, work automatically shifts to others
- Optimal resource utilization: No idle waiting; work flows to wherever capacity exists
- Simplified operations: Single cluster to deploy, monitor, and maintain

![Image](https://github.com/user-attachments/assets/079f874c-0030-42a1-9fab-6b45ab01598e)

# Scope & Suitable Workloads
Federated KubeRay is designed to enable unified Ray clusters across multiple Kubernetes clusters, AZs, or cloud providers. While it brings significant benefits, it is important to clarify when it is suitable and when it is not.

## Suitable Scenarios
1. Compute-Intensive Workloads
- Model inference (OCR, ASR, embeddings, LLM inference)
- Large-scale batch processing where computation dominates communication
- Workloads where single jobs require hundreds to thousands of GPUs across clusters
2. Elastic / Multi-Cloud Resource Environments
- Resources are scattered across multiple cloud vendors or availability zones
- GPU availability fluctuates due to preemption or autoscaling
- Desire to dynamically balance workloads with Ray-level task / actor across multiple Kubernetes clusters 

## Scenarios Where Federation May Not Be Ideal
1. IO-Bound or Lightweight Tasks
- Simple filtering, text transformations, or small-scale ETL
- When network latency dominates execution time, local execution is more efficient
2. Workloads with Large-Scale Shuffle
- Applications that require frequent all-to-all data exchanges between tasks
- Examples: large-scale joins, sorts, or other shuffle-heavy distributed algorithms

Feel free to suggest additional scenarios.

# Design Considerations & Community Concerns
## 1. Cross-AZ Communication Overhead
A key concern is the potential communication overhead when Ray workers are distributed across multiple availability zones (AZs) or Kubernetes clusters.
In practice, the impact depends heavily on the workload characteristics.

For compute-heavy workloads such as, the computation time per data block is significantly larger than the data transfer cost. And in the current LLM era,  compute-intensive workloads are becoming the dominant use case, which motivates exploring cross-cluster federation.

# Prior Art & References
We noticed that Tencent may have implemented a similar approach, which serves as a useful reference: [ray-forward-2025](https://github.com/ray-project/Ray-Forward/blob/main/ray-forward-2025-ppt/%E3%80%90Ray%20Forward%202025%E3%80%91Ray%E8%9E%8D%E5%90%88%E8%AE%A1%E7%AE%97%E5%9C%A8%E8%85%BE%E8%AE%AFTEG%E6%8E%A8%E7%90%86%E6%96%B9%E5%90%91%E7%9A%84%E8%90%BD%E5%9C%B0%E5%AE%9E%E8%B7%B5-%E5%AE%8B%E9%A1%BE%E6%9D%A8.pdf)

We're happy to share more details about our production use cases. Looking forward to the community's feedback!

### Related issues

_No response_

### Are you willing to submit a PR?

- [x] Yes I am willing to submit a PR!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] KubeRay Federation: Multi-Cluster RayCluster Deployment and AutoScaling #4561

Search before asking

Description

Summary

Motivation

The Problem

1. Fragmented GPU Resources

2. Operational Overhead with Current Workarounds

3. Virtual Kubelet Limitations

Use case

Use Cases: Large-Scale Elastic Batch Inference

Current Pain Points:

1. Forced Multi-Cluster Deployment

2. Dynamic Resource Availability Creates Imbalance

3. Operational Complexity

Scope & Suitable Workloads

Suitable Scenarios

Scenarios Where Federation May Not Be Ideal

Design Considerations & Community Concerns

1. Cross-AZ Communication Overhead

Prior Art & References

Related issues

Are you willing to submit a PR?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] KubeRay Federation: Multi-Cluster RayCluster Deployment and AutoScaling #4561

Description

Search before asking

Description

Summary

Motivation

The Problem

1. Fragmented GPU Resources

2. Operational Overhead with Current Workarounds

3. Virtual Kubelet Limitations

Use case

Use Cases: Large-Scale Elastic Batch Inference

Current Pain Points:

1. Forced Multi-Cluster Deployment

2. Dynamic Resource Availability Creates Imbalance

3. Operational Complexity

Scope & Suitable Workloads

Suitable Scenarios

Scenarios Where Federation May Not Be Ideal

Design Considerations & Community Concerns

1. Cross-AZ Communication Overhead

Prior Art & References

Related issues

Are you willing to submit a PR?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions