Skip to content

[Feature] KubeRay Federation: Multi-Cluster RayCluster Deployment and AutoScaling #4561

@yuchen-ecnu

Description

@yuchen-ecnu

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

Summary

We propose adding KubeRay Federation capability to enable RayCluster deployment and auto-scaling across multiple Kubernetes clusters. This feature allows organizations to unify fragmented compute resources (especially GPUs) distributed across different availability zones, cloud providers, or physical locations into a single logical Ray cluster.

Motivation

The Problem

In large-scale production environments, compute resources are often distributed across:

  • Multiple availability zones (AZs) within a single cloud provider
  • Multiple cloud vendors (for supply assurance and cost optimization)
  • On-premise and cloud hybrid environments

Current KubeRay is limited to single Kubernetes cluster deployment, which creates several operational challenges:

1. Fragmented GPU Resources

Organizations procure GPUs from multiple cloud vendors to ensure supply. These resources are physically isolated across different K8s clusters, making it impossible to:

  • create a single RayCluster spanning multiple AZs
  • dynamically scale workers across cluster boundaries
  • efficiently utilize all available resources as a unified pool (for scheduling user ray job within a single Ray cluster)

2. Operational Overhead with Current Workarounds

To work around single-cluster limitations, teams currently:

  • Deploy multiple smaller Ray clusters (e.g., 20 clusters × 10 GPUs instead of one cluster × 200 GPUs) and split datasets into multiple Ray Data jobs for large-scale computing.
    • These jobs are isolated and cannot be balanced across Ray clusters, which may lead to long-tail performance issues.
  • Manually start/stop clusters across different AZs when resources are unavailable

3. Virtual Kubelet Limitations

Some organizations use Virtual Kubelet to aggregate resources into a single entry-point K8s cluster. However, this approach has scalability issues:

  • The entry-point K8s control plane must handle all pods, creating bottlenecks
  • During peak scaling (e.g., scaling from 10K to 400K+ cores within an hour), the control plane may fail to keep up
  • Single cluster pod capacity limits require multiple entry-point clusters, increasing deployment complexity

Use case

Use Cases: Large-Scale Elastic Batch Inference

In large-scale batch inference scenarios, workloads often require hundreds or even thousands of GPUs to process massive datasets within tight time windows. However, due to GPUs are procured from multiple cloud vendors and availability zones, these GPU resources are scattered across different Kubernetes clusters.

Image

Current Pain Points:

1. Forced Multi-Cluster Deployment

Users are forced to create multiple RayCluster instances across different K8s clusters and manually partition datasets and submit jobs separately to each cluster.

2. Dynamic Resource Availability Creates Imbalance

In elastic resource environments, idle capacity curves vary across cloud vendors and AZs over time — compute power is constantly fluctuating.

  • At time T1: Cloud-A/AZ1 has 200 idle GPUs, Cloud-B/AZ1 has 50 idle GPUs
  • At time T2: Cloud-A/AZ1 drops to 80 GPUs (preemption), Cloud-B/AZ1 scales to 300 GPUs
    Once data is partitioned and distributed to separate clusters, Ray Data cannot perform cross-cluster load balancing. The cluster with fewer resources becomes a bottleneck while others sit idle, which causes severe long-tail effects.

3. Operational Complexity

Managing multiple RayClusters creates significant burden:

  • Deployment complexity: Manually create/configure clusters across different K8s environments
  • Data partitioning: Estimate resource availability and split data accordingly (often inaccurate)
  • Monitoring overhead: Track job progress across multiple dashboards
  • Failure handling: Manually redistribute work when one cluster fails or is preempted

Desired State with KubeRay Federation:

  • Single job submission: one RayCluster, one dataset, one job
  • Automatic load balancing: Ray Data dynamically schedules tasks to available resources across all AZs
  • Preemption resilience: When GPUs are reclaimed in one AZ, work automatically shifts to others
  • Optimal resource utilization: No idle waiting; work flows to wherever capacity exists
  • Simplified operations: Single cluster to deploy, monitor, and maintain

Image

Scope & Suitable Workloads

Federated KubeRay is designed to enable unified Ray clusters across multiple Kubernetes clusters, AZs, or cloud providers. While it brings significant benefits, it is important to clarify when it is suitable and when it is not.

Suitable Scenarios

  1. Compute-Intensive Workloads
  • Model inference (OCR, ASR, embeddings, LLM inference)
  • Large-scale batch processing where computation dominates communication
  • Workloads where single jobs require hundreds to thousands of GPUs across clusters
  1. Elastic / Multi-Cloud Resource Environments
  • Resources are scattered across multiple cloud vendors or availability zones
  • GPU availability fluctuates due to preemption or autoscaling
  • Desire to dynamically balance workloads with Ray-level task / actor across multiple Kubernetes clusters

Scenarios Where Federation May Not Be Ideal

  1. IO-Bound or Lightweight Tasks
  • Simple filtering, text transformations, or small-scale ETL
  • When network latency dominates execution time, local execution is more efficient
  1. Workloads with Large-Scale Shuffle
  • Applications that require frequent all-to-all data exchanges between tasks
  • Examples: large-scale joins, sorts, or other shuffle-heavy distributed algorithms

Feel free to suggest additional scenarios.

Design Considerations & Community Concerns

1. Cross-AZ Communication Overhead

A key concern is the potential communication overhead when Ray workers are distributed across multiple availability zones (AZs) or Kubernetes clusters.
In practice, the impact depends heavily on the workload characteristics.

For compute-heavy workloads such as, the computation time per data block is significantly larger than the data transfer cost. And in the current LLM era, compute-intensive workloads are becoming the dominant use case, which motivates exploring cross-cluster federation.

Prior Art & References

We noticed that Tencent may have implemented a similar approach, which serves as a useful reference: ray-forward-2025

We're happy to share more details about our production use cases. Looking forward to the community's feedback!

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions