HomeBlogBlog Detail

An Open Source Stack for AI Compute: Kubernetes + Ray + PyTorch + vLLM

By Robert Nishihara   |   May 28, 2025

Through our experience building Ray, we work with hundreds of platform teams and AI teams productionizing AI*. These teams use a range of open source frameworks to run AI workloads. Over the years, as AI workloads have evolved from classical ML to deep learning to generative AI, the underlying software stack for running and scaling these workloads has grown in complexity.

As an industry matures, a standard tech stack tends to emerge. For example, after a proliferation of deep learning frameworks, PyTorch has emerged as the winner. Among the classical big data analytics frameworks, Spark is the winner. Among container orchestrators, Kubernetes has won. And so on. When it comes to infrastructure products, the winners tend to be open source. It’s hard to imagine any of those technologies gaining much traction if they had been proprietary products.

We are in the early days of AI, and yet some pieces of a common stack are beginning to emerge. This blog post describes an early picture based on experience working with Ray users.

I’ll start by explaining the components of this stack and then illustrate it with case studies from Pinterest, Uber, and Roblox as well as from five popular open source post-training frameworks.

LinkBackground

The purpose of this tech stack is to connect a team’s AI workloads to their underlying compute resources (especially GPUs). The teams piecing together these software stacks tend to solve for a few common requirements.

  • Three key workloads. They need to support model training, model serving, and batch inference (especially multimodal data processing). Within training, pre-training and post-training are emerging as distinct workloads each with their own requirements and tech stacks.

  • Scale. AI is computationally intensive as well as data intensive. Many platform and AI teams are prioritizing training and inference on orders of magnitude more data. Supporting this scale means solving hard reliability and cost-efficiency challenges.

  • Iteration speed. The ability to test ideas quickly often bottlenecks progress. Notably, performance and scale are sometimes at odds with the flexibility required for rapid iteration. New ideas can be incompatible with existing optimized infrastructure for achieving performance at scale. To measure this speed, some AI teams track the number of experiments run per person per month.

  • Flexibility for the future. Many teams architect for flexibility because AI is changing rapidly. They don’t want to be locked into rigid frameworks that prevent the adoption of the latest models, techniques, and optimizations. They also value portability across cloud providers and hardware accelerators as many of them want to be prepared to jump on the best sources of compute that arise.

LinkThe Software Stack

We see Ray being used with nearly every tool out there, but some combinations occur repeatedly. One of the most common software stacks that we see for AI compute is Kubernetes + Ray + PyTorch + vLLM.

The software stack connects AI workloads at the top with the hardware powering these workloads at the bottom and it has three layers, each with important responsibilities. 

  • The training and inference framework: Running the model efficiently on the GPU (model optimizations, model parallelism strategies, model definition, automatic differentiation.

  • The distributed compute engine: Distributed computing including task scheduling, process management, RPCs, data movement, failure handling, workload autoscaling

  • The container orchestrator: Compute provisioning including container management, compute autoscaling, workload and user multitenancy

Fig 1. Kubernetes + Ray + PyTorch + vLLM: One of the most popular software stacks for AI compute.
Fig 1. Kubernetes + Ray + PyTorch + vLLM: One of the most popular software stacks for AI compute.

Fig 1. Kubernetes + Ray + PyTorch + vLLM: One of the most popular software stacks for AI compute.

LinkTraining and Inference Framework

Key Responsibility: Running the model efficiently on the GPU

  • Model compilation and model-level performance optimization
  • Accelerator memory management
  • Parallelism strategies (model parallelism, data parallelism, etc)
  • Model definition and automatic differentiation

GPUs and other accelerators are the centerpiece of AI compute, and so it is vitally important that models take full advantage of them.

At a high level, this layer is responsible for running AI models in a performant way on hardware accelerators. This layer bears the primary responsibility for interfacing with GPUs. In addition to model definition and automatic differentiation (in the case of training), these responsibilities include the execution of high-performance kernels, the optimization and compilation of models for different accelerator backends, accelerator memory management, and model parallelism (of which there are many varieties: tensor parallelism, pipeline parallelism, expert parallelism, and so forth).

A deep learning framework is at the heart of this layer. PyTorch is the dominant framework today, with Jax as a popular alternative, especially for running on TPUs. These frameworks have consolidated after an explosion of options over the past decade including Theano, Torch, Caffe, Chainer, MXNet, and TensorFlow.

Surrounding PyTorch, we have frameworks designed for model parallelism (sharding models across multiple hardware accelerators). These frameworks can be divided along two primary axes. First, these frameworks may be designed for training or for inference (or for both). Second, they may be general purpose (meaning they can be used with any model) or specialized for transformers. Transformers are sufficiently important and complex that they deserve their own specialized frameworks.

PyTorch FSDP (fully-sharded data parallel) and DeepSpeed are general frameworks for model parallelism, primarily optimized for training though they can be used for inference as well. Nvidia Megatron is a model-parallel training framework specialized for transformers. This specialization allows for more optimized sharding strategies (e.g., different sharding strategies for different weights).

Another set of frameworks are vLLM, SGLang, and Nvidia TensorRT-LLM. These are inference engines specialized for running transformers. There are a number of important techniques for optimizing inference in transformers. Examples include continuous batching, paged attention, speculative decoding, and chunked prefill.

LinkDistributed Compute Engine

Key Responsibility: Distributed computing

  • Task scheduling and execution
  • Process creation and lifecycle management, process coordination (RPCs)
  • Data handling (ingest and movement)
  • Workload-aware failure handling and autoscaling

At a high level, this layer is responsible for the distributed systems challenges beyond running the model efficiently on the GPU. These challenges include process lifecycle management, task scheduling across processes, efficient data movement, failure handling, and workload autoscaling.

These frameworks typically “wrap” the training and inference frameworks and invoke the training and inference frameworks as components of a broader distributed workload (e.g., using Ray might schedule vLLM tasks to do batch LLM inference as part of a broader data processing pipeline).

Ray and Spark are two of the most popular distributed compute engines. Ray is a Python-native general-purpose framework for scaling arbitrary Python workloads on CPUs, GPUs or both. It provides low-level task and actor primitives for building distributed applications, not so different conceptually from an RPC framework. Its most common use cases are for model training, model serving, multimodal data processing, and reinforcement learning. Spark is a unified analytics engine whose sweet spot is large-scale data pipelines and SQL & dataframe operations on CPUs.

It’s worth noting that failure handling and autoscaling responsibilities are split with the container orchestrator layer.

  • Failure handling: The distributed compute engine is workload aware and is responsible for ensuring that the overall AI workload handles the exception and continues executing smoothly (possibly restarting dead processes, re-executing interrupted tasks, or reconfiguring itself on the remaining available compute). The container orchestrator is not workload aware. It is responsible for removing the problematic GPU (or other component) from the underlying cluster.

  • Autoscaling: The distributed compute engine is responsible for workload-aware autoscaling. It ensures that the AI workload requests the resources it needs (from the container orchestrator), releases resources when they are no longer needed, and reconfigures itself to run on the compute it has available. The container orchestrator is responsible for actually obtaining the required compute resources from the underlying cloud provider, allocating them to different AI workloads, and possibly taking them back when appropriate.

LinkContainer Orchestrator

Key Responsibility: Compute provisioning

  • Workload-level scheduling and multitenancy
  • User multitenancy including isolation and resource quotas
  • Container creation and lifecycle management
  • Compute autoscaling

Workloads do not exist in isolation. In a typical company, many users may run many jobs, and these jobs share a common pool of compute resources.

The container orchestration layer is responsible for scheduling workloads and multiplexing users across a shared pool of compute. The scheduling granularity is important: the compute orchestrator schedules entire jobs and the distributed compute engine schedules the tasks & processes that make up an individual job). This layer is responsible for creating containers and managing container lifecycles. Furthermore, it is responsible for interfacing with cloud providers to select instance types and provision compute.

Kubernetes is the dominant framework at this layer. However, SLURM is a popular alternative. It is especially popular among researchers, and its strength is in offline workloads (training and batch processing). Kubernetes handles both, but was originally created for microservices and therefore biases more toward online workloads. That said, while many teams start off using SLURM, the momentum is toward standardizing on Kubernetes.

Note that this layer could equally well be called the compute orchestrator (as opposed to the container orchestrator). Software environments are usually defined with containers, but that is not a hard requirement. For example, SLURM often runs without containers.

One subtle point about this layer is the target users. Kubernetes fundamentally targets platform engineers, the people who operate clusters, configure infrastructure through YAML, manage multi-tenancy and resource sharing, and standardize policies across their organization. Ray, PyTorch, and vLLM target ML developers, people who write Python code and use PyTorch and Hugging Face, people who may find Docker and Kubernetes intimidating, but have no problem debugging complex numerical bugs or performance bottlenecks in their inference pipelines. Individually, these layers do not satisfy the needs of both sets of users, but together, they can meet both sets of requirements.

LinkCase Studies

A few companies have written extensively about their AI compute tech stacks.

LinkPinterest

Pinterest wrote an impressive three part blog series covering AI infra at Pinterest.

  • Part 1 is on last mile data processing for training and covers developer velocity bottlenecks, especially around ML dataset iteration.

  • Part 2 is on managing Ray at Pinterest and covers the year-long journey to adopt Ray best practices and manage Ray workloads in production. It covers all of the challenges from the interaction with Kubernetes to authentication to logging to metrics to cost optimization.

  • Part 3 is on batch inference and describes how they evolved their previous generation batch inference solution by pairing Ray with vLLM and PyTorch.

Their tech stack perfectly exhibits the Kubernetes + Ray + PyTorch + vLLM combination.

Workload

Container orchestrator

Distributed compute engine

Training & inference framework

Outcomes

Last-mile data processing + model training

Kubernetes-based PinCompute platform

Previously Spark, now Ray due to streaming execution and heterogeneous CPU + GPU compute

PyTorch

Dataset-iteration wall-clock cut by 6x (from 90h to 15h). Dev cycles cut from days to hours. GPU utilization over 90%. Training throughput up 45% while per-job cost down 25%. ✧ (blog)

Offline / batch inference

Kubernetes-based PinCompute platform

Previously Spark, now Ray

PyTorch, vLLM

30x decrease in cost for search-quality jobs. 4.5x throughput increase. 4x reduction in job runtime for GPU inference jobs. Running around 1800 jobs per month. ✧ (blog, blog)

Large-scale training

Especially recommender models

Kubernetes-based PinCompute platform

Ray on heterogeneous clusters

PyTorch

Running 5000+ training jobs per month. Transparent scaling. Greater developer velocity and interactive development. Enable heterogeneous resources to keep GPUs saturated. ✧ (blog, talk)

Fig 2. Components of Pinterest’s software stack for managing AI compute.

LinkUber

Uber has been highly influential in the ML platform space and has written extensively about their AI infrastructure journey.

  • They announced Michelangelo in 2017, a pioneering ML platform, where they detail their stack from managing data to training models to deploying models to monitoring predictions.

  • Around the same time, they open sourced Horovod, an early distributed training framework. They followed that post up with additional details on scaling both Horovod with Ray for deep learning and XGBoost with Ray for classical ML.

  • When it comes to batch inference, they wrote about their PySpark-based architecture and spoke recently about an architecture based on Ray + vLLM.

  • In this blog post, they describe the steps they took to optimize the cost and reliability of their training and serving infrastructure, across both on-prem and cloud settings.

  • In 2024, they shared an extensive history of the evolution of Michelangelo and how it made the transition from predictive ML to generative AI.

  • They describe their LLM training tech stack, which is based on PyTorch, Ray Train, Hugging Face Transformers, and DeepSpeed.

  • In 2024, Uber migrated its machine learning workloads to Kubernetes and recently described that migration in two parts. Part 1 describes the open source tech stack they adopted in order to enable this migration and the challenges they had to solve along the way. Part 2 zooms in on their job management platform on top of Kubernetes and some of the enhancements they had to make to Kubernetes.

Uber has evolved its ML platform significantly and has used many different open source technologies as part of its stack. The Kubernetes + Ray + PyTorch + vLLM combination plays a major role in their tech stack today, and it is surrounded by many other tools. Notably, Uber makes heavy use of Spark.

Fig 3. Uber LLM training software stack
Fig 3. Uber LLM training software stack

Fig 3. Uber LLM training software stack

Workload

Container orchestrator

Distributed compute engine

Training & inference framework

Outcomes

LLM training & evaluation

Fine-tuning 7B to 70B models on A100s & H100s; model evals

Kubernetes-based Michelangelo Job Controller across multiple AZs (for GPU availability and scale)

Ray for coordinating and scaling

PyTorch DDP, DeepSpeed, Hugging Face Transformers, vLLM for offline scoring

GPU memory reduction enabling 2-7x larger batches leading to 2-3x throughput increase on Llama-2 70B. ✧ (blog, blog)

Deep-learning & classical ML training

Distributed training, hyperparameter optimization

Originally Peloton, now Kubernetes-based Michelangelo Job Controller

Originally Spark, now Ray

Horovod, XGBoost, PyTorch

Improved scalability and reliability. ✧ (blog, blog, blog)

Batch inference

Classical models and LLMs

Originally Peloton, now Kubernetes-based Michelangelo Job Controller

Spark, Ray

TensorFlow, PyTorch, XGBoost, vLLM, Triton for embeddings

Scalability and GPU support. Multi-GPU inference. ✧ (blog, talk)

Model serving

Latency sensitive

Originally Peloton, now Kubernetes-based Michelangelo Job Controller

Michelangelo online prediction service

TensorFlow, PyTorch, previously served with Neuropod, now Triton

Framework agnostic, support for low-latency GPU serving. ✧ (blog)

Marketplace-incentive optimization

Adjusting incentives & discounts across thousands of cities

Originally Peloton, now Kubernetes-based Michelangelo Job Controller

Originally Spark, switched to Spark + Ray hybrid

CVXOPT and pure Python logic for optimization

40x overall speed-up. Reduced job deployment from 15–20 min to 2 min. Improved iteration speed. ✧ (blog)

Fig 4. Components of Uber’s software stack for managing AI compute.

LinkRoblox

Roblox has also written about their ML platform evolution. They describe three phases of evolution.

  • Phase 1 centered on Kubeflow and Spark for data processing and training workloads and on KServe and NVIDIA Triton for online serving.

  • Phase 2 focused on scaling training to support larger datasets and larger models as well as optimizing performance and efficiency for training and inference. This phase introduced Ray and Flink.

  • Phase 3 tackles LLM operations and brings in vLLM as a central piece of the platform.

Their tech stack features all of the core components described above along with Spark, Flink, Kubeflow, kServe, NVIDIA Triton, and many many other tools.

Fig 5. Roblox’s AI platform.
Fig 5. Roblox’s AI platform.

Fig 5. Roblox’s AI platform.

Workload

Container orchestrator

Distributed compute engine

Training & inference framework

Outcomes

Batch / offline inference

Personalization, multimodal search, CLIP embeddings, LLM batch inference

Kubernetes using hybrid cloud & on-prem data centers with Yunikorn for queueing

Previously Spark or reusing Triton online inference services, now Ray

PyTorch, vLLM

58% cost reduction (for large image CLIP inference), 9x speedup compared to calling online LLM server. Shortened engineering + job runtime by 50%. Hundreds of Ray jobs launched daily. Enabled multistage processing, heterogeneous resources, 10x improved GPU utilization, greater fault tolerance, tremendous efficiency gains, handling 1B personalization requests per day. ✧ (blog, talk)

Online LLM inference

Assistant, chat translation, voice safety

Kubernetes using hybrid cloud & on-prem data centers

kServe

vLLM is the primary LLM inference engine, Triton

Switch to vLLM delivered 2x lower latency and higher throughput. ✧ (blog)

Training pipelines

Daily retraining, distributed training

Kubernetes using hybrid cloud & on-prem data centers with Yunikorn for queueing

Kubeflow Pipelines, Ray for distributed training

PyTorch

Platform expanded from 50 → 250 inference pipelines. ✧ (blog)

Fig 6. Components of Roblox’s software stack for managing AI compute.

LinkPost-training Frameworks

This last case study is not about a specific company, but instead illustrates the stack through an important emerging AI workload, namely post-training.

Post-training is notable due to its complexity. Compared with regular training, post-training involves a combination of model training as well as model inference. Inference is run in order to generate additional data for training. Model inference often interacts with simulation (e.g., if you are teaching the model to do software engineering, you may be running and scaling containerized coding environments). Reward models score the quality of generated data. Model weights need to be shipped from the training portion of the algorithm to the inference portion, and rollouts (essentially execution traces) need to be shipped from the generation part back to the training part. As a result, there are many moving parts, and the distributed systems challenges are harder.

The table below describes the software stacks used to implement five of the most popular open source post-training frameworks. These frameworks are generally implemented by combining technologies at the distributed compute engine layer (primarily Ray) with technologies at the training and inference framework layer. They are designed to run across multiple container orchestrators (Kubernetes and SLURM). SLURM is especially popular for deploying these frameworks because researchers are the main people doing post-training right now.

Note that because this workload requires both training and inference, we separate out training frameworks from inference frameworks in this table.

Framework

Container orchestrator

Distributed compute engine

Training framework

Inference framework

Other comments

VeRL

ByteDance

Documentation primarily for SLURM

Ray for scheduling and coordination of RL components

PyTorch FSDP, Megatron-LM (for very large models)

vLLM, SGLang

Single-controller programming paradigm. Easily switch algorithms (PPO, GRPO) with a few lines of code. Decouples data flow and compute allowing flexible placement on GPUs. Achieves 1.5x–20x throughput versus earlier frameworks. ✧ (GitHub)

SkyRL

UC Berkeley Sky Lab

Experiments run on Kubernetes

Ray for scheduling and coordination of RL components

PyTorch FSDP

vLLM, SGLang

Extends VeRL with advanced agentic capabilities to perform tasks like SWE-Bench. Includes pre-built SkyRL-Agent models trained on SWE-Bench tasks. ✧ (GitHub, blog)

OpenRLHF

OpenRLHF community

Scripts provided for SLURM

Ray for scheduling and coordination of RL components

DeepSpeed

vLLM

Easy-to-use and scalable RLHF framework. Designed to make RLHF training simple. Enables training for models up to 70B+ parameters. ✧ (GitHub, blog)

Open-Instruct

AllenAI

Experiments run on Kubernetes-based Beaker platform

Ray for scheduling and coordination of RL components

PyTorch, DeepSpeed

vLLM

Post-training and instruction-tuning toolkit. Includes fine-tuning language models, DPO, preference tuning, and RL with verifiable rewards (RLVR). Includes scripts for measuring dataset contamination. Used to produce AllenAI’s TÜLU-3 models. ✧ (GitHub)

NeMo-RL

NVIDIA

Documentation primarily for SLURM, but also Kubernetes

Ray for scheduling and coordination of RL components

PyTorch FSDP, Megatron-LM (for very large models)

vLLM

A scalable and efficient post-training library for models ranging from 1 to 1000s of GPUs and from tiny models to over 100B parameters. Emphasizes scalability. ✧ (GitHub)

Fig 7. Technologies used for building the most popular open source post-training frameworks. Ray, PyTorch, and vLLM are universally used in the implementation. These frameworks are typically deployed on top of Kubernetes and SLURM.

LinkAppendix

*Beyond the examples given in this blog post, others include Spotify, ByteDance, Tencent, Canva, Coinbase, Instacart, Niantic, Netflix, Runway, Reddit, Cohere, Ant Group, Samsara, eBay, Handshake, Workday, Zoox, Airbnb, and Apple.

Ready to try Anyscale?

Access Anyscale today to see how companies using Anyscale and Ray benefit from rapid time-to-market and faster iterations across the entire AI lifecycle.