Home BlogBlog Detail

An Open Source Stack for AI Compute: Kubernetes + Ray + PyTorch + vLLM

By Robert Nishihara | June 12, 2025

Through our experience building Ray, we work with hundreds of AI teams and platform teams productionizing AI.¹ As AI workloads have evolved from classical ML to deep learning to generative AI, the software stack for running and scaling these workloads has ballooned in complexity.

Over time, industries standardize on common tech stacks. In the case of infrastructure products, the standards are often open source. Just as Spark eclipsed Hadoop for big-data analytics, Kubernetes has become the standard for container orchestration and PyTorch dominates deep learning.

Now over a decade into deep learning, many teams are landing on the same recipe. This blog post describes a snapshot of that emerging stack based on our experience working with Ray users.

We’ll break down each layer, then illustrate it with case studies from Pinterest, Uber, Roblox, and five popular open source post-training frameworks.

LinkBackground

The right stack connects a team’s AI workloads to their underlying compute resources (especially GPUs). The teams piecing together these software stacks tend to solve for a few common requirements.

Three key workloads. They need to support model training, model serving, and batch inference (especially multimodal data processing). Within training, pre-training and post-training are emerging as distinct workloads each with their own requirements and tech stacks.
Massive scale. AI is computationally intensive as well as data intensive. Platform and AI teams are working to accommodate orders of magnitude more data. Supporting scale means solving hard reliability and cost-efficiency challenges.
Iteration speed. Top AI teams track the number of experiments per developer per month. Unfortunately, performance and scale are sometimes at odds with rapid iteration. In the worst case, new ideas can be incompatible with existing optimized infrastructure.
Future-proofing. No one wants to rearchitect code to adopt the latest models, frameworks, or sources of compute capacity. Swapping in H100 spot nodes should be a parameter change, not a rewrite.

LinkThe software stack

A common recipe for AI compute is Kubernetes + Ray + PyTorch + vLLM.

This stack consists of three layers, each with a distinct slice of responsibilities.

The training and inference framework: Run models fast on GPUs: model optimizations, parallelism strategies, model definition, and automatic differentiation.
The distributed compute engine: Schedule tasks within a job, move data, handle failures, autoscale workloads, manage process lifecycles.
The container orchestrator: Allocate nodes, schedule entire jobs, spin up containers, manage multitenancy.

Fig 1. Kubernetes + Ray + PyTorch + vLLM: One of the most popular software stacks for AI compute.

Fig 1. Kubernetes + Ray + PyTorch + vLLM: a popular software stack for AI compute.

LinkRunning example: video processing

To make this concrete, we’ll ground the layers in a real workload: automated video moderation. The workload stages are

Video decoding and standardization (CPU)
Computer vision and audio tasks (GPU)
LLM reasoning (GPU)

LinkLayer 1: Training and inference framework

Key responsibility: Running the model efficiently on the GPU

Model compilation and model-level performance optimization
Accelerator memory management
Parallelism strategies (model parallelism, data parallelism, etc)
Model definition and automatic differentiation

GPUs and other accelerators are the centerpiece of AI compute, and so it is vitally important that models take full advantage of them.

Running example: In the video processing example, all of the computer vision tasks, the audio tasks, and other classification tasks would be executed by a deep learning framework like PyTorch. An LLM inference engine like vLLM would run LLM reasoning across multiple GPUs.

A deep learning framework is at the heart of this layer. PyTorch is the dominant framework today, with Jax as a popular alternative, especially for running on TPUs. These frameworks have consolidated after a decade-long shake-out including frameworks like Theano, Torch, Caffe, Chainer, MXNet, and TensorFlow.

Surrounding PyTorch, we have frameworks designed for model parallelism (sharding models across multiple hardware accelerators). These frameworks can be divided along two primary axes. First, these frameworks may be designed for training or for inference (or for both). Second, they may be general purpose (meaning they can be used with any model) or specialized for transformers. Transformers are sufficiently important and complex that they deserve their own specialized frameworks.

PyTorch FSDP (fully-sharded data parallel) and DeepSpeed are general frameworks for model parallelism, primarily optimized for training. Nvidia Megatron is a model-parallel training framework specialized for transformers. This specialization allows for more optimized sharding strategies (e.g., different sharding strategies for different weights).

Another set of frameworks are vLLM, SGLang, and Nvidia TensorRT-LLM. These are inference engines specialized for running transformers. There are a number of important techniques for optimizing inference in transformers. Examples include continuous batching, paged attention, speculative decoding, and chunked prefill.

LinkLayer 2: Distributed compute engine

Key Responsibility: Distributed computing

Task scheduling and execution
Process creation and lifecycle management, process coordination (RPCs)
Data handling (ingest and movement)
Workload-aware failure handling and autoscaling

Running example: In the video processing example, a distributed compute engine like Ray handles the scheduling of different stages of processing across containers and nodes, the management and autoscaling of the compute resources allocated to each stage, the movement of data between stages, and the recovery from hardware and application failures (by redoing lost work).

The compute engine treats the training and inference code as just another task. For example, Ray might schedule vLLM tasks to do batch LLM inference as part of a broader data processing pipeline.

Ray and Spark are two of the most popular distributed compute engines. Ray is Python-native and GPU-aware, ideal for model training, model serving, and multimodal data processing. Spark is a unified analytics engine whose sweet spot is large-scale SQL and dataframe operations on CPUs.

It’s worth noting that failure handling and autoscaling responsibilities are split with the container orchestrator layer.

Failure handling: The distributed compute engine is workload aware and is responsible for ensuring that the overall AI workload catches failures and continues executing smoothly (possibly restarting dead processes, re-executing interrupted tasks, or reconfiguring itself on the remaining available compute). The container orchestrator is not workload aware. It is responsible for removing the problematic GPU (or other component) from the underlying cluster.
Autoscaling: The distributed compute engine is responsible for workload-aware autoscaling. It ensures that the AI workload requests the resources it needs (from the container orchestrator), releases resources when they are no longer needed, and reconfigures itself to run on the compute it has available. The container orchestrator is responsible for actually obtaining the required compute resources from the underlying cloud provider, allocating them to different AI workloads, and possibly taking them back when appropriate.

LinkLayer 3: Container orchestrator

Key Responsibility: Compute provisioning

Workload-level scheduling and multitenancy
User multitenancy including isolation and resource quotas
Container creation and lifecycle management
Compute autoscaling

Running example: In the video processing example, a container orchestrator like Kubernetes obtains compute resources and allocates them to the data processing workload, mediates competing resource demands from multiple concurrent jobs, and creates and manages the containers that run the workload.

Workloads do not exist in isolation. In a typical company, many users may run many jobs, and these jobs share a common pool of compute resources.

The container orchestration layer is responsible for scheduling workloads and multiplexing users across a shared pool of compute. The scheduling granularity is important: the compute orchestrator schedules entire jobs and the distributed compute engine schedules the tasks & processes that make up an individual job. This layer is responsible for creating containers and managing container lifecycles. Furthermore, it is responsible for interfacing with cloud providers to select instance types and provision compute.

Kubernetes is the dominant framework at this layer. However, SLURM is a popular alternative. It is especially popular among researchers, and its strength is in offline workloads (training and batch processing). Kubernetes handles both, but was originally created for microservices and therefore biases more toward online workloads. That said, while many teams start off using SLURM, the momentum is toward standardizing on Kubernetes.

One subtle point about this layer is the target users. Kubernetes fundamentally targets platform engineers, the people who operate clusters, configure infrastructure through YAML, manage multi-tenancy and resource sharing, and standardize policies across their organization. Ray, PyTorch, and vLLM target ML developers, people who write Python code and use PyTorch and Hugging Face, people who may find Docker and Kubernetes intimidating, but have no problem debugging complex numerical bugs or performance bottlenecks in their inference pipelines. Individually, these layers do not satisfy the needs of both sets of users, but together, they can meet both sets of requirements.

LinkCase Studies

A few companies have written extensively about their AI compute tech stacks.

LinkPinterest

Pinterest wrote an impressive three part blog series covering AI infra at Pinterest.

Part 1 is on last mile data processing for training and covers developer velocity bottlenecks, especially around ML dataset iteration
Part 2 is on managing Ray at Pinterest and covers the year-long journey to adopt Ray best practices and manage Ray workloads in production. It covers all of the challenges from the interaction with Kubernetes to authentication to logging to metrics to cost optimization.
Part 3 is on batch inference and describes how they evolved their previous generation batch inference solution by pairing Ray with vLLM and PyTorch.

Their tech stack perfectly exhibits the Kubernetes + Ray + PyTorch + vLLM combination.

Workload	Container orchestrator	Distributed compute engine	Training & inference framework	Outcomes
Last-mile data processing + model training	Kubernetes-based PinCompute platform	Previously Spark, now Ray	PyTorch	Dataset-iteration wall-clock cut by 6x (from 90h to 15h). Dev cycles cut from days to hours. GPU utilization over 90%. Training throughput up 45% while per-job cost down 25%. ✧ (blog)
Offline / batch inference	Kubernetes-based PinCompute platform	Previously Spark, now Ray	PyTorch, vLLM	30x decrease in cost for search-quality jobs. 4.5x throughput increase. 4x reduction in job runtime for GPU inference jobs. Running around 1800 jobs per month.✧ (blog, blog)
Large-scale training Especially recommender models	Kubernetes-based PinCompute platform	Ray	PyTorch	Running 5000+ training jobs per month. Transparent scaling. Greater developer velocity and interactive development. Enable heterogeneous resources to keep GPUs saturated. ✧ (blog, talk)

Workload

Container orchestrator

Distributed compute engine

Training & inference framework

Outcomes

Last-mile data processing + model training

Kubernetes-based PinCompute platform

Previously Spark, now Ray

PyTorch

Dataset-iteration wall-clock cut by 6x (from 90h to 15h). Dev cycles cut from days to hours. GPU utilization over 90%. Training throughput up 45% while per-job cost down 25%. ✧ (blog)

Offline / batch inference

Kubernetes-based PinCompute platform

Previously Spark, now Ray

PyTorch, vLLM

30x decrease in cost for search-quality jobs. 4.5x throughput increase. 4x reduction in job runtime for GPU inference jobs. Running around 1800 jobs per month.✧ (blog, blog)

Large-scale training

Especially recommender models

Kubernetes-based PinCompute platform

Ray

PyTorch

Running 5000+ training jobs per month. Transparent scaling. Greater developer velocity and interactive development. Enable heterogeneous resources to keep GPUs saturated. ✧ (blog, talk)

Fig 2. Components of Pinterest’s software stack for managing AI compute.

LinkUber

Uber has been highly influential in the ML platform space and has written extensively about their AI infrastructure journey.

They announced Michelangelo in 2017, a pioneering ML platform, where they detail their stack from managing data to training models to deploying models to monitoring predictions.
Around the same time, they open sourced Horovod, an early distributed training framework. They followed that post up with additional details on scaling both Horovod with Ray for deep learning and XGBoost with Ray for classical ML.
When it comes to batch inference, they wrote about their PySpark-based architecture and spoke recently about an architecture based on Ray + vLLM.
In this blog post, they describe the steps they took to optimize the cost and reliability of their training and serving infrastructure, across both on-prem and cloud settings.
In 2024, they shared an extensive history of the evolution of Michelangelo and how it made the transition from predictive ML to generative AI.
They describe their LLM training tech stack, which is based on PyTorch, Ray Train, Hugging Face Transformers, and DeepSpeed.
In 2024, Uber migrated its machine learning workloads to Kubernetes and recently described that migration in two parts. Part 1 describes the open source tech stack they adopted in order to enable this migration and the challenges they had to solve along the way. Part 2 zooms in on their job management platform on top of Kubernetes and some of the enhancements they had to make to Kubernetes.

Uber has evolved its ML platform significantly and has used many different open source technologies as part of its stack. The Kubernetes + Ray + PyTorch + vLLM combination plays a major role in their tech stack today, and it is surrounded by many other tools. Notably, Uber makes heavy use of Spark.

Workload	Container orchestrator	Distributed compute engine	Training & inference framework	Outcomes
LLM training & evaluation Fine-tuning 7B to 70B models on A100s & H100s; model evals	Originally Peloton, now Kubernetes-based Michelangelo Job Controller	Ray	PyTorch DDP, DeepSpeed, Hugging Face Transformers, vLLM for offline scoring	2-7x larger batches, 2-3x throughput increase on Llama-2 70B. ✧ (blog, blog)
Deep-learning & classical ML training Distributed training, hyperparameter optimization	Originally Peloton, now Kubernetes-based Michelangelo Job Controller	Originally Spark, now Ray	Horovod, XGBoost, PyTorch	Improved scalability and reliability. ✧ (blog, blog, blog)
Batch inference Classical models and LLMs	Originally Peloton, now Kubernetes-based Michelangelo Job Controller	Spark, Ray	TensorFlow, PyTorch, XGBoost, vLLM, Triton for embeddings	Scalability and GPU support. Multi-GPU inference. ✧ (blog, talk)
Model serving Latency sensitive	Originally Peloton, now Kubernetes-based Michelangelo Job Controller	Michelangelo online prediction service	TensorFlow, PyTorch, previously served with Neuropod, now Triton	Framework agnostic, support for low-latency GPU serving. ✧ (blog)
Marketplace-incentive optimization Adjusting incentives & discounts across thousands of cities	Originally Peloton, now Kubernetes-based Michelangelo Job Controller	Originally Spark, switched to Spark + Ray hybrid	CVXOPT and pure Python logic for optimization	40x overall speed-up. Reduced job deployment from 15–20 min to 2 min. Improved iteration speed. ✧ (blog)

Fig 3. Components of Uber’s software stack for managing AI compute.

LinkRoblox

Roblox has also written about their ML platform evolution. They describe three phases of evolution.

Phase 1 centered on Kubeflow and Spark for data processing and training workloads and on KServe and Triton for online serving.
Phase 2 focused on scaling training to support larger datasets and larger models as well as optimizing performance and efficiency for training and inference. This phase introduced Ray and Flink.
Phase 3 tackles LLM operations and brings in vLLM as a central piece of the platform.

Their tech stack features all of the core components described above along with Spark, Flink, Kubeflow, kServe, Triton, and many many other tools.

Workload	Container orchestrator	Distributed compute engine	Training & inference framework	Outcomes
Batch / offline inference Personalization, multimodal search, CLIP embeddings, LLM batch inference	Kubernetes using hybrid cloud, Yunikorn for queueing	Previously Spark or reusing Triton online inference services, now Ray	PyTorch, vLLM	9x speedup over online LLM server, 58% cost reduction for large image CLIP inference. Enabled multistage processing, heterogeneous resources, 10x improved GPU utilization, greater fault tolerance, 1B requests per day. ✧ (blog, talk)
Online LLM inference Assistant, chat translation, voice safety	Kubernetes using hybrid cloud	kServe	vLLM is the primary LLM inference engine, Triton	Switch to vLLM delivered 2x lower latency and higher throughput. ✧ (blog)
Training pipelines Daily retraining, distributed training	Kubernetes using hybrid cloud, Yunikorn for queueing	Kubeflow Pipelines, Ray for distributed training	PyTorch	Platform expanded from 50 → 250 inference pipelines. ✧ (blog)

Workload

Container orchestrator

Distributed compute engine

Training & inference framework

Outcomes

Batch / offline inference

Personalization, multimodal search, CLIP embeddings, LLM batch inference

Kubernetes using hybrid cloud, Yunikorn for queueing

Previously Spark or reusing Triton online inference services, now Ray

PyTorch, vLLM

9x speedup over online LLM server, 58% cost reduction for large image CLIP inference. Enabled multistage processing, heterogeneous resources, 10x improved GPU utilization, greater fault tolerance, 1B requests per day. ✧ (blog, talk)

Online LLM inference

Assistant, chat translation, voice safety

Kubernetes using hybrid cloud

kServe

vLLM is the primary LLM inference engine, Triton

Switch to vLLM delivered 2x lower latency and higher throughput. ✧ (blog)

Training pipelines

Daily retraining, distributed training

Kubernetes using hybrid cloud, Yunikorn for queueing

Kubeflow Pipelines, Ray for distributed training

PyTorch

Platform expanded from 50 → 250 inference pipelines. ✧ (blog)

Fig 4. Components of Roblox’s software stack for managing AI compute.

LinkPost-training Frameworks

This last case study is not about a specific company, but instead illustrates the stack through an important emerging AI workload, namely post-training.

Post-training now centers on reinforcement learning and is notable due to its complexity. Compared with regular training, post-training involves a combination of model training as well as model inference. Inference is run in order to generate additional data for training. Model inference often interacts with simulation or an agentic environment (e.g., if you are teaching the model to do software engineering, you may be running and scaling containerized coding environments). Reward models score the quality of generated data. Model weights need to be shipped from the training portion of the algorithm to the inference portion, and rollouts (essentially execution traces) need to be shipped from the generation part back to the training part. As a result, there are many moving parts, and the distributed systems challenges are harder.

The table below describes the software stacks used to implement five of the most popular open source post-training frameworks. These frameworks are generally implemented by combining technologies at the distributed compute engine layer (primarily Ray) with technologies at the training and inference framework layer. They are designed to run across multiple container orchestrators (Kubernetes and SLURM). SLURM is especially popular for deploying these frameworks because researchers are the main people doing post-training right now.

Note that because this workload requires both training and inference, we separate out training frameworks from inference frameworks in this table.

Framework	Container orchestrator	Distributed compute engine	Training framework	Inference framework	Other comments
VeRL ByteDance	Documentation primarily for SLURM	Ray for scheduling and coordination of RL components	PyTorch FSDP, Megatron-LM (for very large models)	vLLM, SGLang	PPO, GRPO. Decouples data flow and compute. ✧ (GitHub)
SkyRL UC Berkeley Sky Lab	Experiments run on Kubernetes	Ray for scheduling and coordination of RL components	PyTorch FSDP	vLLM, SGLang	Extends VeRL with agentic capabilities for tasks like SWE-Bench. ✧ (GitHub, blog)
OpenRLHF OpenRLHF community	Scripts provided for SLURM	Ray for scheduling and coordination of RL components	DeepSpeed	vLLM	Scalable RLHF. 70B+ parameter models. ✧ (GitHub, blog)
Open-Instruct AllenAI	Experiments run on Kubernetes-based Beaker platform	Ray for scheduling and coordination of RL components	PyTorch, DeepSpeed	vLLM	Fine-tuning, DPO, preference tuning, RLVR. ✧ (GitHub)
NeMo-RL NVIDIA	Documentation primarily for SLURM, but also Kubernetes	Ray for scheduling and coordination of RL components	PyTorch FSDP, Megatron-LM (for very large models)	vLLM	Scalable post-training, up to 1000s of GPUs, 100B+ parameter models. ✧ (GitHub)

Fig 5. Technologies used for building the most popular open source post-training frameworks. Ray, PyTorch, and vLLM are universally used in the implementation. These frameworks are typically deployed on top of Kubernetes and SLURM.

¹ Beyond the examples given in this blog post, others include Spotify, ByteDance, Tencent, Canva, Coinbase, Instacart, Niantic, Netflix, Runway, Reddit, Cohere, Ant Group, Samsara, eBay, Handshake, Workday, Zoox, Airbnb, Nubank, Attentive, and Apple.

Background
The software stack
Running example: video processing
Layer 1: Training and inference framework
Layer 2: Distributed compute engine
Layer 3: Container orchestrator
Case Studies
Pinterest
Uber
Roblox
Post-training Frameworks

Sharing

Sign up for product updates

Introducing KubeRay v1.4

The architecture of a Reinforcement Learning (RL) library is split into two primary components: Generation and Training. During the generation phase, an LLM Engine performs multi-turn rollouts within an environment to produce data and reward signals. This output is then fed into the training phase to update the model's parameters. This process forms a feedback loop, where the progressively improved model generates the next iteration of data for continuous refinement.

Open Source RL Libraries for LLMs

Figure 24: TFCC inference runtime

Large-Scale Deployment of Ray in Tencent’s Weixin AI Infrastructure

Ready to try Anyscale?

Access Anyscale today to see how companies using Anyscale and Ray benefit from rapid time-to-market and faster iterations across the entire AI lifecycle.

An Open Source Stack for AI Compute: Kubernetes + Ray + PyTorch + vLLM

LinkBackground

LinkThe software stack

LinkRunning example: video processing

LinkLayer 1: Training and inference framework

Key responsibility: Running the model efficiently on the GPU

LinkLayer 2: Distributed compute engine

Key Responsibility: Distributed computing

LinkLayer 3: Container orchestrator

Key Responsibility: Compute provisioning

LinkCase Studies

LinkPinterest

LinkUber

LinkRoblox

LinkPost-training Frameworks

Table of contents

Sharing

Sign up for product updates

Recommended content

Introducing KubeRay v1.4

Open Source RL Libraries for LLMs

Large-Scale Deployment of Ray in Tencent’s Weixin AI Infrastructure

Ready to try Anyscale?