HomeBlogBlog Detail

How Adyen trains a Transaction Foundation Model (TFM) on 51 trillion tokens and other stories on scaling AI with Ray from Xoople, Criteo, and BMW

By Katarina Stanley   |   June 11, 2026

Ray Day London marked the last stop on Anyscale's 2026 Ray on the Road series, an eight-city tour focused on helping organizations navigate the growing complexity of AI workloads. The event showcased how Ray and Anyscale simplify AI infrastructure while bringing together AI developers and platform teams for technical learning, community building, and networking.

LinkFrom the community: Ray user talks

Philipp Moritz, co-creator of Ray and co-founder of Anyscale
Philipp Moritz, co-creator of Ray and co-founder of Anyscale

Philipp Moritz, co-creator of Ray and co-founder of Anyscale, opened the day alongside Richard Liaw, one of Anyscale's founding engineers, and Christian Stano, Anyscale's Field CTO. Together they argued that AI infrastructure keeps getting harder, as multimodal data processing, foundation model training, and the shift toward reinforcement learning (RL) workloads outgrow the tools AI and platform teams have relied on.

Four practitioner talks followed, from teams running AI at scale with Ray. Different industries, but the same conclusion each time: when they needed to scale, and to do it efficiently, Ray was what made it possible. 

LinkHow Xoople runs geospatial foundation model inference at global scale

Presented by Marcell Ferencz, Head of Enterprise Product Engineering, Xoople
Presented by Marcell Ferencz, Head of Enterprise Product Engineering, Xoople

Presented by Marcell Ferencz, Head of Enterprise Product Engineering, Xoople

Xoople's goal is to run geospatial foundation model predictions at global scale, so the team set out to test how fast it could run pixel-level predictions over an area the size of Spain. The result was 500,000 km² of satellite imagery in under 5 minutes, with only 12 x A10 GPUs using Anyscale on Azure.

Behind that number is a data pipeline built specifically for geospatial inference. Sentinel-2 imagery is organized into 12-band data cubes with four dimensions (x, y, spectral bands, and time) stored in Zarr, a format similar to HDF5 but cloud-native, using chunked arrays built for parallel reads and writes. To run this at scale, Marcell's team built extensions for Ray Data, a multimodal data processing library that supports deployment of heterogeneous CPU and GPU compute pipelines. To get the most out of every GPU, they leveraged fractional GPU Ray actors, batching, and bf16. With Anyscale, the team can submit a Job, quickly spin up resources, and let it run reliably to completion, so they can focus on development instead of Ray cluster operations.

Their first pipeline ran with IBM's Terramind, the first generative, any-to-any multimodal earth observation foundation model. The pipeline is built to easily swap or add models without ops complexity, whether that is land use classification (LULC), embeddings, or cloud masking.

LinkHow Adyen upgraded its AI platform to train a Transaction Foundation Model (TFM) on 51 trillion token dataset

Presented by Martin Iglesias, Research Engineer, AI, and Maxime Batello, MLOps Engineer, Adyen
Presented by Martin Iglesias, Research Engineer, AI, and Maxime Batello, MLOps Engineer, Adyen

Presented by Martin Iglesias, Research Engineer, AI, and Maxime Batello, MLOps Engineer, Adyen

Adyen processes 200 million payments a day, 45 billion a year, across 400 payment methods, 143 countries, and 230 currencies. ML is embedded throughout the payment flow (risk, authentication, and payment route optimization), all aimed at maximizing authorization rates while minimizing risk.

Their bet was a Transaction Foundation Model (TFM). Fraud patterns emerge from behavioral sequences, as the same payment can carry very different risks depending on the history that precedes it. Rather than compressing that history into hand-engineered features and task-specific models, they trained a single foundation model directly on raw payment sequences, allowing it to learn the behavioral patterns most relevant to fraud detection.

The dataset is the headline here: 200 billion payments, with sequences of up to 256 tokens, came to roughly 51 trillion tokens. This is more than twice the 22 trillion text tokens used to train Llama 4. The approach paid off once training crossed hundreds of millions of payments, where the sequence model overtook their tabular approaches by more than 200 basis points.

Getting there required modernizing both the data and training stack. The legacy PySpark/JVM pipeline forced engineers to wait 4 to 6 hours for data preparation and disk materialization, while scaling PyTorch beyond a single node meant building distributed Python infrastructure by hand. With Ray, preprocessing, tokenization, dataloading, and training code that used to run on a single node now executes as a single streaming Python pipeline, with each step scaling independently across multiple nodes. No separate engines, no materialization in between data and training. Each engineer gained 5x more experiments per week, on a platform that breaks far less often. 

LinkHow Criteo made its relevancy model multimodal and sped up inference 50x

Presented by Paul Coursaux, Senior Machine Learning Engineer, Criteo
Presented by Paul Coursaux, Senior Machine Learning Engineer, Criteo

Presented by Paul Coursaux, Senior Machine Learning Engineer, Criteo

Criteo, the 20-year-old French adtech company, operates 11 datacenters worldwide, runs 2,500+ A/B tests a year, and serves an average of 59,000 ads per second. Paul's team works on sponsored products in retail media, where strong relevancy constraints demand a model that truly understands the match between a search query and a product.

That model is a two-tower architecture trained with contrastive loss – a keyword encoder on one side, a product encoder combining text and image transformers on the other, trained on 100 million SKU-keyword pairs mined from user behavior. Training runs on Ray Train over 1TB+ of data using 8 B200 GPUs, taking around 5 days, with Ray Data handling last-mile processing and Ray Tune driving experiments.

What Paul liked about Ray was the flexibility in where the work runs. CPU-heavy steps like tokenization and JPEG decoding can sit on dedicated CPU nodes, or use the spare CPUs on GPU nodes. Decoupling CPU processing from GPU training made the pipeline more scalable, kept smaller GPUs fully fed, and unlocked fault tolerance and support for preemptible workloads.

Offline inference followed the same shape. To embed 2 billion products across a 30TB dataset, CPU workers handle the Parquet reads and writes on either side of the GPU inference actors. Tuning block sizes to avoid memory-hungry repartition steps yielded a 50x speedup in the inference pipeline. Ray also accelerated the original text pipeline and made it feasible to add images, taking the model multimodal.

LinkBeyond LLMs: How BMW scaled its AI Gateway from chat to video with Ray and vLLM

Presented by Thomas Riedl, AI Platform Engineer, BMW Group
Presented by Thomas Riedl, AI Platform Engineer, BMW Group

Presented by Thomas Riedl, AI Platform Engineer, BMW Group

BMW's Connected AI Platform (CAIP) serves 550+ users, 55+ models, and 60+ use cases across 7 modalities and 25 million connected vehicles, all through one central AI Gateway. Thomas traced its evolution from text chat models via hyperscaler APIs in 2024, to multimodality in 2025, and then a hard lesson – you cannot build on a model that gets deprecated underneath you, as happened with OpenAI's Sora. Lack of reliable model availability, data privacy, and per token costs pushed BMW toward self-hosting.

As a first step, the team started to both fine-tune and deploy open-source models in house. To make these models available to the teams they already supported, the team connected a Ray Serve cluster with vLLM to their existing OpenAI-compatible AI gateway. Because BMW pays for the whole GPU either way, the team leverages Ray fractional allocation APIs to pack three models onto a single L40S card with fractional allocations. A Qwen3.5-9B chat model runs on 0.55 of the GPU, an embedding model on 0.1, and a speech recognition model on 0.15. For image and video, the recently released vLLM-Omni extension added support for non-autoregressive models like FLUX.2 and Wan2.2 in order to cover multiple modalities. 

The cost case is compelling at scale. At 20,000 videos a month, self-hosted Wan2.2-5B runs about $0.035 per video, versus a flat $0.40 with Sora-2, and the per-video cost keeps dropping as volume climbs.

LinkNext stops

Next stop is Ray Summit in San Francisco Aug 24-26. It will bring together AI leaders working on physical AI, foundation model training and reinforcement learning for LLMs. Join us!

Explore Anyscale today

Build, run, and scale any AI workload on Ray with a multi-cloud platform built for production AI.