Coinbase logo blue

Case Study

Inside Coinbase’s ML Infrastructure Overhaul

By migrating to Ray and Anyscale, Coinbase has created a foundation that supports both the speed of experimentation and the scale needed for production-grade models.

Coinbase case study hero image

15x

more jobs at the same cost

8x

faster data transformation

50x

increase in ability to process training dataset volumes

Quick Takeaways

  • 2 hours → seconds for iteration cycles
  • 15x more jobs at the same cost
  • Able to process 50x greater training dataset volumes
  • 8x faster data transformation
  • Handling billions of rows and terabytes of data
  • Full migration on to Ray and Anyscale in under one year

LinkIntroduction

Coinbase's mission is to "increase economic freedom in the world." To fulfill this vision, the company has long been at the forefront of applying machine learning to protect customers and improve product experiences. From preventing fraud and account takeovers to powering customer support tools, the machine learning team at Coinbase supports a wide range of critical use cases.

The Coinbase team has built sophisticated AI systems to secure the cryptocurrency ecosystem and enhance user experiences. Their machine learning models power critical functions across the platform:

  • Fraud Prevention and Security: Models that detect suspicious logins, prevent account takeovers, and block fraudulent fund transfers

  • Risk Assessment: Advanced systems that evaluate transaction risk and monitor external Ethereum and Bitcoin wallet safety

  • Customer Experience: AI-powered chatbots and support systems that enhance service quality

Since deploying their first risk model in 2017, Coinbase's ML capabilities have expanded dramatically. By 2023, they had built proprietary internal systems like EasyML, a feature store, and EasyTensor for recommendation systems, followed by CBGPT for internal and external AI applications.

LinkThe Challenge: Infrastructure Bottleneck

Like many organizations, as Coinbase's AI ambitions grew, so did the demands on their ML systems. This increased strain was caused by a variety of factors, including:

  • Expanding Use Cases: From simple fraud detection to real-time predictions, personalization, and large scale analytics

  • Growing Datasets: Increasing volumes of transaction data, user behaviors, and external signals pushed the limits of their existing infrastructure

  • Complex Models: Moving from tree-based models to DNNs and LLMs

Suffice to say, their infrastructure needed to increase its efficiency. The company initially relied on an internally implemented solution, but this approach needed to be scaled as their ML needs expanded.

"We realized our ML Engineers were spending too much time waiting before they could iterate." – Wenyue Liu, Senior Machine Learning Platform Engineer @ Coinbase

There were three main blockers: 

  1. Slow Iteration Speed: On their internal solution, training a model required creating a pull request, building and pushing a Docker image, then triggering a job. Because of the level of complexity and reliance on outside tooling, engineers often had to wait up to two hours to test even minor code changes.

  2. Limited Scale: Training workloads also struggled to scale.The team also needed training workflows that could more effectively scale with their needs. Their original system didn't support tree-based models, and many jobs had to run on single large-memory instances, wasting GPU and CPU resources.

  3. Increasing Costs: Due to the complexity of their process, it was difficult for the Coinbase ML Infra team to optimize their process. As a result, several parts of the process were underoptimized, leading to increased costs. Hyperparameter optimization, for example, often redundantly processed the same data, multiplying infrastructure costs and slowing down experimentation. 

"Sometimes engineers would choose to work on features instead of working on a model because they never knew if their PR would work or not. Three or four failed PRs were equal to a whole day wasted.” –Wenyue Liu, Senior Machine Learning Platform Engineer @ Coinbase

For a company handling sensitive financial transactions, these constraints weren't just inconvenient – they limited the ability to develop more powerful models for important issues like fraud prevention and risk assessments.

LinkThe Solution: From Monolith to Modularity

Coinbase needed an AI infrastructure setup that could accelerate iteration cycles while efficiently handling distributed workloads. To modernize their ML infrastructure, the team chose Ray as the foundation, largely due to Ray’s Python-native, distributed compute engine for training, data processing, and HPO. 

With Ray, the Coinbase team got: 

  • Purpose-Built ML Architecture: Ray’s distributed architecture can handle everything from massive datasets to complex models.

  • Single Unified Framework: With a Python-first approach, Ray enabled teams to work across data preprocessing, training, and deployment without switching tools or frameworks.

  • Seamless Integration: Ray’s compatibility with popular libraries like PyTorch, TensorFlow, XgBoost, and Hugging Face made it easy for the Coinbase team to migrate with minimal disruption.

  • Optimized Resource Management: By offering fine-grained control over CPUs, GPUs, and memory, Ray significantly reduced resource waste and costs.

“Ray aligned with our vision: to iterate faster, scale smarter, and operate more efficiently.” –Wenyue Liu, Senior Machine Learning Platform Engineer @ Coinbase

After evaluating several options, the team decided to move their training and serving from their in-house solution to Ray in Q3 2023. The team began a systematic migration process:

  1. First, they decomposed their monolithic Docker images into modular components

  2. Next, they deployed Ray clusters on Kubernetes with robust integrations for S3 and other resources

  3. They started with simple workloads for quick wins before tackling more complex model training

  4. Finally, they rebuilt data pipelines using Ray Core for truly distributed processing

LinkThe Results: Coinbase Runs on Ray and Anyscale

The shift to Ray and Anyscale delivered improvements in speed, scale, and efficiency.

Link1. Increased Iteration Speed

The new Ray workflow significantly streamlined iterations:

  1. Create a working directory with training logic

  2. Write a simple configuration file

  3. Trigger the job using a CLI wrapper

Coinbase flow – Increased Iteration Speed
Coinbase flow – Increased Iteration Speed

Link2. Increased Scale

Ray's distributed architecture enabled Coinbase to process datasets up to 50 times larger than before. The team replaced their single-instance approach with horizontal scaling across 70+ workers, utilizing Ray Data for distributed preprocessing.

A key breakthrough came when Coinbase implemented and contributed a Ray-to-Delta Sharing connection, accelerating data transfer speeds between Spark (on Databricks) and Ray

"The last mile data transformation dropped from 120 minutes to 15 minutes because of distributed data processing with Ray." –Wenyue Liu, Senior Machine Learning Platform Engineer @ Coinbase

Other scale improvements: 

  • Ability to process training datasets with 50x more volume

  • Models train on terabytes of data and billions of rows

  • Last-mile data transformation is 8x faster

  • Clusters scale up to 70+ nodes, depending on job complexity

Link3. Increased Cost Efficiency

Ray's approach to resource sharing delivered substantial cost savings thanks to three main features:

  1. Runtime Environment Caching: Now, long-running clusters cache dependencies – dramatically cutting down on wait time and waste.

  2. Shared Data Processing: With Ray Tune, preprocessing and data loading are shared across all HPO trials – cutting down compute time and cost.

  3. Fine-grained Resource Control: Teams can specify exactly which compute resources each job needs, eliminating waste.

The result? Coinbase now trains 15x more jobs at the same cost compared to their original solution.

Coinbase – Increased Cost Efficiency
Coinbase – Increased Cost Efficiency

LinkWhat’s Next for ML at Coinbase?

Coinbase’s machine learning journey reflects a growing trend: the move from managed services to modular, scalable, developer-first infrastructure. With Ray and Anyscale, the Coinbase team has a platform that enables their teams to move faster, train smarter, and operate at scale.

Looking forward, Coinbase continues to expand its Ray-powered infrastructure:

  • LLM Fine-Tuning: Adding capabilities for fine-tuning large language models.

  • Granular Task Tracking: Shifting to task-level insights for improved observability.

  • Managed Services: Exploring managed Ray offerings to further reduce infrastructure complexity.

  • Resource Optimization: Implementing more sophisticated CPU/GPU utilization monitoring

With a modern ML platform in place, Coinbase is poised to scale its AI capabilities even further – powering the next generation of intelligent financial products.

LinkConclusion

Coinbase’s ML journey mirrors the evolution many organizations face. Starting with manageable workloads, companies often find themselves outgrowing initial solutions as their ambitions expand. 

But Coinbase's journey represents more than just a technology migration – it's a blueprint for how financial technology companies can build AI infrastructure that enables true innovation. By migrating to Ray and Anyscale, Coinbase has created a foundation that supports both the speed of experimentation and the scale needed for production-grade models.

“Ray and Anyscale aligned with our vision: to iterate faster, scale smarter, and operate more efficiently.”

Wenyue Liu

Senior Machine Learning Platform Engineer @ Coinbase

Coinbase logomark blue