Home BlogBlog Detail

Ray 1.13: Improving support for shuffling terabyte-scale and larger datasets

By Stephanie Wang, Jiao Dong, Dmitri Gekhtman and Sang Cho | June 9, 2022

Ray 1.13 is here! The highlights in this release include:

Scalable shuffle in Ray Datasets (now in alpha), improving support for shuffling terabyte-scale and larger datasets
Performance fixes and other enhancements to the Ray Serve Deployment Graph API
Enhancements to KubeRay, including improved stability of autoscaling support (alpha)
Usage stats data collection, now on by default (guarded by an opt-out prompt), helping the open source Ray engineering team better understand how to improve our libraries and core APIs

You can run pip install -U ray to access these features and more. With that, let’s dive in.

LinkImproved support for shuffling terabyte-scale and larger datasets with Ray Datasets

Large-scale (terabyte-scale datasets or larger and/or 1,000+ partitions) Ray Datasets users who are using random_shuffle for ML ingest or sort: this one’s for you. We’ve improved support for shuffling terabyte-scale and larger datasets (now in alpha). The affected API calls are Dataset.random_shuffle and Dataset.sort. In addition to stability improvements in Ray Core, we’ve also introduced a new shuffle algorithm in the Datasets library for scaling to larger clusters and datasets. Check out the documentation for more information.

Fun fact: This feature was inspired by an academic paper co-authored by the Anyscale team — and the team will be giving a talk on this topic at Ray Summit as well.

Read the release notes for more on Ray Datasets improvements now available in 1.13.

Note: Ray AI Runtime (AIR) is currently in alpha. Prior to the Ray AIR beta release, we plan to improve the documentation and scope of the Ray Datasets integrations with Ray AIR.

LinkDeployment Graph API enhancements

Ray 1.13 introduces performance fixes and other enhancements to the Ray Serve Deployment Graph API, which provides a Python-native API to create a complex inference graph of multiple models to complete an inference or prediction, such as chaining, ensembling, and dynamic control flow.

The API is also now nearly ready for beta — but we want your feedback! If you have suggestions on how the Deployment Graph API can work better for your use case, head over to our user forums or file an issue or feature request on GitHub, or get in touch with us via this form. We can’t wait to partner with you to adopt Ray Serve in your project!

For more on creating inference graphs with Ray Serve, check out our recent Ray Meetup or read the full end-to-end walkthrough in the documentation.

LinkImproved autoscaling support in KubeRay

KubeRay is a toolkit for running Ray applications on Kubernetes. Since the Ray 1.12.0 release, we’ve improved the stability of autoscaling support (currently in alpha). We’ve also introduced CI tests guaranteeing functionality of autoscaling, Ray Client, and Ray Job Submission when deploying using the KubeRay operator.

LinkUsage stats collection

Starting in Ray 1.13, Ray collects usage stats data by default (guarded by an opt-out prompt). This data will be used by the open-source Ray engineering team to better understand how to improve our libraries and core APIs, and how to prioritize bug fixes and enhancements.

Here are the guiding principles of our collection policy:

No surprises: You will be notified before we begin collecting data. You will be notified of any changes to the data being collected or how it is used.
Easy opt-out: You will be able to easily opt out of data collection
Transparency: You will be able to review all data that is sent to us.
Control: You will have control over your data, and we will honor requests to delete your data.
We will not collect any personally identifiable data or proprietary code/data.
We will not sell data or buy data about you.

For more details, check the official documentation or RFC.

LinkOther highlights

Python 3.10 support is now in alpha.
Ray Tune can now synchronize Trial data from worker nodes via the object store (without rsync!).
Ray Workflow comes with a new API and is integrated with Ray DAG.

LinkLearn more

Many thanks to all those who contributed to this release. To learn about all the features and enhancements in this release, check out the Ray 1.13.0 release notes. If you would like to keep up to date with all things Ray, follow @raydistributed on Twitter and sign up for the Ray newsletter.

Improved support for shuffling terabyte-scale and larger datasets with Ray Datasets
Deployment Graph API enhancements
Improved autoscaling support in KubeRay
Usage stats collection
Other highlights
Learn more

Sharing

Sign up for product updates

Introducing KubeRay v1.4

Deploy DeepSeek‑R1 with vLLM and Ray Serve on Kubernetes

The architecture of a Reinforcement Learning (RL) library is split into two primary components: Generation and Training. During the generation phase, an LLM Engine performs multi-turn rollouts within an environment to produce data and reward signals. This output is then fed into the training phase to update the model's parameters. This process forms a feedback loop, where the progressively improved model generates the next iteration of data for continuous refinement.

Open Source RL Libraries for LLMs

Ready to try Anyscale?

Access Anyscale today to see how companies using Anyscale and Ray benefit from rapid time-to-market and faster iterations across the entire AI lifecycle.