Ray Datasets is a Ray-native distributed dataset library that serves as the standard way to load, process, and exchange data in Ray libraries and applications. It features performant distributed data loading, flexible parallel compute operations, and comprehensive datasource compatibility and distributed framework integrations, all behind an incredibly simple API.
Register for the webinar and get a first-hand look at how Ray Datasets:
Provides hyper-scalable parallel I/O to most popular storage backends and file formats
Supports common last-mile preprocessing operations, including basic parallel data transformations such as map, batched map, and filter, and global operations such as sort, shuffle, groupby, and stats aggregations
Efficiently integrates with data processing libraries (e.g., Spark, Pandas, NumPy, Dask, Mars) and machine learning frameworks (e.g., TensorFlow, Torch, Horovod)
Clark Zinzow is a software engineer at Anyscale. He loves working at the boundaries of service, data, and compute scale. When he's not reading papers or trapped under a stack of books, you can probably find him biking, skiing, or trapped under a different stack of books.
Alex is a Software Engineer at Anyscale and a Ray committer. His main contributions are to Ray's autoscaler, and the stability, scalability, and performance of the Ray Core.