Welcome to our first Ray meetup in the New Year 2022, after two years of absence due to the Covid-19 pandemic. We are super excited to be back.
6:00 PM-6:05 PM: Kickoff Welcome remarks & agenda by Jules Damji
6:05 PM - 6:15 PM: “The year 2021 in Review of Ray” by Robert Nishihara
6:15 - 6:40 PM: “What’s New in Ray 1.9 and Beyond” by Zhe Zhang
6:45 - 7:15 PM: “Unifying Preprocessing and Training at Scale with Ray Datasets" by Alex Wu and Clark Zinzow
7:25 PM - 7:50 PM: Ray Community Talk by TBD
Talk 1: The 2021 Year in Review of Ray
We will reflect back on Ray’s major milestones, Ray’s ecosystems of ML native and integrated libraries, and community growth and contributions.
Bio: Robert Nishihara is the co-creator of Ray and CEO and co-founder of Anyscale, the company behind Ray
Talk 2: What’s New in Ray 1.9 and Beyond
We share what’s new in Ray, what's coming in the near future and roadmap, and get involved & contribute
Bio: Zhe Zhang leads and spearheads the Ray OSS project at Anyscale.
Talk 3: Unifying Preprocessing and Training at Scale with Ray Datasets
ML tasks such as distributed training and batch inference stretch the abstractions of modern data processing systems. In this talk, we’ll discuss the wide-ranging problems that the Python community faces when building large-scale preprocessing and training pipelines. Some of these problems are caused by the complexity of stitching together distributed systems that weren’t designed to be compatible. For example, creating a pipeline with Dask and Horovod that can efficiently use the CPUs and GPUs in a cluster. Other problems -- like per-epoch dataset shuffling, show a gap between what operations ML practitioners want and what data processing libraries are capable of doing efficiently. We’ll also introduce Ray Datasets, a simple, scalable, and pythonic way of solving these problems.
Bio(s): Alex Wu and Clark Zinzow are both software engineers at Anyscale, working in the Ray Core and cloud computing infrastructure teams.
Talk 4: DeltaCAT: A Scalable Data Catalog for Ray Datasets
Most of today’s open-source data catalogs and data lakes are written for Java, with Python support either unavailable or tacked on as an afterthought. This can lead to awkward programming models and cross-language integration overhead for Python developers. To better connect Python devs to their data, we introduced the DeltaCAT project to the Ray Project ecosystem.
We’ll discuss how DeltaCAT leverages Ray Datasets to manage petabyte-scale data catalog tables. We’ll also review the goals of the project, how Amazon is using it internally, its current state, and future roadmap.
Bio: Patrick Ames is a Senior Software Engineer working on Big Data Technologies / Data Management and Optimization at Amazon.