HomeEventsPetabyte Scale Datalake Table Management with Ray, Arrow, Parquet, and S3

Ray Summit

Petabyte Scale Datalake Table Management with Ray, Arrow, Parquet, and S3

View Slides >>>

Managing a data lake that your business depends on to continuously deliver critical insights can be a daunting task. From applying table upserts/deletes during log compaction to managing structural changes through schema evolution or repartitioning, there's a lot that can go wrong and countless trade-offs to weigh. Moreover, as the volume of data in individual tables grow to petabytes and beyond, the jobs that fulfill these tasks grow increasingly expensive, fail to complete on time, and entrench teams in operational burden. Scalability limits are reached and yesterday's corner cases become everyday realities. In this talk, we will discuss Amazon's progress toward resolving these issues in its S3-based data lake by leveraging Ray, Arrow, and Parquet. We will also review past approaches, subsequent lessons learned, goals met/missed, and anticipated future work.

Speakers

Patrick Ames

Patrick Ames

Senior Software Engineer, Amazon, Amazon

Other Events

Ray Summit 2026

08 . 24 . 2026  ,  07:00 AM (PST)

Ray Summit 2024

09 . 30 . 2024  ,  03:00 PM (PST)

Ray Summit 2023

09 . 18 . 2023  ,  03:30 PM (PST)