
Agenda (US Pacific Time):
12:00 - 12:05 pm: Introduction
12:05 - 12:40 pm: Tech Talk: Magnet Shuffle Service: Push-based Shuffle at LinkedIn
1:30 pm - event closed
Abstract:
The number of daily Spark applications at LinkedIn has increased by more than 3X in the past year. The shuffle process alone is processing 10+ PB of data and billions of blocks daily in our clusters nowadays. With such a rapid increase of Spark workloads, we quickly realized that the shuffle process can become a severe bottleneck for both infrastructure scalability and workload efficiency. In our production clusters, we have observed both reliability issues due to shuffle fetch connection failures and efficiency issues due to the random reads of small shuffle blocks on disks.
To tackle those challenges and optimize shuffle performance in Spark, we have developed Magnet shuffle service, a push-based shuffle mechanism that works natively with Spark. Our paper describing this work has recently been accepted by VLDB 2020. In this talk, we will introduce how push-based shuffle can drastically increase shuffle efficiency when compared with the existing pull-based shuffle. In addition, by combining push-based shuffle and pull-based shuffle, we show how Magnet shuffle service helps to harden shuffle infrastructure at LinkedIn scale by both reducing shuffle related failures and removing scaling bottlenecks. Furthermore, we will also talk about a few highlights of the implementation behind Magnet shuffle service, which can work natively with Spark and does not require deploying any external infrastructure or specialized hardware.