This document summarizes Netflix's big data platform, which uses Presto and Spark on Amazon EMR and S3. Key points:
- Netflix processes over 50 billion hours of streaming per quarter from 65+ million members across over 1000 devices.
- Their data warehouse contains over 25PB stored on S3. They read 10% daily and write 10% of reads.
- They use Presto for interactive queries and Spark for both batch and iterative jobs.
- They have customized Presto and Spark for better performance on S3 and Parquet, and contributed code back to open source projects.
- Their architecture leverages dynamic EMR clusters with Presto and Spark deployed via bootstrap actions for scalability.