SlideShare a Scribd company logo
Confidentia
l
Mao Ye
Big Data Platform at interest
1
Data Architecture
Design Choices for Hadoop Platform
Pinball for Workflow Management
Data Architecture
Data at Pinterest
• 60 Billion Pins
• 1 Billion boards
• 100M MAU
• 60 PB of data on S3
• 3 PB processed every day
• 2000 node Hadoop cluster
• 250 engineers
Pinterest Data Architecture
App
Pinterest Data Architecture
App
events
Kafka
Secor
Singer
Pinterest Data Architecture
App
events
Kafka
Secor
Singer
Pinterest Data Architecture
App
events
Kafka
Secor
Skyline
Pinball
Redshift
Pinalytics
Features
Qubole
(Hadoop)
Singer
Design Choices for Hadoop Platform
• Ephemeral clusters
• Access control layer
• Shared data store
• Easy deployment
Hadoop Platform Requirements
• Isolated multi-tenancy
• Elasticity
• Support multiple
clusters
Decoupling compute & storage
Hadoop Cluster 1
Transient
HDFS
Hadoop Cluster 2
Transient
HDFS
S3 Persistent
Store
Centralized Hive Metastore
Hive
Metastore
Pig
Cascading
Hive
HDFS/S3
DataMetadata
Multi-layered Packaging
Mapreduce Jobs
Hadoop Jars/Libs
Job/User level Configs
Software Packages/Libs
Configs (OS/Hadoop)
Misc Sys Admin
OS
Bootstrap Script
Core SW
Runtime Staging
(on S3)
Automated
Configuration
(Masterless Puppet)
Baked AMI
Executor Abstraction Layer
Hive
Metastore
HDFS/S3
Qubole
Managed
Hadoop
EMR
Executor
Pinball
Dev
Server
• API for simplified
executor abstraction
• Advanced support
for spot instances
• Baked AMI
customization
Why Qubole?
• Hadoop & Spark as
managed services
• Tight integration with
Hive
• Graceful cluster
scaling
Confidentia
l
Pinball for Workflow Management
Confidentia
l
â—Ź Scale:
o 60 Billion Pins
o Hundreds of workflows
o Thousands of jobs
o 500+ jobs in a workflow
o 3 petabytes processed daily
â—Ź Support:
o Hadoop, Cascading, Hive, Spark …
Scale of Processing
job
workflow
Confidentia
l
Why Pinball?
â—Ź Requirements
o Simple abstractions
o Extensible in future
o Reliable stateless computing
o Easy to debug
o Scales horizontally
o Can be upgraded w/o aborting workflows
o Rich features like auto-retries, per-job emails, overrun
policies…
â—Ź Options
o Apache Oozie, Azkaban, Luigi
Confidentia
l
Pinball Design
Master
Worker
Scheduler
Command
Line Clients
UI
Confidentia
l
â—Ź Workflow
o A directed graph of
nodes called jobs
â—Ź Edge
o Run after
dependence
â—Ź Node
o Job is a node
Workflow Model
Confidentia
l
Job State
â—Ź Job state is captured in a token
â—Ź Tokens are named hierarchically
Master
Job Token
version: 123
name: /workflow/w1/job
owner: worker_0
expiration: 1234567
data: JobTemplate(....)
Confidentia
l
Job State Machine
RUNNABLE
RUNNINGWAITING
Confidentia
l
â—Ź Master keeps the state
â—Ź Workers claim and execute tasks
â—Ź Horizontally scalable
Master Worker Interaction
Worker Master Persistent Store
1: request 2: update
3: ack
Confidentia
l
Master
â—Ź Entire state is kept in memory
â—Ź Each state update is synchronously persisted
before master replies to client
● Master runs on a single thread – no
concurrency issues
Confidentia
l
Worker
Confidentia
l
Open Source
Git repo:
https://github.com/pinterest/pinball
Mailing list:
https://groups.google.com/forum/#!forum/
pinball-users
Confidentia
l
Thank You

More Related Content

What's hot (20)

PPTX
Data Engineer's Lunch #54: dbt and Spark
Anant Corporation
 
PPTX
Big data architectures and the data lake
James Serra
 
PDF
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
Cloudera, Inc.
 
PDF
Some Iceberg Basics for Beginners (CDP).pdf
Michael Kogan
 
PDF
Modernizing to a Cloud Data Architecture
Databricks
 
PDF
Big query
Tanvi Parikh
 
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
PDF
Iceberg: a fast table format for S3
DataWorks Summit
 
PDF
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
PDF
GraphFrames: Graph Queries In Spark SQL
Spark Summit
 
PDF
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
Alluxio, Inc.
 
PDF
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
HostedbyConfluent
 
PPTX
Building a modern data warehouse
James Serra
 
PDF
Introduction to Spark Streaming
datamantra
 
ZIP
NoSQL databases
Harri Kauhanen
 
PDF
Introducing Change Data Capture with Debezium
ChengKuan Gan
 
PDF
Time to Talk about Data Mesh
LibbySchulze
 
PDF
Light Weight Transactions Under Stress (Christopher Batey, The Last Pickle) ...
DataStax
 
PPTX
Presto: SQL-on-anything
DataWorks Summit
 
PPTX
Customer-Centric Data Management for Better Customer Experiences
Informatica
 
Data Engineer's Lunch #54: dbt and Spark
Anant Corporation
 
Big data architectures and the data lake
James Serra
 
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
Cloudera, Inc.
 
Some Iceberg Basics for Beginners (CDP).pdf
Michael Kogan
 
Modernizing to a Cloud Data Architecture
Databricks
 
Big query
Tanvi Parikh
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
Iceberg: a fast table format for S3
DataWorks Summit
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
GraphFrames: Graph Queries In Spark SQL
Spark Summit
 
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
Alluxio, Inc.
 
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
HostedbyConfluent
 
Building a modern data warehouse
James Serra
 
Introduction to Spark Streaming
datamantra
 
NoSQL databases
Harri Kauhanen
 
Introducing Change Data Capture with Debezium
ChengKuan Gan
 
Time to Talk about Data Mesh
LibbySchulze
 
Light Weight Transactions Under Stress (Christopher Batey, The Last Pickle) ...
DataStax
 
Presto: SQL-on-anything
DataWorks Summit
 
Customer-Centric Data Management for Better Customer Experiences
Informatica
 

Similar to Big Data Platform at Pinterest (20)

PPTX
50 Billion pins and counting: Using Hadoop to build data driven Products
DataWorks Summit
 
PPTX
Pinterest hadoop summit_talk
Krishna Gade
 
PDF
Webinar - DreamObjects/Ceph Case Study
Ceph Community
 
PDF
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
Monal Daxini
 
PDF
Openstack India May Meetup
Deepak Garg
 
PDF
Serverless SQL
Torsten Steinbach
 
PDF
Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)
Sascha Wenninger
 
PDF
Michael stack -the state of apache h base
hdhappy001
 
PPTX
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
PDF
Facebook Presto presentation
Cyanny LIANG
 
PPT
AWS (Hadoop) Meetup 30.04.09
Chris Purrington
 
PDF
Netflix Open Source Meetup Season 4 Episode 2
aspyker
 
PPTX
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
cdmaxime
 
PDF
Low Latency Polyglot Model Scoring using Apache Apex
Apache Apex
 
PPTX
Sql Start! 2020 - SQL Server Lift & Shift su Azure
Marco Obinu
 
PDF
Low latency high throughput streaming using Apache Apex and Apache Kudu
DataWorks Summit
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
YARN: a resource manager for analytic platform
Tsuyoshi OZAWA
 
PDF
Modern MySQL Monitoring and Dashboards.
Mydbops
 
PPTX
OS for AI: Elastic Microservices & the Next Gen of ML
Nordic APIs
 
50 Billion pins and counting: Using Hadoop to build data driven Products
DataWorks Summit
 
Pinterest hadoop summit_talk
Krishna Gade
 
Webinar - DreamObjects/Ceph Case Study
Ceph Community
 
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
Monal Daxini
 
Openstack India May Meetup
Deepak Garg
 
Serverless SQL
Torsten Steinbach
 
Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)
Sascha Wenninger
 
Michael stack -the state of apache h base
hdhappy001
 
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
Facebook Presto presentation
Cyanny LIANG
 
AWS (Hadoop) Meetup 30.04.09
Chris Purrington
 
Netflix Open Source Meetup Season 4 Episode 2
aspyker
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
cdmaxime
 
Low Latency Polyglot Model Scoring using Apache Apex
Apache Apex
 
Sql Start! 2020 - SQL Server Lift & Shift su Azure
Marco Obinu
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
DataWorks Summit
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
YARN: a resource manager for analytic platform
Tsuyoshi OZAWA
 
Modern MySQL Monitoring and Dashboards.
Mydbops
 
OS for AI: Elastic Microservices & the Next Gen of ML
Nordic APIs
 
Ad

More from Qubole (20)

PPTX
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Qubole
 
PDF
7 Big Data Challenges and How to Overcome Them
Qubole
 
PDF
State of Big Data Adoption
Qubole
 
PPTX
Big Data at Pinterest - Presented by Qubole
Qubole
 
PDF
5 Factors Impacting Your Big Data Project's Performance
Qubole
 
PPTX
Spark on Yarn
Qubole
 
PPTX
Atlanta MLConf
Qubole
 
PDF
Running Spark on Cloud
Qubole
 
PDF
Qubole State of the Big Data Industry
Qubole
 
PPTX
Atlanta Data Science Meetup | Qubole slides
Qubole
 
PPTX
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole
 
PDF
BIPD Tech Tuesday Presentation - Qubole
Qubole
 
PDF
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
Qubole
 
PPTX
Optimizing Big Data to run in the Public Cloud
Qubole
 
PPTX
Getting to 1.5M Ads/sec: How DataXu manages Big Data
Qubole
 
PDF
Expert Big Data Tips
Qubole
 
PPTX
Big dataproposal
Qubole
 
PDF
Presto in the cloud
Qubole
 
PPTX
Basic Sentiment Analysis using Hive
Qubole
 
PDF
Effective Hive Queries
Qubole
 
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Qubole
 
7 Big Data Challenges and How to Overcome Them
Qubole
 
State of Big Data Adoption
Qubole
 
Big Data at Pinterest - Presented by Qubole
Qubole
 
5 Factors Impacting Your Big Data Project's Performance
Qubole
 
Spark on Yarn
Qubole
 
Atlanta MLConf
Qubole
 
Running Spark on Cloud
Qubole
 
Qubole State of the Big Data Industry
Qubole
 
Atlanta Data Science Meetup | Qubole slides
Qubole
 
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole
 
BIPD Tech Tuesday Presentation - Qubole
Qubole
 
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
Qubole
 
Optimizing Big Data to run in the Public Cloud
Qubole
 
Getting to 1.5M Ads/sec: How DataXu manages Big Data
Qubole
 
Expert Big Data Tips
Qubole
 
Big dataproposal
Qubole
 
Presto in the cloud
Qubole
 
Basic Sentiment Analysis using Hive
Qubole
 
Effective Hive Queries
Qubole
 
Ad

Recently uploaded (20)

PPTX
Nursing Shift Supervisor 24/7 in a week .pptx
amjadtanveer
 
PDF
apidays Munich 2025 - Integrate Your APIs into the New AI Marketplace, Senthi...
apidays
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PDF
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PPTX
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
PPTX
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
PDF
apidays Munich 2025 - The Physics of Requirement Sciences Through Application...
apidays
 
PDF
apidays Munich 2025 - The Double Life of the API Product Manager, Emmanuel Pa...
apidays
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PPTX
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PDF
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
PPTX
Introduction to computer chapter one 2017.pptx
mensunmarley
 
PDF
Top Civil Engineer Canada Services111111
nengineeringfirms
 
PDF
apidays Munich 2025 - Developer Portals, API Catalogs, and Marketplaces, Miri...
apidays
 
Nursing Shift Supervisor 24/7 in a week .pptx
amjadtanveer
 
apidays Munich 2025 - Integrate Your APIs into the New AI Marketplace, Senthi...
apidays
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
apidays Munich 2025 - The Physics of Requirement Sciences Through Application...
apidays
 
apidays Munich 2025 - The Double Life of the API Product Manager, Emmanuel Pa...
apidays
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
Introduction to computer chapter one 2017.pptx
mensunmarley
 
Top Civil Engineer Canada Services111111
nengineeringfirms
 
apidays Munich 2025 - Developer Portals, API Catalogs, and Marketplaces, Miri...
apidays
 

Big Data Platform at Pinterest