Big Data Platform at Pinterest

Confidentia
l
Mao Ye
Big Data Platform at interest
1

Data Architecture
Design Choices for Hadoop Platform
Pinball for Workflow Management

Data at Pinterest
• 60 Billion Pins
• 1 Billion boards
• 100M MAU
• 60 PB of data on S3
• 3 PB processed every day
• 2000 node Hadoop cluster
• 250 engineers

Pinterest Data Architecture
App

App
events
Kafka
Secor
Singer

App
events
Kafka
Secor
Skyline
Pinball
Redshift
Pinalytics
Features
Qubole
(Hadoop)
Singer

Design Choices for Hadoop Platform

• Ephemeral clusters
• Access control layer
• Shared data store
• Easy deployment
Hadoop Platform Requirements
• Isolated multi-tenancy
• Elasticity
• Support multiple
clusters

Decoupling compute & storage
Hadoop Cluster 1
Transient
HDFS
Hadoop Cluster 2
Transient
HDFS
S3 Persistent
Store

Centralized Hive Metastore
Hive
Metastore
Pig
Cascading
Hive
HDFS/S3
DataMetadata

Multi-layered Packaging
Mapreduce Jobs
Hadoop Jars/Libs
Job/User level Configs
Software Packages/Libs
Configs (OS/Hadoop)
Misc Sys Admin
OS
Bootstrap Script
Core SW
Runtime Staging
(on S3)
Automated
Configuration
(Masterless Puppet)
Baked AMI

Executor Abstraction Layer
Hive
Metastore
HDFS/S3
Qubole
Managed
Hadoop
EMR
Executor
Pinball
Dev
Server

• API for simplified
executor abstraction
• Advanced support
for spot instances
• Baked AMI
customization
Why Qubole?
• Hadoop & Spark as
managed services
• Tight integration with
Hive
• Graceful cluster
scaling

Confidentia
l
Pinball for Workflow Management

Confidentia
l
● Scale:
o 60 Billion Pins
o Hundreds of workflows
o Thousands of jobs
o 500+ jobs in a workflow
o 3 petabytes processed daily
● Support:
o Hadoop, Cascading, Hive, Spark …
Scale of Processing
job
workflow

Confidentia
l
Why Pinball?
● Requirements
o Simple abstractions
o Extensible in future
o Reliable stateless computing
o Easy to debug
o Scales horizontally
o Can be upgraded w/o aborting workflows
o Rich features like auto-retries, per-job emails, overrun
policies…
● Options
o Apache Oozie, Azkaban, Luigi

Confidentia
l
Pinball Design
Master
Worker
Scheduler
Command
Line Clients
UI

Confidentia
l
● Workflow
o A directed graph of
nodes called jobs
● Edge
o Run after
dependence
● Node
o Job is a node
Workflow Model

Confidentia
l
Job State
● Job state is captured in a token
● Tokens are named hierarchically
Master
Job Token
version: 123
name: /workflow/w1/job
owner: worker_0
expiration: 1234567
data: JobTemplate(....)

Confidentia
l
Job State Machine
RUNNABLE
RUNNINGWAITING

Confidentia
l
● Master keeps the state
● Workers claim and execute tasks
● Horizontally scalable
Master Worker Interaction
Worker Master Persistent Store
1: request 2: update
3: ack

Confidentia
l
Master
● Entire state is kept in memory
● Each state update is synchronously persisted
before master replies to client
● Master runs on a single thread – no
concurrency issues

Confidentia
l
Open Source
Git repo:
https://github.com/pinterest/pinball
Mailing list:
https://groups.google.com/forum/#!forum/
pinball-users

Big Data Platform at Pinterest

More Related Content

What's hot (20)

Similar to Big Data Platform at Pinterest (20)

More from Qubole (20)

Recently uploaded (20)

Big Data Platform at Pinterest