Big Data, Fast Data @ PayPal (YOW 2018)

@r39132
Big Data, Fast Data @ PayPal
Sid Anand (@r39132)
YOW! Conferences (Sydney, Brisbane, Melbourne)
Nov-Dec 2018
A Data Infrastructure Story

@r39132
About Me
Worked @
Committer & PPMC on
Father of 2
Co-Chair @
Work @

@r39132
@Scale: Last Year
200+ 100+
Markets Currencies
227M
Active Customer Accounts
7.8B
Payments Transactions
2,700
Applications
4,500
Engineers
17,000
Releases
200,000
Servers
27 Megawatts
Power
238 Petabytes
Storage
Full year 2017 numbers
PayPal by the Numbers!

@r39132
Putting our data scale in perspective …
DVDs7x
height
of Mt
Everest
x
500,000
x
2, 000,000

@r39132
And we continue to see growth in all areas…

@r39132
And to keep up with this growth, we’ve had to scale our data infrastructure
2,000 +
Database Instances
~116 Billion
Calls/day
~74 PB
Total Storage
OLTP DBs
Kafka
Messaging
Hadoop
Analytics

@r39132
2,000 +
Database Instances
~116 Billion
Calls/day
~74 PB
Total Storage
OLTP DBs
Kafka
Messaging
400+ Billion
Messages/day
~7 PB
Total Storage
50 +
Clusters
3K +
Topics
Hadoop
Analytics

@r39132
2,000 +
Database Instances
~116 Billion
Calls/day
~74 PB
Total Storage
OLTP DBs
Kafka
Messaging
400+ Billion
Messages/day
~7 PB
Total Storage
50 +
Clusters
3K +
Topics
200,000 +
Jobs/day
32
Hadoop Clusters
250+ PB
Storage
Hadoop
Analytics

@r39132
Interlude …
Why we love Ozzies!
• Oz has ~25MM people
• Ozzies Eligible for PayPal: ~19MM
people
• Ozzies with Active Accounts: ~7MM
• @ 37%, it’s PayPal’s most penetrated
market!!
• PayPal

@r39132
Setting the Context
To understand PayPal’s Data Infrastructure today, scale is only half the story!
It’s Data Infrastructure has evolved based on the creation of new technologies as well
as changing requirements
PayPal is a 20 year old company!

@r39132
Building A Modern Website
A Data Infrastructure Evolution Story

@r39132
Building a Modern Day Web Site
DB
CName

@r39132
DB
Load Balancer
CName
Load Balancer Load Balancer Load Balancer

@r39132
DB
Load Balancer
CName
Search

@r39132
DB
Load Balancer
CName
Search
CDCIndexing

@r39132
DB
Load Balancer
CName
Search
CDCIndexing
Media
Store
CName

@r39132
DB
CName
Search
CDCIndexing
CName
Media
Store
Ad-hoc ReportingDP/ML
Analytics Use-cases
1. Reporting (Nightly)
• Well-defined columns
2. Ad-hoc Analysis (throughout Day)
• Fast reads, any column
3. Data Processing / ML training
• Large scans & writes

@r39132
DB
CName
Search
CDCIndexing
CName
Impedance Mismatch
• Serving needs
• Fast reads & writes
• Well-defined workloads
• Simple queries
• Analytic (Ad-hoc) needs
• Fast reads
• Unknown workloads
• Complex
(exploratory)
queries
Media
Store

@r39132
DB
CName
Search
CDCIndexing
CName
Impedance Mismatch
• Serving needs
• Fast reads & writes
• Well-defined workloads
• Simple queries
• OLTP DBs
• Analytic (Ad-hoc) needs
• Fast reads
• Unknown workloads
• Complex
(exploratory)
queries
• OLAP DBs
Media
Store

@r39132
DB
CName
Search
CDCIndexing
CName
Ad-hoc
DB
ETL/ELT
ReportingDP/ML
Analytics Use-cases
2. Ad-hoc Analysis
(throughout Day)
• Fast reads, any
column
3. Data Processing / ML
training
Media
Store

@r39132
DB
CName
Search
CDCIndexing
CName
Ad-hoc
DB
ETL/ELT
ReportingDP/ML
Scheduler
A workflow scheduler needs
to coordinate the
nightly/hourly loads!
Media
Store
Scheduler

@r39132
DB
CName
Search
CDCIndexing
CName
Ad-hoc
DB
ETL/ELT
Reporting
Analytics Use-cases
2. Ad-hoc Analysis
(throughout Day)
• Fast reads, any
column
3. Data Processing / ML
training
Media
Store
DP/ML
HDFS HDFS HDFSScheduler

@r39132
DB
CName
Search
CDCIndexing
CName
Ad-hoc
DB
ETL/ELT
Reporting
Media
Store
DP/ML
Ad-hoc
Increasingly, ad-hoc
exploratory queries are also
being moved to the data
lake to keep costs down!

@r39132
DB
CName
Search
CDCIndexing
CName
Ad-hoc
DB
ETL/ELT
Reporting
Media
Store
DP/ML
Kafka
HDFS
Ingest
Ad-hoc
What about App
engagement metric & other
business metric events?
• The web apps business
log events to Kafka
• A Kafka consumer ingest
these events into HDFS
where they can be
aggregated & possibly
also used in ML features

@r39132
DB
CName
Search
CDCIndexing
CName
Ad-hoc
DB
ETL/ELT
Reporting
We live in a connected world.
• We can infer a lot from
what goes on around us in
our connected
neighborhood.
• Graph Processing
• Graph DBs
Media
Store
DP/ML
Kafka
HDFS
Ingest
Ad-hoc
Graph
Processing
Graph
DBs
Graph
Ingest

@r39132
DB
CName
Search
CDCIndexing
CName
Ad-hoc
DB
ETL/ELT
Reporting
Media
Store
DP/ML
Kafka
HDFS
Ingest
Ad-hoc
Graph
Processing
Graph
DBs
Graph
Ingest
Cache
And who can forget about
caches?

@r39132
DB
CName
Search
CDCIndexing
CName
Ad-hoc
DB
ETL/ELT
Reporting
Media
Store
DP/ML
Kafka
HDFS
Ingest
Ad-hoc
Graph
Processing
Graph
DBs
Graph
Ingest
Cache
And RT OLAP engines like
Apache Druid or LinkedIn’s
Pinot!
A specialty data system
optimized for time-
oriented roll-ups
RT OLAP

@r39132
DB
CName
Search
CDCIndexing
CName
Ad-hoc
DB
ETL/ELT
Reporting
Media
Store
DP/ML
Kafka
HDFS Ingest
Ad-hoc
Graph
Processing
Graph
DBs
Graph
Ingest
Cache RT OLAP
Modern Data
Infrastructure

@r39132
Data Infrastructure Domain Specialty Data Systems Examples
Online Serving • OLTP DBs (NoSQL, NewSQL, RDBMS)
• Caches
• Search Engines
• Graph Engines,
• Media Stores (Object, Filers)
• RT OLAP engines
• MySQL Postgres, FoundationDB
• Redis, Memcached
• Elasticsearch, SOLR
• JanusGraph, AWS Neptune, TigerGraph
• AWS S3, LinkedIn Ambry
• LinkedIn Pinot, Apache Druid
Offline Analytics • OLAP (MPP) DBs
• Graph Processing
• Large Scale Data Processing
• SQL-on-Hadoop
• Stream Processing
• ML Platforms
• BI tools (Reporting)
• Teradata, AWS Redshift, Big Query
• GraphX
• Pig, Spark, M/R
• Presto, Impala, KSQL
• Spark, Flink, Beam, Storm
• MLFlow, Kubeflow
• Tableau, Microstrategy
Data Movement • Streams
• Workflow Schedulers
• Ingesters (Graph, Search, Hadoop,
ETL/ELT)
• Kafka
• Apache Airflow, UC4, Control-M
• Sqoop, LinkedIn Gobblin, Informatica

@r39132
Key Take-aways
• Common pitfall!
• When your primary OLTP data store is struggling under load, your
first reaction may be to
• Scale it out! Or
• Replace it with a hot new technology

@r39132
Key Take-aways
• Better approach
• Analyze the workloads & potentially
• Move different workloads to different systems
• Hire specialty talent to manage those systems
• Separate those systems by well-defined interfaces & protocols

@r39132
Key Take-aways
This is Microservices & Conway’s law applied to Data Engineering

@r39132
PayPal Data Architecture
An Overview

@r39132
PayPal’s (Core) Architecture (Simplified)
PHXSLC
2 Customer-Serving Data Centers today,
more on the way

@r39132
PHXSLC
CName
Mobile & Web App traffic that hits paypal.com is Akamai-
routed to one of these 2 Data Centers

@r39132
PHXSLC
CName
Load Balancer Load Balancer
Within a Data Center, we have multiple
Availability Zones.
A routing layer within the Data Center
will route to one of the Availability Zones
Each AZs is composed of many
microservices as well as other services,
such as Kafka clusters, etc…

@r39132
PHXSLC
CName Within a Data Center, we have multiple
Availability Zones.
A routing layer within the Data Center
will route to one of the Availability Zones
Each AZs is composed of many
microservices as well as other services,
such as Kafka clusters, etc…
OCC OCC
DB (RO)DB
DB requests are made to a single
“Horizontal” AZ that contains all of the
Core DBs (Oracle RACs)
OCC = Oracle Connection Cache
GG

@r39132
PHX
CName
=
OCC OCC
DB
SLC LVS
PP has one Analytics Data Center in Las
Vegas!

@r39132
PHX
CName
=
OCC OCC
DB
SLC LVS
We have 2 major data store types in our
Analytics Data Center:
• Teradata
• Hadoop
Hadoop
Teradata

@r39132
PHX
CName
=
OCC OCC
DB
SLC LVS
While Reporting is primarily from
Teradata, the other use cases can hit
either store
Hadoop
Teradata
ReportingDP/MLAd-hoc

@r39132
PHX
CName
=
OCC OCC
DB
SLC LVS
Custom pipelines feed both Teradata &
Hadoop from our Site DBs
Hadoop
Teradata
DB
(Pump)
OIS
CDH-R
Informatica
(ETL/ELT)
Core Data HighwayGG
GG
GG

@r39132
PHX
CName
=
OCC OCC
DB
SLC LVS
Hadoop
Teradata
DB
(Pump)
OIS
CDH-R
Core Data HighwayGG
GG
GG
We have 3 schedulers today for Batch Job
execution
Scheduler
Informatica
(ETL/ELT)

@r39132
PHX
CName
=
OCC OCC
DB
SLC LVS
Our home-grown Steam Donkey
transfers data between Teradata &
Hadoop
Hadoop
Teradata
DB
(Pump)
OIS
CDH-R
Core Data HighwayGG
Steam
DonkeyGG
GG
Scheduler
Informatica
(ETL/ELT)

@r39132
PHX
CName
=
OCC OCC
DB
SLC LVS
The remainder of this talk will focus on
the highlighted components:
• Fast Data (CDH)
• Big Data (Hadoop & More)
Hadoop
Teradata
DB
(Pump)
OIS
CDH-R
Core Data HighwayGG
GG
GG
Scheduler
Steam
Donkey
Informatica
(ETL/ELT)

@r39132
Fast Data in Action
Let’s look at a use-case

@r39132
Fast Data in Action
Say I want to send my
wife money!

@r39132
Fast Data in Action
After specifying an
amount & a message, I
hit Send

@r39132
Fast Data in Action
I see a confirmation
page

@r39132
Fast Data in Action
And I see the transfer
in my activity feed!

@r39132
Fast Data in Action
AsynchronousSynchronous

@r39132
Fast Data in Action
AsynchronousSynchronous
DB DBSynchronization

@r39132
Fast Data in Action
DB
SLC
Once the customer sees the confirmation screen, she can rest
assured the a commit has completed to the TXN database!

@r39132
Fast Data in Action
DB DB
(Pump)
CDH
Replicat
GG GG
SLC • Oracle Golden Gate
reads the Redo log into
its proprietary trail file
format & streams it to
the CDH Replicat

@r39132
Fast Data in Action
DB DB
(Pump)
CDH
Replicat
GG GG
Avro
Schema
Registry
register
SLC • The Replicat reads the
trail file, record by record
• Extracts the db schema
of each row, converts it
into an Avro schema, and
registers that with the
Avro Schema Registry
(ASR)

@r39132
Fast Data in Action
DB DB
(Pump)
CDH
Replicat
GG GG
Avro
Schema
Registry
SLC • Composes an Avro
message
• Sends the message to
Kafka
Kafka

@r39132
Fast Data in Action
DB DB
(Pump)
CDH
Replicat
GG GG
Router
Kafka
Avro
Schema
Registry
get
SLC
• A Storm Router gets
the Writer’s schema id
from the message
header
• Contacts the ASR to
download the schema
by id, if not in a local
cache
• Decodes the datum
using the Writer’s
schema

@r39132
Fast Data in Action
DB DB
(Pump)
CDH
Replicat
GG GG
Router
Kafka
Avro
Schema
Registry
SLC • Hydrates the message
from Oracle to get all
columns (not just CDC
columns)!
Read full record by PK

@r39132
Fast Data in Action
DB DB
(Pump)
CDH
Replicat
GG GG
Router
Kafka
Avro
Schema
Registry
SLC • Generates N output
messages, one per
destination, masking
sensitive columns by
destination
• Sends N messages
Kafka

@r39132
Fast Data in Action
DB DB
(Pump)
CDH
Replicat
GG GG
Router
SLC
Kafka Kafka
Avro
Schema
Registry get
• The Activity Services
consumer app follows
the same steps
previously mentioned
to decode the Avro
message

@r39132
Fast Data in Action
DB DB
(Pump)
CDH
Replicat
GG GG
Router
SLC
Kafka Kafka
Avro
Schema
Registry
• It does does some
transformation to the
data before storing it in
its own DB
DB

@r39132
Fast Data in Action
DB DB
(Pump)
CDH
Replicat
GG GG
Router
SLC
Kafka Kafka
Avro
Schema
Registry
• When you visit the
Activity mobile or web
app, your data is
retrieved from the
Activity Services DB!
DB

@r39132
Fast Data in Action
DB DB
(Pump)
CDH
Replicat
GG GG
Router
SLC
Kafka
Avro
Schema
Registry
Activity Streams -- by the
Numbers!
• Scale: hundreds of millions of
events / day
• Latency (99%ile): < 60s
• Correctness: 100%
DB
Kafka

@r39132
Take-Aways: Change Data Capture

@r39132
Why Change Data Capture?
DB DBSynchronization
Many Ways to Sync two-or-more databases:
• XA Transactions
• Event Sourcing
• Change Data Capture

@r39132
XA Transactions (a.k.a. 2-Phase Commits)
DB DB
Problem:
• Giving up Availability for consistency (CAP Theorem)

@r39132
Event Sourcing
DB DB
Problem:
• Giving up Read-Your-Write Consistency
W W W W W W W WKafka

@r39132
Change Data Capture
DB DB
Solution:
• Guaranteed eventual consistency with low-latency

@r39132
Take-Aways: Apache Avro

@r39132
Why is Avro Needed?
DB

@r39132
Why is Avro Needed?
DB
The Data Contract between Reader & Writer is
enforced by the DB via a table Schema

@r39132
Why is Avro Needed?
DB
Kafka

@r39132
Why is Avro Needed?
Avro
• Is an efficient self-describing
(schema’d) data serialization format
• Supports Schema evolution
• Has good support in most languages
• Is widely accepted in the Big & Fast
data space
• Is used for data interchange across
both streams and files (HDFS)
Kafka

@r39132
Fast Data Architecture
The Control Plane

@r39132
SLC
DB DB
(Pump)
CDH
Replicat
GG GG
Router
Kafka Kafka
ASR
register
get
CDH Data
Plane

@r39132
Some More Requirements
• We have ~60K tables in our Oracle databases
• We can’t just turn on 60K streams as it would be wasteful, especially if no one
needs to consume it!
• We have 4500+ engineers in PayPal & 6 engineers on the CDH dev team
• How do we enable anyone in the company to launch any stream?
• If we did eventually have 60K+ streams, how would we manage them?

@r39132
SLC
DB DB
(Pump)
CDH
Replicat
GG GG
Router
Kafka Kafka
ASR
register
get
CDH Self-
Service
Control Plane
CDH Data
Plane
Metadata
DB

@r39132
SLC
DB DB
(Pump)
GG GG
ASR
CDH Data
Plane
Metadata
DB
CDH Control
Plane
• PP Engineer visits the
CDH self-service portal (a
ReactJS app) to provision
a data pipeline
• He or she submits a
request for a new pipeline
• The provision request is
recorded in the metadata
db

@r39132
SLC
DB DB
(Pump)
GG GG
ASR
CDH Data
Plane
• A periodic Airflow job
kicks off to call an API on
the Squbs server to
execute long-running
pipeline provisioning tasks
Metadata
DB
CDH Control
Plane

@r39132
SLC
Metadata
DB
CDH Control
Plane
DB DB
(Pump)
CDH
Replicat
GG GG
Router
Kafka Kafka
ASR
register
get
CDH Data
Plane
• This task creates new
Kafka topics, GG Replicat
processes, and Storm
topologies!
• Within a minute a new
pipeline is flowing!

@r39132
Design Principles
1. System built from OSS components & runs on containers (HA)!
2. Separation of Concerns:
• Intent Capture vs Orchestration
3. Orchestration is the brains of the control plane!
• DP Self-healing
• DP Auto-scaling
• Fault-tolerant actions
• Maintenance-aware
Metadata
DB
CDH Control
Plane

@r39132
Fast Data Requirements
Data Plane Requirements

@r39132
• Correctness – 0% data loss/corruption
• Latency – 99%ile < 1 minute (rain or shine)
• Availability – Always Available

@r39132
• Correctness – 0% data loss/corruption
• Causes of data loss/corruption are typically
• Deployments of Buggy Code
• Data corner-cases – latent bugs not related to recent code changes but to data outliers
• Latency – 99%ile < 1 minute (rain or shine)
• Definition of Latency SLA Misses
• Data is arriving, but it is delayed
• Causes of latency SLA Misses
• Scalability bottlenecks
• Performance bottlenecks
• Availability – Always Available
• Definition of Availability SLA Misses
• No data is arriving
• Causes of availability loss are typically
• Deployments of Buggy Code
• SPOF outages

@r39132
Fast Data Challenges
Solutions!

@r39132
Fast Data Challenges
1. Performance Bottlenecks
2. Data Corner Cases
3. Deployments of Buggy Code

@r39132
Performance Bottlenecks
SLC
Metadata
DB
CDH Control
Plane
DB DB
(Pump)
CDH
Replicat
GG GG
Router
Kafka Kafka
ASR
register
get
CDH Data
Plane
• Hydration Queries
• The biggest bottleneck
is the hydration query
back to the source DB
for updated rows
• Hydration queries can
take 20-40 ms vs 500
microseconds

@r39132
SLC
Metadata
DB
CDH Control
Plane
DB DB
(Pump)
CDH
Replicat
GG GG
Router
Kafka Kafka
ASR
register
get
CDH Data
Plane
• Hydration Queries
• Solution : Oracle GG
Full-Supplemental
Logging! No more
hydration!
Performance Bottlenecks

@r39132
Data Corner Cases
• Considerations
1. A latent bug can be triggered when it encounters unexpected data!
• Approach
• We do 0 type conversions!
• Oracle Golden Gate provides everything as String
• Due to the Number type, which does not map to any numeric type in
Avro or any programming language, we had to abandon end-to-end type
safety
• The upside is that we don’t run into type-related conversion issues &
related to data corner cases!
• We don’t replicate LOB fields
• Currently, we have no transformation logic in our pipelines!

@r39132
Code Deployments
SLC
Metadata
DB
CDH Control
Plane
DB DB
(Pump)
CDH
Replicat
GG GG
Router
Kafka
ASR
register
get
CDH Data
Plane
1. Set maintenance
mode (pausing all
orchestration actions)

@r39132
Code Deployments
SLC
Metadata
DB
CDH Control
Plane
DB DB
(Pump)
CDH
Replicat
GG GG
Router
Kafka
ASR
register
get
CDH Data
Plane 2. Stop Storm topology
3. Backup checkpoint
4. Deploy new code
5. Start Storm topology
6. Monitor for errors

@r39132
Code Deployments
SLC
Metadata
DB
CDH Control
Plane
DB DB
(Pump)
CDH
Replicat
GG GG
Router
Kafka
ASR
register
get
CDH Data
Plane 6. If errors detected,
- a. stop topology
- b. rollback checkpoint
- c. rollback code version
- d. restart topology

@r39132
Code Deployments
SLC
Metadata
DB
CDH Control
Plane
DB DB
(Pump)
CDH
Replicat
GG GG
Router
Kafka
ASR
register
get
CDH Data
Plane
7. Set maintenance
mode off (unpausing all
orchestration actions)

@r39132
Fast Data Stats!
SLC
Metadata
DB
CDH Control
Plane
DB DB
(Pump)
CDH
Replicat
GG GG
Router
Kafka Kafka
ASR
register
get
CDH Data
Plane
Data Plane
• GA’d in August 2018
• 2.2 TB streamed / day
• 300+ Pipelines activated
through our self-service
portal!

@r39132
Closing Thoughts
• Favor microservice approaches to building data architectures
• When possible (almost always), favor OSS data projects over proprietary ones
• In stream processing. #NO_OPS is the only ways to meet SLAs
• Check out our OSS Data Projects on http://paypal.github.io/

@r39132
Acknowledgments
• Akara Sucharitakul
• Anil Gursel
• Doron Mimon
• Na Yang
• Maulin Vasavada
• Kevin Lu
• Prasanna Krishna
• Sri Shivananda
• Kamlakar Singh
• Nagendra Rai
• Swroop Singh
• Anoj Rawat
• Rahul Srivastava
• Naitra Muralykrishnan
• Prabhu Kasinathan
• Vincent Chen
• Anisha Nainani
• Pramod Garre
• Harsh Bhimani
• Nirmalya Ghosh
• Yash Shah
• Aastha Sinha
• Deepak Mohanakumar
Chandramouli
• Romit Mehta
• Dheeraj Rampally
• Stalin Subbiah
• Ashwin Nellore
• Lohit Giri
• Plamen Jeliazhov
• Sehmuz Bayhan
And Many More…

Big Data, Fast Data @ PayPal (YOW 2018)

Recommended

More Related Content

What's hot (20)

Similar to Big Data, Fast Data @ PayPal (YOW 2018) (20)

More from Sid Anand (20)

Recently uploaded (20)

Big Data, Fast Data @ PayPal (YOW 2018)