SlideShare a Scribd company logo
@r39132
Big Data, Fast Data @ PayPal
Sid Anand (@r39132)
YOW! Conferences (Sydney, Brisbane, Melbourne)
Nov-Dec 2018
A Data Infrastructure Story
@r39132
About Me
Worked @
Committer & PPMC on
Father of 2
Co-Chair @
Work @
@r39132
Let’s talk scale!
@r39132
@Scale: Last Year
200+ 100+
Markets Currencies
227M
Active Customer Accounts
7.8B
Payments Transactions
2,700
Applications
4,500
Engineers
17,000
Releases
200,000
Servers
27 Megawatts
Power
238 Petabytes
Storage
Full year 2017 numbers
PayPal by the Numbers!
@r39132
Putting our data scale in perspective …
PayPal by the Numbers!
DVDs7x
height
of Mt
Everest
x
500,000
x
2, 000,000
@r39132
And we continue to see growth in all areas…
PayPal by the Numbers!
@r39132
And to keep up with this growth, we’ve had to scale our data infrastructure
PayPal by the Numbers!
2,000 +
Database Instances
~116 Billion
Calls/day
~74 PB
Total Storage
OLTP DBs
Kafka
Messaging
Hadoop
Analytics
@r39132
And to keep up with this growth, we’ve had to scale our data infrastructure
PayPal by the Numbers!
2,000 +
Database Instances
~116 Billion
Calls/day
~74 PB
Total Storage
OLTP DBs
Kafka
Messaging
400+ Billion
Messages/day
~7 PB
Total Storage
50 +
Clusters
3K +
Topics
Hadoop
Analytics
@r39132
And to keep up with this growth, we’ve had to scale our data infrastructure
PayPal by the Numbers!
2,000 +
Database Instances
~116 Billion
Calls/day
~74 PB
Total Storage
OLTP DBs
Kafka
Messaging
400+ Billion
Messages/day
~7 PB
Total Storage
50 +
Clusters
3K +
Topics
200,000 +
Jobs/day
32
Hadoop Clusters
250+ PB
Storage
Hadoop
Analytics
@r39132
Interlude …
Why we love Ozzies!
• Oz has ~25MM people
• Ozzies Eligible for PayPal: ~19MM
people
• Ozzies with Active Accounts: ~7MM
• @ 37%, it’s PayPal’s most penetrated
market!!
• PayPal
@r39132
Setting the Context
To understand PayPal’s Data Infrastructure today, scale is only half the story!
It’s Data Infrastructure has evolved based on the creation of new technologies as well
as changing requirements
PayPal is a 20 year old company!
@r39132
Building A Modern Website
A Data Infrastructure Evolution Story
@r39132
Building a Modern Day Web Site
DB
CName
@r39132
DB
Load Balancer
CName
Load Balancer Load Balancer Load Balancer
Building a Modern Day Web Site
@r39132
DB
Load Balancer
CName
Load Balancer Load Balancer Load Balancer
Search
Building a Modern Day Web Site
@r39132
DB
Load Balancer
CName
Load Balancer Load Balancer Load Balancer
Search
CDCIndexing
Building a Modern Day Web Site
@r39132
DB
Load Balancer
CName
Load Balancer Load Balancer Load Balancer
Search
CDCIndexing
Media
Store
CName
Building a Modern Day Web Site
@r39132
DB
CName
Load Balancer Load Balancer Load Balancer
Search
CDCIndexing
CName
Media
Store
Ad-hoc ReportingDP/ML
Analytics Use-cases
1. Reporting (Nightly)
• Well-defined columns
2. Ad-hoc Analysis (throughout Day)
• Fast reads, any column
3. Data Processing / ML training
• Large scans & writes
Building a Modern Day Web Site
@r39132
DB
CName
Load Balancer Load Balancer Load Balancer
Search
CDCIndexing
CName
Impedance Mismatch
• Serving needs
• Fast reads & writes
• Well-defined workloads
• Simple queries
• Analytic (Ad-hoc) needs
• Fast reads
• Unknown workloads
• Complex
(exploratory)
queries
Media
Store
Ad-hoc ReportingDP/ML
Building a Modern Day Web Site
@r39132
DB
CName
Load Balancer Load Balancer Load Balancer
Search
CDCIndexing
CName
Impedance Mismatch
• Serving needs
• Fast reads & writes
• Well-defined workloads
• Simple queries
• OLTP DBs
• Analytic (Ad-hoc) needs
• Fast reads
• Unknown workloads
• Complex
(exploratory)
queries
• OLAP DBs
Media
Store
Ad-hoc ReportingDP/ML
Building a Modern Day Web Site
@r39132
DB
CName
Load Balancer Load Balancer Load Balancer
Search
CDCIndexing
CName
Ad-hoc
DB
ETL/ELT
ReportingDP/ML
Analytics Use-cases
1. Reporting (Nightly)
• Well-defined columns
2. Ad-hoc Analysis
(throughout Day)
• Fast reads, any
column
3. Data Processing / ML
training
• Large scans & writes
Media
Store
Building a Modern Day Web Site
@r39132
DB
CName
Load Balancer Load Balancer Load Balancer
Search
CDCIndexing
CName
Ad-hoc
DB
ETL/ELT
ReportingDP/ML
Scheduler
A workflow scheduler needs
to coordinate the
nightly/hourly loads!
Media
Store
Scheduler
Building a Modern Day Web Site
@r39132
DB
CName
Load Balancer Load Balancer Load Balancer
Search
CDCIndexing
CName
Ad-hoc
DB
ETL/ELT
Reporting
Analytics Use-cases
1. Reporting (Nightly)
• Well-defined columns
2. Ad-hoc Analysis
(throughout Day)
• Fast reads, any
column
3. Data Processing / ML
training
• Large scans & writes
Media
Store
DP/ML
HDFS HDFS HDFSScheduler
Building a Modern Day Web Site
@r39132
DB
CName
Load Balancer Load Balancer Load Balancer
Search
CDCIndexing
CName
Ad-hoc
DB
ETL/ELT
Reporting
Media
Store
DP/ML
HDFS HDFS HDFSScheduler
Ad-hoc
Increasingly, ad-hoc
exploratory queries are also
being moved to the data
lake to keep costs down!
Building a Modern Day Web Site
@r39132
DB
CName
Load Balancer Load Balancer Load Balancer
Search
CDCIndexing
CName
Ad-hoc
DB
ETL/ELT
Reporting
Media
Store
DP/ML
HDFS HDFS HDFSScheduler
Kafka
HDFS
Ingest
Ad-hoc
What about App
engagement metric & other
business metric events?
• The web apps business
log events to Kafka
• A Kafka consumer ingest
these events into HDFS
where they can be
aggregated & possibly
also used in ML features
Building a Modern Day Web Site
@r39132
DB
CName
Load Balancer Load Balancer Load Balancer
Search
CDCIndexing
CName
Ad-hoc
DB
ETL/ELT
Reporting
We live in a connected world.
• We can infer a lot from
what goes on around us in
our connected
neighborhood.
• Graph Processing
• Graph DBs
Media
Store
DP/ML
HDFS HDFS HDFSScheduler
Kafka
HDFS
Ingest
Ad-hoc
Graph
Processing
Graph
DBs
Graph
Ingest
Building a Modern Day Web Site
@r39132
DB
CName
Load Balancer Load Balancer Load Balancer
Search
CDCIndexing
CName
Ad-hoc
DB
ETL/ELT
Reporting
Media
Store
DP/ML
HDFS HDFS HDFSScheduler
Kafka
HDFS
Ingest
Ad-hoc
Graph
Processing
Graph
DBs
Graph
Ingest
Cache
And who can forget about
caches?
Building a Modern Day Web Site
@r39132
DB
CName
Load Balancer Load Balancer Load Balancer
Search
CDCIndexing
CName
Ad-hoc
DB
ETL/ELT
Reporting
Media
Store
DP/ML
HDFS HDFS HDFSScheduler
Kafka
HDFS
Ingest
Ad-hoc
Graph
Processing
Graph
DBs
Graph
Ingest
Cache
And RT OLAP engines like
Apache Druid or LinkedIn’s
Pinot!
A specialty data system
optimized for time-
oriented roll-ups
RT OLAP
Building a Modern Day Web Site
@r39132
DB
CName
Load Balancer Load Balancer Load Balancer
Search
CDCIndexing
CName
Ad-hoc
DB
ETL/ELT
Reporting
Media
Store
DP/ML
HDFS HDFS HDFSScheduler
Kafka
HDFS Ingest
Ad-hoc
Graph
Processing
Graph
DBs
Graph
Ingest
Cache RT OLAP
Modern Data
Infrastructure
Building a Modern Day Web Site
@r39132
Data Infrastructure Domain Specialty Data Systems Examples
Online Serving • OLTP DBs (NoSQL, NewSQL, RDBMS)
• Caches
• Search Engines
• Graph Engines,
• Media Stores (Object, Filers)
• RT OLAP engines
• MySQL Postgres, FoundationDB
• Redis, Memcached
• Elasticsearch, SOLR
• JanusGraph, AWS Neptune, TigerGraph
• AWS S3, LinkedIn Ambry
• LinkedIn Pinot, Apache Druid
Offline Analytics • OLAP (MPP) DBs
• Graph Processing
• Large Scale Data Processing
• SQL-on-Hadoop
• Stream Processing
• ML Platforms
• BI tools (Reporting)
• Teradata, AWS Redshift, Big Query
• GraphX
• Pig, Spark, M/R
• Presto, Impala, KSQL
• Spark, Flink, Beam, Storm
• MLFlow, Kubeflow
• Tableau, Microstrategy
Data Movement • Streams
• Workflow Schedulers
• Ingesters (Graph, Search, Hadoop,
ETL/ELT)
• Kafka
• Apache Airflow, UC4, Control-M
• Sqoop, LinkedIn Gobblin, Informatica
Building a Modern Day Web Site
@r39132
Key Take-aways
• Common pitfall!
• When your primary OLTP data store is struggling under load, your
first reaction may be to
• Scale it out! Or
• Replace it with a hot new technology
@r39132
Key Take-aways
• Better approach
• Analyze the workloads & potentially
• Move different workloads to different systems
• Hire specialty talent to manage those systems
• Separate those systems by well-defined interfaces & protocols
@r39132
Key Take-aways
This is Microservices & Conway’s law applied to Data Engineering
@r39132
PayPal Data Architecture
An Overview
@r39132
PayPal’s (Core) Architecture (Simplified)
PHXSLC
2 Customer-Serving Data Centers today,
more on the way
@r39132
PayPal’s (Core) Architecture (Simplified)
PHXSLC
CName
Mobile & Web App traffic that hits paypal.com is Akamai-
routed to one of these 2 Data Centers
@r39132
PayPal’s (Core) Architecture (Simplified)
PHXSLC
CName
Load Balancer Load Balancer
Within a Data Center, we have multiple
Availability Zones.
A routing layer within the Data Center
will route to one of the Availability Zones
Each AZs is composed of many
microservices as well as other services,
such as Kafka clusters, etc…
@r39132
PayPal’s (Core) Architecture (Simplified)
PHXSLC
CName Within a Data Center, we have multiple
Availability Zones.
Load Balancer Load Balancer
A routing layer within the Data Center
will route to one of the Availability Zones
Each AZs is composed of many
microservices as well as other services,
such as Kafka clusters, etc…
OCC OCC
DB (RO)DB
DB requests are made to a single
“Horizontal” AZ that contains all of the
Core DBs (Oracle RACs)
OCC = Oracle Connection Cache
GG
@r39132
PHX
PayPal’s (Core) Architecture (Simplified)
CName
Load Balancer Load Balancer
=
OCC OCC
DB
SLC LVS
PP has one Analytics Data Center in Las
Vegas!
@r39132
PHX
PayPal’s (Core) Architecture (Simplified)
CName
Load Balancer Load Balancer
=
OCC OCC
DB
SLC LVS
We have 2 major data store types in our
Analytics Data Center:
• Teradata
• Hadoop
Hadoop
Teradata
@r39132
PHX
PayPal’s (Core) Architecture (Simplified)
CName
Load Balancer Load Balancer
=
OCC OCC
DB
SLC LVS
While Reporting is primarily from
Teradata, the other use cases can hit
either store
Hadoop
Teradata
ReportingDP/MLAd-hoc
@r39132
PHX
PayPal’s (Core) Architecture (Simplified)
CName
Load Balancer Load Balancer
=
OCC OCC
DB
SLC LVS
Custom pipelines feed both Teradata &
Hadoop from our Site DBs
Hadoop
Teradata
ReportingDP/MLAd-hoc
DB
(Pump)
OIS
CDH-R
Informatica
(ETL/ELT)
Core Data HighwayGG
GG
GG
@r39132
PHX
PayPal’s (Core) Architecture (Simplified)
CName
Load Balancer Load Balancer
=
OCC OCC
DB
SLC LVS
Hadoop
Teradata
ReportingDP/MLAd-hoc
DB
(Pump)
OIS
CDH-R
Core Data HighwayGG
GG
GG
We have 3 schedulers today for Batch Job
execution
Scheduler
Informatica
(ETL/ELT)
@r39132
PHX
PayPal’s (Core) Architecture (Simplified)
CName
Load Balancer Load Balancer
=
OCC OCC
DB
SLC LVS
Our home-grown Steam Donkey
transfers data between Teradata &
Hadoop
Hadoop
Teradata
ReportingDP/MLAd-hoc
DB
(Pump)
OIS
CDH-R
Core Data HighwayGG
Steam
DonkeyGG
GG
Scheduler
Informatica
(ETL/ELT)
@r39132
PHX
PayPal’s (Core) Architecture (Simplified)
CName
Load Balancer Load Balancer
=
OCC OCC
DB
SLC LVS
The remainder of this talk will focus on
the highlighted components:
• Fast Data (CDH)
• Big Data (Hadoop & More)
Hadoop
Teradata
ReportingDP/MLAd-hoc
DB
(Pump)
OIS
CDH-R
Core Data HighwayGG
GG
GG
Scheduler
Steam
Donkey
Informatica
(ETL/ELT)
@r39132
Fast Data in Action
Let’s look at a use-case
@r39132
Fast Data in Action
Say I want to send my
wife money!
@r39132
Fast Data in Action
After specifying an
amount & a message, I
hit Send
@r39132
Fast Data in Action
I see a confirmation
page
@r39132
Fast Data in Action
And I see the transfer
in my activity feed!
@r39132
Fast Data in Action
AsynchronousSynchronous
@r39132
Fast Data in Action
AsynchronousSynchronous
DB DBSynchronization
@r39132
Fast Data in Action
DB
SLC
Once the customer sees the confirmation screen, she can rest
assured the a commit has completed to the TXN database!
@r39132
Fast Data in Action
DB DB
(Pump)
CDH
Replicat
GG GG
SLC • Oracle Golden Gate
reads the Redo log into
its proprietary trail file
format & streams it to
the CDH Replicat
@r39132
Fast Data in Action
DB DB
(Pump)
CDH
Replicat
GG GG
Avro
Schema
Registry
register
SLC • The Replicat reads the
trail file, record by record
• Extracts the db schema
of each row, converts it
into an Avro schema, and
registers that with the
Avro Schema Registry
(ASR)
@r39132
Fast Data in Action
DB DB
(Pump)
CDH
Replicat
GG GG
Avro
Schema
Registry
SLC • Composes an Avro
message
• Sends the message to
Kafka
Kafka
@r39132
Fast Data in Action
DB DB
(Pump)
CDH
Replicat
GG GG
Router
Kafka
Avro
Schema
Registry
get
SLC
• A Storm Router gets
the Writer’s schema id
from the message
header
• Contacts the ASR to
download the schema
by id, if not in a local
cache
• Decodes the datum
using the Writer’s
schema
@r39132
Fast Data in Action
DB DB
(Pump)
CDH
Replicat
GG GG
Router
Kafka
Avro
Schema
Registry
SLC • Hydrates the message
from Oracle to get all
columns (not just CDC
columns)!
Read full record by PK
@r39132
Fast Data in Action
DB DB
(Pump)
CDH
Replicat
GG GG
Router
Kafka
Avro
Schema
Registry
SLC • Generates N output
messages, one per
destination, masking
sensitive columns by
destination
• Sends N messages
Read full record by PK
Kafka
@r39132
Fast Data in Action
DB DB
(Pump)
CDH
Replicat
GG GG
Router
SLC
Kafka Kafka
Avro
Schema
Registry get
• The Activity Services
consumer app follows
the same steps
previously mentioned
to decode the Avro
message
@r39132
Fast Data in Action
DB DB
(Pump)
CDH
Replicat
GG GG
Router
SLC
Kafka Kafka
Avro
Schema
Registry
• It does does some
transformation to the
data before storing it in
its own DB
DB
@r39132
Fast Data in Action
DB DB
(Pump)
CDH
Replicat
GG GG
Router
SLC
Kafka Kafka
Avro
Schema
Registry
• When you visit the
Activity mobile or web
app, your data is
retrieved from the
Activity Services DB!
DB
@r39132
Fast Data in Action
DB DB
(Pump)
CDH
Replicat
GG GG
Router
SLC
Kafka
Avro
Schema
Registry
Activity Streams -- by the
Numbers!
• Scale: hundreds of millions of
events / day
• Latency (99%ile): < 60s
• Correctness: 100%
DB
Kafka
@r39132
Take-Aways: Change Data Capture
@r39132
Why Change Data Capture?
DB DBSynchronization
Many Ways to Sync two-or-more databases:
• XA Transactions
• Event Sourcing
• Change Data Capture
@r39132
XA Transactions (a.k.a. 2-Phase Commits)
DB DB
Problem:
• Giving up Availability for consistency (CAP Theorem)
@r39132
Event Sourcing
DB DB
Problem:
• Giving up Read-Your-Write Consistency
W W W W W W W WKafka
@r39132
Change Data Capture
DB DB
Solution:
• Guaranteed eventual consistency with low-latency
@r39132
Take-Aways: Apache Avro
@r39132
Why is Avro Needed?
DB
@r39132
Why is Avro Needed?
DB
The Data Contract between Reader & Writer is
enforced by the DB via a table Schema
@r39132
Why is Avro Needed?
DB
Kafka
@r39132
Why is Avro Needed?
Avro
• Is an efficient self-describing
(schema’d) data serialization format
• Supports Schema evolution
• Has good support in most languages
• Is widely accepted in the Big & Fast
data space
• Is used for data interchange across
both streams and files (HDFS)
Kafka
@r39132
Fast Data Architecture
The Control Plane
@r39132
Fast Data Architecture
SLC
DB DB
(Pump)
CDH
Replicat
GG GG
Router
Kafka Kafka
ASR
register
get
Read full record by PK
CDH Data
Plane
@r39132
Fast Data Architecture
Some More Requirements
• We have ~60K tables in our Oracle databases
• We can’t just turn on 60K streams as it would be wasteful, especially if no one
needs to consume it!
• We have 4500+ engineers in PayPal & 6 engineers on the CDH dev team
• How do we enable anyone in the company to launch any stream?
• If we did eventually have 60K+ streams, how would we manage them?
@r39132
SLC
DB DB
(Pump)
CDH
Replicat
GG GG
Router
Kafka Kafka
ASR
register
get
Read full record by PK
CDH Self-
Service
Control Plane
CDH Data
Plane
Metadata
DB
Fast Data Architecture
@r39132
Fast Data Architecture
SLC
DB DB
(Pump)
GG GG
ASR
CDH Data
Plane
Metadata
DB
CDH Control
Plane
• PP Engineer visits the
CDH self-service portal (a
ReactJS app) to provision
a data pipeline
• He or she submits a
request for a new pipeline
• The provision request is
recorded in the metadata
db
@r39132
SLC
DB DB
(Pump)
GG GG
ASR
CDH Data
Plane
• A periodic Airflow job
kicks off to call an API on
the Squbs server to
execute long-running
pipeline provisioning tasks
Fast Data Architecture
Metadata
DB
CDH Control
Plane
@r39132
SLC
Metadata
DB
CDH Control
Plane
DB DB
(Pump)
CDH
Replicat
GG GG
Router
Kafka Kafka
ASR
register
get
Read full record by PK
CDH Data
Plane
• This task creates new
Kafka topics, GG Replicat
processes, and Storm
topologies!
• Within a minute a new
pipeline is flowing!
Fast Data Architecture
@r39132
Design Principles
1. System built from OSS components & runs on containers (HA)!
2. Separation of Concerns:
• Intent Capture vs Orchestration
3. Orchestration is the brains of the control plane!
• DP Self-healing
• DP Auto-scaling
• Fault-tolerant actions
• Maintenance-aware
Fast Data Architecture
Metadata
DB
CDH Control
Plane
@r39132
Fast Data Requirements
Data Plane Requirements
@r39132
Fast Data Requirements
• Correctness – 0% data loss/corruption
• Latency – 99%ile < 1 minute (rain or shine)
• Availability – Always Available
@r39132
Fast Data Requirements
• Correctness – 0% data loss/corruption
• Latency – 99%ile < 1 minute (rain or shine)
• Availability – Always Available
@r39132
Fast Data Requirements
• Correctness – 0% data loss/corruption
• Causes of data loss/corruption are typically
• Deployments of Buggy Code
• Data corner-cases – latent bugs not related to recent code changes but to data outliers
• Latency – 99%ile < 1 minute (rain or shine)
• Definition of Latency SLA Misses
• Data is arriving, but it is delayed
• Causes of latency SLA Misses
• Scalability bottlenecks
• Performance bottlenecks
• Availability – Always Available
• Definition of Availability SLA Misses
• No data is arriving
• Causes of availability loss are typically
• Deployments of Buggy Code
• SPOF outages
@r39132
Fast Data Challenges
Solutions!
@r39132
Fast Data Challenges
1. Performance Bottlenecks
2. Data Corner Cases
3. Deployments of Buggy Code
@r39132
Performance Bottlenecks
SLC
Metadata
DB
CDH Control
Plane
DB DB
(Pump)
CDH
Replicat
GG GG
Router
Kafka Kafka
ASR
register
get
Read full record by PK
CDH Data
Plane
• Hydration Queries
• The biggest bottleneck
is the hydration query
back to the source DB
for updated rows
• Hydration queries can
take 20-40 ms vs 500
microseconds
@r39132
SLC
Metadata
DB
CDH Control
Plane
DB DB
(Pump)
CDH
Replicat
GG GG
Router
Kafka Kafka
ASR
register
get
CDH Data
Plane
• Hydration Queries
• Solution : Oracle GG
Full-Supplemental
Logging! No more
hydration!
Performance Bottlenecks
@r39132
Fast Data Challenges
1. Performance Bottlenecks
2. Data Corner Cases
3. Deployments of Buggy Code
@r39132
Data Corner Cases
• Considerations
1. A latent bug can be triggered when it encounters unexpected data!
• Approach
• We do 0 type conversions!
• Oracle Golden Gate provides everything as String
• Due to the Number type, which does not map to any numeric type in
Avro or any programming language, we had to abandon end-to-end type
safety
• The upside is that we don’t run into type-related conversion issues &
related to data corner cases!
• We don’t replicate LOB fields
• Currently, we have no transformation logic in our pipelines!
@r39132
Fast Data Challenges
1. Performance Bottlenecks
2. Data Corner Cases
3. Deployments of Buggy Code
@r39132
Code Deployments
SLC
Metadata
DB
CDH Control
Plane
DB DB
(Pump)
CDH
Replicat
GG GG
Router
Kafka
ASR
register
get
CDH Data
Plane
1. Set maintenance
mode (pausing all
orchestration actions)
@r39132
Code Deployments
SLC
Metadata
DB
CDH Control
Plane
DB DB
(Pump)
CDH
Replicat
GG GG
Router
Kafka
ASR
register
get
CDH Data
Plane 2. Stop Storm topology
3. Backup checkpoint
4. Deploy new code
5. Start Storm topology
6. Monitor for errors
@r39132
Code Deployments
SLC
Metadata
DB
CDH Control
Plane
DB DB
(Pump)
CDH
Replicat
GG GG
Router
Kafka
ASR
register
get
CDH Data
Plane 6. If errors detected,
- a. stop topology
- b. rollback checkpoint
- c. rollback code version
- d. restart topology
@r39132
Code Deployments
SLC
Metadata
DB
CDH Control
Plane
DB DB
(Pump)
CDH
Replicat
GG GG
Router
Kafka
ASR
register
get
CDH Data
Plane
7. Set maintenance
mode off (unpausing all
orchestration actions)
@r39132
Fast Data Stats!
SLC
Metadata
DB
CDH Control
Plane
DB DB
(Pump)
CDH
Replicat
GG GG
Router
Kafka Kafka
ASR
register
get
CDH Data
Plane
Data Plane
• GA’d in August 2018
• 2.2 TB streamed / day
• 300+ Pipelines activated
through our self-service
portal!
@r39132
Closing Thoughts
• Favor microservice approaches to building data architectures
• When possible (almost always), favor OSS data projects over proprietary ones
• In stream processing. #NO_OPS is the only ways to meet SLAs
• Check out our OSS Data Projects on http://paypal.github.io/
@r39132
Acknowledgments
• Akara Sucharitakul
• Anil Gursel
• Doron Mimon
• Na Yang
• Maulin Vasavada
• Kevin Lu
• Prasanna Krishna
• Sri Shivananda
• Kamlakar Singh
• Nagendra Rai
• Swroop Singh
• Anoj Rawat
• Rahul Srivastava
• Naitra Muralykrishnan
• Prabhu Kasinathan
• Vincent Chen
• Anisha Nainani
• Pramod Garre
• Harsh Bhimani
• Nirmalya Ghosh
• Yash Shah
• Aastha Sinha
• Deepak Mohanakumar
Chandramouli
• Romit Mehta
• Dheeraj Rampally
• Stalin Subbiah
• Ashwin Nellore
• Lohit Giri
• Plamen Jeliazhov
• Sehmuz Bayhan
And Many More…
@r39132
Questions?
@r39132
Ad

More Related Content

What's hot (20)

[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
Chris Hoyean Song
 
Big data introduction
Big data introductionBig data introduction
Big data introduction
Chirag Ahuja
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
Databricks
 
RDBMS to Graph
RDBMS to GraphRDBMS to Graph
RDBMS to Graph
Neo4j
 
Vertica
VerticaVertica
Vertica
Samchu Li
 
Introduction to SPARQL
Introduction to SPARQLIntroduction to SPARQL
Introduction to SPARQL
Jose Emilio Labra Gayo
 
How Financial Services Organizations Use MongoDB
How Financial Services Organizations Use MongoDBHow Financial Services Organizations Use MongoDB
How Financial Services Organizations Use MongoDB
MongoDB
 
Introduction to Machine Learning with Spark
Introduction to Machine Learning with SparkIntroduction to Machine Learning with Spark
Introduction to Machine Learning with Spark
datamantra
 
Intro to Neo4j
Intro to Neo4jIntro to Neo4j
Intro to Neo4j
Neo4j
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
Hakka Labs
 
Microservices Patterns with GoldenGate
Microservices Patterns with GoldenGateMicroservices Patterns with GoldenGate
Microservices Patterns with GoldenGate
Jeffrey T. Pollock
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
Manish Gupta
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
Presto
PrestoPresto
Presto
Knoldus Inc.
 
Semantic Web - Ontologies
Semantic Web - OntologiesSemantic Web - Ontologies
Semantic Web - Ontologies
Serge Linckels
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
James Serra
 
Mongo DB
Mongo DB Mongo DB
Mongo DB
Tata Consultancy Services
 
Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing concepts
pcherukumalla
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
Russell Jurney
 
Power BI for Developers
Power BI for DevelopersPower BI for Developers
Power BI for Developers
Jan Pieter Posthuma
 
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
Chris Hoyean Song
 
Big data introduction
Big data introductionBig data introduction
Big data introduction
Chirag Ahuja
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
Databricks
 
RDBMS to Graph
RDBMS to GraphRDBMS to Graph
RDBMS to Graph
Neo4j
 
How Financial Services Organizations Use MongoDB
How Financial Services Organizations Use MongoDBHow Financial Services Organizations Use MongoDB
How Financial Services Organizations Use MongoDB
MongoDB
 
Introduction to Machine Learning with Spark
Introduction to Machine Learning with SparkIntroduction to Machine Learning with Spark
Introduction to Machine Learning with Spark
datamantra
 
Intro to Neo4j
Intro to Neo4jIntro to Neo4j
Intro to Neo4j
Neo4j
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
Hakka Labs
 
Microservices Patterns with GoldenGate
Microservices Patterns with GoldenGateMicroservices Patterns with GoldenGate
Microservices Patterns with GoldenGate
Jeffrey T. Pollock
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
Manish Gupta
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
Semantic Web - Ontologies
Semantic Web - OntologiesSemantic Web - Ontologies
Semantic Web - Ontologies
Serge Linckels
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
James Serra
 
Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing concepts
pcherukumalla
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
Russell Jurney
 

Similar to Big Data, Fast Data @ PayPal (YOW 2018) (20)

Svccg nosql 2011_v4
Svccg nosql 2011_v4Svccg nosql 2011_v4
Svccg nosql 2011_v4
Sid Anand
 
SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017
SnappyData
 
DoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics PlatformDoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics Platform
martinbpeters
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
Inside Freshworks' Migration from Cassandra to ScyllaDB by Premkumar Patturaj
Inside Freshworks' Migration from Cassandra to ScyllaDB by Premkumar PatturajInside Freshworks' Migration from Cassandra to ScyllaDB by Premkumar Patturaj
Inside Freshworks' Migration from Cassandra to ScyllaDB by Premkumar Patturaj
ScyllaDB
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineSpark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Data Con LA
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Databricks
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
Cloudera, Inc.
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
DataWorks Summit
 
Dynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the fly
DataWorks Summit
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
Kriangkrai Chaonithi
 
Real-Time Analytics with Spark and MemSQL
Real-Time Analytics with Spark and MemSQLReal-Time Analytics with Spark and MemSQL
Real-Time Analytics with Spark and MemSQL
SingleStore
 
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
DataStax Academy
 
ScyllaDB Virtual Workshop: Getting Started with ScyllaDB 2024
ScyllaDB Virtual Workshop: Getting Started with ScyllaDB 2024ScyllaDB Virtual Workshop: Getting Started with ScyllaDB 2024
ScyllaDB Virtual Workshop: Getting Started with ScyllaDB 2024
ScyllaDB
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark Summit
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
SoftServe
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Databricks
 
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
Mark Rittman
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeong
Yousun Jeong
 
Svccg nosql 2011_v4
Svccg nosql 2011_v4Svccg nosql 2011_v4
Svccg nosql 2011_v4
Sid Anand
 
SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017
SnappyData
 
DoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics PlatformDoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics Platform
martinbpeters
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
Inside Freshworks' Migration from Cassandra to ScyllaDB by Premkumar Patturaj
Inside Freshworks' Migration from Cassandra to ScyllaDB by Premkumar PatturajInside Freshworks' Migration from Cassandra to ScyllaDB by Premkumar Patturaj
Inside Freshworks' Migration from Cassandra to ScyllaDB by Premkumar Patturaj
ScyllaDB
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineSpark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Data Con LA
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Databricks
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
Cloudera, Inc.
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
DataWorks Summit
 
Dynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the fly
DataWorks Summit
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
Kriangkrai Chaonithi
 
Real-Time Analytics with Spark and MemSQL
Real-Time Analytics with Spark and MemSQLReal-Time Analytics with Spark and MemSQL
Real-Time Analytics with Spark and MemSQL
SingleStore
 
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
DataStax Academy
 
ScyllaDB Virtual Workshop: Getting Started with ScyllaDB 2024
ScyllaDB Virtual Workshop: Getting Started with ScyllaDB 2024ScyllaDB Virtual Workshop: Getting Started with ScyllaDB 2024
ScyllaDB Virtual Workshop: Getting Started with ScyllaDB 2024
ScyllaDB
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark Summit
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
SoftServe
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Databricks
 
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
Mark Rittman
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeong
Yousun Jeong
 
Ad

More from Sid Anand (20)

Building High Fidelity Data Streams (QCon London 2023)
Building High Fidelity Data Streams (QCon London 2023)Building High Fidelity Data Streams (QCon London 2023)
Building High Fidelity Data Streams (QCon London 2023)
Sid Anand
 
Building & Operating High-Fidelity Data Streams - QCon Plus 2021
Building & Operating High-Fidelity Data Streams - QCon Plus 2021Building & Operating High-Fidelity Data Streams - QCon Plus 2021
Building & Operating High-Fidelity Data Streams - QCon Plus 2021
Sid Anand
 
Low Latency Fraud Detection & Prevention
Low Latency Fraud Detection & PreventionLow Latency Fraud Detection & Prevention
Low Latency Fraud Detection & Prevention
Sid Anand
 
YOW! Data Keynote (2021)
YOW! Data Keynote (2021)YOW! Data Keynote (2021)
YOW! Data Keynote (2021)
Sid Anand
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
Sid Anand
 
Cloud Native Predictive Data Pipelines (micro talk)
Cloud Native Predictive Data Pipelines (micro talk)Cloud Native Predictive Data Pipelines (micro talk)
Cloud Native Predictive Data Pipelines (micro talk)
Sid Anand
 
Cloud Native Data Pipelines (GoTo Chicago 2017)
Cloud Native Data Pipelines (GoTo Chicago 2017)Cloud Native Data Pipelines (GoTo Chicago 2017)
Cloud Native Data Pipelines (GoTo Chicago 2017)
Sid Anand
 
Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)
Sid Anand
 
Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo
Cloud Native Data Pipelines (in Eng & Japanese)  - QCon TokyoCloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo
Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo
Sid Anand
 
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Sid Anand
 
Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016
Sid Anand
 
Airflow @ Agari
Airflow @ Agari Airflow @ Agari
Airflow @ Agari
Sid Anand
 
Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Resilient Predictive Data Pipelines (GOTO Chicago 2016)Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Sid Anand
 
Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)
Sid Anand
 
Software Developer and Architecture @ LinkedIn (QCon SF 2014)
Software Developer and Architecture @ LinkedIn (QCon SF 2014)Software Developer and Architecture @ LinkedIn (QCon SF 2014)
Software Developer and Architecture @ LinkedIn (QCon SF 2014)
Sid Anand
 
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
Sid Anand
 
Building a Modern Website for Scale (QCon NY 2013)
Building a Modern Website for Scale (QCon NY 2013)Building a Modern Website for Scale (QCon NY 2013)
Building a Modern Website for Scale (QCon NY 2013)
Sid Anand
 
Hands On with Maven
Hands On with MavenHands On with Maven
Hands On with Maven
Sid Anand
 
Learning git
Learning gitLearning git
Learning git
Sid Anand
 
LinkedIn Data Infrastructure Slides (Version 2)
LinkedIn Data Infrastructure Slides (Version 2)LinkedIn Data Infrastructure Slides (Version 2)
LinkedIn Data Infrastructure Slides (Version 2)
Sid Anand
 
Building High Fidelity Data Streams (QCon London 2023)
Building High Fidelity Data Streams (QCon London 2023)Building High Fidelity Data Streams (QCon London 2023)
Building High Fidelity Data Streams (QCon London 2023)
Sid Anand
 
Building & Operating High-Fidelity Data Streams - QCon Plus 2021
Building & Operating High-Fidelity Data Streams - QCon Plus 2021Building & Operating High-Fidelity Data Streams - QCon Plus 2021
Building & Operating High-Fidelity Data Streams - QCon Plus 2021
Sid Anand
 
Low Latency Fraud Detection & Prevention
Low Latency Fraud Detection & PreventionLow Latency Fraud Detection & Prevention
Low Latency Fraud Detection & Prevention
Sid Anand
 
YOW! Data Keynote (2021)
YOW! Data Keynote (2021)YOW! Data Keynote (2021)
YOW! Data Keynote (2021)
Sid Anand
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
Sid Anand
 
Cloud Native Predictive Data Pipelines (micro talk)
Cloud Native Predictive Data Pipelines (micro talk)Cloud Native Predictive Data Pipelines (micro talk)
Cloud Native Predictive Data Pipelines (micro talk)
Sid Anand
 
Cloud Native Data Pipelines (GoTo Chicago 2017)
Cloud Native Data Pipelines (GoTo Chicago 2017)Cloud Native Data Pipelines (GoTo Chicago 2017)
Cloud Native Data Pipelines (GoTo Chicago 2017)
Sid Anand
 
Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)
Sid Anand
 
Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo
Cloud Native Data Pipelines (in Eng & Japanese)  - QCon TokyoCloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo
Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo
Sid Anand
 
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Sid Anand
 
Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016
Sid Anand
 
Airflow @ Agari
Airflow @ Agari Airflow @ Agari
Airflow @ Agari
Sid Anand
 
Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Resilient Predictive Data Pipelines (GOTO Chicago 2016)Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Sid Anand
 
Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)
Sid Anand
 
Software Developer and Architecture @ LinkedIn (QCon SF 2014)
Software Developer and Architecture @ LinkedIn (QCon SF 2014)Software Developer and Architecture @ LinkedIn (QCon SF 2014)
Software Developer and Architecture @ LinkedIn (QCon SF 2014)
Sid Anand
 
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
Sid Anand
 
Building a Modern Website for Scale (QCon NY 2013)
Building a Modern Website for Scale (QCon NY 2013)Building a Modern Website for Scale (QCon NY 2013)
Building a Modern Website for Scale (QCon NY 2013)
Sid Anand
 
Hands On with Maven
Hands On with MavenHands On with Maven
Hands On with Maven
Sid Anand
 
Learning git
Learning gitLearning git
Learning git
Sid Anand
 
LinkedIn Data Infrastructure Slides (Version 2)
LinkedIn Data Infrastructure Slides (Version 2)LinkedIn Data Infrastructure Slides (Version 2)
LinkedIn Data Infrastructure Slides (Version 2)
Sid Anand
 
Ad

Recently uploaded (20)

Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Andre Hora
 
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
Egor Kaleynik
 
Download Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With LatestDownload Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With Latest
tahirabibi60507
 
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& ConsiderationsDesigning AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Dinusha Kumarasiri
 
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
steaveroggers
 
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Eric D. Schabell
 
How can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptxHow can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptx
laravinson24
 
Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025
kashifyounis067
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
How to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud PerformanceHow to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud Performance
ThousandEyes
 
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
Andre Hora
 
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
F-Secure Freedome VPN 2025 Crack Plus Activation  New VersionF-Secure Freedome VPN 2025 Crack Plus Activation  New Version
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
saimabibi60507
 
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
AxisTechnolabs
 
Not So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java WebinarNot So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java Webinar
Tier1 app
 
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRYLEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
NidaFarooq10
 
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AIScaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
danshalev
 
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdfMicrosoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
TechSoup
 
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Orangescrum
 
Kubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptxKubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptx
CloudScouts
 
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and CollaborateMeet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Maxim Salnikov
 
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Andre Hora
 
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
Egor Kaleynik
 
Download Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With LatestDownload Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With Latest
tahirabibi60507
 
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& ConsiderationsDesigning AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Dinusha Kumarasiri
 
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
steaveroggers
 
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Eric D. Schabell
 
How can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptxHow can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptx
laravinson24
 
Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025
kashifyounis067
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
How to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud PerformanceHow to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud Performance
ThousandEyes
 
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
Andre Hora
 
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
F-Secure Freedome VPN 2025 Crack Plus Activation  New VersionF-Secure Freedome VPN 2025 Crack Plus Activation  New Version
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
saimabibi60507
 
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
AxisTechnolabs
 
Not So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java WebinarNot So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java Webinar
Tier1 app
 
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRYLEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
NidaFarooq10
 
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AIScaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
danshalev
 
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdfMicrosoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
TechSoup
 
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Orangescrum
 
Kubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptxKubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptx
CloudScouts
 
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and CollaborateMeet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Maxim Salnikov
 

Big Data, Fast Data @ PayPal (YOW 2018)

  • 1. @r39132 Big Data, Fast Data @ PayPal Sid Anand (@r39132) YOW! Conferences (Sydney, Brisbane, Melbourne) Nov-Dec 2018 A Data Infrastructure Story
  • 2. @r39132 About Me Worked @ Committer & PPMC on Father of 2 Co-Chair @ Work @
  • 4. @r39132 @Scale: Last Year 200+ 100+ Markets Currencies 227M Active Customer Accounts 7.8B Payments Transactions 2,700 Applications 4,500 Engineers 17,000 Releases 200,000 Servers 27 Megawatts Power 238 Petabytes Storage Full year 2017 numbers PayPal by the Numbers!
  • 5. @r39132 Putting our data scale in perspective … PayPal by the Numbers! DVDs7x height of Mt Everest x 500,000 x 2, 000,000
  • 6. @r39132 And we continue to see growth in all areas… PayPal by the Numbers!
  • 7. @r39132 And to keep up with this growth, we’ve had to scale our data infrastructure PayPal by the Numbers! 2,000 + Database Instances ~116 Billion Calls/day ~74 PB Total Storage OLTP DBs Kafka Messaging Hadoop Analytics
  • 8. @r39132 And to keep up with this growth, we’ve had to scale our data infrastructure PayPal by the Numbers! 2,000 + Database Instances ~116 Billion Calls/day ~74 PB Total Storage OLTP DBs Kafka Messaging 400+ Billion Messages/day ~7 PB Total Storage 50 + Clusters 3K + Topics Hadoop Analytics
  • 9. @r39132 And to keep up with this growth, we’ve had to scale our data infrastructure PayPal by the Numbers! 2,000 + Database Instances ~116 Billion Calls/day ~74 PB Total Storage OLTP DBs Kafka Messaging 400+ Billion Messages/day ~7 PB Total Storage 50 + Clusters 3K + Topics 200,000 + Jobs/day 32 Hadoop Clusters 250+ PB Storage Hadoop Analytics
  • 10. @r39132 Interlude … Why we love Ozzies! • Oz has ~25MM people • Ozzies Eligible for PayPal: ~19MM people • Ozzies with Active Accounts: ~7MM • @ 37%, it’s PayPal’s most penetrated market!! • PayPal
  • 11. @r39132 Setting the Context To understand PayPal’s Data Infrastructure today, scale is only half the story! It’s Data Infrastructure has evolved based on the creation of new technologies as well as changing requirements PayPal is a 20 year old company!
  • 12. @r39132 Building A Modern Website A Data Infrastructure Evolution Story
  • 13. @r39132 Building a Modern Day Web Site DB CName
  • 14. @r39132 DB Load Balancer CName Load Balancer Load Balancer Load Balancer Building a Modern Day Web Site
  • 15. @r39132 DB Load Balancer CName Load Balancer Load Balancer Load Balancer Search Building a Modern Day Web Site
  • 16. @r39132 DB Load Balancer CName Load Balancer Load Balancer Load Balancer Search CDCIndexing Building a Modern Day Web Site
  • 17. @r39132 DB Load Balancer CName Load Balancer Load Balancer Load Balancer Search CDCIndexing Media Store CName Building a Modern Day Web Site
  • 18. @r39132 DB CName Load Balancer Load Balancer Load Balancer Search CDCIndexing CName Media Store Ad-hoc ReportingDP/ML Analytics Use-cases 1. Reporting (Nightly) • Well-defined columns 2. Ad-hoc Analysis (throughout Day) • Fast reads, any column 3. Data Processing / ML training • Large scans & writes Building a Modern Day Web Site
  • 19. @r39132 DB CName Load Balancer Load Balancer Load Balancer Search CDCIndexing CName Impedance Mismatch • Serving needs • Fast reads & writes • Well-defined workloads • Simple queries • Analytic (Ad-hoc) needs • Fast reads • Unknown workloads • Complex (exploratory) queries Media Store Ad-hoc ReportingDP/ML Building a Modern Day Web Site
  • 20. @r39132 DB CName Load Balancer Load Balancer Load Balancer Search CDCIndexing CName Impedance Mismatch • Serving needs • Fast reads & writes • Well-defined workloads • Simple queries • OLTP DBs • Analytic (Ad-hoc) needs • Fast reads • Unknown workloads • Complex (exploratory) queries • OLAP DBs Media Store Ad-hoc ReportingDP/ML Building a Modern Day Web Site
  • 21. @r39132 DB CName Load Balancer Load Balancer Load Balancer Search CDCIndexing CName Ad-hoc DB ETL/ELT ReportingDP/ML Analytics Use-cases 1. Reporting (Nightly) • Well-defined columns 2. Ad-hoc Analysis (throughout Day) • Fast reads, any column 3. Data Processing / ML training • Large scans & writes Media Store Building a Modern Day Web Site
  • 22. @r39132 DB CName Load Balancer Load Balancer Load Balancer Search CDCIndexing CName Ad-hoc DB ETL/ELT ReportingDP/ML Scheduler A workflow scheduler needs to coordinate the nightly/hourly loads! Media Store Scheduler Building a Modern Day Web Site
  • 23. @r39132 DB CName Load Balancer Load Balancer Load Balancer Search CDCIndexing CName Ad-hoc DB ETL/ELT Reporting Analytics Use-cases 1. Reporting (Nightly) • Well-defined columns 2. Ad-hoc Analysis (throughout Day) • Fast reads, any column 3. Data Processing / ML training • Large scans & writes Media Store DP/ML HDFS HDFS HDFSScheduler Building a Modern Day Web Site
  • 24. @r39132 DB CName Load Balancer Load Balancer Load Balancer Search CDCIndexing CName Ad-hoc DB ETL/ELT Reporting Media Store DP/ML HDFS HDFS HDFSScheduler Ad-hoc Increasingly, ad-hoc exploratory queries are also being moved to the data lake to keep costs down! Building a Modern Day Web Site
  • 25. @r39132 DB CName Load Balancer Load Balancer Load Balancer Search CDCIndexing CName Ad-hoc DB ETL/ELT Reporting Media Store DP/ML HDFS HDFS HDFSScheduler Kafka HDFS Ingest Ad-hoc What about App engagement metric & other business metric events? • The web apps business log events to Kafka • A Kafka consumer ingest these events into HDFS where they can be aggregated & possibly also used in ML features Building a Modern Day Web Site
  • 26. @r39132 DB CName Load Balancer Load Balancer Load Balancer Search CDCIndexing CName Ad-hoc DB ETL/ELT Reporting We live in a connected world. • We can infer a lot from what goes on around us in our connected neighborhood. • Graph Processing • Graph DBs Media Store DP/ML HDFS HDFS HDFSScheduler Kafka HDFS Ingest Ad-hoc Graph Processing Graph DBs Graph Ingest Building a Modern Day Web Site
  • 27. @r39132 DB CName Load Balancer Load Balancer Load Balancer Search CDCIndexing CName Ad-hoc DB ETL/ELT Reporting Media Store DP/ML HDFS HDFS HDFSScheduler Kafka HDFS Ingest Ad-hoc Graph Processing Graph DBs Graph Ingest Cache And who can forget about caches? Building a Modern Day Web Site
  • 28. @r39132 DB CName Load Balancer Load Balancer Load Balancer Search CDCIndexing CName Ad-hoc DB ETL/ELT Reporting Media Store DP/ML HDFS HDFS HDFSScheduler Kafka HDFS Ingest Ad-hoc Graph Processing Graph DBs Graph Ingest Cache And RT OLAP engines like Apache Druid or LinkedIn’s Pinot! A specialty data system optimized for time- oriented roll-ups RT OLAP Building a Modern Day Web Site
  • 29. @r39132 DB CName Load Balancer Load Balancer Load Balancer Search CDCIndexing CName Ad-hoc DB ETL/ELT Reporting Media Store DP/ML HDFS HDFS HDFSScheduler Kafka HDFS Ingest Ad-hoc Graph Processing Graph DBs Graph Ingest Cache RT OLAP Modern Data Infrastructure Building a Modern Day Web Site
  • 30. @r39132 Data Infrastructure Domain Specialty Data Systems Examples Online Serving • OLTP DBs (NoSQL, NewSQL, RDBMS) • Caches • Search Engines • Graph Engines, • Media Stores (Object, Filers) • RT OLAP engines • MySQL Postgres, FoundationDB • Redis, Memcached • Elasticsearch, SOLR • JanusGraph, AWS Neptune, TigerGraph • AWS S3, LinkedIn Ambry • LinkedIn Pinot, Apache Druid Offline Analytics • OLAP (MPP) DBs • Graph Processing • Large Scale Data Processing • SQL-on-Hadoop • Stream Processing • ML Platforms • BI tools (Reporting) • Teradata, AWS Redshift, Big Query • GraphX • Pig, Spark, M/R • Presto, Impala, KSQL • Spark, Flink, Beam, Storm • MLFlow, Kubeflow • Tableau, Microstrategy Data Movement • Streams • Workflow Schedulers • Ingesters (Graph, Search, Hadoop, ETL/ELT) • Kafka • Apache Airflow, UC4, Control-M • Sqoop, LinkedIn Gobblin, Informatica Building a Modern Day Web Site
  • 31. @r39132 Key Take-aways • Common pitfall! • When your primary OLTP data store is struggling under load, your first reaction may be to • Scale it out! Or • Replace it with a hot new technology
  • 32. @r39132 Key Take-aways • Better approach • Analyze the workloads & potentially • Move different workloads to different systems • Hire specialty talent to manage those systems • Separate those systems by well-defined interfaces & protocols
  • 33. @r39132 Key Take-aways This is Microservices & Conway’s law applied to Data Engineering
  • 35. @r39132 PayPal’s (Core) Architecture (Simplified) PHXSLC 2 Customer-Serving Data Centers today, more on the way
  • 36. @r39132 PayPal’s (Core) Architecture (Simplified) PHXSLC CName Mobile & Web App traffic that hits paypal.com is Akamai- routed to one of these 2 Data Centers
  • 37. @r39132 PayPal’s (Core) Architecture (Simplified) PHXSLC CName Load Balancer Load Balancer Within a Data Center, we have multiple Availability Zones. A routing layer within the Data Center will route to one of the Availability Zones Each AZs is composed of many microservices as well as other services, such as Kafka clusters, etc…
  • 38. @r39132 PayPal’s (Core) Architecture (Simplified) PHXSLC CName Within a Data Center, we have multiple Availability Zones. Load Balancer Load Balancer A routing layer within the Data Center will route to one of the Availability Zones Each AZs is composed of many microservices as well as other services, such as Kafka clusters, etc… OCC OCC DB (RO)DB DB requests are made to a single “Horizontal” AZ that contains all of the Core DBs (Oracle RACs) OCC = Oracle Connection Cache GG
  • 39. @r39132 PHX PayPal’s (Core) Architecture (Simplified) CName Load Balancer Load Balancer = OCC OCC DB SLC LVS PP has one Analytics Data Center in Las Vegas!
  • 40. @r39132 PHX PayPal’s (Core) Architecture (Simplified) CName Load Balancer Load Balancer = OCC OCC DB SLC LVS We have 2 major data store types in our Analytics Data Center: • Teradata • Hadoop Hadoop Teradata
  • 41. @r39132 PHX PayPal’s (Core) Architecture (Simplified) CName Load Balancer Load Balancer = OCC OCC DB SLC LVS While Reporting is primarily from Teradata, the other use cases can hit either store Hadoop Teradata ReportingDP/MLAd-hoc
  • 42. @r39132 PHX PayPal’s (Core) Architecture (Simplified) CName Load Balancer Load Balancer = OCC OCC DB SLC LVS Custom pipelines feed both Teradata & Hadoop from our Site DBs Hadoop Teradata ReportingDP/MLAd-hoc DB (Pump) OIS CDH-R Informatica (ETL/ELT) Core Data HighwayGG GG GG
  • 43. @r39132 PHX PayPal’s (Core) Architecture (Simplified) CName Load Balancer Load Balancer = OCC OCC DB SLC LVS Hadoop Teradata ReportingDP/MLAd-hoc DB (Pump) OIS CDH-R Core Data HighwayGG GG GG We have 3 schedulers today for Batch Job execution Scheduler Informatica (ETL/ELT)
  • 44. @r39132 PHX PayPal’s (Core) Architecture (Simplified) CName Load Balancer Load Balancer = OCC OCC DB SLC LVS Our home-grown Steam Donkey transfers data between Teradata & Hadoop Hadoop Teradata ReportingDP/MLAd-hoc DB (Pump) OIS CDH-R Core Data HighwayGG Steam DonkeyGG GG Scheduler Informatica (ETL/ELT)
  • 45. @r39132 PHX PayPal’s (Core) Architecture (Simplified) CName Load Balancer Load Balancer = OCC OCC DB SLC LVS The remainder of this talk will focus on the highlighted components: • Fast Data (CDH) • Big Data (Hadoop & More) Hadoop Teradata ReportingDP/MLAd-hoc DB (Pump) OIS CDH-R Core Data HighwayGG GG GG Scheduler Steam Donkey Informatica (ETL/ELT)
  • 46. @r39132 Fast Data in Action Let’s look at a use-case
  • 47. @r39132 Fast Data in Action Say I want to send my wife money!
  • 48. @r39132 Fast Data in Action After specifying an amount & a message, I hit Send
  • 49. @r39132 Fast Data in Action I see a confirmation page
  • 50. @r39132 Fast Data in Action And I see the transfer in my activity feed!
  • 51. @r39132 Fast Data in Action AsynchronousSynchronous
  • 52. @r39132 Fast Data in Action AsynchronousSynchronous DB DBSynchronization
  • 53. @r39132 Fast Data in Action DB SLC Once the customer sees the confirmation screen, she can rest assured the a commit has completed to the TXN database!
  • 54. @r39132 Fast Data in Action DB DB (Pump) CDH Replicat GG GG SLC • Oracle Golden Gate reads the Redo log into its proprietary trail file format & streams it to the CDH Replicat
  • 55. @r39132 Fast Data in Action DB DB (Pump) CDH Replicat GG GG Avro Schema Registry register SLC • The Replicat reads the trail file, record by record • Extracts the db schema of each row, converts it into an Avro schema, and registers that with the Avro Schema Registry (ASR)
  • 56. @r39132 Fast Data in Action DB DB (Pump) CDH Replicat GG GG Avro Schema Registry SLC • Composes an Avro message • Sends the message to Kafka Kafka
  • 57. @r39132 Fast Data in Action DB DB (Pump) CDH Replicat GG GG Router Kafka Avro Schema Registry get SLC • A Storm Router gets the Writer’s schema id from the message header • Contacts the ASR to download the schema by id, if not in a local cache • Decodes the datum using the Writer’s schema
  • 58. @r39132 Fast Data in Action DB DB (Pump) CDH Replicat GG GG Router Kafka Avro Schema Registry SLC • Hydrates the message from Oracle to get all columns (not just CDC columns)! Read full record by PK
  • 59. @r39132 Fast Data in Action DB DB (Pump) CDH Replicat GG GG Router Kafka Avro Schema Registry SLC • Generates N output messages, one per destination, masking sensitive columns by destination • Sends N messages Read full record by PK Kafka
  • 60. @r39132 Fast Data in Action DB DB (Pump) CDH Replicat GG GG Router SLC Kafka Kafka Avro Schema Registry get • The Activity Services consumer app follows the same steps previously mentioned to decode the Avro message
  • 61. @r39132 Fast Data in Action DB DB (Pump) CDH Replicat GG GG Router SLC Kafka Kafka Avro Schema Registry • It does does some transformation to the data before storing it in its own DB DB
  • 62. @r39132 Fast Data in Action DB DB (Pump) CDH Replicat GG GG Router SLC Kafka Kafka Avro Schema Registry • When you visit the Activity mobile or web app, your data is retrieved from the Activity Services DB! DB
  • 63. @r39132 Fast Data in Action DB DB (Pump) CDH Replicat GG GG Router SLC Kafka Avro Schema Registry Activity Streams -- by the Numbers! • Scale: hundreds of millions of events / day • Latency (99%ile): < 60s • Correctness: 100% DB Kafka
  • 65. @r39132 Why Change Data Capture? DB DBSynchronization Many Ways to Sync two-or-more databases: • XA Transactions • Event Sourcing • Change Data Capture
  • 66. @r39132 XA Transactions (a.k.a. 2-Phase Commits) DB DB Problem: • Giving up Availability for consistency (CAP Theorem)
  • 67. @r39132 Event Sourcing DB DB Problem: • Giving up Read-Your-Write Consistency W W W W W W W WKafka
  • 68. @r39132 Change Data Capture DB DB Solution: • Guaranteed eventual consistency with low-latency
  • 70. @r39132 Why is Avro Needed? DB
  • 71. @r39132 Why is Avro Needed? DB The Data Contract between Reader & Writer is enforced by the DB via a table Schema
  • 72. @r39132 Why is Avro Needed? DB Kafka
  • 73. @r39132 Why is Avro Needed? Avro • Is an efficient self-describing (schema’d) data serialization format • Supports Schema evolution • Has good support in most languages • Is widely accepted in the Big & Fast data space • Is used for data interchange across both streams and files (HDFS) Kafka
  • 75. @r39132 Fast Data Architecture SLC DB DB (Pump) CDH Replicat GG GG Router Kafka Kafka ASR register get Read full record by PK CDH Data Plane
  • 76. @r39132 Fast Data Architecture Some More Requirements • We have ~60K tables in our Oracle databases • We can’t just turn on 60K streams as it would be wasteful, especially if no one needs to consume it! • We have 4500+ engineers in PayPal & 6 engineers on the CDH dev team • How do we enable anyone in the company to launch any stream? • If we did eventually have 60K+ streams, how would we manage them?
  • 77. @r39132 SLC DB DB (Pump) CDH Replicat GG GG Router Kafka Kafka ASR register get Read full record by PK CDH Self- Service Control Plane CDH Data Plane Metadata DB Fast Data Architecture
  • 78. @r39132 Fast Data Architecture SLC DB DB (Pump) GG GG ASR CDH Data Plane Metadata DB CDH Control Plane • PP Engineer visits the CDH self-service portal (a ReactJS app) to provision a data pipeline • He or she submits a request for a new pipeline • The provision request is recorded in the metadata db
  • 79. @r39132 SLC DB DB (Pump) GG GG ASR CDH Data Plane • A periodic Airflow job kicks off to call an API on the Squbs server to execute long-running pipeline provisioning tasks Fast Data Architecture Metadata DB CDH Control Plane
  • 80. @r39132 SLC Metadata DB CDH Control Plane DB DB (Pump) CDH Replicat GG GG Router Kafka Kafka ASR register get Read full record by PK CDH Data Plane • This task creates new Kafka topics, GG Replicat processes, and Storm topologies! • Within a minute a new pipeline is flowing! Fast Data Architecture
  • 81. @r39132 Design Principles 1. System built from OSS components & runs on containers (HA)! 2. Separation of Concerns: • Intent Capture vs Orchestration 3. Orchestration is the brains of the control plane! • DP Self-healing • DP Auto-scaling • Fault-tolerant actions • Maintenance-aware Fast Data Architecture Metadata DB CDH Control Plane
  • 83. @r39132 Fast Data Requirements • Correctness – 0% data loss/corruption • Latency – 99%ile < 1 minute (rain or shine) • Availability – Always Available
  • 84. @r39132 Fast Data Requirements • Correctness – 0% data loss/corruption • Latency – 99%ile < 1 minute (rain or shine) • Availability – Always Available
  • 85. @r39132 Fast Data Requirements • Correctness – 0% data loss/corruption • Causes of data loss/corruption are typically • Deployments of Buggy Code • Data corner-cases – latent bugs not related to recent code changes but to data outliers • Latency – 99%ile < 1 minute (rain or shine) • Definition of Latency SLA Misses • Data is arriving, but it is delayed • Causes of latency SLA Misses • Scalability bottlenecks • Performance bottlenecks • Availability – Always Available • Definition of Availability SLA Misses • No data is arriving • Causes of availability loss are typically • Deployments of Buggy Code • SPOF outages
  • 87. @r39132 Fast Data Challenges 1. Performance Bottlenecks 2. Data Corner Cases 3. Deployments of Buggy Code
  • 88. @r39132 Performance Bottlenecks SLC Metadata DB CDH Control Plane DB DB (Pump) CDH Replicat GG GG Router Kafka Kafka ASR register get Read full record by PK CDH Data Plane • Hydration Queries • The biggest bottleneck is the hydration query back to the source DB for updated rows • Hydration queries can take 20-40 ms vs 500 microseconds
  • 89. @r39132 SLC Metadata DB CDH Control Plane DB DB (Pump) CDH Replicat GG GG Router Kafka Kafka ASR register get CDH Data Plane • Hydration Queries • Solution : Oracle GG Full-Supplemental Logging! No more hydration! Performance Bottlenecks
  • 90. @r39132 Fast Data Challenges 1. Performance Bottlenecks 2. Data Corner Cases 3. Deployments of Buggy Code
  • 91. @r39132 Data Corner Cases • Considerations 1. A latent bug can be triggered when it encounters unexpected data! • Approach • We do 0 type conversions! • Oracle Golden Gate provides everything as String • Due to the Number type, which does not map to any numeric type in Avro or any programming language, we had to abandon end-to-end type safety • The upside is that we don’t run into type-related conversion issues & related to data corner cases! • We don’t replicate LOB fields • Currently, we have no transformation logic in our pipelines!
  • 92. @r39132 Fast Data Challenges 1. Performance Bottlenecks 2. Data Corner Cases 3. Deployments of Buggy Code
  • 93. @r39132 Code Deployments SLC Metadata DB CDH Control Plane DB DB (Pump) CDH Replicat GG GG Router Kafka ASR register get CDH Data Plane 1. Set maintenance mode (pausing all orchestration actions)
  • 94. @r39132 Code Deployments SLC Metadata DB CDH Control Plane DB DB (Pump) CDH Replicat GG GG Router Kafka ASR register get CDH Data Plane 2. Stop Storm topology 3. Backup checkpoint 4. Deploy new code 5. Start Storm topology 6. Monitor for errors
  • 95. @r39132 Code Deployments SLC Metadata DB CDH Control Plane DB DB (Pump) CDH Replicat GG GG Router Kafka ASR register get CDH Data Plane 6. If errors detected, - a. stop topology - b. rollback checkpoint - c. rollback code version - d. restart topology
  • 96. @r39132 Code Deployments SLC Metadata DB CDH Control Plane DB DB (Pump) CDH Replicat GG GG Router Kafka ASR register get CDH Data Plane 7. Set maintenance mode off (unpausing all orchestration actions)
  • 97. @r39132 Fast Data Stats! SLC Metadata DB CDH Control Plane DB DB (Pump) CDH Replicat GG GG Router Kafka Kafka ASR register get CDH Data Plane Data Plane • GA’d in August 2018 • 2.2 TB streamed / day • 300+ Pipelines activated through our self-service portal!
  • 98. @r39132 Closing Thoughts • Favor microservice approaches to building data architectures • When possible (almost always), favor OSS data projects over proprietary ones • In stream processing. #NO_OPS is the only ways to meet SLAs • Check out our OSS Data Projects on http://paypal.github.io/
  • 99. @r39132 Acknowledgments • Akara Sucharitakul • Anil Gursel • Doron Mimon • Na Yang • Maulin Vasavada • Kevin Lu • Prasanna Krishna • Sri Shivananda • Kamlakar Singh • Nagendra Rai • Swroop Singh • Anoj Rawat • Rahul Srivastava • Naitra Muralykrishnan • Prabhu Kasinathan • Vincent Chen • Anisha Nainani • Pramod Garre • Harsh Bhimani • Nirmalya Ghosh • Yash Shah • Aastha Sinha • Deepak Mohanakumar Chandramouli • Romit Mehta • Dheeraj Rampally • Stalin Subbiah • Ashwin Nellore • Lohit Giri • Plamen Jeliazhov • Sehmuz Bayhan And Many More…