SlideShare a Scribd company logo
How @twitterhadoop
chose Google Cloud
Joep Rottinghuis & Lohit VijayaRenu
Twitter Hadoop Team (@twitterhadoop)
1
1. Twitter infrastructure
2. Hadoop evaluation
3. Evaluation outcomes
4. Recommendations and conclusions
5. Q&A
Credit to presentation at GoogleNext 2019 by Derek Lyon & Dave
Beckett (https://youtu.be/4FLFcWgZdo4) 2
Twitter Infrastructure
3
Twitter’s infrastructure
● Twitter founded in 2006
● Global-scale application
● Unique scale and performance characteristics
● Real-time
● Built to purpose and well optimized
● Large data centers
4
Strategic questions
1. What is the long-term mix of cloud versus
datacenter?
2. Which cloud provider(s) should we use?
3. How can we be confident in this type of
decision?
4. Why should we evaluate this now (2016)?
5
Tactical questions
1. What is the feasibility and cost of large-scale
adoption?
2. Which workloads are best-suited for the cloud
and are they separable?
3. How would our architecture change on the
cloud?
4. How do we get to an actionable plan?
6
Evaluation process
● Started evaluation in 2016
● Were able to make a patient, rigorous
decision
● Defined baseline workload requirements
● Engaged major providers
● Analyzed clouds for each major workload
● Built overall cloud plan
● Iterated and optimized choices
7
Evaluation Timeline
Considering Moving
● PoC’s Completed
& Results Delivered
● Legal Agreement with
T&C’s ratified
● Kickoff dataproc,
bigquery, dataflow
experimentation
● Security and
Platform
Review
● v1 Hadoop on GCP
Architecture
Ratified
● Begin build for
migration plan
● Consensus built with
Product, Revenue, Eng
● Migration Kickoff
● Proposal to migrate
Hadoop to GCP
formally accepted
June
‘16
● Initial Cloud RfP release
● 27 Synthetic PoC’s on
GCP begin
● Testing Projects /
Network established
Sept
‘16
Mar
‘17
July
‘17
Nov
‘17
Jan
‘18
Apr
‘18
June
‘18
8
Built overall cloud plan
● Created a series of candidate architectures
for each platform with their resource
requirements
● Developed a migration project plan &
timeline
● Created financial projections
● With some other business considerations
9
Financial modeling
● 10-year time horizon to avoid timing artifacts
● Compared on premise and multiple cloud
scenarios
● Costs of migration and long-term
● Long-term price/performance curves
(e.g. Moore’s Law, historical pricing)
● Two independent models to avoid model
errors
10
● An immediate all-in migration at Twitter scale
is: expensive, distracting, and risky
● More value from new architectures and
transformation, so start smaller and learn as
we go
● Hadoop offered several important, specific
benefits with lower risk
● We gained confidence in our investments in
both cloud projects and data centers
What we found
11
>1.4T
Messages Per Day
>500K
Compute Cores
>300PB
Logical Storage
Hadoop@Twitter scale
>12,500
Peak Cluster Size
12
Type Use Compute %
Real-time Critical performance production jobs
with dedicated capacity
10%
Processing Regularly scheduled production jobs
with dedicated capacity
60%
Ad-hoc One off / ad-hoc queries and analysis 30%
Cold Dense storage clusters, not for compute minimal
Twitter Hadoop cluster types
13
Twitter Hadoop challenges
1. Scaling: Significant YoY Compute & Storage growth
2. Hardware: Designing, building, maintaining & operating
3. Capacity Planning: Hard to predict for adhoc especially
4. Agility: Must respond fast especially for adhoc compute
5. Deployment: Must deploy at scale and in-flight
6. Network: Both cross-DC and cross-cluster
7. Disaster Recovery: Durable copies needed in 2+ DCs
14
Twitter Hadoop requirements
● Network sustained bandwidth per core
● Disk (data) sustained bandwidth per core
● Large sequential reads & writes
● Throughput not latency
● Capacity
● CPU / RAM not usually the bottleneck
● Consistency of datasets (set of HDFS files)
15
Twitter Hadoop on premise hardware
numbers
Clusters: 10 to 10K nodes
Network: 10G moving to 25G
Data Disks: 24T-72T over 12 HDDs
CPU: 8 cores with 64G memory
I/O: Network: ~20MB/s sustained, peaks of 10x
HDFS read: 20 rq/s sustained, peaks of 3x
HDFS write: large variation, peaks of 10x
16
2. Twitter Hadoop on
cloud VMs
Durable storage: cloud
object store
Scratch storage:
a. with HDFS over
cloud object store
b. with HDFS on cloud
block store
c. with HDFS on local
disks
1. Hadoop-as-a-Service
(HaaS) from the cloud
provider
Cloud architectural options
17
2. Functional Test
Gridmix: IO + Compute
● Capture of real
production cluster
workload (1k-5k jobs)
● Replays reads, writes,
shuffles, compute
Testing plan
1. Baseline Tests
● TestDFSIO:
low level IO read/write
● Teragen:
measure maximum
write rate
● Terasort:
read, shuffle, write
18
HDFS configurations tested
Availability
● Critical data: 2 regions
● Other data: 2 zones
Each type of Object, Block
and Local Storage
Dataset consistency
Test cloud provider choices:
1. object store
2. object store with external
consistency service
19
Hadoop Evaluation
20
GCP HaaS: DataProc config
● Hadoop 2.7.2
● Performance tests with 800 vCPUs:
○ 100 x n1-standard-8 (8 VCPU, 30G memory)
○ 200 x n1-standard-4 (4 VCPU, 30G memory)
● Scale test with 8000 vCPUs:
○ 1000 x n1-standard-8 (8 vCPU, 30G memory)
● Modeled average CPU and average to peak CPU.
● No preemptible instances in initial work
● Similar to on premise hardware SKUs
21
Decided to use DataProc
for evaluation.
Durable
Storage
Scratch
Storage
HDFS Speedup vs on premise
(normalized by IO-per-core)
Cloud
Storage
Local SSD 3 x 375G SSD ~2x (but expensive)
Cloud
Storage
PD-HDD 1.5TB PD-HDD ~1x
None PD-HDD 1.5TB PD-HDD ~1x
DataProc 100 x n1-standard-8 Results
Tuned Compute Engine instance types to get the optimum balance of
network : cores : storage (this changes over time)
22
Durable
Storage
Scratch
Storage
HDFS Speedup vs on premise
(normalized by IO-per-core)
Cloud
Storage
Local SSD 2 x 375G SSD ~2x (but expensive)
Cloud
Storage
PD-HDD 1.5TB PD-HDD 1.4x
DataProc 200 x n1-standard-4 Results
23
Benchmark Findings
1. Application Benchmarks
are critical
Total job time is composed of
multiple steps. We found
variation both better and worse
at each step.
Recommendation: You should
rely on an application
benchmark like GridMix rather
than micro-benchmarks.
2. Can treat network
storage like local disk
Both Cloud Storage and PD
offered nearly as much
bandwidth as typical direct
attached HDDs on premise
24
Functional Test Findings
1. Live Migration of VMs was not noticeable
during Hadoop testing. It was during other
Twitter platform testing of Compute Engine
(cache at very high rps of small objects)
2. Cloud Storage checksum vs HDFS checksum.
Fixed via HDFS-13056 in collaboration with
Google
3. fsync() system call on Local SSD was slow
(fixed)
25
Evaluation Outcomes
26
+ Leads to the fastest migration
+ Limits duplication of costs during migration period
- Introduces significant tech debt post-migration
- Requires a major rearchitecture post-migration to
capture benefits of cloud
- Concerns around overall cost, risk, and distraction of this
approach at Twitter scale
Life-and-Shift
everything
Disqualified Lift-and-Shift *Everything*
27
● Separable with fewer dependencies
● Standard open source software:
○ Continue to develop in house and run on premise
○ Reduces lock-in risk
● Rearchitecting is achievable
○ Not a lift-and-shift
● Data in Cloud Storage:
○ Enables broader diversity of data processing
frameworks and services
● Long-term bet on Google’s Big Data ecosystem
Hadoop to Cloud was Interesting
28
Separate Hadoop Compute and Storage
● Scaling the dimensions independently
● Makes it easy to run multiple clusters and processing
frameworks over the same data
● Virtual network and project primitives provide
segmentation of access and cost structures.
● State is preserved in Cloud Storage therefore
deployments, upgrades, and testing are simpler
● Can treat storage as a commodity
Enables
29
1. Cold Cluster
● Storage: Cloud Storage
● Compute: Limited
ephemeral Dataproc an
option
● Scaling: mostly storage
driven
2. Ad-Hoc Clusters
● Storage: Cloud Storage
● Compute: Compute
Engine and Twitter build
of Hadoop (long running
clusters)
● Scaling: mixture, with
spiky compute
Twitter Hadoop Rearchitected for Cloud
30
Twitter production Hadoop remains on premise
● Not as separable from other production workloads
● Focusing on non-production workloads limits our risk
● Regular compute-intensive usage patterns
● Benefits more from purpose built hardware
● Fewer processing frameworks are needed
31
Twitter Strategic Benefits
● Next-generation architecture with numerous
enhancements:
○ security, encryption, isolation, live migration
● Leverage Google’s capacity and R&D
● Larger ecosystem of open source & cloud software
● Long-term strategic collaboration with Google
● Beachhead that enable teams across Twitter to make
tactical cloud adoption decisions
What does this do
overall for Twitter?
32
Infrastructure benefits
● Large-scale ad-hoc
analysis and backfills
● Cloud Storage avoids
HDFS limits
● Offsite Backup
● Increases availability of
cold data
Twitter Functional Benefits
Platform benefits
● Built-in compliance
support (e.g. SOX)
● Direct chargeback using
Project
● Simplified retention
● GCP services such as
BigQuery, Spanner,
Cloud ML, TPUs, etc
33
Finding: At Twitter Scale, Cloud has limits
● Cloud providers have limits for all sorts of things
and we often need them increased.
● Cloud HaaS do not generally support 10K node
hadoop clusters
● Dynamic scaling down < O(days) is not yet
feasible / cost-effective with current Hadoop at
Twitter scale
● Capacity planning with cloud providers is
encouraged for O(10K) vCPU deltas and required
for O(100K) vCPU deltas
34
What we are working on now
❏ Finalizing bucket & user creation and IAM designs
❏ Building replication, cluster deployment, and data
management software
❏ Hadoop Cloud Storage connector improvements
continue (open source)
❏ Retention and “directory” / dataset atomicity in GCS
35
✓ Foundational network
(8x100Gbps)
✓ Copy cluster
✓ Copying PBs of data to the
cloud
✓ Early Presto analytics use
case: up to 100K-core
Dataproc cluster querying
15PB dataset in Cloud
Storage
Recommendations
and Conclusion
36
3. Ensure migration plan
captures benefits
Lift-and-shift may not deliver
value in all cases.
Substantial iteration is required
to balance tactical migration
work with long-term strategy.
2. Compare application
benchmark costs
Compare the cost of running an
application using benchmark
results. Don’t just look at
pricing pages.
e.g. the network is hugely
important to performance.
1. Run the most informative
tests
Application-level
benchmarking (e.g. GridMix)
Scale testing
Recommendations
37
2. Cloud adoption
is complex
Finding separable workloads
can be a challenge.
Architectural choices are
non-obvious.
Methodical evaluation is
well-worth the effort.
1. Separate compute and
storage is a real thing
The better the network, the less
locality matters.
Life gets much easier when
Compute can be stateless.
You can treat PD like direct
attached HDDs.
Conclusions
3. Very early in this process
and lots more to come
We’re excited to be gaining
experience with the platform
and learning from everyone.
38
Thank You
Questions?
39
Ad

More Related Content

What's hot (20)

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Imply at Apache Druid Meetup in London 1-15-20
Imply at Apache Druid Meetup in London 1-15-20Imply at Apache Druid Meetup in London 1-15-20
Imply at Apache Druid Meetup in London 1-15-20
Jelena Zanko
 
Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium
confluent
 
Symantec: Cassandra Data Modelling techniques in action
Symantec: Cassandra Data Modelling techniques in actionSymantec: Cassandra Data Modelling techniques in action
Symantec: Cassandra Data Modelling techniques in action
DataStax Academy
 
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics
 
DIscover Spark and Spark streaming
DIscover Spark and Spark streamingDIscover Spark and Spark streaming
DIscover Spark and Spark streaming
Maturin BADO
 
HDF Data in the Cloud
HDF Data in the CloudHDF Data in the Cloud
HDF Data in the Cloud
The HDF-EOS Tools and Information Center
 
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with SchlumbergerGet Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
inside-BigData.com
 
Google Cloud Dataflow
Google Cloud DataflowGoogle Cloud Dataflow
Google Cloud Dataflow
Alex Van Boxel
 
5 levels of high availability from multi instance to hybrid cloud
5 levels of high availability  from multi instance to hybrid cloud5 levels of high availability  from multi instance to hybrid cloud
5 levels of high availability from multi instance to hybrid cloud
Rafał Leszko
 
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DBDistributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
YugabyteDB
 
5 Levels of High Availability: From Multi-instance to Hybrid Cloud
5 Levels of High Availability: From Multi-instance to Hybrid Cloud5 Levels of High Availability: From Multi-instance to Hybrid Cloud
5 Levels of High Availability: From Multi-instance to Hybrid Cloud
Rafał Leszko
 
Serverless ETL and Optimization on ML pipeline
Serverless ETL and Optimization on ML pipelineServerless ETL and Optimization on ML pipeline
Serverless ETL and Optimization on ML pipeline
Shu-Jeng Hsieh
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
Leandro Totino Pereira
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeong
Yousun Jeong
 
Scaling HDFS at Xiaomi
Scaling HDFS at XiaomiScaling HDFS at Xiaomi
Scaling HDFS at Xiaomi
DataWorks Summit
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
Gleb Kanterov
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
Dendej Sawarnkatat
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
HBaseCon
 
Discover some "Big Data" architectural concepts with Redis
Discover some  "Big Data" architectural concepts with  Redis Discover some  "Big Data" architectural concepts with  Redis
Discover some "Big Data" architectural concepts with Redis
Maturin BADO
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Imply at Apache Druid Meetup in London 1-15-20
Imply at Apache Druid Meetup in London 1-15-20Imply at Apache Druid Meetup in London 1-15-20
Imply at Apache Druid Meetup in London 1-15-20
Jelena Zanko
 
Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium
confluent
 
Symantec: Cassandra Data Modelling techniques in action
Symantec: Cassandra Data Modelling techniques in actionSymantec: Cassandra Data Modelling techniques in action
Symantec: Cassandra Data Modelling techniques in action
DataStax Academy
 
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics
 
DIscover Spark and Spark streaming
DIscover Spark and Spark streamingDIscover Spark and Spark streaming
DIscover Spark and Spark streaming
Maturin BADO
 
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with SchlumbergerGet Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
inside-BigData.com
 
5 levels of high availability from multi instance to hybrid cloud
5 levels of high availability  from multi instance to hybrid cloud5 levels of high availability  from multi instance to hybrid cloud
5 levels of high availability from multi instance to hybrid cloud
Rafał Leszko
 
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DBDistributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
YugabyteDB
 
5 Levels of High Availability: From Multi-instance to Hybrid Cloud
5 Levels of High Availability: From Multi-instance to Hybrid Cloud5 Levels of High Availability: From Multi-instance to Hybrid Cloud
5 Levels of High Availability: From Multi-instance to Hybrid Cloud
Rafał Leszko
 
Serverless ETL and Optimization on ML pipeline
Serverless ETL and Optimization on ML pipelineServerless ETL and Optimization on ML pipeline
Serverless ETL and Optimization on ML pipeline
Shu-Jeng Hsieh
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeong
Yousun Jeong
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
Gleb Kanterov
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
HBaseCon
 
Discover some "Big Data" architectural concepts with Redis
Discover some  "Big Data" architectural concepts with  Redis Discover some  "Big Data" architectural concepts with  Redis
Discover some "Big Data" architectural concepts with Redis
Maturin BADO
 

Similar to How @twitterhadoop chose google cloud (20)

Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
Alluxio, Inc.
 
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Alluxio, Inc.
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
Aswini Ashu
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
aswini pilli
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Sumeet Singh
 
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successArchitecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
DataWorks Summit
 
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAccelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Alluxio, Inc.
 
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Sumeet Singh
 
Getting started with GCP ( Google Cloud Platform)
Getting started with GCP ( Google  Cloud Platform)Getting started with GCP ( Google  Cloud Platform)
Getting started with GCP ( Google Cloud Platform)
bigdata trunk
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
Alluxio, Inc.
 
1. beyond mission critical virtualizing big data and hadoop
1. beyond mission critical   virtualizing big data and hadoop1. beyond mission critical   virtualizing big data and hadoop
1. beyond mission critical virtualizing big data and hadoop
Chiou-Nan Chen
 
Bigdata and Hadoop with Docker
Bigdata and Hadoop with DockerBigdata and Hadoop with Docker
Bigdata and Hadoop with Docker
haridasnss
 
vBACD - Distributed Petabyte-Scale Cloud Storage with GlusterFS - 2/28
vBACD - Distributed Petabyte-Scale Cloud Storage with GlusterFS - 2/28vBACD - Distributed Petabyte-Scale Cloud Storage with GlusterFS - 2/28
vBACD - Distributed Petabyte-Scale Cloud Storage with GlusterFS - 2/28
CloudStack - Open Source Cloud Computing Project
 
The Future of GlusterFS and Gluster.org
The Future of GlusterFS and Gluster.orgThe Future of GlusterFS and Gluster.org
The Future of GlusterFS and Gluster.org
John Mark Walker
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio, Inc.
 
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
Alluxio, Inc.
 
Data Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud EraData Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud Era
Alluxio, Inc.
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
Alluxio, Inc.
 
Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS
Alluxio, Inc.
 
Getting more into GCP.pdf
Getting more into GCP.pdfGetting more into GCP.pdf
Getting more into GCP.pdf
Knoldus Inc.
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
Alluxio, Inc.
 
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Alluxio, Inc.
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
Aswini Ashu
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
aswini pilli
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Sumeet Singh
 
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successArchitecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
DataWorks Summit
 
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAccelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Alluxio, Inc.
 
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Sumeet Singh
 
Getting started with GCP ( Google Cloud Platform)
Getting started with GCP ( Google  Cloud Platform)Getting started with GCP ( Google  Cloud Platform)
Getting started with GCP ( Google Cloud Platform)
bigdata trunk
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
Alluxio, Inc.
 
1. beyond mission critical virtualizing big data and hadoop
1. beyond mission critical   virtualizing big data and hadoop1. beyond mission critical   virtualizing big data and hadoop
1. beyond mission critical virtualizing big data and hadoop
Chiou-Nan Chen
 
Bigdata and Hadoop with Docker
Bigdata and Hadoop with DockerBigdata and Hadoop with Docker
Bigdata and Hadoop with Docker
haridasnss
 
The Future of GlusterFS and Gluster.org
The Future of GlusterFS and Gluster.orgThe Future of GlusterFS and Gluster.org
The Future of GlusterFS and Gluster.org
John Mark Walker
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio, Inc.
 
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
Alluxio, Inc.
 
Data Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud EraData Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud Era
Alluxio, Inc.
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
Alluxio, Inc.
 
Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS
Alluxio, Inc.
 
Getting more into GCP.pdf
Getting more into GCP.pdfGetting more into GCP.pdf
Getting more into GCP.pdf
Knoldus Inc.
 
Ad

More from lohitvijayarenu (8)

OpenSource and the Cloud ApacheCon.pptx
OpenSource and the Cloud  ApacheCon.pptxOpenSource and the Cloud  ApacheCon.pptx
OpenSource and the Cloud ApacheCon.pptx
lohitvijayarenu
 
The Adoption of Apache Beam at Twitter
The Adoption of Apache Beam at TwitterThe Adoption of Apache Beam at Twitter
The Adoption of Apache Beam at Twitter
lohitvijayarenu
 
Scaling event aggregation at twitter
Scaling event aggregation at twitterScaling event aggregation at twitter
Scaling event aggregation at twitter
lohitvijayarenu
 
Large Scale EventLog Management @Twitter
Large Scale EventLog Management @TwitterLarge Scale EventLog Management @Twitter
Large Scale EventLog Management @Twitter
lohitvijayarenu
 
Routing trillion events per day @twitter
Routing trillion events per day @twitterRouting trillion events per day @twitter
Routing trillion events per day @twitter
lohitvijayarenu
 
Open Source india 2014
Open Source india 2014Open Source india 2014
Open Source india 2014
lohitvijayarenu
 
Hadoop 2 @Twitter, Elephant Scale. Presented at
Hadoop 2 @Twitter, Elephant Scale. Presented at Hadoop 2 @Twitter, Elephant Scale. Presented at
Hadoop 2 @Twitter, Elephant Scale. Presented at
lohitvijayarenu
 
HBase backups and performance on MapR
HBase backups and performance on MapRHBase backups and performance on MapR
HBase backups and performance on MapR
lohitvijayarenu
 
OpenSource and the Cloud ApacheCon.pptx
OpenSource and the Cloud  ApacheCon.pptxOpenSource and the Cloud  ApacheCon.pptx
OpenSource and the Cloud ApacheCon.pptx
lohitvijayarenu
 
The Adoption of Apache Beam at Twitter
The Adoption of Apache Beam at TwitterThe Adoption of Apache Beam at Twitter
The Adoption of Apache Beam at Twitter
lohitvijayarenu
 
Scaling event aggregation at twitter
Scaling event aggregation at twitterScaling event aggregation at twitter
Scaling event aggregation at twitter
lohitvijayarenu
 
Large Scale EventLog Management @Twitter
Large Scale EventLog Management @TwitterLarge Scale EventLog Management @Twitter
Large Scale EventLog Management @Twitter
lohitvijayarenu
 
Routing trillion events per day @twitter
Routing trillion events per day @twitterRouting trillion events per day @twitter
Routing trillion events per day @twitter
lohitvijayarenu
 
Hadoop 2 @Twitter, Elephant Scale. Presented at
Hadoop 2 @Twitter, Elephant Scale. Presented at Hadoop 2 @Twitter, Elephant Scale. Presented at
Hadoop 2 @Twitter, Elephant Scale. Presented at
lohitvijayarenu
 
HBase backups and performance on MapR
HBase backups and performance on MapRHBase backups and performance on MapR
HBase backups and performance on MapR
lohitvijayarenu
 
Ad

Recently uploaded (20)

WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)
sh607827
 
FL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full VersionFL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full Version
tahirabibi60507
 
The Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdfThe Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdf
drewplanas10
 
PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025
mu394968
 
Automation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath CertificateAutomation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath Certificate
VICTOR MAESTRE RAMIREZ
 
Top 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docxTop 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docx
Portli
 
EASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License CodeEASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License Code
aneelaramzan63
 
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Eric D. Schabell
 
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
Andre Hora
 
Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025
kashifyounis067
 
Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]
saniaaftab72555
 
Solidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license codeSolidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license code
aneelaramzan63
 
Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025
kashifyounis067
 
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
AxisTechnolabs
 
Revolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptxRevolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptx
nidhisingh691197
 
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
steaveroggers
 
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRYLEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
NidaFarooq10
 
Adobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest VersionAdobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest Version
kashifyounis067
 
Societal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainabilitySocietal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainability
Jordi Cabot
 
Download Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With LatestDownload Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With Latest
tahirabibi60507
 
WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)
sh607827
 
FL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full VersionFL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full Version
tahirabibi60507
 
The Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdfThe Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdf
drewplanas10
 
PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025
mu394968
 
Automation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath CertificateAutomation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath Certificate
VICTOR MAESTRE RAMIREZ
 
Top 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docxTop 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docx
Portli
 
EASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License CodeEASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License Code
aneelaramzan63
 
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Eric D. Schabell
 
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
Andre Hora
 
Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025
kashifyounis067
 
Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]
saniaaftab72555
 
Solidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license codeSolidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license code
aneelaramzan63
 
Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025
kashifyounis067
 
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
AxisTechnolabs
 
Revolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptxRevolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptx
nidhisingh691197
 
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
steaveroggers
 
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRYLEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
NidaFarooq10
 
Adobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest VersionAdobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest Version
kashifyounis067
 
Societal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainabilitySocietal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainability
Jordi Cabot
 
Download Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With LatestDownload Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With Latest
tahirabibi60507
 

How @twitterhadoop chose google cloud

  • 1. How @twitterhadoop chose Google Cloud Joep Rottinghuis & Lohit VijayaRenu Twitter Hadoop Team (@twitterhadoop) 1
  • 2. 1. Twitter infrastructure 2. Hadoop evaluation 3. Evaluation outcomes 4. Recommendations and conclusions 5. Q&A Credit to presentation at GoogleNext 2019 by Derek Lyon & Dave Beckett (https://youtu.be/4FLFcWgZdo4) 2
  • 4. Twitter’s infrastructure ● Twitter founded in 2006 ● Global-scale application ● Unique scale and performance characteristics ● Real-time ● Built to purpose and well optimized ● Large data centers 4
  • 5. Strategic questions 1. What is the long-term mix of cloud versus datacenter? 2. Which cloud provider(s) should we use? 3. How can we be confident in this type of decision? 4. Why should we evaluate this now (2016)? 5
  • 6. Tactical questions 1. What is the feasibility and cost of large-scale adoption? 2. Which workloads are best-suited for the cloud and are they separable? 3. How would our architecture change on the cloud? 4. How do we get to an actionable plan? 6
  • 7. Evaluation process ● Started evaluation in 2016 ● Were able to make a patient, rigorous decision ● Defined baseline workload requirements ● Engaged major providers ● Analyzed clouds for each major workload ● Built overall cloud plan ● Iterated and optimized choices 7
  • 8. Evaluation Timeline Considering Moving ● PoC’s Completed & Results Delivered ● Legal Agreement with T&C’s ratified ● Kickoff dataproc, bigquery, dataflow experimentation ● Security and Platform Review ● v1 Hadoop on GCP Architecture Ratified ● Begin build for migration plan ● Consensus built with Product, Revenue, Eng ● Migration Kickoff ● Proposal to migrate Hadoop to GCP formally accepted June ‘16 ● Initial Cloud RfP release ● 27 Synthetic PoC’s on GCP begin ● Testing Projects / Network established Sept ‘16 Mar ‘17 July ‘17 Nov ‘17 Jan ‘18 Apr ‘18 June ‘18 8
  • 9. Built overall cloud plan ● Created a series of candidate architectures for each platform with their resource requirements ● Developed a migration project plan & timeline ● Created financial projections ● With some other business considerations 9
  • 10. Financial modeling ● 10-year time horizon to avoid timing artifacts ● Compared on premise and multiple cloud scenarios ● Costs of migration and long-term ● Long-term price/performance curves (e.g. Moore’s Law, historical pricing) ● Two independent models to avoid model errors 10
  • 11. ● An immediate all-in migration at Twitter scale is: expensive, distracting, and risky ● More value from new architectures and transformation, so start smaller and learn as we go ● Hadoop offered several important, specific benefits with lower risk ● We gained confidence in our investments in both cloud projects and data centers What we found 11
  • 12. >1.4T Messages Per Day >500K Compute Cores >300PB Logical Storage Hadoop@Twitter scale >12,500 Peak Cluster Size 12
  • 13. Type Use Compute % Real-time Critical performance production jobs with dedicated capacity 10% Processing Regularly scheduled production jobs with dedicated capacity 60% Ad-hoc One off / ad-hoc queries and analysis 30% Cold Dense storage clusters, not for compute minimal Twitter Hadoop cluster types 13
  • 14. Twitter Hadoop challenges 1. Scaling: Significant YoY Compute & Storage growth 2. Hardware: Designing, building, maintaining & operating 3. Capacity Planning: Hard to predict for adhoc especially 4. Agility: Must respond fast especially for adhoc compute 5. Deployment: Must deploy at scale and in-flight 6. Network: Both cross-DC and cross-cluster 7. Disaster Recovery: Durable copies needed in 2+ DCs 14
  • 15. Twitter Hadoop requirements ● Network sustained bandwidth per core ● Disk (data) sustained bandwidth per core ● Large sequential reads & writes ● Throughput not latency ● Capacity ● CPU / RAM not usually the bottleneck ● Consistency of datasets (set of HDFS files) 15
  • 16. Twitter Hadoop on premise hardware numbers Clusters: 10 to 10K nodes Network: 10G moving to 25G Data Disks: 24T-72T over 12 HDDs CPU: 8 cores with 64G memory I/O: Network: ~20MB/s sustained, peaks of 10x HDFS read: 20 rq/s sustained, peaks of 3x HDFS write: large variation, peaks of 10x 16
  • 17. 2. Twitter Hadoop on cloud VMs Durable storage: cloud object store Scratch storage: a. with HDFS over cloud object store b. with HDFS on cloud block store c. with HDFS on local disks 1. Hadoop-as-a-Service (HaaS) from the cloud provider Cloud architectural options 17
  • 18. 2. Functional Test Gridmix: IO + Compute ● Capture of real production cluster workload (1k-5k jobs) ● Replays reads, writes, shuffles, compute Testing plan 1. Baseline Tests ● TestDFSIO: low level IO read/write ● Teragen: measure maximum write rate ● Terasort: read, shuffle, write 18
  • 19. HDFS configurations tested Availability ● Critical data: 2 regions ● Other data: 2 zones Each type of Object, Block and Local Storage Dataset consistency Test cloud provider choices: 1. object store 2. object store with external consistency service 19
  • 21. GCP HaaS: DataProc config ● Hadoop 2.7.2 ● Performance tests with 800 vCPUs: ○ 100 x n1-standard-8 (8 VCPU, 30G memory) ○ 200 x n1-standard-4 (4 VCPU, 30G memory) ● Scale test with 8000 vCPUs: ○ 1000 x n1-standard-8 (8 vCPU, 30G memory) ● Modeled average CPU and average to peak CPU. ● No preemptible instances in initial work ● Similar to on premise hardware SKUs 21 Decided to use DataProc for evaluation.
  • 22. Durable Storage Scratch Storage HDFS Speedup vs on premise (normalized by IO-per-core) Cloud Storage Local SSD 3 x 375G SSD ~2x (but expensive) Cloud Storage PD-HDD 1.5TB PD-HDD ~1x None PD-HDD 1.5TB PD-HDD ~1x DataProc 100 x n1-standard-8 Results Tuned Compute Engine instance types to get the optimum balance of network : cores : storage (this changes over time) 22
  • 23. Durable Storage Scratch Storage HDFS Speedup vs on premise (normalized by IO-per-core) Cloud Storage Local SSD 2 x 375G SSD ~2x (but expensive) Cloud Storage PD-HDD 1.5TB PD-HDD 1.4x DataProc 200 x n1-standard-4 Results 23
  • 24. Benchmark Findings 1. Application Benchmarks are critical Total job time is composed of multiple steps. We found variation both better and worse at each step. Recommendation: You should rely on an application benchmark like GridMix rather than micro-benchmarks. 2. Can treat network storage like local disk Both Cloud Storage and PD offered nearly as much bandwidth as typical direct attached HDDs on premise 24
  • 25. Functional Test Findings 1. Live Migration of VMs was not noticeable during Hadoop testing. It was during other Twitter platform testing of Compute Engine (cache at very high rps of small objects) 2. Cloud Storage checksum vs HDFS checksum. Fixed via HDFS-13056 in collaboration with Google 3. fsync() system call on Local SSD was slow (fixed) 25
  • 27. + Leads to the fastest migration + Limits duplication of costs during migration period - Introduces significant tech debt post-migration - Requires a major rearchitecture post-migration to capture benefits of cloud - Concerns around overall cost, risk, and distraction of this approach at Twitter scale Life-and-Shift everything Disqualified Lift-and-Shift *Everything* 27
  • 28. ● Separable with fewer dependencies ● Standard open source software: ○ Continue to develop in house and run on premise ○ Reduces lock-in risk ● Rearchitecting is achievable ○ Not a lift-and-shift ● Data in Cloud Storage: ○ Enables broader diversity of data processing frameworks and services ● Long-term bet on Google’s Big Data ecosystem Hadoop to Cloud was Interesting 28
  • 29. Separate Hadoop Compute and Storage ● Scaling the dimensions independently ● Makes it easy to run multiple clusters and processing frameworks over the same data ● Virtual network and project primitives provide segmentation of access and cost structures. ● State is preserved in Cloud Storage therefore deployments, upgrades, and testing are simpler ● Can treat storage as a commodity Enables 29
  • 30. 1. Cold Cluster ● Storage: Cloud Storage ● Compute: Limited ephemeral Dataproc an option ● Scaling: mostly storage driven 2. Ad-Hoc Clusters ● Storage: Cloud Storage ● Compute: Compute Engine and Twitter build of Hadoop (long running clusters) ● Scaling: mixture, with spiky compute Twitter Hadoop Rearchitected for Cloud 30
  • 31. Twitter production Hadoop remains on premise ● Not as separable from other production workloads ● Focusing on non-production workloads limits our risk ● Regular compute-intensive usage patterns ● Benefits more from purpose built hardware ● Fewer processing frameworks are needed 31
  • 32. Twitter Strategic Benefits ● Next-generation architecture with numerous enhancements: ○ security, encryption, isolation, live migration ● Leverage Google’s capacity and R&D ● Larger ecosystem of open source & cloud software ● Long-term strategic collaboration with Google ● Beachhead that enable teams across Twitter to make tactical cloud adoption decisions What does this do overall for Twitter? 32
  • 33. Infrastructure benefits ● Large-scale ad-hoc analysis and backfills ● Cloud Storage avoids HDFS limits ● Offsite Backup ● Increases availability of cold data Twitter Functional Benefits Platform benefits ● Built-in compliance support (e.g. SOX) ● Direct chargeback using Project ● Simplified retention ● GCP services such as BigQuery, Spanner, Cloud ML, TPUs, etc 33
  • 34. Finding: At Twitter Scale, Cloud has limits ● Cloud providers have limits for all sorts of things and we often need them increased. ● Cloud HaaS do not generally support 10K node hadoop clusters ● Dynamic scaling down < O(days) is not yet feasible / cost-effective with current Hadoop at Twitter scale ● Capacity planning with cloud providers is encouraged for O(10K) vCPU deltas and required for O(100K) vCPU deltas 34
  • 35. What we are working on now ❏ Finalizing bucket & user creation and IAM designs ❏ Building replication, cluster deployment, and data management software ❏ Hadoop Cloud Storage connector improvements continue (open source) ❏ Retention and “directory” / dataset atomicity in GCS 35 ✓ Foundational network (8x100Gbps) ✓ Copy cluster ✓ Copying PBs of data to the cloud ✓ Early Presto analytics use case: up to 100K-core Dataproc cluster querying 15PB dataset in Cloud Storage
  • 37. 3. Ensure migration plan captures benefits Lift-and-shift may not deliver value in all cases. Substantial iteration is required to balance tactical migration work with long-term strategy. 2. Compare application benchmark costs Compare the cost of running an application using benchmark results. Don’t just look at pricing pages. e.g. the network is hugely important to performance. 1. Run the most informative tests Application-level benchmarking (e.g. GridMix) Scale testing Recommendations 37
  • 38. 2. Cloud adoption is complex Finding separable workloads can be a challenge. Architectural choices are non-obvious. Methodical evaluation is well-worth the effort. 1. Separate compute and storage is a real thing The better the network, the less locality matters. Life gets much easier when Compute can be stateless. You can treat PD like direct attached HDDs. Conclusions 3. Very early in this process and lots more to come We’re excited to be gaining experience with the platform and learning from everyone. 38