SlideShare a Scribd company logo
Google BigData solution
Su plus Hu @ GCPUG.TW
Simon Su
var simon = {};
simon.aboutme = 'http://about.me/peihsinsu';
simon.nodejs = ‘http://opennodes.arecord.us';
simon.googleshare = 'http://gappsnews.blogspot.tw'
simon.nodejsblog = ‘http://nodejs-in-example.blogspot.tw';
simon.blog = ‘http://peihsinsu.blogspot.com';
simon.slideshare = ‘http://slideshare.net/peihsinsu/';
simon.email = ‘simonsu.mail@gmail.com’;
simon.say(‘Good luck to everybody!');
Sunny Hu
var sunny = {};
sunny.aboutme = 'https://plus.google.com/u/0/+sunnyHU/posts';
sunny.email = sunnyhu@linkernetworks.com.’;
sunny.language =[‘Java’,’.NET’,’NodeJS’,’SQL’ ]
sunny.skill = [ ‘Project management’,
’System Analysis’,
’System design’,
’Car ho lan’ ]
sunny.say(‘寫code太苦悶,心情要sunny');
GCP Qualified Developer
● We are “舒” “服” 二人組 ...
● This is Su Hu style ...
https://www.facebook.com/groups/GCPUG.TW/
https://plus.google.com/u/0/communities/116100913832589966421
Google Cloud Platform User Group Taiwan
我們是Google Cloud Platform Taiwan User Group。在Google雲端服務在台灣地區展
露頭角之後,有許多新的服務、新的知識、新的創意,歡迎大家一起分享,一起了解
Google雲端服務...
GCPUG透過網際網路串聯喜好Google Cloud的使用者,分享與交流使用GCP的點滴
鑑驗。如果您是Google Cloud Platform的初學者,您應該來聽聽前輩們的使用經驗;如
果您是Google Cloud Platform的Expert,您應該來分享一下寶貴的經驗,並與更多高
手互相交流;如果您還沒開始用Google Cloud Platform,那麼您應該馬上來聽聽我們
是怎麼使用Google Cloud的!
Linker Want You...
● Data scientist
● Data engineer
● Frontend engineer
每分鐘上傳到YouTube的影片長度?
Google search 的 index 有多大?
Google有多少有效的使用者?
72 hours
425M+
100PB+ (over 100,000 TBs)
0.25 seconds
Google 需要平均回應客戶搜索關鍵字的時間?
Management MobileCompute
Networking
Big Data
Storage
Developer
Tools
SpannerDremelMapReduce
Big Table Colossus
2012 20132002 2004 2006 2008 2010
GFS MillWheel
Flume
Google innovation
Provided as a managed services ...
SpannerDremelMapReduce
Big Table Colossus
2012 20132002 2004 2006 2008 2010
GFS MillWheel
Flume
Google Changed the Big Data Market
Google
MapReduce
Google
Bigtable
Google
Borg
Google
Borg
Google
Dremel
StoreCapture Analyze
BigQuery Larger
Hadoop
Ecosystem
Hadoop
Spark
(on GCE)
Pub/Sub
BigQuery streaming
Process
Dataflow
(stream & batch)
Cloud Storage
(objects)
BigQuery Storage
(structured)
Hadoop
Spark (on GCE)
Big Data on Google Cloud Platform
Bigdata Scenario
1M
Devices
16.6K
Events/sec
16.6K
Events/sec
43B
Events/month
Cloud Pub/Sub
Publisher A Publisher B Publisher C
Message 1
Topic A Topic B Topic C
Subscription XA Subscription XB
Subscription
YC
Subscription
ZC
Cloud
Pub/Sub
Subscriber X Subscriber Y
Message 2 Message 3
Subscriber Z
Message 1
Message 2
Message 3
Message 3
● Globally redundant
● Low latency (sub sec.)
● N to N coupling
● Batched read/write
● Push & Pull
● Guaranteed Delivery
● Auto expiration
Cloud Dataflow = Managed Flume + MillWheel on GCE
Dataflow use case
• Movement
• Filtering
• Enrichment
• Shaping
• Reduction
• Batch computation
• Continuous
computation
• Composition
• External
orchestration
• Simulation
OrchestrationAnalysisETL
<- Aggregations, Filters, Joins, ...
<- Completeness
Pipeline{
Who => Inputs
What => Transforms
Where => Windows
When => Watermarks + Triggers
To => Outputs
}
Transform
Output
Input
Cloud Dataflow SDK - Logic model
Life of Pipeline
GCP
Managed Service
User Code & SDK
Work Manager
Deploy & Schedule
Monitoring UI
Job Manager
Progress & Logs
Cloud Dataflow SDK
❯ Unified programming model for both batch & stream processing
● Independent from the execution back-end aka “runner”
❯ Google driven & open sourced
● Java 7 or 8 @ github.com/GoogleCloudPlatform/DataflowJavaSDK
● Python
❯ Community sourced
● Scala @ github.com/darkjh/scalaflow
● Scala @ github.com/jhlch/scala-dataflow-dsl
Pipeline
● A Direct Acyclic Graph of data processing
transformations
● Can be submitted to the Dataflow Service for
optimization and execution or executed on an
alternate runner e.g. Spark
● May include multiple inputs and multiple outputs
● May encompass many logical MapReduce
operations
● PCollections flow through the pipeline
Your
Source/Sink
Here
❯ Read from standard Google Cloud Platform
data sources
• GCS, Pub/Sub, BigQuery, Datastore
❯ Write your own custom source by teaching
Dataflow how to read it in parallel
• Currently for bounded sources only
❯ Write to GCS, BigQuery, Pub/Sub
• More coming…
❯ Can use a combination of text, JSON, XML,
Avro formatted data
Inputs & Outputs
PCollection
❯ A collection of data of type T in a pipeline
- PCollection<K,V>
❯ Maybe be either bounded or unbounded
in size
❯ Created by using a PTransform to:
• Build from a java.util.Collection
• Read from a backing data store
• Transform an existing PCollection
❯ Often contain the key-value pairs using
KV
{Seahawks, NFC, Champions, Seattle,
...}
{...,
“NFC Champions #GreenBay”,
“Green Bay #superbowl!”,
...
“#GoHawks”,
...}
● A step, or a processing operation that transforms data
○ convert format , group , filter data
● Type of Transforms
○ ParDo
○ GroupByKey
○ Combine
○ Flatten
■ Multiple PCollection objects that contain the same data type, you can
merge them into a single logical PCollection using the Flatten transform
Transforms
❯ Processes each element of a PCollection
independently using a user-provided DoFn
❯ Corresponds to both the Map and Reduce
phases in Hadoop i.e. ParDo->GBK->ParDo
❯ Useful for
○ Filtering a data set.
○ Formatting or converting the type of each
element in a data set.
○ Extracting parts of each element in a data set.
○ Performing computations on each element in a
data set.
Pardo (Parallel do)
{Seahawks, NFC, Champions, Seattle, ...}
{
KV<S, Seahawks>,
KV<C,Champions>,
<KV<S, Seattle>,
KV<N, NFC>, …
}
KeyBySessionId
Map
Shuffle
Reduce
ParDo
GroupByKey
ParDo
Wait a minute…
How do you do a GroupByKey on an unbounded PCollection?
{KV<S, Seahawks>, KV<C,Champions>,
<KV<S, Seattle>, KV<N, NFC>, ...}
{KV<S, Seahawks>, KV<C,Champions>,
<KV<S, Seattle>, KV<N, NFC>, ...}
GroupByKey
• Takes a PCollection of key-value pairs
and gathers up all values with the same
key
• Corresponds to the shuffle phase in
Hadoop
{KV<S, {Seahawks, Seattle, …},
KV<N, {NFC, …}
KV<C, {Champion, …}}
Group by key
GCPUG meetup 201610 - Dataflow Introduction
Dataflow in Advance
Windowing
● Triggers control when
results are emitted.
● Triggers are often relative
to the watermark
Trigger
http://cdn.oreillystatic.com/en/assets/1/event/155/Watermarks_%20Time%20and%20progress%20in
%20streaming%20dataflow%20and%20beyond%20Presentation.pdf
GCPUG meetup 201610 - Dataflow Introduction
Composite Transform
● Code reuse
● Better monitoring experience
Benifits of Cloud Dataflow
● Functional (transform based) programming model
● Unified programming model for batch & stream processing
● Reduced operational cost of “cluster” management
● Decreased job clock time via platform innovation
● Open source ecosystem of SDKs, extensions, runners,
etc.
Optimizing Your Time
Programming
Resource
provisioning
Performance
tuning
Monitoring
Reliability
Deployment &
configuration
Handling
growing scale
Utilization
improvements
Typical Data Processing
More time to dig
into your data
Programming
Data Processing with Cloud Dataflow
Run the same code in multiple modes using different runners
❯ Direct Runner
• For local, in-memory execution.
• Great for developing and unit tests
❯ Cloud Dataflow Service Runner
• Runs on the fully-manage Dataflow Service
• Your code runs distributed across GCE instances
❯ Community sourced
• Spark runner @ github.com/cloudera/spark-dataflow
• Flink runner from dataArtisans
Cloud DataFlow Runners
Build a mobile gaming analytics platform
Q&A

More Related Content

What's hot (20)

PDF
From airflow to google cloud composer
Bruce Kuo
 
PPTX
High Performance, High Reliability Data Loading on ClickHouse
Altinity Ltd
 
PDF
Terraforming the Kubernetes Land
Radek Simko
 
PDF
Digdagによる大規模データ処理の自動化とエラー処理
Sadayuki Furuhashi
 
PDF
Automating Workflows for Analytics Pipelines
Sadayuki Furuhashi
 
PDF
ClickHouse Monitoring 101: What to monitor and how
Altinity Ltd
 
PDF
Our Story With ClickHouse at seo.do
Metehan Çetinkaya
 
PDF
Altinity Cluster Manager: ClickHouse Management for Kubernetes and Cloud
Altinity Ltd
 
PDF
Building ClickHouse and Making Your First Contribution: A Tutorial_06.10.2021
Altinity Ltd
 
PDF
Google Cloud Platform Special Training
Simon Su
 
PDF
Google Compute Engine Starter Guide
Simon Su
 
PDF
Apache MXNet Distributed Training Explained In Depth by Viacheslav Kovalevsky...
Big Data Spain
 
PDF
Embuk internals
Sadayuki Furuhashi
 
PDF
Making KVS 10x Scalable
Sadayuki Furuhashi
 
PDF
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Yohei Onishi
 
PDF
Orchestrating workflows Apache Airflow on GCP & AWS
Derrick Qin
 
PDF
Elasticsearch (R)Evolution — You Know, for Search… by Philipp Krenn at Big Da...
Big Data Spain
 
PPTX
Airflow at WePay
Chris Riccomini
 
PDF
All about Zookeeper and ClickHouse Keeper.pdf
Altinity Ltd
 
PDF
Building infrastructure with Terraform (Google)
Radek Simko
 
From airflow to google cloud composer
Bruce Kuo
 
High Performance, High Reliability Data Loading on ClickHouse
Altinity Ltd
 
Terraforming the Kubernetes Land
Radek Simko
 
Digdagによる大規模データ処理の自動化とエラー処理
Sadayuki Furuhashi
 
Automating Workflows for Analytics Pipelines
Sadayuki Furuhashi
 
ClickHouse Monitoring 101: What to monitor and how
Altinity Ltd
 
Our Story With ClickHouse at seo.do
Metehan Çetinkaya
 
Altinity Cluster Manager: ClickHouse Management for Kubernetes and Cloud
Altinity Ltd
 
Building ClickHouse and Making Your First Contribution: A Tutorial_06.10.2021
Altinity Ltd
 
Google Cloud Platform Special Training
Simon Su
 
Google Compute Engine Starter Guide
Simon Su
 
Apache MXNet Distributed Training Explained In Depth by Viacheslav Kovalevsky...
Big Data Spain
 
Embuk internals
Sadayuki Furuhashi
 
Making KVS 10x Scalable
Sadayuki Furuhashi
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Yohei Onishi
 
Orchestrating workflows Apache Airflow on GCP & AWS
Derrick Qin
 
Elasticsearch (R)Evolution — You Know, for Search… by Philipp Krenn at Big Da...
Big Data Spain
 
Airflow at WePay
Chris Riccomini
 
All about Zookeeper and ClickHouse Keeper.pdf
Altinity Ltd
 
Building infrastructure with Terraform (Google)
Radek Simko
 

Viewers also liked (20)

PDF
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
Denodo
 
PDF
Lianjia data infrastructure, Yi Lyu
毅 吕
 
PPTX
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
Nicolas Kourtellis
 
PPTX
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
Nicolas Kourtellis
 
PDF
Callcenter HPE IDOL overview
Tania Akinina
 
PDF
ANTS - 360 view of your customer - bigdata innovation summit 2016
Dinh Le Dat (Kevin D.)
 
PDF
クラウドを活用した自由自在なデータ分析
aiichiro
 
PDF
Oxalide MorningTech #1 - BigData
Ludovic Piot
 
PDF
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
Inhacking
 
PDF
Brocade - Stingray Application Firewall
Simon Su
 
PDF
GCPNext17' Extend 開始GCP了嗎?
Simon Su
 
PDF
Google I/O Extended 2016 - 台北場活動回顧
Simon Su
 
PDF
Google I/O 2016 Recap - Google Cloud Platform News Update
Simon Su
 
PDF
中華電信 教育訓練
謝 宗穎
 
PDF
GCPUG.TW - 2016活動討論
Simon Su
 
PDF
Developer team review of 2014
Caesar Chi
 
PDF
技術單兵作戰及團隊開發流程差異
Caesar Chi
 
PDF
html5 & phonegap
Caesar Chi
 
PDF
Google Cloud Platform 2014Q4
Simon Su
 
PDF
中原大學 Shift to cloud
Simon Su
 
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
Denodo
 
Lianjia data infrastructure, Yi Lyu
毅 吕
 
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
Nicolas Kourtellis
 
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
Nicolas Kourtellis
 
Callcenter HPE IDOL overview
Tania Akinina
 
ANTS - 360 view of your customer - bigdata innovation summit 2016
Dinh Le Dat (Kevin D.)
 
クラウドを活用した自由自在なデータ分析
aiichiro
 
Oxalide MorningTech #1 - BigData
Ludovic Piot
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
Inhacking
 
Brocade - Stingray Application Firewall
Simon Su
 
GCPNext17' Extend 開始GCP了嗎?
Simon Su
 
Google I/O Extended 2016 - 台北場活動回顧
Simon Su
 
Google I/O 2016 Recap - Google Cloud Platform News Update
Simon Su
 
中華電信 教育訓練
謝 宗穎
 
GCPUG.TW - 2016活動討論
Simon Su
 
Developer team review of 2014
Caesar Chi
 
技術單兵作戰及團隊開發流程差異
Caesar Chi
 
html5 & phonegap
Caesar Chi
 
Google Cloud Platform 2014Q4
Simon Su
 
中原大學 Shift to cloud
Simon Su
 
Ad

Similar to GCPUG meetup 201610 - Dataflow Introduction (20)

PDF
Hadoop Conf 2014 - Hadoop BigQuery Connector
Simon Su
 
PDF
DSDT Meetup Nov 2017
DSDT_MTL
 
PDF
Dsdt meetup 2017 11-21
JDA Labs MTL
 
PDF
How Bitbucket Pipelines Loads Connect UI Assets Super-fast
Atlassian
 
PDF
Powerful Google developer tools for immediate impact! (2023-24 C)
wesley chun
 
PDF
How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
Daniel Zivkovic
 
PDF
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Databricks
 
PDF
Xephon K A Time series database with multiple backends
University of California, Santa Cruz
 
PDF
Spark on Dataproc - Israel Spark Meetup at taboola
tsliwowicz
 
PDF
Google Cloud Dataflow
Alex Van Boxel
 
PDF
SparkR - Play Spark Using R (20160909 HadoopCon)
wqchen
 
PDF
Groovy On Trading Desk (2010)
Jonathan Felch
 
PPTX
Altitude San Francisco 2018: Logging at the Edge
Fastly
 
ODP
Pyramid Lighter/Faster/Better web apps
Dylan Jay
 
DOCX
Step by Step Personal Drive to One Drive Migration using SPMT
IT Industry
 
PDF
Google Cloud Dataflow Two Worlds Become a Much Better One
DataWorks Summit
 
PPTX
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
CAPSiDE
 
PDF
Everything is Awesome - Cutting the Corners off the Web
James Rakich
 
PDF
Extending spark ML for custom models now with python!
Holden Karau
 
PPTX
SharePoint Saturday Chicago - Everything your need to know about the Microsof...
Sébastien Levert
 
Hadoop Conf 2014 - Hadoop BigQuery Connector
Simon Su
 
DSDT Meetup Nov 2017
DSDT_MTL
 
Dsdt meetup 2017 11-21
JDA Labs MTL
 
How Bitbucket Pipelines Loads Connect UI Assets Super-fast
Atlassian
 
Powerful Google developer tools for immediate impact! (2023-24 C)
wesley chun
 
How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
Daniel Zivkovic
 
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Databricks
 
Xephon K A Time series database with multiple backends
University of California, Santa Cruz
 
Spark on Dataproc - Israel Spark Meetup at taboola
tsliwowicz
 
Google Cloud Dataflow
Alex Van Boxel
 
SparkR - Play Spark Using R (20160909 HadoopCon)
wqchen
 
Groovy On Trading Desk (2010)
Jonathan Felch
 
Altitude San Francisco 2018: Logging at the Edge
Fastly
 
Pyramid Lighter/Faster/Better web apps
Dylan Jay
 
Step by Step Personal Drive to One Drive Migration using SPMT
IT Industry
 
Google Cloud Dataflow Two Worlds Become a Much Better One
DataWorks Summit
 
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
CAPSiDE
 
Everything is Awesome - Cutting the Corners off the Web
James Rakich
 
Extending spark ML for custom models now with python!
Holden Karau
 
SharePoint Saturday Chicago - Everything your need to know about the Microsof...
Sébastien Levert
 
Ad

More from Simon Su (17)

PDF
Kubernetes Basic Operation
Simon Su
 
PDF
Google IoT Core 初體驗
Simon Su
 
PDF
JSDC 2017 - 使用google cloud 從雲到端,動手刻個IoT
Simon Su
 
PDF
GCPUG.TW meetup #28 - GKE上運作您的k8s服務
Simon Su
 
PDF
GCE Windows Serial Console Usage Guide
Simon Su
 
PDF
Try Cloud Spanner
Simon Su
 
PDF
Google Cloud Monitoring
Simon Su
 
PDF
JCConf2016 - Dataflow Workshop Setup
Simon Su
 
PDF
IThome DevOps Summit - IoT、docker與DevOps
Simon Su
 
PDF
Google Cloud Platform Introduction - 2016Q3
Simon Su
 
PPTX
GCS - Access Control Lists (中文)
Simon Su
 
PDF
Google Cloud Platform - for Mobile Solutions
Simon Su
 
PDF
JCConf 2015 - 輕鬆學google的雲端開發 - Google App Engine入門(下)
Simon Su
 
PDF
JCConf 2015 - 輕鬆學google的雲端開發 - Google App Engine入門(上)
Simon Su
 
PDF
GCPUG.TW - 2015活動回顧
Simon Su
 
PDF
CouchDB Getting Start
Simon Su
 
PDF
Google Cloud Platform專案建立說明
Simon Su
 
Kubernetes Basic Operation
Simon Su
 
Google IoT Core 初體驗
Simon Su
 
JSDC 2017 - 使用google cloud 從雲到端,動手刻個IoT
Simon Su
 
GCPUG.TW meetup #28 - GKE上運作您的k8s服務
Simon Su
 
GCE Windows Serial Console Usage Guide
Simon Su
 
Try Cloud Spanner
Simon Su
 
Google Cloud Monitoring
Simon Su
 
JCConf2016 - Dataflow Workshop Setup
Simon Su
 
IThome DevOps Summit - IoT、docker與DevOps
Simon Su
 
Google Cloud Platform Introduction - 2016Q3
Simon Su
 
GCS - Access Control Lists (中文)
Simon Su
 
Google Cloud Platform - for Mobile Solutions
Simon Su
 
JCConf 2015 - 輕鬆學google的雲端開發 - Google App Engine入門(下)
Simon Su
 
JCConf 2015 - 輕鬆學google的雲端開發 - Google App Engine入門(上)
Simon Su
 
GCPUG.TW - 2015活動回顧
Simon Su
 
CouchDB Getting Start
Simon Su
 
Google Cloud Platform專案建立說明
Simon Su
 

Recently uploaded (20)

PDF
Market Insight : ETH Dominance Returns
CIFDAQ
 
PDF
How Current Advanced Cyber Threats Transform Business Operation
Eryk Budi Pratama
 
PDF
TrustArc Webinar - Navigating Data Privacy in LATAM: Laws, Trends, and Compli...
TrustArc
 
PDF
UiPath on Tour London Community Booth Deck
UiPathCommunity
 
PDF
Novus Safe Lite- What is Novus Safe Lite.pdf
Novus Hi-Tech
 
PDF
Human-centred design in online workplace learning and relationship to engagem...
Tracy Tang
 
PDF
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
PDF
CloudStack GPU Integration - Rohit Yadav
ShapeBlue
 
PDF
How a Code Plagiarism Checker Protects Originality in Programming
Code Quiry
 
PPTX
Simplifying End-to-End Apache CloudStack Deployment with a Web-Based Automati...
ShapeBlue
 
PDF
Upskill to Agentic Automation 2025 - Kickoff Meeting
DianaGray10
 
PDF
Alpha Altcoin Setup : TIA - 19th July 2025
CIFDAQ
 
PDF
2025-07-15 EMEA Volledig Inzicht Dutch Webinar
ThousandEyes
 
PPTX
The Yotta x CloudStack Advantage: Scalable, India-First Cloud
ShapeBlue
 
PDF
Bitcoin+ Escalando sin concesiones - Parte 1
Fernando Paredes García
 
PDF
Ampere Offers Energy-Efficient Future For AI And Cloud
ShapeBlue
 
PDF
Meetup Kickoff & Welcome - Rohit Yadav, CSIUG Chairman
ShapeBlue
 
PDF
Arcee AI - building and working with small language models (06/25)
Julien SIMON
 
PPTX
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
PPTX
python advanced data structure dictionary with examples python advanced data ...
sprasanna11
 
Market Insight : ETH Dominance Returns
CIFDAQ
 
How Current Advanced Cyber Threats Transform Business Operation
Eryk Budi Pratama
 
TrustArc Webinar - Navigating Data Privacy in LATAM: Laws, Trends, and Compli...
TrustArc
 
UiPath on Tour London Community Booth Deck
UiPathCommunity
 
Novus Safe Lite- What is Novus Safe Lite.pdf
Novus Hi-Tech
 
Human-centred design in online workplace learning and relationship to engagem...
Tracy Tang
 
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
CloudStack GPU Integration - Rohit Yadav
ShapeBlue
 
How a Code Plagiarism Checker Protects Originality in Programming
Code Quiry
 
Simplifying End-to-End Apache CloudStack Deployment with a Web-Based Automati...
ShapeBlue
 
Upskill to Agentic Automation 2025 - Kickoff Meeting
DianaGray10
 
Alpha Altcoin Setup : TIA - 19th July 2025
CIFDAQ
 
2025-07-15 EMEA Volledig Inzicht Dutch Webinar
ThousandEyes
 
The Yotta x CloudStack Advantage: Scalable, India-First Cloud
ShapeBlue
 
Bitcoin+ Escalando sin concesiones - Parte 1
Fernando Paredes García
 
Ampere Offers Energy-Efficient Future For AI And Cloud
ShapeBlue
 
Meetup Kickoff & Welcome - Rohit Yadav, CSIUG Chairman
ShapeBlue
 
Arcee AI - building and working with small language models (06/25)
Julien SIMON
 
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
python advanced data structure dictionary with examples python advanced data ...
sprasanna11
 

GCPUG meetup 201610 - Dataflow Introduction

  • 1. Google BigData solution Su plus Hu @ GCPUG.TW
  • 2. Simon Su var simon = {}; simon.aboutme = 'http://about.me/peihsinsu'; simon.nodejs = ‘http://opennodes.arecord.us'; simon.googleshare = 'http://gappsnews.blogspot.tw' simon.nodejsblog = ‘http://nodejs-in-example.blogspot.tw'; simon.blog = ‘http://peihsinsu.blogspot.com'; simon.slideshare = ‘http://slideshare.net/peihsinsu/'; simon.email = ‘[email protected]’; simon.say(‘Good luck to everybody!');
  • 3. Sunny Hu var sunny = {}; sunny.aboutme = 'https://plus.google.com/u/0/+sunnyHU/posts'; sunny.email = [email protected].’; sunny.language =[‘Java’,’.NET’,’NodeJS’,’SQL’ ] sunny.skill = [ ‘Project management’, ’System Analysis’, ’System design’, ’Car ho lan’ ] sunny.say(‘寫code太苦悶,心情要sunny'); GCP Qualified Developer
  • 4. ● We are “舒” “服” 二人組 ... ● This is Su Hu style ...
  • 5. https://www.facebook.com/groups/GCPUG.TW/ https://plus.google.com/u/0/communities/116100913832589966421 Google Cloud Platform User Group Taiwan 我們是Google Cloud Platform Taiwan User Group。在Google雲端服務在台灣地區展 露頭角之後,有許多新的服務、新的知識、新的創意,歡迎大家一起分享,一起了解 Google雲端服務... GCPUG透過網際網路串聯喜好Google Cloud的使用者,分享與交流使用GCP的點滴 鑑驗。如果您是Google Cloud Platform的初學者,您應該來聽聽前輩們的使用經驗;如 果您是Google Cloud Platform的Expert,您應該來分享一下寶貴的經驗,並與更多高 手互相交流;如果您還沒開始用Google Cloud Platform,那麼您應該馬上來聽聽我們 是怎麼使用Google Cloud的!
  • 6. Linker Want You... ● Data scientist ● Data engineer ● Frontend engineer
  • 7. 每分鐘上傳到YouTube的影片長度? Google search 的 index 有多大? Google有多少有效的使用者? 72 hours 425M+ 100PB+ (over 100,000 TBs) 0.25 seconds Google 需要平均回應客戶搜索關鍵字的時間?
  • 9. SpannerDremelMapReduce Big Table Colossus 2012 20132002 2004 2006 2008 2010 GFS MillWheel Flume Google innovation
  • 10. Provided as a managed services ... SpannerDremelMapReduce Big Table Colossus 2012 20132002 2004 2006 2008 2010 GFS MillWheel Flume
  • 11. Google Changed the Big Data Market Google MapReduce Google Bigtable Google Borg Google Borg Google Dremel
  • 12. StoreCapture Analyze BigQuery Larger Hadoop Ecosystem Hadoop Spark (on GCE) Pub/Sub BigQuery streaming Process Dataflow (stream & batch) Cloud Storage (objects) BigQuery Storage (structured) Hadoop Spark (on GCE) Big Data on Google Cloud Platform
  • 14. Cloud Pub/Sub Publisher A Publisher B Publisher C Message 1 Topic A Topic B Topic C Subscription XA Subscription XB Subscription YC Subscription ZC Cloud Pub/Sub Subscriber X Subscriber Y Message 2 Message 3 Subscriber Z Message 1 Message 2 Message 3 Message 3 ● Globally redundant ● Low latency (sub sec.) ● N to N coupling ● Batched read/write ● Push & Pull ● Guaranteed Delivery ● Auto expiration
  • 15. Cloud Dataflow = Managed Flume + MillWheel on GCE
  • 16. Dataflow use case • Movement • Filtering • Enrichment • Shaping • Reduction • Batch computation • Continuous computation • Composition • External orchestration • Simulation OrchestrationAnalysisETL
  • 17. <- Aggregations, Filters, Joins, ... <- Completeness Pipeline{ Who => Inputs What => Transforms Where => Windows When => Watermarks + Triggers To => Outputs } Transform Output Input Cloud Dataflow SDK - Logic model
  • 18. Life of Pipeline GCP Managed Service User Code & SDK Work Manager Deploy & Schedule Monitoring UI Job Manager Progress & Logs
  • 19. Cloud Dataflow SDK ❯ Unified programming model for both batch & stream processing ● Independent from the execution back-end aka “runner” ❯ Google driven & open sourced ● Java 7 or 8 @ github.com/GoogleCloudPlatform/DataflowJavaSDK ● Python ❯ Community sourced ● Scala @ github.com/darkjh/scalaflow ● Scala @ github.com/jhlch/scala-dataflow-dsl
  • 20. Pipeline ● A Direct Acyclic Graph of data processing transformations ● Can be submitted to the Dataflow Service for optimization and execution or executed on an alternate runner e.g. Spark ● May include multiple inputs and multiple outputs ● May encompass many logical MapReduce operations ● PCollections flow through the pipeline
  • 21. Your Source/Sink Here ❯ Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore ❯ Write your own custom source by teaching Dataflow how to read it in parallel • Currently for bounded sources only ❯ Write to GCS, BigQuery, Pub/Sub • More coming… ❯ Can use a combination of text, JSON, XML, Avro formatted data Inputs & Outputs
  • 22. PCollection ❯ A collection of data of type T in a pipeline - PCollection<K,V> ❯ Maybe be either bounded or unbounded in size ❯ Created by using a PTransform to: • Build from a java.util.Collection • Read from a backing data store • Transform an existing PCollection ❯ Often contain the key-value pairs using KV {Seahawks, NFC, Champions, Seattle, ...} {..., “NFC Champions #GreenBay”, “Green Bay #superbowl!”, ... “#GoHawks”, ...}
  • 23. ● A step, or a processing operation that transforms data ○ convert format , group , filter data ● Type of Transforms ○ ParDo ○ GroupByKey ○ Combine ○ Flatten ■ Multiple PCollection objects that contain the same data type, you can merge them into a single logical PCollection using the Flatten transform Transforms
  • 24. ❯ Processes each element of a PCollection independently using a user-provided DoFn ❯ Corresponds to both the Map and Reduce phases in Hadoop i.e. ParDo->GBK->ParDo ❯ Useful for ○ Filtering a data set. ○ Formatting or converting the type of each element in a data set. ○ Extracting parts of each element in a data set. ○ Performing computations on each element in a data set. Pardo (Parallel do) {Seahawks, NFC, Champions, Seattle, ...} { KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>, … } KeyBySessionId
  • 26. Wait a minute… How do you do a GroupByKey on an unbounded PCollection? {KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>, ...} {KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>, ...} GroupByKey • Takes a PCollection of key-value pairs and gathers up all values with the same key • Corresponds to the shuffle phase in Hadoop {KV<S, {Seahawks, Seattle, …}, KV<N, {NFC, …} KV<C, {Champion, …}} Group by key
  • 30. ● Triggers control when results are emitted. ● Triggers are often relative to the watermark Trigger http://cdn.oreillystatic.com/en/assets/1/event/155/Watermarks_%20Time%20and%20progress%20in %20streaming%20dataflow%20and%20beyond%20Presentation.pdf
  • 32. Composite Transform ● Code reuse ● Better monitoring experience
  • 33. Benifits of Cloud Dataflow ● Functional (transform based) programming model ● Unified programming model for batch & stream processing ● Reduced operational cost of “cluster” management ● Decreased job clock time via platform innovation ● Open source ecosystem of SDKs, extensions, runners, etc.
  • 34. Optimizing Your Time Programming Resource provisioning Performance tuning Monitoring Reliability Deployment & configuration Handling growing scale Utilization improvements Typical Data Processing More time to dig into your data Programming Data Processing with Cloud Dataflow
  • 35. Run the same code in multiple modes using different runners ❯ Direct Runner • For local, in-memory execution. • Great for developing and unit tests ❯ Cloud Dataflow Service Runner • Runs on the fully-manage Dataflow Service • Your code runs distributed across GCE instances ❯ Community sourced • Spark runner @ github.com/cloudera/spark-dataflow • Flink runner from dataArtisans Cloud DataFlow Runners
  • 36. Build a mobile gaming analytics platform
  • 37. Q&A