SlideShare a Scribd company logo
Google Dataflow 小試
Simon Su @ LinkerNetworks
{Google Developer Expert}
https://goo.gl/xkANFT
var simon = {/** I am at GCPUG.TW **/};
simon.aboutme = 'http://about.me/peihsinsu';
simon.nodejs = ‘http://opennodes.arecord.us';
simon.googleshare = 'http://gappsnews.blogspot.tw'
simon.nodejsblog = ‘http://nodejs-in-example.blogspot.tw';
simon.blog = ‘http://peihsinsu.blogspot.com';
simon.slideshare = ‘http://slideshare.net/peihsinsu/';
simon.email = ‘simonsu.mail@gmail.com’;
simon.say(‘Good luck to everybody!');
I’m Simon Su...
What… I’m at Japan!!
https://www.facebook.com/groups/GCPUG.TW/
https://plus.google.com/u/0/communities/116100913832589966421
● Data scientist
● Data engineer
● Frontend engineer
Before Workshop...
Prepare Doc
http://www.slideshare.net/peihsinsu/jcconf2016-dataflow-workshop
LAB Doc
http://www.slideshare.net/peihsinsu/jcconf-2016-dataflow-workshop-labs
Or http://goo.gl/nfdBhV
Google Cloud in Big Data Solution
Google Focused Cloud
Virtualized
Data Centers
Standard virtual kit for
Rent. Still yours to
manage.
2nd
Wave
Colocation
1st
Wave
Your kit, someone
else’s building.
Yours to manage.
Assembly required True On Demand Cloud
Next
Storage Processing Memory Network
Clusters
Distributed Storage, Processing
& Machine Learning
Containers
3rd
Wave
An actual, global
elastic cloud
Invest your energy in
great apps.
Google Cloud Family
Foundation
Infrastructure & Operations
Data Services
Application
Runtime Services
Enabling No-Touch Operations
Breakthrough Insights,
Breakthrough Applications
The Gear that Powers Google
GCP tools for data processing and analysis
Dataflow
StoreCapture Analyze
BigQuery Larger
Hadoop
Ecosystem
Pub/Sub
Logs
App Engine
BigQuery streaming
Process
Cloud
Storage
Cloud
Datastore
(NoSQL)
Cloud SQL
(mySQL)
BigQuery
Storage
Dataproc Dataproc
Common big data processing flow
Devices
Physical or
virtual servers
as frontend
data receiver
MapReduce
servers for large
data
transformation in
batch way or
streaming
Strong queue
service for
handling large
scale data
injection
Large scale data
store for storing
data and serve
query workload
Smart devices,
IoT devices
and Sensors
1M
Devices
16.6K
Events/sec
16.6K
Events/sec
43B
Events/month
SpannerDremelMapReduce
Big Table Colossus
2012 20132002 2004 2006 2008 2010
GFS MillWheel
Flume
Google Big Data Evolution History
Look into Dataflow
GCP
Managed Service
User Code & SDK
Work Manager
Deploy & Schedule
Monitoring UI
Job Manager
Progress & Logs
Dataflow use case
• Movement
• Filtering
• Enrichment
• Shaping
• Reduction
• Batch computation
• Continuous
computation
• Composition
• External
orchestration
• Simulation
OrchestrationAnalysisETL
Getting Start - Installation
Get your GCP project
Install Eclipse Plugin for Dataflow
Verify your installation
Another Choice...
https://github.com/peihsinsu/gcp-dataflow-java
Getting start with GCS
Google Cloud Storage Features
Online cloud
import (Cloud
Storage Transfer
Service)
Object lifecycle
management
ACLs
Object change
notification
Offline import
(third party)
Regional buckets
Object
versioning
Create your bucket for Dataflow use
Run Dataflow in Local
Create dataflow project
Dataflow Sample
Execute Dataflow
Lab 1: Ready your dataflow
environment and create your
first dataflow project
● Create GCP project
● Install Eclipse and Dataflow plugin
● Create first Dataflow project
● Run your project
After lab - What thing in GCS bucket
After lab - The Run Configuration
Dataflow in Batch Mode
What we do in Big Data process...
Map
Shuffle
Reduce
ParDo
GroupByKey
ParDo
Pipeline
● A Direct Acyclic Graph of data processing
transformations
● Can be submitted to the Dataflow Service for
optimization and execution or executed on an
alternate runner e.g. Spark
● May include multiple inputs and multiple outputs
● May encompass many logical MapReduce
operations
● PCollections flow through the pipeline
Code Review
Pipeline p = Pipeline.create(
PipelineOptionsFactory.fromArgs(args).withValidation().create());
p.apply(Create.of("Hello", "World"))
.apply(ParDo.of(new DoFn<String, String>() {
@Override
public void processElement(ProcessContext c) {
c.output(c.element().toUpperCase());
}
}))
.apply(ParDo.of(new DoFn<String, Void>() {
@Override
public void processElement(ProcessContext c) {
LOG.info(c.element());
}
}));
p.run();
建立pipeline物件
執行pipeline
Inputs & Outputs
Your
Source/Sink
Here
❯ Read from standard Google Cloud Platform
data sources
• GCS, Pub/Sub, BigQuery, Datastore
❯ Write your own custom source by teaching
Dataflow how to read it in parallel
• Currently for bounded sources only
❯ Write to GCS, BigQuery, Pub/Sub
• More coming…
❯ Can use a combination of text, JSON, XML,
Avro formatted data
Code Review
Pipeline p = Pipeline.create(
PipelineOptionsFactory.fromArgs(args).withValidation().create());
p.apply(Create.of("Hello", "World"))
.apply(ParDo.of(new DoFn<String, String>() {
@Override
public void processElement(ProcessContext c) {
c.output(c.element().toUpperCase());
}
}))
.apply(ParDo.of(new DoFn<String, Void>() {
@Override
public void processElement(ProcessContext c) {
LOG.info(c.element());
}
}));
p.run();
建立Input
Parallel Do,Log輸出
PCollection
❯ A collection of data of type T in a pipeline
- PCollection<K,V>
❯ Maybe be either bounded or unbounded
in size
❯ Created by using a PTransform to:
• Build from a java.util.Collection
• Read from a backing data store
• Transform an existing PCollection
❯ Often contain the key-value pairs using
KV
{Seahawks, NFC, Champions, Seattle,
...}
{...,
“NFC Champions #GreenBay”,
“Green Bay #superbowl!”,
...
“#GoHawks”,
...}
Transforms
● A step, or a processing operation that transforms data
○ convert format , group , filter data
● Type of Transforms
○ ParDo
○ GroupByKey
○ Combine
○ Flatten
■ Multiple PCollection objects that contain the same data type, you can
merge them into a single logical PCollection using the Flatten transform
Code Review
Pipeline p = Pipeline.create(
PipelineOptionsFactory.fromArgs(args).withValidation().create());
p.apply(Create.of("Hello", "World"))
.apply(ParDo.of(new DoFn<String, String>() {
@Override
public void processElement(ProcessContext c) {
c.output(c.element().toUpperCase());
}
}))
.apply(ParDo.of(new DoFn<String, Void>() {
@Override
public void processElement(ProcessContext c) {
LOG.info(c.element());
}
}));
p.run();
Transform輸入,使用Parallel Do
方式,平行的將輸入物件轉大寫
Parallel Do,Log輸出
Pardo (Parallel do)
❯ Processes each element of a PCollection
independently using a user-provided DoFn
❯ Corresponds to both the Map and Reduce
phases in Hadoop i.e. ParDo->GBK->ParDo
❯ Useful for
Filtering a data set.
Formatting or converting the type of each element
in a data set.
Extracting parts of each element in a data set.
Performing computations on each element in a
data set.
{Seahawks, NFC,
Champions, Seattle, ...}
{
KV<S, Seahawks>,
KV<C,Champions>,
<KV<S, Seattle>,
KV<N, NFC>, …
}
KeyBySessionId
Group by key
Wait a minute…
How do you do a GroupByKey on an unbounded PCollection?
{KV<S, Seahawks>, KV<C,Champions>,
<KV<S, Seattle>, KV<N, NFC>, ...}
{KV<S, Seahawks>, KV<C,Champions>,
<KV<S, Seattle>, KV<N, NFC>, ...}
GroupByKey
• Takes a PCollection of key-value pairs
and gathers up all values with the same
key
• Corresponds to the shuffle phase in
Hadoop
{KV<S, {Seahawks, Seattle, …},
KV<N, {NFC, …}
KV<C, {Champion, …}}
Group by key sample
Lab 2: Deploy your first
project to Google Cloud
Platform
● Checking the Lab 1 project working well
● Deploy to cloud and watch the dataflow
task dashboard
● Implement the Input/Output/Transform
in your project
Dataflow Task Dashboard
Dataflow in Streaming Mode
Pub/Sub working model
Pub/Sub Operations - Topics
Pub/Sub Operations - Subscriber
Simple Guide for Pub/Sub
Ref: https://gcpug-tw.gitbooks.io/google-cloud-platform-in-practice/content/pubsub_getting_start.html
Dataflow with Cloud Pub/Sub Use Case
Publisher A Publisher B Publisher C
Message 1
Topic A Topic B Topic C
Subscription XA Subscription XB
Subscription
YC
Subscription
ZC
Cloud
Pub/Sub
Subscriber X Subscriber Y
Message 2 Message 3
Subscriber Z
Message 1
Message 2
Message 3
Message 3
Globally redundant
Low latency (sub sec.)
N to N coupling
Batched read/write
Push & Pull
Guaranteed Delivery
Auto expiration
Example of PubSub IO to BigQuery IO
Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
options.setStreaming(true);
Pipeline p = Pipeline.create(options);
PCollection<String> input =
p.apply(PubsubIO.Read.topic("projects/sunny-573/topics/jcconf2016"));
PCollection<String> windowedWords =
input.apply(Window.<String> into(
FixedWindows.of(Duration.standardMinutes(options.getWindowSize()))));
PCollection<KV<String, Long>> wordCounts =
windowedWords.apply(new TestMain.MyCountWords());
wordCounts.apply(ParDo.of(new FormatAsTableRowFn())).apply(
BigQueryIO.Write.to(getTableReference(options)).withSchema(getSchema()));
p.run();
Lab 3: Create a Streaming
Dataflow model
● Create PubSub topic
● Deploy Dataflow streaming sample
● Watch Dataflow task dashboard
Streaming Dataflow Executing Log
Don’t forget to shut down streaming process...
After Dataflow
JCConf 2016 - Google Dataflow 小試
Datalab
An easy tool for analysis and report
Analysis tools - BigQuery & Datalab
BigQuery
An interactive analysis service
Google Data Studio
https://datastudio.google.com/
Thinking your data orchestration...
● Google Codelab:
https://codelabs.developers.google.com/?cat=Cloud
● Dataflow Doc:
https://cloud.google.com/dataflow/docs/
More learning resources
--- THANKS ---

More Related Content

PDF
GCPUG.TW - GCP學習資源分享
Simon Su
 
PDF
Google Cloud Computing compares GCE, GAE and GKE
Simon Su
 
PDF
JCConf 2015 - Google Dataflow 在雲端大資料處理的應用
Simon Su
 
PDF
使用 Raspberry pi + fluentd + gcp cloud logging, big query 做iot 資料搜集與分析
Simon Su
 
PDF
Google Cloud Dataflow meets TensorFlow
Hayato Yoshikawa
 
PDF
Google Cloud Dataflow
Alex Van Boxel
 
PDF
JCConf 2016 - Dataflow Workshop Labs
Simon Su
 
PDF
Google compute engine - overview
Charles Fan
 
GCPUG.TW - GCP學習資源分享
Simon Su
 
Google Cloud Computing compares GCE, GAE and GKE
Simon Su
 
JCConf 2015 - Google Dataflow 在雲端大資料處理的應用
Simon Su
 
使用 Raspberry pi + fluentd + gcp cloud logging, big query 做iot 資料搜集與分析
Simon Su
 
Google Cloud Dataflow meets TensorFlow
Hayato Yoshikawa
 
Google Cloud Dataflow
Alex Van Boxel
 
JCConf 2016 - Dataflow Workshop Labs
Simon Su
 
Google compute engine - overview
Charles Fan
 

What's hot (20)

PDF
GCPUG meetup 201610 - Dataflow Introduction
Simon Su
 
PDF
Google Cloud Platform Special Training
Simon Su
 
PDF
Docker in Action
Simon Su
 
PDF
node.js on Google Compute Engine
Arun Nagarajan
 
PDF
From airflow to google cloud composer
Bruce Kuo
 
PPT
Using Google Compute Engine
Lynn Langit
 
PDF
Hands on Compute Engine
Simon Su
 
PDF
Introduction to Google Compute Engine
Colin Su
 
PPTX
Intro to the Google Cloud for Developers
Lynn Langit
 
PDF
Streaming Auto-scaling in Google Cloud Dataflow
C4Media
 
PPTX
Google Cloud Platform - Eric Johnson, Joe Selman - ManageIQ Design Summit 2016
ManageIQ
 
PDF
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Yohei Onishi
 
PDF
Hands on App Engine
Simon Su
 
PDF
Getting started with Google Cloud Training Material - 2018
JK Baseer
 
PDF
A Tour of Google Cloud Platform
Colin Su
 
PDF
Google Cloud Platform as a Backend Solution for your Product
Sergey Smetanin
 
PDF
Google Cloud Platform Introduction - 2016Q3
Simon Su
 
PDF
Google Cloud Dataflow Two Worlds Become a Much Better One
DataWorks Summit
 
PDF
Google Dataflow Intro
Ivan Glushkov
 
PDF
Apache Beam and Google Cloud Dataflow - IDG - final
Sub Szabolcs Feczak
 
GCPUG meetup 201610 - Dataflow Introduction
Simon Su
 
Google Cloud Platform Special Training
Simon Su
 
Docker in Action
Simon Su
 
node.js on Google Compute Engine
Arun Nagarajan
 
From airflow to google cloud composer
Bruce Kuo
 
Using Google Compute Engine
Lynn Langit
 
Hands on Compute Engine
Simon Su
 
Introduction to Google Compute Engine
Colin Su
 
Intro to the Google Cloud for Developers
Lynn Langit
 
Streaming Auto-scaling in Google Cloud Dataflow
C4Media
 
Google Cloud Platform - Eric Johnson, Joe Selman - ManageIQ Design Summit 2016
ManageIQ
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Yohei Onishi
 
Hands on App Engine
Simon Su
 
Getting started with Google Cloud Training Material - 2018
JK Baseer
 
A Tour of Google Cloud Platform
Colin Su
 
Google Cloud Platform as a Backend Solution for your Product
Sergey Smetanin
 
Google Cloud Platform Introduction - 2016Q3
Simon Su
 
Google Cloud Dataflow Two Worlds Become a Much Better One
DataWorks Summit
 
Google Dataflow Intro
Ivan Glushkov
 
Apache Beam and Google Cloud Dataflow - IDG - final
Sub Szabolcs Feczak
 
Ad

Viewers also liked (20)

PDF
Brocade - Stingray Application Firewall
Simon Su
 
PDF
GCPNext17' Extend 開始GCP了嗎?
Simon Su
 
PDF
Google I/O Extended 2016 - 台北場活動回顧
Simon Su
 
PDF
Google I/O 2016 Recap - Google Cloud Platform News Update
Simon Su
 
PDF
中原大學 Shift to cloud
Simon Su
 
PDF
GCPUG.TW - 2016活動討論
Simon Su
 
PDF
Google Cloud Platform 2014Q4
Simon Su
 
PDF
技術單兵作戰及團隊開發流程差異
Caesar Chi
 
PDF
html5 & phonegap
Caesar Chi
 
PDF
中華電信 教育訓練
謝 宗穎
 
PDF
Developer team review of 2014
Caesar Chi
 
PDF
為 Node.js 專案打造專屬管家進行開發流程整合及健康檢測
謝 宗穎
 
PDF
GCPUG.TW - 2015活動回顧
Simon Su
 
PDF
Web development, from git flow to github flow
Caesar Chi
 
PDF
Docker with Cloud Service GCPUG
Caesar Chi
 
PDF
GCP - GCE, Cloud SQL, Cloud Storage, BigQuery Basic Training
Simon Su
 
PDF
Google IO - When Bigquery meeet Node.js
Simon Su
 
PDF
遠端團隊專案建立與管理 remote team management 2016
Caesar Chi
 
PDF
Try Cloud Spanner
Simon Su
 
PDF
從失敗中學習打造技術團隊
Caesar Chi
 
Brocade - Stingray Application Firewall
Simon Su
 
GCPNext17' Extend 開始GCP了嗎?
Simon Su
 
Google I/O Extended 2016 - 台北場活動回顧
Simon Su
 
Google I/O 2016 Recap - Google Cloud Platform News Update
Simon Su
 
中原大學 Shift to cloud
Simon Su
 
GCPUG.TW - 2016活動討論
Simon Su
 
Google Cloud Platform 2014Q4
Simon Su
 
技術單兵作戰及團隊開發流程差異
Caesar Chi
 
html5 & phonegap
Caesar Chi
 
中華電信 教育訓練
謝 宗穎
 
Developer team review of 2014
Caesar Chi
 
為 Node.js 專案打造專屬管家進行開發流程整合及健康檢測
謝 宗穎
 
GCPUG.TW - 2015活動回顧
Simon Su
 
Web development, from git flow to github flow
Caesar Chi
 
Docker with Cloud Service GCPUG
Caesar Chi
 
GCP - GCE, Cloud SQL, Cloud Storage, BigQuery Basic Training
Simon Su
 
Google IO - When Bigquery meeet Node.js
Simon Su
 
遠端團隊專案建立與管理 remote team management 2016
Caesar Chi
 
Try Cloud Spanner
Simon Su
 
從失敗中學習打造技術團隊
Caesar Chi
 
Ad

Similar to JCConf 2016 - Google Dataflow 小試 (20)

PPTX
Google cloud Dataflow & Apache Flink
Iván Fernández Perea
 
PDF
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDK
nagachika t
 
PDF
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
DataWorks Summit
 
PDF
Dsdt meetup 2017 11-21
JDA Labs MTL
 
PDF
DSDT Meetup Nov 2017
DSDT_MTL
 
PDF
How to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
Lucas Arruda
 
PDF
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...
tdc-globalcode
 
PDF
Data Science on Google Cloud Platform
Virot "Ta" Chiraphadhanakul
 
PDF
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
Flink Forward
 
PPTX
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
DoiT International
 
PDF
Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud ...
Fwdays
 
PDF
Dataflow - A Unified Model for Batch and Streaming Data Processing
DoiT International
 
PDF
GDG Jakarta Meetup - Streaming Analytics With Apache Beam
Imre Nagi
 
PPTX
Apache Crunch
Alwin James
 
PPTX
Building Modern Data Pipelines for Time Series Data on GCP with InfluxData by...
InfluxData
 
PDF
Google Cloud Dataflow
GirdhareeSaran
 
PDF
Introduction to Apache Beam
Jean-Baptiste Onofré
 
PDF
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Dan Halperin
 
PDF
Gaming analytics on gcp
Myunggeun Choi
 
PDF
How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
Daniel Zivkovic
 
Google cloud Dataflow & Apache Flink
Iván Fernández Perea
 
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDK
nagachika t
 
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
DataWorks Summit
 
Dsdt meetup 2017 11-21
JDA Labs MTL
 
DSDT Meetup Nov 2017
DSDT_MTL
 
How to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
Lucas Arruda
 
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...
tdc-globalcode
 
Data Science on Google Cloud Platform
Virot "Ta" Chiraphadhanakul
 
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
Flink Forward
 
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
DoiT International
 
Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud ...
Fwdays
 
Dataflow - A Unified Model for Batch and Streaming Data Processing
DoiT International
 
GDG Jakarta Meetup - Streaming Analytics With Apache Beam
Imre Nagi
 
Apache Crunch
Alwin James
 
Building Modern Data Pipelines for Time Series Data on GCP with InfluxData by...
InfluxData
 
Google Cloud Dataflow
GirdhareeSaran
 
Introduction to Apache Beam
Jean-Baptiste Onofré
 
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Dan Halperin
 
Gaming analytics on gcp
Myunggeun Choi
 
How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
Daniel Zivkovic
 

More from Simon Su (14)

PDF
Kubernetes Basic Operation
Simon Su
 
PDF
Google IoT Core 初體驗
Simon Su
 
PDF
JSDC 2017 - 使用google cloud 從雲到端,動手刻個IoT
Simon Su
 
PDF
GCPUG.TW meetup #28 - GKE上運作您的k8s服務
Simon Su
 
PDF
GCE Windows Serial Console Usage Guide
Simon Su
 
PDF
Google Cloud Monitoring
Simon Su
 
PDF
JCConf2016 - Dataflow Workshop Setup
Simon Su
 
PDF
IThome DevOps Summit - IoT、docker與DevOps
Simon Su
 
PPTX
GCS - Access Control Lists (中文)
Simon Su
 
PDF
Google Cloud Platform - for Mobile Solutions
Simon Su
 
PDF
JCConf 2015 - 輕鬆學google的雲端開發 - Google App Engine入門(下)
Simon Su
 
PDF
JCConf 2015 - 輕鬆學google的雲端開發 - Google App Engine入門(上)
Simon Su
 
PDF
CouchDB Getting Start
Simon Su
 
PDF
Google Cloud Platform專案建立說明
Simon Su
 
Kubernetes Basic Operation
Simon Su
 
Google IoT Core 初體驗
Simon Su
 
JSDC 2017 - 使用google cloud 從雲到端,動手刻個IoT
Simon Su
 
GCPUG.TW meetup #28 - GKE上運作您的k8s服務
Simon Su
 
GCE Windows Serial Console Usage Guide
Simon Su
 
Google Cloud Monitoring
Simon Su
 
JCConf2016 - Dataflow Workshop Setup
Simon Su
 
IThome DevOps Summit - IoT、docker與DevOps
Simon Su
 
GCS - Access Control Lists (中文)
Simon Su
 
Google Cloud Platform - for Mobile Solutions
Simon Su
 
JCConf 2015 - 輕鬆學google的雲端開發 - Google App Engine入門(下)
Simon Su
 
JCConf 2015 - 輕鬆學google的雲端開發 - Google App Engine入門(上)
Simon Su
 
CouchDB Getting Start
Simon Su
 
Google Cloud Platform專案建立說明
Simon Su
 

Recently uploaded (20)

PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 

JCConf 2016 - Google Dataflow 小試

  • 1. Google Dataflow 小試 Simon Su @ LinkerNetworks {Google Developer Expert} https://goo.gl/xkANFT
  • 2. var simon = {/** I am at GCPUG.TW **/}; simon.aboutme = 'http://about.me/peihsinsu'; simon.nodejs = ‘http://opennodes.arecord.us'; simon.googleshare = 'http://gappsnews.blogspot.tw' simon.nodejsblog = ‘http://nodejs-in-example.blogspot.tw'; simon.blog = ‘http://peihsinsu.blogspot.com'; simon.slideshare = ‘http://slideshare.net/peihsinsu/'; simon.email = ‘[email protected]’; simon.say(‘Good luck to everybody!'); I’m Simon Su...
  • 5. ● Data scientist ● Data engineer ● Frontend engineer
  • 6. Before Workshop... Prepare Doc http://www.slideshare.net/peihsinsu/jcconf2016-dataflow-workshop LAB Doc http://www.slideshare.net/peihsinsu/jcconf-2016-dataflow-workshop-labs Or http://goo.gl/nfdBhV
  • 7. Google Cloud in Big Data Solution
  • 8. Google Focused Cloud Virtualized Data Centers Standard virtual kit for Rent. Still yours to manage. 2nd Wave Colocation 1st Wave Your kit, someone else’s building. Yours to manage. Assembly required True On Demand Cloud Next Storage Processing Memory Network Clusters Distributed Storage, Processing & Machine Learning Containers 3rd Wave An actual, global elastic cloud Invest your energy in great apps.
  • 9. Google Cloud Family Foundation Infrastructure & Operations Data Services Application Runtime Services Enabling No-Touch Operations Breakthrough Insights, Breakthrough Applications The Gear that Powers Google
  • 10. GCP tools for data processing and analysis Dataflow StoreCapture Analyze BigQuery Larger Hadoop Ecosystem Pub/Sub Logs App Engine BigQuery streaming Process Cloud Storage Cloud Datastore (NoSQL) Cloud SQL (mySQL) BigQuery Storage Dataproc Dataproc
  • 11. Common big data processing flow Devices Physical or virtual servers as frontend data receiver MapReduce servers for large data transformation in batch way or streaming Strong queue service for handling large scale data injection Large scale data store for storing data and serve query workload Smart devices, IoT devices and Sensors 1M Devices 16.6K Events/sec 16.6K Events/sec 43B Events/month
  • 12. SpannerDremelMapReduce Big Table Colossus 2012 20132002 2004 2006 2008 2010 GFS MillWheel Flume Google Big Data Evolution History
  • 13. Look into Dataflow GCP Managed Service User Code & SDK Work Manager Deploy & Schedule Monitoring UI Job Manager Progress & Logs
  • 14. Dataflow use case • Movement • Filtering • Enrichment • Shaping • Reduction • Batch computation • Continuous computation • Composition • External orchestration • Simulation OrchestrationAnalysisETL
  • 15. Getting Start - Installation
  • 16. Get your GCP project
  • 17. Install Eclipse Plugin for Dataflow
  • 21. Google Cloud Storage Features Online cloud import (Cloud Storage Transfer Service) Object lifecycle management ACLs Object change notification Offline import (third party) Regional buckets Object versioning
  • 22. Create your bucket for Dataflow use
  • 27. Lab 1: Ready your dataflow environment and create your first dataflow project ● Create GCP project ● Install Eclipse and Dataflow plugin ● Create first Dataflow project ● Run your project
  • 28. After lab - What thing in GCS bucket
  • 29. After lab - The Run Configuration
  • 31. What we do in Big Data process... Map Shuffle Reduce ParDo GroupByKey ParDo
  • 32. Pipeline ● A Direct Acyclic Graph of data processing transformations ● Can be submitted to the Dataflow Service for optimization and execution or executed on an alternate runner e.g. Spark ● May include multiple inputs and multiple outputs ● May encompass many logical MapReduce operations ● PCollections flow through the pipeline
  • 33. Code Review Pipeline p = Pipeline.create( PipelineOptionsFactory.fromArgs(args).withValidation().create()); p.apply(Create.of("Hello", "World")) .apply(ParDo.of(new DoFn<String, String>() { @Override public void processElement(ProcessContext c) { c.output(c.element().toUpperCase()); } })) .apply(ParDo.of(new DoFn<String, Void>() { @Override public void processElement(ProcessContext c) { LOG.info(c.element()); } })); p.run(); 建立pipeline物件 執行pipeline
  • 34. Inputs & Outputs Your Source/Sink Here ❯ Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore ❯ Write your own custom source by teaching Dataflow how to read it in parallel • Currently for bounded sources only ❯ Write to GCS, BigQuery, Pub/Sub • More coming… ❯ Can use a combination of text, JSON, XML, Avro formatted data
  • 35. Code Review Pipeline p = Pipeline.create( PipelineOptionsFactory.fromArgs(args).withValidation().create()); p.apply(Create.of("Hello", "World")) .apply(ParDo.of(new DoFn<String, String>() { @Override public void processElement(ProcessContext c) { c.output(c.element().toUpperCase()); } })) .apply(ParDo.of(new DoFn<String, Void>() { @Override public void processElement(ProcessContext c) { LOG.info(c.element()); } })); p.run(); 建立Input Parallel Do,Log輸出
  • 36. PCollection ❯ A collection of data of type T in a pipeline - PCollection<K,V> ❯ Maybe be either bounded or unbounded in size ❯ Created by using a PTransform to: • Build from a java.util.Collection • Read from a backing data store • Transform an existing PCollection ❯ Often contain the key-value pairs using KV {Seahawks, NFC, Champions, Seattle, ...} {..., “NFC Champions #GreenBay”, “Green Bay #superbowl!”, ... “#GoHawks”, ...}
  • 37. Transforms ● A step, or a processing operation that transforms data ○ convert format , group , filter data ● Type of Transforms ○ ParDo ○ GroupByKey ○ Combine ○ Flatten ■ Multiple PCollection objects that contain the same data type, you can merge them into a single logical PCollection using the Flatten transform
  • 38. Code Review Pipeline p = Pipeline.create( PipelineOptionsFactory.fromArgs(args).withValidation().create()); p.apply(Create.of("Hello", "World")) .apply(ParDo.of(new DoFn<String, String>() { @Override public void processElement(ProcessContext c) { c.output(c.element().toUpperCase()); } })) .apply(ParDo.of(new DoFn<String, Void>() { @Override public void processElement(ProcessContext c) { LOG.info(c.element()); } })); p.run(); Transform輸入,使用Parallel Do 方式,平行的將輸入物件轉大寫 Parallel Do,Log輸出
  • 39. Pardo (Parallel do) ❯ Processes each element of a PCollection independently using a user-provided DoFn ❯ Corresponds to both the Map and Reduce phases in Hadoop i.e. ParDo->GBK->ParDo ❯ Useful for Filtering a data set. Formatting or converting the type of each element in a data set. Extracting parts of each element in a data set. Performing computations on each element in a data set. {Seahawks, NFC, Champions, Seattle, ...} { KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>, … } KeyBySessionId
  • 40. Group by key Wait a minute… How do you do a GroupByKey on an unbounded PCollection? {KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>, ...} {KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>, ...} GroupByKey • Takes a PCollection of key-value pairs and gathers up all values with the same key • Corresponds to the shuffle phase in Hadoop {KV<S, {Seahawks, Seattle, …}, KV<N, {NFC, …} KV<C, {Champion, …}}
  • 41. Group by key sample
  • 42. Lab 2: Deploy your first project to Google Cloud Platform ● Checking the Lab 1 project working well ● Deploy to cloud and watch the dataflow task dashboard ● Implement the Input/Output/Transform in your project
  • 47. Pub/Sub Operations - Subscriber
  • 48. Simple Guide for Pub/Sub Ref: https://gcpug-tw.gitbooks.io/google-cloud-platform-in-practice/content/pubsub_getting_start.html
  • 49. Dataflow with Cloud Pub/Sub Use Case Publisher A Publisher B Publisher C Message 1 Topic A Topic B Topic C Subscription XA Subscription XB Subscription YC Subscription ZC Cloud Pub/Sub Subscriber X Subscriber Y Message 2 Message 3 Subscriber Z Message 1 Message 2 Message 3 Message 3 Globally redundant Low latency (sub sec.) N to N coupling Batched read/write Push & Pull Guaranteed Delivery Auto expiration
  • 50. Example of PubSub IO to BigQuery IO Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class); options.setStreaming(true); Pipeline p = Pipeline.create(options); PCollection<String> input = p.apply(PubsubIO.Read.topic("projects/sunny-573/topics/jcconf2016")); PCollection<String> windowedWords = input.apply(Window.<String> into( FixedWindows.of(Duration.standardMinutes(options.getWindowSize())))); PCollection<KV<String, Long>> wordCounts = windowedWords.apply(new TestMain.MyCountWords()); wordCounts.apply(ParDo.of(new FormatAsTableRowFn())).apply( BigQueryIO.Write.to(getTableReference(options)).withSchema(getSchema())); p.run();
  • 51. Lab 3: Create a Streaming Dataflow model ● Create PubSub topic ● Deploy Dataflow streaming sample ● Watch Dataflow task dashboard
  • 53. Don’t forget to shut down streaming process...
  • 56. Datalab An easy tool for analysis and report Analysis tools - BigQuery & Datalab BigQuery An interactive analysis service
  • 58. Thinking your data orchestration...
  • 59. ● Google Codelab: https://codelabs.developers.google.com/?cat=Cloud ● Dataflow Doc: https://cloud.google.com/dataflow/docs/ More learning resources