SlideShare a Scribd company logo
Dustin Vannoy
Data Engineer
Cloud + Streaming
Spark Streaming with
Azure Databricks
Dustin Vannoy
Data Engineering Consultant
Co-founder Data Engineering San Diego
/in/dustinvannoy
@dustinvannoy
dustin@dustinvannoy.com
Technologies
• Azure & AWS
• Spark
• Kafka
• Python
Modern Data Systems
• Data Lakes
• Analytics in Cloud
• Streaming
© Microsoft Azure + AI Conference All rights reserved.
Agenda
 Shifting to Streaming
 Spark Structured Streaming
 Apache Kafka
 Azure Event Hubs
 Get Hands On
© Microsoft Azure + AI Conference All rights reserved.
Shifting to Streaming
If you haven’t started with streaming, you will soon
Life’s a batch stream
Why Streaming?
Data Engineers have
decided that the
business only updates
in batch
Our customers and
business leaders know
better
© Microsoft Azure + AI Conference All rights reserved.
Is streaming ingestion easier?
 Dealing with a large set of data at
once brings its own challenges
 Process as it comes in for cleaner
logic
 Even if not doing real-time analytics
yet, prepare for when you will
© Microsoft Azure + AI Conference All rights reserved.
Spark Structured Streaming
Technology overview
Why Spark?
Big data and the cloud
changed our mindset.
We want tools that
scale easily as data
size grows.
Spark is a leader in
data processing that
scales across many
machines. It can run
on Hadoop but is
faster and easier than
Map Reduce.
© Microsoft Azure + AI Conference All rights reserved.
What is Spark?
 Fast, general purpose engine for large-scale data processing
 Replaces MapReduce as Hadoop parallel programming API
 Many options:
 Yarn / Spark Cluster / Local
 Scala / Python / Java / R
 Spark Core / SQL / Streaming / ML / Graph
© Microsoft Azure + AI Conference All rights reserved.
What is Spark Structured Streaming?
 Alternative to traditional Spark Streaming which used
DStreams
 If you are familiar with Spark, it is best to think of
Structured Streaming as Spark SQL API but for streaming
 Use import spark.sql.streaming
© Microsoft Azure + AI Conference All rights reserved.
What is Spark Structured Streaming?
Tathagata Das “TD” - Lead Developer on Spark Streaming
 “Fast, fault-tolerant, exactly-once stateful stream
processing without having to reason about streaming"
 "The simplest way to perform streaming analytics is not
having to reason about streaming at all"
 A table that is constantly appended with each micro-batch
Reference: https://youtu.be/rl8dIzTpxrI
© Microsoft Azure + AI Conference All rights reserved.
Structured Streaming - Read
df = spark.readStream
.format("kafka")
.options(**consumer_config)
.load()
© Microsoft Azure + AI Conference All rights reserved.
Structured Streaming - Write
df.writeStream
.format("kafka")
.options(**producer_config)
.option("checkpointLocation","/tmp/cp001")
.start()
© Microsoft Azure + AI Conference All rights reserved.
Structured Streaming - Window
df2.groupBy(
col("VendorID"),
window(col("pickup_dt"), "10 min"))
.avg("trip_distance")
© Microsoft Azure + AI Conference All rights reserved.
Structured Streaming – Output Mode
 Output triggered on a time interval (defaults to 1 second)
 Append
 Just keep adding newest records
 Complete mode
 Output latest state of table
 Useful for aggregation results
© Microsoft Azure + AI Conference All rights reserved.
Apache Kafka
Technology overview
Why Kafka?
Streaming data
directly from one
system to another
often problematic
Kafka serves as the
scalable broker,
keeping up with
producer and
persisting data for all
consumers
© Microsoft Azure + AI Conference All rights reserved.
The Log
“It is an append-only,
totally-ordered sequence
of records ordered by
time.” - Jay Kreps
Reference: The Log - Jay Kreps
© Microsoft Azure + AI Conference All rights reserved.
Kafka Topics
Feed where records published
Multiple partitions per topic
Order retained within partition
© Microsoft Azure + AI Conference All rights reserved.
Consumers and offset
Offset = record id
Consumers read in order
Multiple consumer per topic
© Microsoft Azure + AI Conference All rights reserved.
Event Hubs
Technology overview
Why Event Hubs?
Same core capability
as Kafka, using PaaS
instead of IaaS
Choose between Kafka
or Event Hub APIs;
avoid operational
overhead of managing
Kafka
© Microsoft Azure + AI Conference All rights reserved.
Event Hubs key concepts
 Namespace = container to hold multiple Event Hubs
 Event Hub = Topic
 Partitions and Consumer Groups
 Same concepts as Kafka
 Minor differences in implementation
 Throughput Units define level of scalability
Eventhub Namespace Setup
 Standard pricing to enable Kafka
 Each Throughput unit
 1 MB/s ingress
 2 MB/s egress
 Auto Inflate to allow autoscale
Eventhub Setup
 Partition count
 Max # of consumers
 Message retention
 More days = More $
 Capture
 Save to Azure Storage
Shared Access Key
Shared Access Key
Demo
Structured
Streaming
+
Event Hubs
for Kafka
© Microsoft Azure + AI Conference All rights reserved.
References
 The Log - Jay Kreps
 https://databricks.com/blog/2016/01/04/introducing-apache-spark-
datasets.html
 https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-
spark-apis-rdds-dataframes-and-datasets.html
 https://github.com/Azure/azure-event-hubs-for-kafka
 https://github.com/Azure/azure-event-hubs-spark
© Microsoft Azure + AI Conference All rights reserved.
 Please use EventsXD to fill out a session evaluation.
Thank you!
Ad

More Related Content

What's hot (20)

201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning
Mark Tabladillo
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
Mykola Zerniuk
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
Data Con LA
 
Spark as a Service with Azure Databricks
Spark as a Service with Azure DatabricksSpark as a Service with Azure Databricks
Spark as a Service with Azure Databricks
Lace Lofranco
 
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
Databricks
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
Laurent Leturgez
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
Trivadis
 
Global AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure DatabricksGlobal AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure Databricks
Alberto Diaz Martin
 
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | QuboleEbooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Vasu S
 
IBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveIBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep Dive
Torsten Steinbach
 
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Michael Rys
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
The Developer Data Scientist – Creating New Analytics Driven Applications usi...The Developer Data Scientist – Creating New Analytics Driven Applications usi...
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
Microsoft Tech Community
 
Data Lakes with Azure Databricks
Data Lakes with Azure DatabricksData Lakes with Azure Databricks
Data Lakes with Azure Databricks
Data Con LA
 
Using Redash for SQL Analytics on Databricks
Using Redash for SQL Analytics on DatabricksUsing Redash for SQL Analytics on Databricks
Using Redash for SQL Analytics on Databricks
Databricks
 
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Databricks
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Azure Data Lake and Azure Data Lake Analytics
Azure Data Lake and Azure Data Lake AnalyticsAzure Data Lake and Azure Data Lake Analytics
Azure Data Lake and Azure Data Lake Analytics
Waqas Idrees
 
Big Data Adavnced Analytics on Microsoft Azure
Big Data Adavnced Analytics on Microsoft AzureBig Data Adavnced Analytics on Microsoft Azure
Big Data Adavnced Analytics on Microsoft Azure
Mark Tabladillo
 
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
Azure Databricks—Apache Spark as a Service with Sascha DittmannAzure Databricks—Apache Spark as a Service with Sascha Dittmann
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
Databricks
 
201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning
Mark Tabladillo
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
Mykola Zerniuk
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
Data Con LA
 
Spark as a Service with Azure Databricks
Spark as a Service with Azure DatabricksSpark as a Service with Azure Databricks
Spark as a Service with Azure Databricks
Lace Lofranco
 
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
Databricks
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
Trivadis
 
Global AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure DatabricksGlobal AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure Databricks
Alberto Diaz Martin
 
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | QuboleEbooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Vasu S
 
IBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveIBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep Dive
Torsten Steinbach
 
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Michael Rys
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
The Developer Data Scientist – Creating New Analytics Driven Applications usi...The Developer Data Scientist – Creating New Analytics Driven Applications usi...
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
Microsoft Tech Community
 
Data Lakes with Azure Databricks
Data Lakes with Azure DatabricksData Lakes with Azure Databricks
Data Lakes with Azure Databricks
Data Con LA
 
Using Redash for SQL Analytics on Databricks
Using Redash for SQL Analytics on DatabricksUsing Redash for SQL Analytics on Databricks
Using Redash for SQL Analytics on Databricks
Databricks
 
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Databricks
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Azure Data Lake and Azure Data Lake Analytics
Azure Data Lake and Azure Data Lake AnalyticsAzure Data Lake and Azure Data Lake Analytics
Azure Data Lake and Azure Data Lake Analytics
Waqas Idrees
 
Big Data Adavnced Analytics on Microsoft Azure
Big Data Adavnced Analytics on Microsoft AzureBig Data Adavnced Analytics on Microsoft Azure
Big Data Adavnced Analytics on Microsoft Azure
Mark Tabladillo
 
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
Azure Databricks—Apache Spark as a Service with Sascha DittmannAzure Databricks—Apache Spark as a Service with Sascha Dittmann
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
Databricks
 

Similar to Spark Streaming with Azure Databricks (20)

Azure Databricks - An Introduction 2019 Roadshow.pptx
Azure Databricks - An Introduction 2019 Roadshow.pptxAzure Databricks - An Introduction 2019 Roadshow.pptx
Azure Databricks - An Introduction 2019 Roadshow.pptx
pascalsegoul
 
Azure satpn19 time series analytics with azure adx
Azure satpn19   time series analytics with azure adxAzure satpn19   time series analytics with azure adx
Azure satpn19 time series analytics with azure adx
Riccardo Zamana
 
Concevoir une application scalable dans le Cloud
Concevoir une application scalable dans le CloudConcevoir une application scalable dans le Cloud
Concevoir une application scalable dans le Cloud
Stéphanie Hertrich
 
Azure Data.pptx
Azure Data.pptxAzure Data.pptx
Azure Data.pptx
FedoRam1
 
When NOT to use Apache Kafka?
When NOT to use Apache Kafka?When NOT to use Apache Kafka?
When NOT to use Apache Kafka?
Kai Wähner
 
When NOT to Use Apache Kafka? With Kai Waehner | Current 2022
When NOT to Use Apache Kafka? With Kai Waehner | Current 2022When NOT to Use Apache Kafka? With Kai Waehner | Current 2022
When NOT to Use Apache Kafka? With Kai Waehner | Current 2022
HostedbyConfluent
 
Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020
Riccardo Zamana
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Michael Rys
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
Paulo Gutierrez
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
James Serra
 
AIDEVDAY_ Data-in-Motion to Supercharge AI
AIDEVDAY_ Data-in-Motion to Supercharge AIAIDEVDAY_ Data-in-Motion to Supercharge AI
AIDEVDAY_ Data-in-Motion to Supercharge AI
Timothy Spann
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
Timothy Spann
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
James Serra
 
Microsoft Azure News - 2018 October
Microsoft Azure News - 2018 OctoberMicrosoft Azure News - 2018 October
Microsoft Azure News - 2018 October
Daniel Toomey
 
Time Series Analytics Azure ADX
Time Series Analytics Azure ADXTime Series Analytics Azure ADX
Time Series Analytics Azure ADX
Riccardo Zamana
 
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreAzure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
DataStax Academy
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Michael Rys
 
Building a Real-Time IoT monitoring application with Azure
Building a Real-Time IoT monitoring application with AzureBuilding a Real-Time IoT monitoring application with Azure
Building a Real-Time IoT monitoring application with Azure
Davide Mauri
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data Platform
Shu-Jeng Hsieh
 
Stargate, the gateway for some multi-models data API
Stargate, the gateway for some multi-models data APIStargate, the gateway for some multi-models data API
Stargate, the gateway for some multi-models data API
Data Con LA
 
Azure Databricks - An Introduction 2019 Roadshow.pptx
Azure Databricks - An Introduction 2019 Roadshow.pptxAzure Databricks - An Introduction 2019 Roadshow.pptx
Azure Databricks - An Introduction 2019 Roadshow.pptx
pascalsegoul
 
Azure satpn19 time series analytics with azure adx
Azure satpn19   time series analytics with azure adxAzure satpn19   time series analytics with azure adx
Azure satpn19 time series analytics with azure adx
Riccardo Zamana
 
Concevoir une application scalable dans le Cloud
Concevoir une application scalable dans le CloudConcevoir une application scalable dans le Cloud
Concevoir une application scalable dans le Cloud
Stéphanie Hertrich
 
Azure Data.pptx
Azure Data.pptxAzure Data.pptx
Azure Data.pptx
FedoRam1
 
When NOT to use Apache Kafka?
When NOT to use Apache Kafka?When NOT to use Apache Kafka?
When NOT to use Apache Kafka?
Kai Wähner
 
When NOT to Use Apache Kafka? With Kai Waehner | Current 2022
When NOT to Use Apache Kafka? With Kai Waehner | Current 2022When NOT to Use Apache Kafka? With Kai Waehner | Current 2022
When NOT to Use Apache Kafka? With Kai Waehner | Current 2022
HostedbyConfluent
 
Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020
Riccardo Zamana
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Michael Rys
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
Paulo Gutierrez
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
James Serra
 
AIDEVDAY_ Data-in-Motion to Supercharge AI
AIDEVDAY_ Data-in-Motion to Supercharge AIAIDEVDAY_ Data-in-Motion to Supercharge AI
AIDEVDAY_ Data-in-Motion to Supercharge AI
Timothy Spann
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
Timothy Spann
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
James Serra
 
Microsoft Azure News - 2018 October
Microsoft Azure News - 2018 OctoberMicrosoft Azure News - 2018 October
Microsoft Azure News - 2018 October
Daniel Toomey
 
Time Series Analytics Azure ADX
Time Series Analytics Azure ADXTime Series Analytics Azure ADX
Time Series Analytics Azure ADX
Riccardo Zamana
 
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreAzure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
DataStax Academy
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Michael Rys
 
Building a Real-Time IoT monitoring application with Azure
Building a Real-Time IoT monitoring application with AzureBuilding a Real-Time IoT monitoring application with Azure
Building a Real-Time IoT monitoring application with Azure
Davide Mauri
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data Platform
Shu-Jeng Hsieh
 
Stargate, the gateway for some multi-models data API
Stargate, the gateway for some multi-models data APIStargate, the gateway for some multi-models data API
Stargate, the gateway for some multi-models data API
Data Con LA
 
Ad

Recently uploaded (20)

Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia
Alexander Romero Arosquipa
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136
illuminati Agent uganda call+256776963507/0741506136
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
Ad

Spark Streaming with Azure Databricks

  • 1. Dustin Vannoy Data Engineer Cloud + Streaming Spark Streaming with Azure Databricks
  • 2. Dustin Vannoy Data Engineering Consultant Co-founder Data Engineering San Diego /in/dustinvannoy @dustinvannoy [email protected] Technologies • Azure & AWS • Spark • Kafka • Python Modern Data Systems • Data Lakes • Analytics in Cloud • Streaming
  • 3. © Microsoft Azure + AI Conference All rights reserved. Agenda  Shifting to Streaming  Spark Structured Streaming  Apache Kafka  Azure Event Hubs  Get Hands On
  • 4. © Microsoft Azure + AI Conference All rights reserved. Shifting to Streaming If you haven’t started with streaming, you will soon
  • 6. Why Streaming? Data Engineers have decided that the business only updates in batch Our customers and business leaders know better
  • 7. © Microsoft Azure + AI Conference All rights reserved. Is streaming ingestion easier?  Dealing with a large set of data at once brings its own challenges  Process as it comes in for cleaner logic  Even if not doing real-time analytics yet, prepare for when you will
  • 8. © Microsoft Azure + AI Conference All rights reserved. Spark Structured Streaming Technology overview
  • 9. Why Spark? Big data and the cloud changed our mindset. We want tools that scale easily as data size grows. Spark is a leader in data processing that scales across many machines. It can run on Hadoop but is faster and easier than Map Reduce.
  • 10. © Microsoft Azure + AI Conference All rights reserved. What is Spark?  Fast, general purpose engine for large-scale data processing  Replaces MapReduce as Hadoop parallel programming API  Many options:  Yarn / Spark Cluster / Local  Scala / Python / Java / R  Spark Core / SQL / Streaming / ML / Graph
  • 11. © Microsoft Azure + AI Conference All rights reserved. What is Spark Structured Streaming?  Alternative to traditional Spark Streaming which used DStreams  If you are familiar with Spark, it is best to think of Structured Streaming as Spark SQL API but for streaming  Use import spark.sql.streaming
  • 12. © Microsoft Azure + AI Conference All rights reserved. What is Spark Structured Streaming? Tathagata Das “TD” - Lead Developer on Spark Streaming  “Fast, fault-tolerant, exactly-once stateful stream processing without having to reason about streaming"  "The simplest way to perform streaming analytics is not having to reason about streaming at all"  A table that is constantly appended with each micro-batch Reference: https://youtu.be/rl8dIzTpxrI
  • 13. © Microsoft Azure + AI Conference All rights reserved. Structured Streaming - Read df = spark.readStream .format("kafka") .options(**consumer_config) .load()
  • 14. © Microsoft Azure + AI Conference All rights reserved. Structured Streaming - Write df.writeStream .format("kafka") .options(**producer_config) .option("checkpointLocation","/tmp/cp001") .start()
  • 15. © Microsoft Azure + AI Conference All rights reserved. Structured Streaming - Window df2.groupBy( col("VendorID"), window(col("pickup_dt"), "10 min")) .avg("trip_distance")
  • 16. © Microsoft Azure + AI Conference All rights reserved. Structured Streaming – Output Mode  Output triggered on a time interval (defaults to 1 second)  Append  Just keep adding newest records  Complete mode  Output latest state of table  Useful for aggregation results
  • 17. © Microsoft Azure + AI Conference All rights reserved. Apache Kafka Technology overview
  • 18. Why Kafka? Streaming data directly from one system to another often problematic Kafka serves as the scalable broker, keeping up with producer and persisting data for all consumers
  • 19. © Microsoft Azure + AI Conference All rights reserved. The Log “It is an append-only, totally-ordered sequence of records ordered by time.” - Jay Kreps Reference: The Log - Jay Kreps
  • 20. © Microsoft Azure + AI Conference All rights reserved. Kafka Topics Feed where records published Multiple partitions per topic Order retained within partition
  • 21. © Microsoft Azure + AI Conference All rights reserved. Consumers and offset Offset = record id Consumers read in order Multiple consumer per topic
  • 22. © Microsoft Azure + AI Conference All rights reserved. Event Hubs Technology overview
  • 23. Why Event Hubs? Same core capability as Kafka, using PaaS instead of IaaS Choose between Kafka or Event Hub APIs; avoid operational overhead of managing Kafka
  • 24. © Microsoft Azure + AI Conference All rights reserved. Event Hubs key concepts  Namespace = container to hold multiple Event Hubs  Event Hub = Topic  Partitions and Consumer Groups  Same concepts as Kafka  Minor differences in implementation  Throughput Units define level of scalability
  • 25. Eventhub Namespace Setup  Standard pricing to enable Kafka  Each Throughput unit  1 MB/s ingress  2 MB/s egress  Auto Inflate to allow autoscale
  • 26. Eventhub Setup  Partition count  Max # of consumers  Message retention  More days = More $  Capture  Save to Azure Storage
  • 30. © Microsoft Azure + AI Conference All rights reserved. References  The Log - Jay Kreps  https://databricks.com/blog/2016/01/04/introducing-apache-spark- datasets.html  https://databricks.com/blog/2016/07/14/a-tale-of-three-apache- spark-apis-rdds-dataframes-and-datasets.html  https://github.com/Azure/azure-event-hubs-for-kafka  https://github.com/Azure/azure-event-hubs-spark
  • 31. © Microsoft Azure + AI Conference All rights reserved.  Please use EventsXD to fill out a session evaluation. Thank you!

Editor's Notes

  • #2: In the world of data science we often default to processing in nightly or hourly batches, but that pattern is not enough any more. Our customers and business leaders see information is being created all the time and realize it should be available much sooner. While the move to stream processing adds complexity, the tools we have available make it achievable for teams of any size. In this session we will talk about why we need to shift some of our workloads from batch data jobs to streaming in real-time. We'll dive into how Spark Structured Streaming in Azure Databricks enables this along with streaming data systems such as Kafka and EventHub. We will discuss the concepts, how Azure Databricks enables stream processing, and review code examples on a sample data set.
  • #4: Shifting to Streaming: We don’t have to convince our stakeholders that they don’t really need streaming. Understand the needs, find the right uses for streaming, and make it happen. Discuss pros and cons, considerations before going to production, and general use cases in AI/ML Spark, Event Hubs, and Kafka Define the systems we will be using for this session, including some of the reasons we choose them Talk about some of the options for using these together Getting Hands On Review dependencies that are not covered Walk through basic setup of the most important pieces Demo of use case code, highlight some important Structure Streaming components Best Practices Cover things to consider when working with Spark Structured Streaming and Kafka or Event Hubs
  • #7: In the world of data science, those of us who develop ETL pipelines have determined that everything can be processed in nightly or hourly batches, but that only makes sense to data engineers. Our customers and business leaders see information is being created all the time and realize it should be available much sooner.
  • #8: Dealing with a large set of data at once brings its own challenges (a lot of resources at once, large table joins, run out of memory, etc) Process as it comes in for cleaner logic (rather than seeing latest state, we see events as they happen and update state downstream) Even if not doing real-time analytics yet, prepare for when you will - the times they are a’changin
  • #11: A fast and general engine for large-scale data processing, uses memory to provide benefit Often replaces MapReduce as parallel programming api on Hadoop, the way it handles data (RDDs) provides one performance benefit and use of memory when possible provides another large performance benefit Can run on Hadoop (using Yarn) but also as a separate Spark cluster. Local is possible as well but reduces the performance benefits…I find its still a useful API though Run Java, Scala, Python, or R. If you don’t already know one of those languages really well, I recommend trying it in Python and Scala and pick whichever is easiest for you. Several modules for different use cases, similar api so you can swap between modes relatively easily. For example, we have both streaming and batch sources of some data and we reuse the rest of the spark processing transformations.
  • #12: A fast and general engine for large-scale data processing, uses memory to provide benefit Often replaces MapReduce as parallel programming api on Hadoop, the way it handles data (RDDs) provides one performance benefit and use of memory when possible provides another large performance benefit Can run on Hadoop (using Yarn) but also as a separate Spark cluster. Local is possible as well but reduces the performance benefits…I find its still a useful API though Run Java, Scala, Python, or R. If you don’t already know one of those languages really well, I recommend trying it in Python and Scala and pick whichever is easiest for you. Several modules for different use cases, similar api so you can swap between modes relatively easily. For example, we have both streaming and batch sources of some data and we reuse the rest of the spark processing transformations.
  • #13: A fast and general engine for large-scale data processing, uses memory to provide benefit Often replaces MapReduce as parallel programming api on Hadoop, the way it handles data (RDDs) provides one performance benefit and use of memory when possible provides another large performance benefit Can run on Hadoop (using Yarn) but also as a separate Spark cluster. Local is possible as well but reduces the performance benefits…I find its still a useful API though Run Java, Scala, Python, or R. If you don’t already know one of those languages really well, I recommend trying it in Python and Scala and pick whichever is easiest for you. Several modules for different use cases, similar api so you can swap between modes relatively easily. For example, we have both streaming and batch sources of some data and we reuse the rest of the spark processing transformations.
  • #16: Window is essentially like grouping. Continuously compute the average distance for each vendor over the last 10 minutes
  • #17: Window is essentially like grouping. Continuously compute the average distance for each vendor over the last 10 minutes
  • #30: Quick overview of important databricks workspace segments – Clusters, Tables, Notebooks Open create_parquet_tables notebook and run first few commands as examples of working without delta