SlideShare a Scribd company logo
IoT Applications and Patterns
using Apache Spark &
Apache Bahir
Luciano Resende
June 14th, 2018
© 2018 IBM Corporation 1
About me - Luciano Resende
2
Data Science Platform Architect – IBM – CODAIT
• Have been contributing to open source at ASF for over 10 years
• Currently contributing to : Jupyter Notebook ecosystem, Apache Bahir, Apache
Toree, Apache Spark among other projects related to AI/ML platforms
lresende@apache.org
https://www.linkedin.com/in/lresende
@lresende1975
https://github.com/lresende
Open Source Community Leadership
C O D A I T
Founding Partner 188+ Project Committers 77+ Projects
Key Open source steering committee
memberships OSS Advisory Board
Open Source
Center for Open Source
Data and AI Technologies
CODAIT
codait.org
codait (French)
= coder/coded
https://m.interglot.com/fr/en/codait
CODAIT aims to make AI solutions
dramatically easier to create, deploy,
and manage in the enterprise
Relaunch of the Spark Technology
Center (STC) to reflect expanded
mission
5
Agenda
6
Introductions
- Apache Spark
- Apache Bahir
IoT Applications
Live Demo
Summary
References
Q&A
Apache Spark
7
Apache Spark Introduction
8
Spark Core
Spark
SQL
Spark
Streaming
Spark
ML
Spark
GraphX
executes SQL
statements
performs
streaming
analytics using
micro-batches
common
machine
learning and
statistical
algorithms
distributed
graph
processing
framework
general compute engine, handles
distributed task dispatching,
scheduling and basic I/O
functions
large variety of data sources
and formats can be supported,
both on-premise or cloud
BigInsights
(HDFS)
Cloudant
dashDB
SQL
DB
Apache Spark Evolution
9
Apache Spark – Spark SQL
10
Spark
SQL
Unified data access APIS: Query
structured data sets with SQL or
Dataset/DataFrame APIs
Fast, familiar query language across
all of your enterprise data
RDBMS
Data Sources
Structured
Streaming
Data Sources
Apache Spark – Spark SQL
11
You can run SQL statement with SparkSession.sql(…) interface:
val spark = SparkSession.builder()
.appName(“Demo”)
.getOrCreate()
spark.sql(“create table T1 (c1 int, c2 int) stored as parquet”)
val ds = spark.sql(“select * from T1”)
You can further transform the resultant dataset:
val ds1 = ds.groupBy(“c1”).agg(“c2”-> “sum”)
val ds2 = ds.orderBy(“c1”)
The result is a DataFrame / Dataset[Row]
ds.show() displays the rows
Apache Spark – Spark SQL
You can read from data sources using SparkSession.read.format(…)
val spark = SparkSession.builder()
.appName(“Demo”)
.getOrCreate()
case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer)
// loading csv data to a Dataset of Bank type
val bankFromCSV = spark.read.csv(“hdfs://localhost:9000/data/bank.csv").as[Bank]
// loading JSON data to a Dataset of Bank type
val bankFromJSON = spark.read.json(“hdfs://localhost:9000/data/bank.json").as[Bank]
// select a column value from the Dataset
bankFromCSV.select(‘age).show() will return all rows of column “age” from this dataset.
12
Apache Spark – Spark SQL
You can also configure a specific data source with specific options
val spark = SparkSession.builder()
.appName(“Demo”)
.getOrCreate()
case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer)
// loading csv data to a Dataset of Bank type
val bankFromCSV = sparkSession.read
.option("header", ”true") // Use first line of all files as header
.option("inferSchema", ”true") // Automatically infer data types
.option("delimiter", " ")
.csv("/users/lresende/data.csv”)
.as[Bank]
bankFromCSV.select(‘age).show() // will return all rows of column “age” from this dataset.
13
Apache Spark – Spark SQL – Data Sources
Data Sources under the covers
- Data source registration (e.g. spark.read.datasource)
- Provide BaseRelation implementation
• That implements support for table scans:
– TableScans, PrunedScan, PrunedFilteredScan, CatalystScan
- Detailed information available at
• https://developer.ibm.com/code/2016/11/10/exploring-apache-spark-datasource-api/
14
Apache Spark – Spark SQL – Data Sources
Data Sources V1 Limitations
- Leak upper-level API in the data source (DataFrame/SQLContext)
- Hard to extend the Data Sources API for more optimizations
- Zero transaction guarantee in the write APIs
- Limited Extensibility
15
Apache Spark – Spark SQL – Data Sources
Data Sources V2
- Support for row-based scan and columnar scan
- Column pruning and filter push-down
- Can report basic statistics and data partitioning
- Transactional write API
- Streaming source and sink support for micro-batch and continuous
mode
- Detailed information available at
• https://developer.ibm.com/code/2018/04/16/introducing-apache-spark-data-sources-api-v2/
16
Apache Spark – Spark SQL Structured Streaming
Unified programming model for streaming, interactive and batch queries
17Image source: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Considers the data stream as unbounded
table
Apache Spark – Spark SQL Structured Streaming
SQL regular APIs
val spark = SparkSession.builder()
.appName(“Demo”)
.getOrCreate()
val input = spark.read
.schema(schema)
.format(”csv")
.load(”input-path")
val result = input
.select(”age”)
.where(”age > 18”)
result.write
.format(”json”)
. save(” dest-path”)
18
Structured Streaming APIs
val spark = SparkSession.builder()
.appName(“Demo”)
.getOrCreate()
val input = spark.readStream
.schema(schema)
.format(”csv")
.load(”input-path")
val result = input
.select(”age”)
.where(”age > 18”)
result.write
.format(”json”)
. startStream(” dest-path”)
Apache Spark – Spark Streaming
19
Spark
Streaming
Micro-batch event processing for
near-real time analytics
e.g. Internet of Things (IoT) devices,
Twitter feeds, Kafka (event hub), etc.
No multi-threading or parallel process
programming required
Apache Spark – Spark Streaming
Also known as discretized stream or DStream
Abstracts a continuous stream of data
Based on micro-batching
Based on RDDs
20
Apache Spark – Spark Streaming
val sparkConf = new SparkConf()
.setAppName("MQTTWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val lines = MQTTUtils.createStream(ssc, brokerUrl, topic, StorageLevel.MEMORY_ONLY_SER_2)
val words = lines.flatMap(x => x.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
21
Apache Bahir
22
Origins of the Apache Bahir Project
MAY/2016: Established as a top-level Apache Project.
- PMC formed by Apache Spark committers/pmc, Apache Members
- Initial contributions imported from Apache Spark
AUG/2016: Apache Flink community join Apache Bahir
- Initial contributions of Flink extensions
- In October 2016 Robert Metzger elected committer
Origins of the Bahir name
Naming an Apache Project is a science !!!
- We needed a name that wasn’t used yet
- Needed to be related to Spark
We ended up with : Bahir
- A name of Arabian origin that means Sparkling,
- Also associated with a guy who succeeds at everything
Why Apache Bahir
It’s an Apache project
- And if you are here, you know what it means
Benefits of curating your extensions at Apache Bahir
- Apache Governance
- Apache License
- Apache Community
- Apache Brand
25
Why Apache Bahir
Flexibility
- Release flexibility
• Bounded to platform or component release
Shared infrastructure
- Release, CI, etc
Shared knowledge
- Collaborate with experts on both platform and component areas
26
Bahir extensions for Apache Spark
MQTT – Enables reading data from MQTT Servers using Spark Streaming or Structured streaming.
• http://bahir.apache.org/docs/spark/current/spark-sql-streaming-mqtt/
• http://bahir.apache.org/docs/spark/current/spark-streaming-mqtt/
Couch DB/Cloudant – Enables reading data from CouchDB/Cloudant using Spark SQL and Spark
Streaming.
Twitter – Enables reading social data from twitter using Spark Streaming.
• http://bahir.apache.org/docs/spark/current/spark-streaming-twitter/
Akka – Enables reading data from Akka Actors using Spark Streaming or Structured Streaming.
• http://bahir.apache.org/docs/spark/current/spark-streaming-akka/
ZeroMQ – Enables reading data from ZeroMQ using Spark Streaming.
• http://bahir.apache.org/docs/spark/current/spark-streaming-zeromq/
27
Bahir extensions for Apache Spark
Google Cloud Pub/Sub – Add spark streaming connector to Google Cloud Pub/Sub
28
Apache Spark extensions in Bahir
Adding Bahir extensions into your application
- Using SBT
libraryDependencies += "org.apache.bahir" %% "spark-streaming-mqtt" % "2.2.0”
- Using Maven
<dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>spark-streaming-mqtt_2.11 </artifactId>
<version>2.2.0</version>
</dependency>
29
Apache Spark extensions in Bahir
Submitting applications with Bahir extensions to Spark
- Spark-shell
bin/spark-shell --packages org.apache.bahir:spark-streaming_mqtt_2.11:2.2.0 …..
- Spark-submit
bin/spark-submit --packages org.apache.bahir:spark-streaming_mqtt_2.11:2.2.0 …..
30
Internet of
Things - IoT
31
IoT – Definition by Wikipedia
The Internet of things (IoT) is the network of
physical devices, vehicles, home
appliances, and other
items embedded with electronics, software,
sensors, actuators, and network
connectivity which enable these objects to
connect and exchange data.
32
IoT – Interaction between multiple entities
33
Things Software
People
control
observe
inform
command
actuate
inform
IoT Patterns – Some of them …
35
• Remote control
• Security analysis
• Edge analytics
• Historical data analysis
• Distributed Platforms
• Real-time decisions
MQTT – M2M / IoT Connectivity Protocol
37
Connect
+
Publish
+
Subscribe
~1990
IBM / Eurotech
2010
Published
2011
Eclipse M2M / Paho
2014
OASIS
Open spec
+ 40 client
implementatio
ns
Minimal
overhead
Tiny
Clients
(Java 170KB)
History
Header
2-4 bytes
(publish)
14 bytes
(connect)
V5
May 2018
MQTT – Quality of Service
38
MQTT
Broker
QoS0
QoS1
QoS2
At most once
At least once
Exactly once
. No connection failover
. Never duplicate
. Has connection failover
. Can duplicate
. Has connection failover
. Never duplicate
MQTT – World usage
Smart Home Automation
Messaging
Notable Mentions:
- IBM IoT Platform
- AWS IoT
- Microsoft IoT Hub
- Facebook Messanger
39
Live Demo
40
IoT Simulator using MQTT
The demo environment
https://github.com/lresende/bahir-iot-demo
41
Node.js Web app
Simulates Elevator IoT devices
Elevator simulator
Metrics:
• Weight
• Speed
• Power
• Temperature
• System
MQTT
Mosquitto
Summary
42
Summary – Take away points
Apache Spark
- IoT Analytics Runtime with support for ”Continuous Applications”
Apache Bahir
- Bring access to IoT data via supported connectors (e.g. MQTT)
IoT Applications
- Using Spark and Bahir to start processing IoT data in near real
time using Spark Streaming and Spark Structured Streaming
43
Join the Apache
Bahir community
44
References
Apache Bahir
http://bahir.apache.org
Documentation for Apache Spark extensions
http://bahir.apache.org/docs/spark/current/documentation/
Source Repositories
https://github.com/apache/bahir
https://github.com/apache/bahir-website
Demo Repository
https://github.com/lresende/bahir-iot-demo
45Image source: http://az616578.vo.msecnd.net/files/2016/03/21/6359412499310138501557867529_thank-you-1400x800-c-default.gif
46March 30 2018 / © 2018 IBM Corporation

More Related Content

What's hot (20)

PPTX
Using Apache Spark with IBM SPSS Modeler
Global Knowledge Training
 
PDF
Spark Summit EU talk by Steve Loughran
Spark Summit
 
PDF
Sparkly Notebook: Interactive Analysis and Visualization with Spark
felixcss
 
PPTX
Apache Spark: Lightning Fast Cluster Computing
All Things Open
 
PDF
Real-time Streaming Pipelines with FLaNK
Data Con LA
 
PPTX
Apache Zeppelin + Livy: Bringing Multi Tenancy to Interactive Data Analysis
DataWorks Summit/Hadoop Summit
 
PDF
Apache Zeppelin, Helium and Beyond
DataWorks Summit/Hadoop Summit
 
PPTX
S3Guard: What's in your consistency model?
Hortonworks
 
PDF
IoT Edge Processing with Apache NiFi and MiniFi and Apache MXNet for IoT NY 2018
Timothy Spann
 
PPTX
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
 
PDF
Apache Spark Overview
airisData
 
PDF
Migrating pipelines into Docker
DataWorks Summit/Hadoop Summit
 
PPTX
Matt Franklin - Apache Software (Geekfest)
W2O Group
 
PDF
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Databricks
 
PPTX
Streamline Hadoop DevOps with Apache Ambari
DataWorks Summit/Hadoop Summit
 
DOCX
Spark,Hadoop,Presto Comparition
Sandish Kumar H N
 
PPTX
August 2016 HUG: Recent development in Apache Oozie
Yahoo Developer Network
 
PPTX
Data Science with Spark & Zeppelin
Vinay Shukla
 
PPTX
Apache Fink 1.0: A New Era for Real-World Streaming Analytics
Slim Baltagi
 
PPTX
Troubleshooting Kerberos in Hadoop: Taming the Beast
DataWorks Summit
 
Using Apache Spark with IBM SPSS Modeler
Global Knowledge Training
 
Spark Summit EU talk by Steve Loughran
Spark Summit
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
felixcss
 
Apache Spark: Lightning Fast Cluster Computing
All Things Open
 
Real-time Streaming Pipelines with FLaNK
Data Con LA
 
Apache Zeppelin + Livy: Bringing Multi Tenancy to Interactive Data Analysis
DataWorks Summit/Hadoop Summit
 
Apache Zeppelin, Helium and Beyond
DataWorks Summit/Hadoop Summit
 
S3Guard: What's in your consistency model?
Hortonworks
 
IoT Edge Processing with Apache NiFi and MiniFi and Apache MXNet for IoT NY 2018
Timothy Spann
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
 
Apache Spark Overview
airisData
 
Migrating pipelines into Docker
DataWorks Summit/Hadoop Summit
 
Matt Franklin - Apache Software (Geekfest)
W2O Group
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Databricks
 
Streamline Hadoop DevOps with Apache Ambari
DataWorks Summit/Hadoop Summit
 
Spark,Hadoop,Presto Comparition
Sandish Kumar H N
 
August 2016 HUG: Recent development in Apache Oozie
Yahoo Developer Network
 
Data Science with Spark & Zeppelin
Vinay Shukla
 
Apache Fink 1.0: A New Era for Real-World Streaming Analytics
Slim Baltagi
 
Troubleshooting Kerberos in Hadoop: Taming the Beast
DataWorks Summit
 

Similar to IoT Applications and Patterns using Apache Spark & Apache Bahir (20)

PDF
Building iot applications with Apache Spark and Apache Bahir
Luciano Resende
 
PPTX
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
PDF
Started with-apache-spark
Happiest Minds Technologies
 
PPTX
Apache Spark in Industry
Dorian Beganovic
 
PPTX
Building highly scalable data pipelines with Apache Spark
Martin Toshev
 
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
PDF
Apache Spark PDF
Naresh Rupareliya
 
PDF
Apache Spark - A High Level overview
Karan Alang
 
PPTX
Big data processing with Apache Spark and Oracle Database
Martin Toshev
 
PPTX
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
PPTX
Apache Spark Components
Girish Khanzode
 
PPTX
Glint with Apache Spark
Venkata Naga Ravi
 
PDF
Apache Spark Streaming
Bartosz Jankiewicz
 
PDF
Not Your Father's Database by Databricks
Caserta
 
PDF
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
PDF
New Analytics Toolbox DevNexus 2015
Robbie Strickland
 
PPTX
Apache Spark Fundamentals
Zahra Eskandari
 
PPTX
APACHE SPARK.pptx
DeepaThirumurugan
 
PDF
Boston Spark Meetup event Slides Update
vithakur
 
PDF
An introduction To Apache Spark
Amir Sedighi
 
Building iot applications with Apache Spark and Apache Bahir
Luciano Resende
 
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
Started with-apache-spark
Happiest Minds Technologies
 
Apache Spark in Industry
Dorian Beganovic
 
Building highly scalable data pipelines with Apache Spark
Martin Toshev
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Apache Spark PDF
Naresh Rupareliya
 
Apache Spark - A High Level overview
Karan Alang
 
Big data processing with Apache Spark and Oracle Database
Martin Toshev
 
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
Apache Spark Components
Girish Khanzode
 
Glint with Apache Spark
Venkata Naga Ravi
 
Apache Spark Streaming
Bartosz Jankiewicz
 
Not Your Father's Database by Databricks
Caserta
 
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
New Analytics Toolbox DevNexus 2015
Robbie Strickland
 
Apache Spark Fundamentals
Zahra Eskandari
 
APACHE SPARK.pptx
DeepaThirumurugan
 
Boston Spark Meetup event Slides Update
vithakur
 
An introduction To Apache Spark
Amir Sedighi
 
Ad

More from Luciano Resende (20)

PDF
A Jupyter kernel for Scala and Apache Spark.pdf
Luciano Resende
 
PDF
Using Elyra for COVID-19 Analytics
Luciano Resende
 
PDF
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Luciano Resende
 
PDF
From Data to AI - Silicon Valley Open Source projects come to you - Madrid me...
Luciano Resende
 
PDF
Ai pipelines powered by jupyter notebooks
Luciano Resende
 
PDF
Strata - Scaling Jupyter with Jupyter Enterprise Gateway
Luciano Resende
 
PDF
Scaling notebooks for Deep Learning workloads
Luciano Resende
 
PDF
Jupyter Enterprise Gateway Overview
Luciano Resende
 
PPTX
Inteligencia artificial, open source e IBM Call for Code
Luciano Resende
 
PDF
Open Source AI - News and examples
Luciano Resende
 
PDF
Building analytical microservices powered by jupyter kernels
Luciano Resende
 
PDF
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
Luciano Resende
 
PDF
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
Luciano Resende
 
PDF
What's new in Apache SystemML - Declarative Machine Learning
Luciano Resende
 
PDF
Big analytics meetup - Extended Jupyter Kernel Gateway
Luciano Resende
 
PDF
Jupyter con meetup extended jupyter kernel gateway
Luciano Resende
 
PDF
How mentoring can help you start contributing to open source
Luciano Resende
 
PDF
SystemML - Declarative Machine Learning
Luciano Resende
 
PDF
Luciano Resende's keynote at Apache big data conference
Luciano Resende
 
PPT
Asf icfoss-mentoring
Luciano Resende
 
A Jupyter kernel for Scala and Apache Spark.pdf
Luciano Resende
 
Using Elyra for COVID-19 Analytics
Luciano Resende
 
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Luciano Resende
 
From Data to AI - Silicon Valley Open Source projects come to you - Madrid me...
Luciano Resende
 
Ai pipelines powered by jupyter notebooks
Luciano Resende
 
Strata - Scaling Jupyter with Jupyter Enterprise Gateway
Luciano Resende
 
Scaling notebooks for Deep Learning workloads
Luciano Resende
 
Jupyter Enterprise Gateway Overview
Luciano Resende
 
Inteligencia artificial, open source e IBM Call for Code
Luciano Resende
 
Open Source AI - News and examples
Luciano Resende
 
Building analytical microservices powered by jupyter kernels
Luciano Resende
 
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
Luciano Resende
 
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
Luciano Resende
 
What's new in Apache SystemML - Declarative Machine Learning
Luciano Resende
 
Big analytics meetup - Extended Jupyter Kernel Gateway
Luciano Resende
 
Jupyter con meetup extended jupyter kernel gateway
Luciano Resende
 
How mentoring can help you start contributing to open source
Luciano Resende
 
SystemML - Declarative Machine Learning
Luciano Resende
 
Luciano Resende's keynote at Apache big data conference
Luciano Resende
 
Asf icfoss-mentoring
Luciano Resende
 
Ad

Recently uploaded (20)

PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PDF
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PPTX
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
PDF
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
PPTX
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
PDF
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PPTX
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
PDF
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
Research Methodology Overview Introduction
ayeshagul29594
 
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
big data eco system fundamentals of data science
arivukarasi
 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 

IoT Applications and Patterns using Apache Spark & Apache Bahir

  • 1. IoT Applications and Patterns using Apache Spark & Apache Bahir Luciano Resende June 14th, 2018 © 2018 IBM Corporation 1
  • 2. About me - Luciano Resende 2 Data Science Platform Architect – IBM – CODAIT • Have been contributing to open source at ASF for over 10 years • Currently contributing to : Jupyter Notebook ecosystem, Apache Bahir, Apache Toree, Apache Spark among other projects related to AI/ML platforms [email protected] https://www.linkedin.com/in/lresende @lresende1975 https://github.com/lresende
  • 3. Open Source Community Leadership C O D A I T Founding Partner 188+ Project Committers 77+ Projects Key Open source steering committee memberships OSS Advisory Board Open Source
  • 4. Center for Open Source Data and AI Technologies CODAIT codait.org codait (French) = coder/coded https://m.interglot.com/fr/en/codait CODAIT aims to make AI solutions dramatically easier to create, deploy, and manage in the enterprise Relaunch of the Spark Technology Center (STC) to reflect expanded mission 5
  • 5. Agenda 6 Introductions - Apache Spark - Apache Bahir IoT Applications Live Demo Summary References Q&A
  • 7. Apache Spark Introduction 8 Spark Core Spark SQL Spark Streaming Spark ML Spark GraphX executes SQL statements performs streaming analytics using micro-batches common machine learning and statistical algorithms distributed graph processing framework general compute engine, handles distributed task dispatching, scheduling and basic I/O functions large variety of data sources and formats can be supported, both on-premise or cloud BigInsights (HDFS) Cloudant dashDB SQL DB
  • 9. Apache Spark – Spark SQL 10 Spark SQL Unified data access APIS: Query structured data sets with SQL or Dataset/DataFrame APIs Fast, familiar query language across all of your enterprise data RDBMS Data Sources Structured Streaming Data Sources
  • 10. Apache Spark – Spark SQL 11 You can run SQL statement with SparkSession.sql(…) interface: val spark = SparkSession.builder() .appName(“Demo”) .getOrCreate() spark.sql(“create table T1 (c1 int, c2 int) stored as parquet”) val ds = spark.sql(“select * from T1”) You can further transform the resultant dataset: val ds1 = ds.groupBy(“c1”).agg(“c2”-> “sum”) val ds2 = ds.orderBy(“c1”) The result is a DataFrame / Dataset[Row] ds.show() displays the rows
  • 11. Apache Spark – Spark SQL You can read from data sources using SparkSession.read.format(…) val spark = SparkSession.builder() .appName(“Demo”) .getOrCreate() case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer) // loading csv data to a Dataset of Bank type val bankFromCSV = spark.read.csv(“hdfs://localhost:9000/data/bank.csv").as[Bank] // loading JSON data to a Dataset of Bank type val bankFromJSON = spark.read.json(“hdfs://localhost:9000/data/bank.json").as[Bank] // select a column value from the Dataset bankFromCSV.select(‘age).show() will return all rows of column “age” from this dataset. 12
  • 12. Apache Spark – Spark SQL You can also configure a specific data source with specific options val spark = SparkSession.builder() .appName(“Demo”) .getOrCreate() case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer) // loading csv data to a Dataset of Bank type val bankFromCSV = sparkSession.read .option("header", ”true") // Use first line of all files as header .option("inferSchema", ”true") // Automatically infer data types .option("delimiter", " ") .csv("/users/lresende/data.csv”) .as[Bank] bankFromCSV.select(‘age).show() // will return all rows of column “age” from this dataset. 13
  • 13. Apache Spark – Spark SQL – Data Sources Data Sources under the covers - Data source registration (e.g. spark.read.datasource) - Provide BaseRelation implementation • That implements support for table scans: – TableScans, PrunedScan, PrunedFilteredScan, CatalystScan - Detailed information available at • https://developer.ibm.com/code/2016/11/10/exploring-apache-spark-datasource-api/ 14
  • 14. Apache Spark – Spark SQL – Data Sources Data Sources V1 Limitations - Leak upper-level API in the data source (DataFrame/SQLContext) - Hard to extend the Data Sources API for more optimizations - Zero transaction guarantee in the write APIs - Limited Extensibility 15
  • 15. Apache Spark – Spark SQL – Data Sources Data Sources V2 - Support for row-based scan and columnar scan - Column pruning and filter push-down - Can report basic statistics and data partitioning - Transactional write API - Streaming source and sink support for micro-batch and continuous mode - Detailed information available at • https://developer.ibm.com/code/2018/04/16/introducing-apache-spark-data-sources-api-v2/ 16
  • 16. Apache Spark – Spark SQL Structured Streaming Unified programming model for streaming, interactive and batch queries 17Image source: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html Considers the data stream as unbounded table
  • 17. Apache Spark – Spark SQL Structured Streaming SQL regular APIs val spark = SparkSession.builder() .appName(“Demo”) .getOrCreate() val input = spark.read .schema(schema) .format(”csv") .load(”input-path") val result = input .select(”age”) .where(”age > 18”) result.write .format(”json”) . save(” dest-path”) 18 Structured Streaming APIs val spark = SparkSession.builder() .appName(“Demo”) .getOrCreate() val input = spark.readStream .schema(schema) .format(”csv") .load(”input-path") val result = input .select(”age”) .where(”age > 18”) result.write .format(”json”) . startStream(” dest-path”)
  • 18. Apache Spark – Spark Streaming 19 Spark Streaming Micro-batch event processing for near-real time analytics e.g. Internet of Things (IoT) devices, Twitter feeds, Kafka (event hub), etc. No multi-threading or parallel process programming required
  • 19. Apache Spark – Spark Streaming Also known as discretized stream or DStream Abstracts a continuous stream of data Based on micro-batching Based on RDDs 20
  • 20. Apache Spark – Spark Streaming val sparkConf = new SparkConf() .setAppName("MQTTWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2)) val lines = MQTTUtils.createStream(ssc, brokerUrl, topic, StorageLevel.MEMORY_ONLY_SER_2) val words = lines.flatMap(x => x.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() 21
  • 22. Origins of the Apache Bahir Project MAY/2016: Established as a top-level Apache Project. - PMC formed by Apache Spark committers/pmc, Apache Members - Initial contributions imported from Apache Spark AUG/2016: Apache Flink community join Apache Bahir - Initial contributions of Flink extensions - In October 2016 Robert Metzger elected committer
  • 23. Origins of the Bahir name Naming an Apache Project is a science !!! - We needed a name that wasn’t used yet - Needed to be related to Spark We ended up with : Bahir - A name of Arabian origin that means Sparkling, - Also associated with a guy who succeeds at everything
  • 24. Why Apache Bahir It’s an Apache project - And if you are here, you know what it means Benefits of curating your extensions at Apache Bahir - Apache Governance - Apache License - Apache Community - Apache Brand 25
  • 25. Why Apache Bahir Flexibility - Release flexibility • Bounded to platform or component release Shared infrastructure - Release, CI, etc Shared knowledge - Collaborate with experts on both platform and component areas 26
  • 26. Bahir extensions for Apache Spark MQTT – Enables reading data from MQTT Servers using Spark Streaming or Structured streaming. • http://bahir.apache.org/docs/spark/current/spark-sql-streaming-mqtt/ • http://bahir.apache.org/docs/spark/current/spark-streaming-mqtt/ Couch DB/Cloudant – Enables reading data from CouchDB/Cloudant using Spark SQL and Spark Streaming. Twitter – Enables reading social data from twitter using Spark Streaming. • http://bahir.apache.org/docs/spark/current/spark-streaming-twitter/ Akka – Enables reading data from Akka Actors using Spark Streaming or Structured Streaming. • http://bahir.apache.org/docs/spark/current/spark-streaming-akka/ ZeroMQ – Enables reading data from ZeroMQ using Spark Streaming. • http://bahir.apache.org/docs/spark/current/spark-streaming-zeromq/ 27
  • 27. Bahir extensions for Apache Spark Google Cloud Pub/Sub – Add spark streaming connector to Google Cloud Pub/Sub 28
  • 28. Apache Spark extensions in Bahir Adding Bahir extensions into your application - Using SBT libraryDependencies += "org.apache.bahir" %% "spark-streaming-mqtt" % "2.2.0” - Using Maven <dependency> <groupId>org.apache.bahir</groupId> <artifactId>spark-streaming-mqtt_2.11 </artifactId> <version>2.2.0</version> </dependency> 29
  • 29. Apache Spark extensions in Bahir Submitting applications with Bahir extensions to Spark - Spark-shell bin/spark-shell --packages org.apache.bahir:spark-streaming_mqtt_2.11:2.2.0 ….. - Spark-submit bin/spark-submit --packages org.apache.bahir:spark-streaming_mqtt_2.11:2.2.0 ….. 30
  • 31. IoT – Definition by Wikipedia The Internet of things (IoT) is the network of physical devices, vehicles, home appliances, and other items embedded with electronics, software, sensors, actuators, and network connectivity which enable these objects to connect and exchange data. 32
  • 32. IoT – Interaction between multiple entities 33 Things Software People control observe inform command actuate inform
  • 33. IoT Patterns – Some of them … 35 • Remote control • Security analysis • Edge analytics • Historical data analysis • Distributed Platforms • Real-time decisions
  • 34. MQTT – M2M / IoT Connectivity Protocol 37 Connect + Publish + Subscribe ~1990 IBM / Eurotech 2010 Published 2011 Eclipse M2M / Paho 2014 OASIS Open spec + 40 client implementatio ns Minimal overhead Tiny Clients (Java 170KB) History Header 2-4 bytes (publish) 14 bytes (connect) V5 May 2018
  • 35. MQTT – Quality of Service 38 MQTT Broker QoS0 QoS1 QoS2 At most once At least once Exactly once . No connection failover . Never duplicate . Has connection failover . Can duplicate . Has connection failover . Never duplicate
  • 36. MQTT – World usage Smart Home Automation Messaging Notable Mentions: - IBM IoT Platform - AWS IoT - Microsoft IoT Hub - Facebook Messanger 39
  • 38. IoT Simulator using MQTT The demo environment https://github.com/lresende/bahir-iot-demo 41 Node.js Web app Simulates Elevator IoT devices Elevator simulator Metrics: • Weight • Speed • Power • Temperature • System MQTT Mosquitto
  • 40. Summary – Take away points Apache Spark - IoT Analytics Runtime with support for ”Continuous Applications” Apache Bahir - Bring access to IoT data via supported connectors (e.g. MQTT) IoT Applications - Using Spark and Bahir to start processing IoT data in near real time using Spark Streaming and Spark Structured Streaming 43
  • 41. Join the Apache Bahir community 44
  • 42. References Apache Bahir http://bahir.apache.org Documentation for Apache Spark extensions http://bahir.apache.org/docs/spark/current/documentation/ Source Repositories https://github.com/apache/bahir https://github.com/apache/bahir-website Demo Repository https://github.com/lresende/bahir-iot-demo 45Image source: http://az616578.vo.msecnd.net/files/2016/03/21/6359412499310138501557867529_thank-you-1400x800-c-default.gif
  • 43. 46March 30 2018 / © 2018 IBM Corporation