SlideShare a Scribd company logo
IBM SparkTechnology Center
Paris Open Surce Summit – Apache Software Foundation – Dec 2017
Building IoT Applications with
Apache Spark and Apache Bahir
Luciano Resende
IBM | Spark Technology Center
2
Data Science Platform Architect – IBM – Spark Technology Center
• Have been contributing to open source at ASF for over 10 years
• Currently contributing to : Jupyter Notebook ecosystem, Apache Bahir, Apache
Spark, Apache Toree among other projects related to Apache Spark ecosystem
lresende@apache.org
http://lresende.blogspot.com/
https://www.linkedin.com/in/lresende
@lresende1975
https://github.com/lresende
@
About me - Luciano Resende
Open Source Community Leadership
Spark	Technology	Center
Founding	Partner 188+	Project	Committers 77+	Projects
Key	Open	source	steering	committee	
memberships OSS	Advisory	Board
Open	Source
IBM SparkTechnology Center
IBM Spark Technology Center
Founded in 2015.
Location:
Physical: 505 Howard St., San Francisco CA
Web: http://spark.tc Twitter: @apachespark_tc
Mission:
Contribute intellectual and technical capital to the Apache Spark community.
Make the core technology enterprise- and cloud-ready.
Build data science skills to drive intelligence into business applications — http://bigdatauniversity.com
Key statistics:
About 40 developers, co-located with 25 IBM designers.
Major contributions to Apache Spark http://jiras.spark.tc
Apache SystemML is now a top level Apache project !
Founding member of UC Berkeley AMPLab and RISE Lab
Member of R Consortium and Scala Center
4
IBM SparkTechnology Center
Agenda
Introductions
Apache Spark
Apache Bahir
IoT Applications
Live Demo
Summary
References
5
IBM SparkTechnology Center
Apache Spark
6
IBM SparkTechnology Center
Apache Spark Introduction
What is Apache Spark ?
7
Spark Core
Spark
SQL
Spark
Streaming
Spark
ML
Spark
GraphX
executes	SQL	
statements
performs	
streaming	
analytics	using	
micro-batches	
common	
machine	
learning	and	
statistical	
algorithms
distributed	
graph	
processing	
framework
general	compute	engine,	handles	
distributed	task	dispatching,	
scheduling	and	basic	I/O	functions
large	variety	of	data	sources	and	
formats	can	be	supported,	both	on-
premise	or	cloud
BigInsights	
(HDFS)
Cloudant
dashDB
SQL	DB
IBM SparkTechnology Center
Apache Spark Evolution
8
IBM SparkTechnology Center
Apache Spark – Spark SQL
9
Spark
SQL
▪Unified data access APIS: Query
structured data sets with SQL or
Dataset/DataFrame APIs
▪Fast, familiar query language across all
of your enterprise data
RDBMS
Data Sources
Structured
Streaming
Data Sources
IBM SparkTechnology Center
Apache Spark – Spark SQL
You can run SQL statement with SparkSession.sql(…) interface:
val spark = SparkSession.builder()
.appName(“Demo”)
.getOrCreate()
spark.sql(“create table T1 (c1 int, c2 int) stored as parquet”)
val ds = spark.sql(“select * from T1”)
You can further transform the resultant dataset:
val ds1 = ds.groupBy(“c1”).agg(“c2”-> “sum”)
val ds2 = ds.orderBy(“c1”)
The result is a DataFrame / Dataset[Row]
ds.show() displays the rows
10
IBM SparkTechnology Center
Apache Spark – Spark SQL
You can read from data sources using SparkSession.read.format(…)
val spark = SparkSession.builder()
.appName(“Demo”)
.getOrCreate()
case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer)
// loading csv data to a Dataset of Bank type
val bankFromCSV = spark.read.csv(“hdfs://localhost:9000/data/bank.csv").as[Bank]
// loading JSON data to a Dataset of Bank type
val bankFromJSON = spark.read.json(“hdfs://localhost:9000/data/bank.json").as[Bank]
// select a column value from the Dataset
bankFromCSV.select(‘age).show() will return all rows of column “age” from this dataset.
11
IBM SparkTechnology Center
Apache Spark – Spark SQL
You can also configure a specific data source with specific options
val spark = SparkSession.builder()
.appName(“Demo”)
.getOrCreate()
case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer)
// loading csv data to a Dataset of Bank type
val bankFromCSV = sparkSession.read
.option("header", ”true") // Use first line of all files as header
.option("inferSchema", ”true") // Automatically infer data types
.option("delimiter", " ")
.csv("/users/lresende/data.csv”)
.as[Bank]
bankFromCSV.select(‘age).show() // will return all rows of column “age” from this dataset.
12
IBM SparkTechnology Center
Apache Spark – Spark SQL
Data Sources under the covers
• Data source registration (e.g. spark.read.datasource)
• Provide BaseRelation implementation
• That implements support for table scans:
• TableScans, PrunedScan, PrunedFilteredScan, CatalystScan
• Detailed information available at
• http://www.spark.tc/exploring-the-apache-spark-datasource-api/
13
IBM SparkTechnology Center
Apache Spark – Spark SQL Structured Streaming
Unified programming model for streaming, interactive and batch queries
14
Image source: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Considers the data stream as unbounded table
IBM SparkTechnology Center
Apache Spark – Spark SQL Structured Streaming
SQL regular APIs
val spark = SparkSession.builder()
.appName(“Demo”)
.getOrCreate()
val input = spark.read
.schema(schema)
.format(”csv")
.load(”input-path")
val result = input
.select(”age”)
.where(”age > 18”)
result.write
.format(”json”)
. save(” dest-path”)
15
Structured Streaming APIs
val spark = SparkSession.builder()
.appName(“Demo”)
.getOrCreate()
val input = spark.readStream
.schema(schema)
.format(”csv")
.load(”input-path")
val result = input
.select(”age”)
.where(”age > 18”)
result.write
.format(”json”)
. startStream(” dest-path”)
IBM SparkTechnology Center
Apache Spark – Spark Streaming
16
Spark
Streaming
▪Micro-batch event processing for near-
real time analytics
▪e.g. Internet of Things (IoT) devices,
Twitter feeds, Kafka (event hub), etc.
▪No multi-threading or parallel process
programming required
IBM SparkTechnology Center
Apache Spark – Spark Streaming
Also known as discretized stream or Dstream
Abstracts a continuous stream of data
Based on micro-batching
Based on RDDs
17
IBM SparkTechnology Center
Apache Spark – Spark Streaming
val sparkConf = new SparkConf()
.setAppName("MQTTWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val lines = MQTTUtils.createStream(ssc, brokerUrl, topic, StorageLevel.MEMORY_ONLY_SER_2)
val words = lines.flatMap(x => x.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
18
IBM SparkTechnology Center
Apache Bahir
19
IBM SparkTechnology Center
MAY/2016: Established as a top-level Apache Project.
• PMC formed by Apache Spark committers/pmc, Apache Members
• Initial contributions imported from Apache Spark
AUG/2016: Flink community join Apache Bahir
• Initial contributions of Flink extensions
• In October 2016 Robert Metzger elected committer
Origins of the Apache Bahir Project
IBM SparkTechnology Center
Origins of the Bahir name
Naming an Apache Project is a science !!!
• We needed a name that wasn’t used yet
• Needed to be related to Spark
We ended up with : Bahir
• A name of Arabian origin that means Sparkling,
• Also associated with a guy who succeeds at everything
IBM SparkTechnology Center
Why Apache Bahir
It’s an Apache project
• And if you are here, you know what it means
What are the benefits of curating your extensions at Apache Bahir
• Apache Governance
• Apache License
• Apache Community
• Apache Brand
22
IBM SparkTechnology Center
Why Apache Bahir
Flexibility
• Release flexibility
• Bounded to platform or component release
Shared infrastructure
• Release, CI, etc
Shared knowledge
• Collaborate with experts on both platform and component areas
23
IBM SparkTechnology Center
Bahir extensions for Apache Spark
MQTT – Enables reading data from MQTT Servers using Spark Streaming or Structured streaming.
• http://bahir.apache.org/docs/spark/current/spark-sql-streaming-mqtt/
• http://bahir.apache.org/docs/spark/current/spark-streaming-mqtt/
Couch DB/Cloudant – Enables reading data from CouchDB/Cloudant using Spark SQL and Spark Streaming.
Twitter – Enables reading social data from twitter using Spark Streaming.
• http://bahir.apache.org/docs/spark/current/spark-streaming-twitter/
Akka – Enables reading data from Akka Actors using Spark Streaming or Structured Streaming.
• http://bahir.apache.org/docs/spark/current/spark-streaming-akka/
ZeroMQ – Enables reading data from ZeroMQ using Spark Streaming.
• http://bahir.apache.org/docs/spark/current/spark-streaming-zeromq/
24
IBM SparkTechnology Center
Bahir extensions for Apache Spark
Google Cloud Pub/Sub – Add spark streaming connector to Google Cloud Pub/Sub
• https://issues.apache.org/jira/browse/BAHIR-116
25
IBM SparkTechnology Center
Apache Spark extensions in Bahir
Adding Bahir extensions into your application
• Using SBT
• libraryDependencies += "org.apache.bahir" %% "spark-streaming-mqtt" % "2.2.0”
• Using Maven
• <dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>spark-streaming-mqtt_2.11 </artifactId>
<version>2.2.0</version>
</dependency>
26
IBM SparkTechnology Center
Apache Spark extensions in Bahir
Submitting applications with Bahir extensions to Spark
• Spark-shell
• bin/spark-shell --packages org.apache.bahir:spark-streaming_mqtt_2.11:2.2.0 …..
• Spark-submit
• bin/spark-submit --packages org.apache.bahir:spark-streaming_mqtt_2.11:2.2.0 …..
27
IBM SparkTechnology Center
IoT - Internet of Things
28
IBM SparkTechnology Center
IoT – Definition by Wikipedia
The Internet of things (IoT) is the network of physical devices, vehicles, home
appliances, and other items embedded with electronics, software, sensors,
actuators, and network connectivity which enable these objects to connect and
exchange data.
29
IBM SparkTechnology Center
IoT – Definition by Wikipedia
The Internet of things (IoT) is the network of physical devices, vehicles, home
appliances, and other items embedded with electronics, software, sensors,
actuators, and network connectivity which enable these objects to connect and
exchange data.
30
IBM SparkTechnology Center
IoT – Interaction between multiple entities
31
Things Software
People
actuate
inform
IBM SparkTechnology Center 32
Manufacturer	
Chipset Board Appliance
Cloud
Service	provider
Consumer
IoT Platform
Connectivity Security Analysis Management Integration
IoT Ecosystem in a Nutshell
IBM SparkTechnology Center
IoT Patterns – Some of them …
33
• Remote control
• Security analysis
• Edge analytics
• Historical data analysis
• Distributed Platforms
• Real-time decisions
IBM SparkTechnology Center
IoT Patterns – Real-time decisions
34
• Action is triggered if an anomaly (+/-) is identified
• MTTR (mean time to repair) is critical
• High throughput might hide real issue
• QoS tradeoffs
• Payload size and format
IBM SparkTechnology Center
MQTT – M2M / IoT Connectivity Protocol
35
Connect
+	
Publish
+
Subscribe
~1990
IBM / Eurotech
2010
Published
2011
Eclipse M2M / Paho
2014
OASIS
Open	spec
+	40	client	
implementations
Minimal	
overhead
Tiny	
Clients	
(Java	170KB)
History
Header
2-4	bytes	
(publish)
14	bytes	
(connect)
Soon
V5
IBM SparkTechnology Center
MQTT – Quality of Service
36
MQTT
Broker
QoS0
QoS1
QoS2
At most once
At least once
Exactly once
. No connection failover
. Never duplicate
. Has connection failover
. Can duplicate
. Has connection failover
. Never duplicate
IBM SparkTechnology Center
MQTT – World usage
Smart Home Automation
Messaging
Notable Mentions:
• IBM IoT Platform
• AWS IoT
• Microsoft IoT Hub
• Facebook Messanger
37
IBM SparkTechnology Center
Live Demo
38
IBM SparkTechnology Center
IoT Simulator using MQTT
The demo environment
https://github.com/lresende/bahir-iot-demo
39
Node.js Web app
Simulates Elevator IoT devices
Elevator simulator Metrics:
• Weight
• Speed
• Power
• Temperature
• System
MQTT	
Mosquitto
IBM SparkTechnology Center
Summary
4
0
IBM SparkTechnology Center
Summary – Take away points
Apache Spark
• IoT Analytics Runtime with support for ”Continuous Applications”
Apache Bahir
• Bring access to IoT data via supported connectors (e.g. MQTT)
IoT Applications
• Using Spark and Bahir to start processing IoT data in near real time
using Spark Streaming and Spark Structured Streaming
41
IBM SparkTechnology Center
Join the Apache Bahir community !!!
42
IBM SparkTechnology Center
References
Apache Bahir
http://bahir.apache.org
Documentation for Apache Spark extensions
http://bahir.apache.org/docs/spark/current/documentation/
Source Repositories
https://github.com/apache/bahir
https://github.com/apache/bahir-website
Demo Repository
https://github.com/lresende/bahir-iot-demo
43
Image source: http://az616578.vo.msecnd.net/files/2016/03/21/6359412499310138501557867529_thank-you-1400x800-c-default.gif
Ad

More Related Content

What's hot (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
LEGO: Data Driven Growth Hacking Powered by Big Data
LEGO: Data Driven Growth Hacking Powered by Big Data LEGO: Data Driven Growth Hacking Powered by Big Data
LEGO: Data Driven Growth Hacking Powered by Big Data
DataWorks Summit/Hadoop Summit
 
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDruid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
DataWorks Summit
 
Breathing new life into Apache Oozie with Apache Ambari Workflow Manager
Breathing new life into Apache Oozie with Apache Ambari Workflow ManagerBreathing new life into Apache Oozie with Apache Ambari Workflow Manager
Breathing new life into Apache Oozie with Apache Ambari Workflow Manager
Artem Ervits
 
Hadoop first ETL on Apache Falcon
Hadoop first ETL on Apache FalconHadoop first ETL on Apache Falcon
Hadoop first ETL on Apache Falcon
DataWorks Summit
 
Redis for Security Data : SecurityScorecard JVM Redis Usage
Redis for Security Data : SecurityScorecard JVM Redis UsageRedis for Security Data : SecurityScorecard JVM Redis Usage
Redis for Security Data : SecurityScorecard JVM Redis Usage
Timothy Spann
 
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, LucidworksVisualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Lucidworks
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
Hortonworks
 
Spark Security
Spark SecuritySpark Security
Spark Security
Yifeng Jiang
 
Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsSharing metadata across the data lake and streams
Sharing metadata across the data lake and streams
DataWorks Summit
 
Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search
Hortonworks
 
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the EnterpriseEnabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
DataWorks Summit/Hadoop Summit
 
Webinar: Replace Google Search Appliance with Lucidworks Fusion
Webinar: Replace Google Search Appliance with Lucidworks FusionWebinar: Replace Google Search Appliance with Lucidworks Fusion
Webinar: Replace Google Search Appliance with Lucidworks Fusion
Lucidworks
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache AmbariStreamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari
DataWorks Summit/Hadoop Summit
 
Oracle Office Hours - Exposing REST services with APEX and ORDS
Oracle Office Hours - Exposing REST services with APEX and ORDSOracle Office Hours - Exposing REST services with APEX and ORDS
Oracle Office Hours - Exposing REST services with APEX and ORDS
Doug Gault
 
Apache Zeppelin Helium and Beyond
Apache Zeppelin Helium and BeyondApache Zeppelin Helium and Beyond
Apache Zeppelin Helium and Beyond
DataWorks Summit/Hadoop Summit
 
Spark mhug2
Spark mhug2Spark mhug2
Spark mhug2
Joseph Niemiec
 
20150627 bigdatala
20150627 bigdatala20150627 bigdatala
20150627 bigdatala
gethue
 
Apache MetaModel - unified access to all your data points
Apache MetaModel - unified access to all your data pointsApache MetaModel - unified access to all your data points
Apache MetaModel - unified access to all your data points
Kasper Sørensen
 
Full Stack Scala
Full Stack ScalaFull Stack Scala
Full Stack Scala
Ramnivas Laddad
 
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDruid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
DataWorks Summit
 
Breathing new life into Apache Oozie with Apache Ambari Workflow Manager
Breathing new life into Apache Oozie with Apache Ambari Workflow ManagerBreathing new life into Apache Oozie with Apache Ambari Workflow Manager
Breathing new life into Apache Oozie with Apache Ambari Workflow Manager
Artem Ervits
 
Hadoop first ETL on Apache Falcon
Hadoop first ETL on Apache FalconHadoop first ETL on Apache Falcon
Hadoop first ETL on Apache Falcon
DataWorks Summit
 
Redis for Security Data : SecurityScorecard JVM Redis Usage
Redis for Security Data : SecurityScorecard JVM Redis UsageRedis for Security Data : SecurityScorecard JVM Redis Usage
Redis for Security Data : SecurityScorecard JVM Redis Usage
Timothy Spann
 
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, LucidworksVisualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Lucidworks
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
Hortonworks
 
Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsSharing metadata across the data lake and streams
Sharing metadata across the data lake and streams
DataWorks Summit
 
Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search
Hortonworks
 
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the EnterpriseEnabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
DataWorks Summit/Hadoop Summit
 
Webinar: Replace Google Search Appliance with Lucidworks Fusion
Webinar: Replace Google Search Appliance with Lucidworks FusionWebinar: Replace Google Search Appliance with Lucidworks Fusion
Webinar: Replace Google Search Appliance with Lucidworks Fusion
Lucidworks
 
Oracle Office Hours - Exposing REST services with APEX and ORDS
Oracle Office Hours - Exposing REST services with APEX and ORDSOracle Office Hours - Exposing REST services with APEX and ORDS
Oracle Office Hours - Exposing REST services with APEX and ORDS
Doug Gault
 
20150627 bigdatala
20150627 bigdatala20150627 bigdatala
20150627 bigdatala
gethue
 
Apache MetaModel - unified access to all your data points
Apache MetaModel - unified access to all your data pointsApache MetaModel - unified access to all your data points
Apache MetaModel - unified access to all your data points
Kasper Sørensen
 

Similar to Building iot applications with Apache Spark and Apache Bahir (20)

H2O PySparkling Water
H2O PySparkling WaterH2O PySparkling Water
H2O PySparkling Water
Sri Ambati
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys
 
Storlets fb session_16_9
Storlets fb session_16_9Storlets fb session_16_9
Storlets fb session_16_9
Eran Rom
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
Databricks
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
Vienna Data Science Group
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Joe Stein
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Michael Rys
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Provectus
 
2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_on2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_on
Sri Ambati
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks
 
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
DataStax Academy
 
ApacheCon NA 2010 - Developing Composite Apps for the Cloud with Apache Tuscany
ApacheCon NA 2010 - Developing Composite Apps for the Cloud with Apache TuscanyApacheCon NA 2010 - Developing Composite Apps for the Cloud with Apache Tuscany
ApacheCon NA 2010 - Developing Composite Apps for the Cloud with Apache Tuscany
Jean-Sebastien Delfino
 
Operational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache SparkOperational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache Spark
Databricks
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Michael Rys
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Chetan Khatri
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
Databricks
 
Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015
Sri Ambati
 
H2O PySparkling Water
H2O PySparkling WaterH2O PySparkling Water
H2O PySparkling Water
Sri Ambati
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys
 
Storlets fb session_16_9
Storlets fb session_16_9Storlets fb session_16_9
Storlets fb session_16_9
Eran Rom
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
Databricks
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Joe Stein
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Michael Rys
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Provectus
 
2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_on2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_on
Sri Ambati
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks
 
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
DataStax Academy
 
ApacheCon NA 2010 - Developing Composite Apps for the Cloud with Apache Tuscany
ApacheCon NA 2010 - Developing Composite Apps for the Cloud with Apache TuscanyApacheCon NA 2010 - Developing Composite Apps for the Cloud with Apache Tuscany
ApacheCon NA 2010 - Developing Composite Apps for the Cloud with Apache Tuscany
Jean-Sebastien Delfino
 
Operational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache SparkOperational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache Spark
Databricks
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Michael Rys
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Chetan Khatri
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
Databricks
 
Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015
Sri Ambati
 
Ad

More from Luciano Resende (20)

A Jupyter kernel for Scala and Apache Spark.pdf
A Jupyter kernel for Scala and Apache Spark.pdfA Jupyter kernel for Scala and Apache Spark.pdf
A Jupyter kernel for Scala and Apache Spark.pdf
Luciano Resende
 
Using Elyra for COVID-19 Analytics
Using Elyra for COVID-19 AnalyticsUsing Elyra for COVID-19 Analytics
Using Elyra for COVID-19 Analytics
Luciano Resende
 
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Luciano Resende
 
From Data to AI - Silicon Valley Open Source projects come to you - Madrid me...
From Data to AI - Silicon Valley Open Source projects come to you - Madrid me...From Data to AI - Silicon Valley Open Source projects come to you - Madrid me...
From Data to AI - Silicon Valley Open Source projects come to you - Madrid me...
Luciano Resende
 
Ai pipelines powered by jupyter notebooks
Ai pipelines powered by jupyter notebooksAi pipelines powered by jupyter notebooks
Ai pipelines powered by jupyter notebooks
Luciano Resende
 
Strata - Scaling Jupyter with Jupyter Enterprise Gateway
Strata - Scaling Jupyter with Jupyter Enterprise GatewayStrata - Scaling Jupyter with Jupyter Enterprise Gateway
Strata - Scaling Jupyter with Jupyter Enterprise Gateway
Luciano Resende
 
Scaling notebooks for Deep Learning workloads
Scaling notebooks for Deep Learning workloadsScaling notebooks for Deep Learning workloads
Scaling notebooks for Deep Learning workloads
Luciano Resende
 
Jupyter Enterprise Gateway Overview
Jupyter Enterprise Gateway OverviewJupyter Enterprise Gateway Overview
Jupyter Enterprise Gateway Overview
Luciano Resende
 
Inteligencia artificial, open source e IBM Call for Code
Inteligencia artificial, open source e IBM Call for CodeInteligencia artificial, open source e IBM Call for Code
Inteligencia artificial, open source e IBM Call for Code
Luciano Resende
 
Open Source AI - News and examples
Open Source AI - News and examplesOpen Source AI - News and examples
Open Source AI - News and examples
Luciano Resende
 
Building analytical microservices powered by jupyter kernels
Building analytical microservices powered by jupyter kernelsBuilding analytical microservices powered by jupyter kernels
Building analytical microservices powered by jupyter kernels
Luciano Resende
 
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
An Enterprise Analytics Platform with Jupyter Notebooks and Apache SparkAn Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
Luciano Resende
 
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
Luciano Resende
 
What's new in Apache SystemML - Declarative Machine Learning
What's new in Apache SystemML  - Declarative Machine LearningWhat's new in Apache SystemML  - Declarative Machine Learning
What's new in Apache SystemML - Declarative Machine Learning
Luciano Resende
 
Big analytics meetup - Extended Jupyter Kernel Gateway
Big analytics meetup - Extended Jupyter Kernel GatewayBig analytics meetup - Extended Jupyter Kernel Gateway
Big analytics meetup - Extended Jupyter Kernel Gateway
Luciano Resende
 
Jupyter con meetup extended jupyter kernel gateway
Jupyter con meetup   extended jupyter kernel gatewayJupyter con meetup   extended jupyter kernel gateway
Jupyter con meetup extended jupyter kernel gateway
Luciano Resende
 
How mentoring can help you start contributing to open source
How mentoring can help you start contributing to open sourceHow mentoring can help you start contributing to open source
How mentoring can help you start contributing to open source
Luciano Resende
 
Luciano Resende's keynote at Apache big data conference
Luciano Resende's keynote at Apache big data conferenceLuciano Resende's keynote at Apache big data conference
Luciano Resende's keynote at Apache big data conference
Luciano Resende
 
Asf icfoss-mentoring
Asf icfoss-mentoringAsf icfoss-mentoring
Asf icfoss-mentoring
Luciano Resende
 
Open Source tools overview
Open Source tools overviewOpen Source tools overview
Open Source tools overview
Luciano Resende
 
A Jupyter kernel for Scala and Apache Spark.pdf
A Jupyter kernel for Scala and Apache Spark.pdfA Jupyter kernel for Scala and Apache Spark.pdf
A Jupyter kernel for Scala and Apache Spark.pdf
Luciano Resende
 
Using Elyra for COVID-19 Analytics
Using Elyra for COVID-19 AnalyticsUsing Elyra for COVID-19 Analytics
Using Elyra for COVID-19 Analytics
Luciano Resende
 
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Luciano Resende
 
From Data to AI - Silicon Valley Open Source projects come to you - Madrid me...
From Data to AI - Silicon Valley Open Source projects come to you - Madrid me...From Data to AI - Silicon Valley Open Source projects come to you - Madrid me...
From Data to AI - Silicon Valley Open Source projects come to you - Madrid me...
Luciano Resende
 
Ai pipelines powered by jupyter notebooks
Ai pipelines powered by jupyter notebooksAi pipelines powered by jupyter notebooks
Ai pipelines powered by jupyter notebooks
Luciano Resende
 
Strata - Scaling Jupyter with Jupyter Enterprise Gateway
Strata - Scaling Jupyter with Jupyter Enterprise GatewayStrata - Scaling Jupyter with Jupyter Enterprise Gateway
Strata - Scaling Jupyter with Jupyter Enterprise Gateway
Luciano Resende
 
Scaling notebooks for Deep Learning workloads
Scaling notebooks for Deep Learning workloadsScaling notebooks for Deep Learning workloads
Scaling notebooks for Deep Learning workloads
Luciano Resende
 
Jupyter Enterprise Gateway Overview
Jupyter Enterprise Gateway OverviewJupyter Enterprise Gateway Overview
Jupyter Enterprise Gateway Overview
Luciano Resende
 
Inteligencia artificial, open source e IBM Call for Code
Inteligencia artificial, open source e IBM Call for CodeInteligencia artificial, open source e IBM Call for Code
Inteligencia artificial, open source e IBM Call for Code
Luciano Resende
 
Open Source AI - News and examples
Open Source AI - News and examplesOpen Source AI - News and examples
Open Source AI - News and examples
Luciano Resende
 
Building analytical microservices powered by jupyter kernels
Building analytical microservices powered by jupyter kernelsBuilding analytical microservices powered by jupyter kernels
Building analytical microservices powered by jupyter kernels
Luciano Resende
 
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
An Enterprise Analytics Platform with Jupyter Notebooks and Apache SparkAn Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
Luciano Resende
 
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
Luciano Resende
 
What's new in Apache SystemML - Declarative Machine Learning
What's new in Apache SystemML  - Declarative Machine LearningWhat's new in Apache SystemML  - Declarative Machine Learning
What's new in Apache SystemML - Declarative Machine Learning
Luciano Resende
 
Big analytics meetup - Extended Jupyter Kernel Gateway
Big analytics meetup - Extended Jupyter Kernel GatewayBig analytics meetup - Extended Jupyter Kernel Gateway
Big analytics meetup - Extended Jupyter Kernel Gateway
Luciano Resende
 
Jupyter con meetup extended jupyter kernel gateway
Jupyter con meetup   extended jupyter kernel gatewayJupyter con meetup   extended jupyter kernel gateway
Jupyter con meetup extended jupyter kernel gateway
Luciano Resende
 
How mentoring can help you start contributing to open source
How mentoring can help you start contributing to open sourceHow mentoring can help you start contributing to open source
How mentoring can help you start contributing to open source
Luciano Resende
 
Luciano Resende's keynote at Apache big data conference
Luciano Resende's keynote at Apache big data conferenceLuciano Resende's keynote at Apache big data conference
Luciano Resende's keynote at Apache big data conference
Luciano Resende
 
Open Source tools overview
Open Source tools overviewOpen Source tools overview
Open Source tools overview
Luciano Resende
 
Ad

Recently uploaded (20)

VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
定制学历(美国Purdue毕业证)普渡大学电子版毕业证
定制学历(美国Purdue毕业证)普渡大学电子版毕业证定制学历(美国Purdue毕业证)普渡大学电子版毕业证
定制学历(美国Purdue毕业证)普渡大学电子版毕业证
Taqyea
 
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
ISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptx
ISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptxISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptx
ISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptx
pankaj6188303
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia
Alexander Romero Arosquipa
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
定制学历(美国Purdue毕业证)普渡大学电子版毕业证
定制学历(美国Purdue毕业证)普渡大学电子版毕业证定制学历(美国Purdue毕业证)普渡大学电子版毕业证
定制学历(美国Purdue毕业证)普渡大学电子版毕业证
Taqyea
 
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
ISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptx
ISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptxISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptx
ISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptx
pankaj6188303
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 

Building iot applications with Apache Spark and Apache Bahir

  • 1. IBM SparkTechnology Center Paris Open Surce Summit – Apache Software Foundation – Dec 2017 Building IoT Applications with Apache Spark and Apache Bahir Luciano Resende IBM | Spark Technology Center
  • 2. 2 Data Science Platform Architect – IBM – Spark Technology Center • Have been contributing to open source at ASF for over 10 years • Currently contributing to : Jupyter Notebook ecosystem, Apache Bahir, Apache Spark, Apache Toree among other projects related to Apache Spark ecosystem [email protected] http://lresende.blogspot.com/ https://www.linkedin.com/in/lresende @lresende1975 https://github.com/lresende @ About me - Luciano Resende
  • 3. Open Source Community Leadership Spark Technology Center Founding Partner 188+ Project Committers 77+ Projects Key Open source steering committee memberships OSS Advisory Board Open Source
  • 4. IBM SparkTechnology Center IBM Spark Technology Center Founded in 2015. Location: Physical: 505 Howard St., San Francisco CA Web: http://spark.tc Twitter: @apachespark_tc Mission: Contribute intellectual and technical capital to the Apache Spark community. Make the core technology enterprise- and cloud-ready. Build data science skills to drive intelligence into business applications — http://bigdatauniversity.com Key statistics: About 40 developers, co-located with 25 IBM designers. Major contributions to Apache Spark http://jiras.spark.tc Apache SystemML is now a top level Apache project ! Founding member of UC Berkeley AMPLab and RISE Lab Member of R Consortium and Scala Center 4
  • 5. IBM SparkTechnology Center Agenda Introductions Apache Spark Apache Bahir IoT Applications Live Demo Summary References 5
  • 7. IBM SparkTechnology Center Apache Spark Introduction What is Apache Spark ? 7 Spark Core Spark SQL Spark Streaming Spark ML Spark GraphX executes SQL statements performs streaming analytics using micro-batches common machine learning and statistical algorithms distributed graph processing framework general compute engine, handles distributed task dispatching, scheduling and basic I/O functions large variety of data sources and formats can be supported, both on- premise or cloud BigInsights (HDFS) Cloudant dashDB SQL DB
  • 9. IBM SparkTechnology Center Apache Spark – Spark SQL 9 Spark SQL ▪Unified data access APIS: Query structured data sets with SQL or Dataset/DataFrame APIs ▪Fast, familiar query language across all of your enterprise data RDBMS Data Sources Structured Streaming Data Sources
  • 10. IBM SparkTechnology Center Apache Spark – Spark SQL You can run SQL statement with SparkSession.sql(…) interface: val spark = SparkSession.builder() .appName(“Demo”) .getOrCreate() spark.sql(“create table T1 (c1 int, c2 int) stored as parquet”) val ds = spark.sql(“select * from T1”) You can further transform the resultant dataset: val ds1 = ds.groupBy(“c1”).agg(“c2”-> “sum”) val ds2 = ds.orderBy(“c1”) The result is a DataFrame / Dataset[Row] ds.show() displays the rows 10
  • 11. IBM SparkTechnology Center Apache Spark – Spark SQL You can read from data sources using SparkSession.read.format(…) val spark = SparkSession.builder() .appName(“Demo”) .getOrCreate() case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer) // loading csv data to a Dataset of Bank type val bankFromCSV = spark.read.csv(“hdfs://localhost:9000/data/bank.csv").as[Bank] // loading JSON data to a Dataset of Bank type val bankFromJSON = spark.read.json(“hdfs://localhost:9000/data/bank.json").as[Bank] // select a column value from the Dataset bankFromCSV.select(‘age).show() will return all rows of column “age” from this dataset. 11
  • 12. IBM SparkTechnology Center Apache Spark – Spark SQL You can also configure a specific data source with specific options val spark = SparkSession.builder() .appName(“Demo”) .getOrCreate() case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer) // loading csv data to a Dataset of Bank type val bankFromCSV = sparkSession.read .option("header", ”true") // Use first line of all files as header .option("inferSchema", ”true") // Automatically infer data types .option("delimiter", " ") .csv("/users/lresende/data.csv”) .as[Bank] bankFromCSV.select(‘age).show() // will return all rows of column “age” from this dataset. 12
  • 13. IBM SparkTechnology Center Apache Spark – Spark SQL Data Sources under the covers • Data source registration (e.g. spark.read.datasource) • Provide BaseRelation implementation • That implements support for table scans: • TableScans, PrunedScan, PrunedFilteredScan, CatalystScan • Detailed information available at • http://www.spark.tc/exploring-the-apache-spark-datasource-api/ 13
  • 14. IBM SparkTechnology Center Apache Spark – Spark SQL Structured Streaming Unified programming model for streaming, interactive and batch queries 14 Image source: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html Considers the data stream as unbounded table
  • 15. IBM SparkTechnology Center Apache Spark – Spark SQL Structured Streaming SQL regular APIs val spark = SparkSession.builder() .appName(“Demo”) .getOrCreate() val input = spark.read .schema(schema) .format(”csv") .load(”input-path") val result = input .select(”age”) .where(”age > 18”) result.write .format(”json”) . save(” dest-path”) 15 Structured Streaming APIs val spark = SparkSession.builder() .appName(“Demo”) .getOrCreate() val input = spark.readStream .schema(schema) .format(”csv") .load(”input-path") val result = input .select(”age”) .where(”age > 18”) result.write .format(”json”) . startStream(” dest-path”)
  • 16. IBM SparkTechnology Center Apache Spark – Spark Streaming 16 Spark Streaming ▪Micro-batch event processing for near- real time analytics ▪e.g. Internet of Things (IoT) devices, Twitter feeds, Kafka (event hub), etc. ▪No multi-threading or parallel process programming required
  • 17. IBM SparkTechnology Center Apache Spark – Spark Streaming Also known as discretized stream or Dstream Abstracts a continuous stream of data Based on micro-batching Based on RDDs 17
  • 18. IBM SparkTechnology Center Apache Spark – Spark Streaming val sparkConf = new SparkConf() .setAppName("MQTTWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2)) val lines = MQTTUtils.createStream(ssc, brokerUrl, topic, StorageLevel.MEMORY_ONLY_SER_2) val words = lines.flatMap(x => x.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() 18
  • 20. IBM SparkTechnology Center MAY/2016: Established as a top-level Apache Project. • PMC formed by Apache Spark committers/pmc, Apache Members • Initial contributions imported from Apache Spark AUG/2016: Flink community join Apache Bahir • Initial contributions of Flink extensions • In October 2016 Robert Metzger elected committer Origins of the Apache Bahir Project
  • 21. IBM SparkTechnology Center Origins of the Bahir name Naming an Apache Project is a science !!! • We needed a name that wasn’t used yet • Needed to be related to Spark We ended up with : Bahir • A name of Arabian origin that means Sparkling, • Also associated with a guy who succeeds at everything
  • 22. IBM SparkTechnology Center Why Apache Bahir It’s an Apache project • And if you are here, you know what it means What are the benefits of curating your extensions at Apache Bahir • Apache Governance • Apache License • Apache Community • Apache Brand 22
  • 23. IBM SparkTechnology Center Why Apache Bahir Flexibility • Release flexibility • Bounded to platform or component release Shared infrastructure • Release, CI, etc Shared knowledge • Collaborate with experts on both platform and component areas 23
  • 24. IBM SparkTechnology Center Bahir extensions for Apache Spark MQTT – Enables reading data from MQTT Servers using Spark Streaming or Structured streaming. • http://bahir.apache.org/docs/spark/current/spark-sql-streaming-mqtt/ • http://bahir.apache.org/docs/spark/current/spark-streaming-mqtt/ Couch DB/Cloudant – Enables reading data from CouchDB/Cloudant using Spark SQL and Spark Streaming. Twitter – Enables reading social data from twitter using Spark Streaming. • http://bahir.apache.org/docs/spark/current/spark-streaming-twitter/ Akka – Enables reading data from Akka Actors using Spark Streaming or Structured Streaming. • http://bahir.apache.org/docs/spark/current/spark-streaming-akka/ ZeroMQ – Enables reading data from ZeroMQ using Spark Streaming. • http://bahir.apache.org/docs/spark/current/spark-streaming-zeromq/ 24
  • 25. IBM SparkTechnology Center Bahir extensions for Apache Spark Google Cloud Pub/Sub – Add spark streaming connector to Google Cloud Pub/Sub • https://issues.apache.org/jira/browse/BAHIR-116 25
  • 26. IBM SparkTechnology Center Apache Spark extensions in Bahir Adding Bahir extensions into your application • Using SBT • libraryDependencies += "org.apache.bahir" %% "spark-streaming-mqtt" % "2.2.0” • Using Maven • <dependency> <groupId>org.apache.bahir</groupId> <artifactId>spark-streaming-mqtt_2.11 </artifactId> <version>2.2.0</version> </dependency> 26
  • 27. IBM SparkTechnology Center Apache Spark extensions in Bahir Submitting applications with Bahir extensions to Spark • Spark-shell • bin/spark-shell --packages org.apache.bahir:spark-streaming_mqtt_2.11:2.2.0 ….. • Spark-submit • bin/spark-submit --packages org.apache.bahir:spark-streaming_mqtt_2.11:2.2.0 ….. 27
  • 28. IBM SparkTechnology Center IoT - Internet of Things 28
  • 29. IBM SparkTechnology Center IoT – Definition by Wikipedia The Internet of things (IoT) is the network of physical devices, vehicles, home appliances, and other items embedded with electronics, software, sensors, actuators, and network connectivity which enable these objects to connect and exchange data. 29
  • 30. IBM SparkTechnology Center IoT – Definition by Wikipedia The Internet of things (IoT) is the network of physical devices, vehicles, home appliances, and other items embedded with electronics, software, sensors, actuators, and network connectivity which enable these objects to connect and exchange data. 30
  • 31. IBM SparkTechnology Center IoT – Interaction between multiple entities 31 Things Software People actuate inform
  • 32. IBM SparkTechnology Center 32 Manufacturer Chipset Board Appliance Cloud Service provider Consumer IoT Platform Connectivity Security Analysis Management Integration IoT Ecosystem in a Nutshell
  • 33. IBM SparkTechnology Center IoT Patterns – Some of them … 33 • Remote control • Security analysis • Edge analytics • Historical data analysis • Distributed Platforms • Real-time decisions
  • 34. IBM SparkTechnology Center IoT Patterns – Real-time decisions 34 • Action is triggered if an anomaly (+/-) is identified • MTTR (mean time to repair) is critical • High throughput might hide real issue • QoS tradeoffs • Payload size and format
  • 35. IBM SparkTechnology Center MQTT – M2M / IoT Connectivity Protocol 35 Connect + Publish + Subscribe ~1990 IBM / Eurotech 2010 Published 2011 Eclipse M2M / Paho 2014 OASIS Open spec + 40 client implementations Minimal overhead Tiny Clients (Java 170KB) History Header 2-4 bytes (publish) 14 bytes (connect) Soon V5
  • 36. IBM SparkTechnology Center MQTT – Quality of Service 36 MQTT Broker QoS0 QoS1 QoS2 At most once At least once Exactly once . No connection failover . Never duplicate . Has connection failover . Can duplicate . Has connection failover . Never duplicate
  • 37. IBM SparkTechnology Center MQTT – World usage Smart Home Automation Messaging Notable Mentions: • IBM IoT Platform • AWS IoT • Microsoft IoT Hub • Facebook Messanger 37
  • 39. IBM SparkTechnology Center IoT Simulator using MQTT The demo environment https://github.com/lresende/bahir-iot-demo 39 Node.js Web app Simulates Elevator IoT devices Elevator simulator Metrics: • Weight • Speed • Power • Temperature • System MQTT Mosquitto
  • 41. IBM SparkTechnology Center Summary – Take away points Apache Spark • IoT Analytics Runtime with support for ”Continuous Applications” Apache Bahir • Bring access to IoT data via supported connectors (e.g. MQTT) IoT Applications • Using Spark and Bahir to start processing IoT data in near real time using Spark Streaming and Spark Structured Streaming 41
  • 42. IBM SparkTechnology Center Join the Apache Bahir community !!! 42
  • 43. IBM SparkTechnology Center References Apache Bahir http://bahir.apache.org Documentation for Apache Spark extensions http://bahir.apache.org/docs/spark/current/documentation/ Source Repositories https://github.com/apache/bahir https://github.com/apache/bahir-website Demo Repository https://github.com/lresende/bahir-iot-demo 43 Image source: http://az616578.vo.msecnd.net/files/2016/03/21/6359412499310138501557867529_thank-you-1400x800-c-default.gif