SlideShare a Scribd company logo
Processing Fast Data with Apache Spark:
The Tale of Two APIs
Big Data Lnd
Gerard Maas
Señor SW Engineer
16/Nov/2017
Gerard Maas
Señor SW Engineer Computer Engineer
Scala Programmer
Early Spark Adopter (v0.9)
Spark Notebook Contributor
Cassandra MVP (2015, 2016)
Stack Overflow Top
Contributor
(Spark, Spark Streaming, Scala)
Wannabe {
IoT Maker
Drone crasher/tinkerer
}
@maasg
https://github.com/maasg
https://www.linkedin.com/
in/gerardmaas/
https://stackoverflow.com
/users/764040/maasg
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Streams
Everywhere
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Once upon a
time...
Apache Spark Core
Spark SQL
SparkMLLib
SparkStreaming
Structured
Streaming
DataFrames
DataSets
GraphFrames
Data Sources
Apache Spark Core
Spark SQL
SparkMLLib
SparkStreaming
Structured
Streaming
DataFrames
DataSets
GraphFrames
Data Sources
Structured Streaming
Structured Streaming
Kafka
Sockets
HDFS/S3
Custom
Streaming
DataFrame
Structured Streaming
Kafka
Sockets
HDFS/S3
Custom
Streaming
DataFrame
Structured Streaming
Kafka
Sockets
HDFS/S3
Custom
Streaming
DataFrame
Query
Kafka
Files
foreachSink
console
memory
Output
Mode
val rawData = sparkSession.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBootstrapServer)
.option("subscribe", topic)
.option("group.id", "iot-data-consumer")
.option("startingOffsets", "earliest")
.load()
case class SensorData(sensorId: Int, timestamp: Long, reading: Double)
def parse(str: String): SensorData = ...
val parsedData = rawData.select($"value").as[String].flatMap(record => parse(record))
val validData = parsedData.where($"reading" > 0)
val query = validData.writeStream
.outputMode("append")
.format("parquet")
.option("path", "/dest/sensor-data")
.option("checkpointLocation", "/tmp/checkpoint")
.start()
val rawData = sparkSession.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBootstrapServer)
.option("subscribe", topic)
.option("group.id", "iot-data-consumer")
.option("startingOffsets", "earliest")
.load()
case class SensorData(sensorId: Int, timestamp: Long, reading: Double)
def parse(str: String): SensorData = ...
val parsedData = rawData.select($"value").as[String].flatMap(record => parse(record))
val validData = parsedData.where($"reading" > 0)
val query = validData.writeStream
.outputMode("append")
.format("parquet")
.option("path", "/dest/sensor-data")
.option("checkpointLocation", "/tmp/checkpoint")
.start()
val rawData = sparkSession.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBootstrapServer)
.option("subscribe", topic)
.option("group.id", "iot-data-consumer")
.option("startingOffsets", "earliest")
.load()
case class SensorData(sensorId: Int, timestamp: Long, reading: Double)
def parse(str: String): SensorData = ...
val parsedData = rawData.select($"value").as[String].flatMap(record => parse(record))
val validData = parsedData.where($"reading" > 0)
val query = validData.writeStream
.outputMode("append")
.format("parquet")
.option("path", "/dest/sensor-data")
.option("checkpointLocation", "/tmp/checkpoint")
.start()
val rawData = sparkSession.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBootstrapServer)
.option("subscribe", topic)
.option("group.id", "iot-data-consumer")
.option("startingOffsets", "earliest")
.load()
case class SensorData(sensorId: Int, timestamp: Long, reading: Double)
def parse(str: String): SensorData = ...
val parsedData = rawData.select($"value").as[String].flatMap(record => parse(record))
val validData = parsedData.where($"reading" > 0)
val query = validData.writeStream
.outputMode("append")
.format("parquet")
.option("path", "/dest/sensor-data")
.option("checkpointLocation", "/tmp/checkpoint")
.start()
val rawData = sparkSession.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBootstrapServer)
.option("subscribe", topic)
.option("group.id", "iot-data-consumer")
.option("startingOffsets", "earliest")
.load()
case class SensorData(sensorId: Int, timestamp: Long, reading: Double)
def parse(str: String): SensorData = ...
val parsedData = rawData.select($"value").as[String].flatMap(record => parse(record))
val validData = parsedData.where($"reading" > 0)
val query = validData.writeStream
.outputMode("append")
.format("parquet")
.option("path", "/dest/sensor-data")
.option("checkpointLocation", "/tmp/checkpoint")
.start()
val rawData = sparkSession.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBootstrapServer)
.option("subscribe", topic)
.option("group.id", "iot-data-consumer")
.option("startingOffsets", "earliest")
.load()
case class SensorData(sensorId: Int, timestamp: Long, reading: Double)
def parse(str: String): SensorData = ...
val parsedData = rawData.select($"value").as[String].flatMap(record => parse(record))
val validData = parsedData.where($"reading" > 0)
val query = validData.writeStream
.outputMode("append")
.format("parquet")
.option("path", "/dest/sensor-data")
.option("checkpointLocation", "/tmp/checkpoint")
.start()
val rawData = sparkSession.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBootstrapServer)
.option("subscribe", topic)
.option("group.id", "iot-data-consumer")
.option("startingOffsets", "earliest")
.load()
case class SensorData(sensorId: Int, timestamp: Long, reading: Double)
def parse(str: String): SensorData = ...
val parsedData = rawData.select($"value").as[String].flatMap(record => parse(record))
val validData = parsedData.where($"reading" > 0)
val query = validData.writeStream
.outputMode("append")
.format("parquet")
.option("path", "/dest/sensor-data")
.option("checkpointLocation", "/tmp/checkpoint")
.start()
val rawData = sparkSession.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBootstrapServer)
.option("subscribe", topic)
.option("group.id", "iot-data-consumer")
.option("startingOffsets", "earliest")
.load()
case class SensorData(sensorId: Int, timestamp: Long, value: Double)
def parse(str: String): SensorData = ...
val parsedData = rawData.select($"value").as[String].flatMap(record => parse(record))
val validData = parsedData.where($"value" > 0)
val query = validData.writeStream
.outputMode("append")
.format("parquet")
.option("path", "/dest/sensor-data")
.option("checkpointLocation", "/tmp/checkpoint")
.start()
Structured Streaming - Usecases
● ETL
● Stream aggregations, windows
● Event-time oriented analytics
● Join Streams with Fixed Datasets
● Apply Machine Learning Models
Structured Streaming
Spark Streaming
Kafka
Flume
Kinesis
Twitter
Sockets
HDFS/S3
Custom Apache Spark
SparkSQL
SparkML
...
Databases
HDFS
API Server
Streams
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
DStream[T]
RDD[T] RDD[T] RDD[T] RDD[T] RDD[T]
t0 t1 t2 t3 ti ti+1
DStream[T]
RDD[T] RDD[T] RDD[T] RDD[T] RDD[T]
t0 t1 t2 t3 ti ti+1
RDD[U] RDD[U] RDD[U] RDD[U] RDD[U]
Transformation
T -> U
DStream[T]
RDD[T] RDD[T] RDD[T] RDD[T] RDD[T]
t0 t1 t2 t3 ti ti+1
RDD[U] RDD[U] RDD[U] RDD[U] RDD[U]
Transformation
T -> U
Actions
API: Transformations
map,
flatmap,
filter
count,
reduce,
countByValue,
reduceByKey
n
union,
join
cogroup
API: Transformations
mapWithState
…
…
API: Transformations
transform
val iotDstream = MQTTUtils.createStream(...)
val devicePriority = sparkContext.cassandraTable(...)
val prioritizedDStream = iotDstream.transform{rdd =>
rdd.map(d => (d.id, d)).join(devicePriority)
}
Actions
print
-------------------------------------------
Time: 1459875469000 ms
-------------------------------------------
data1
data2
saveAsTextFiles,
saveAsObjectFiles,
saveAsHadoopFiles
xxx
yyy
zzz
foreachRDD
*
Actions
print
-------------------------------------------
Time: 1459875469000 ms
-------------------------------------------
data1
data2
saveAsTextFiles,
saveAsObjectFiles,
saveAsHadoopFiles
xxx
yyy
zzz
foreachRDD
*
Spark SQL
Dataframes
GraphFrames
Any API
Structured Streaming - Usecases
● Stream-stream joins
● Complex state management (local + cluster state)
● Streaming Machine Learning
○ Learn
○ Score
● Join Streams with Updatable Datasets
● Event-time oriented analytics
● Continuous processing
Structured Streaming
+
Spark Streaming + Structured Streaming
36
val parse: Dataset[String] => Dataset[Record] = ???
val process: Dataset[Record] => Dataset[Result] = ???
val serialize: Dataset[Result] => Dataset[String] = ???
val kafkaStream = spark.readStream…
val f = parse andThen process andThen serialize
val result = f(kafkaStream)
result.writeStream
.format("kafka")
.option("kafka.bootstrap.servers",bootstrapServers)
.option("topic", writeTopic)
.option("checkpointLocation", checkpointLocation)
.start()
val dstream = KafkaUtils.createDirectStream(...)
dstream.map{rdd =>
val ds = sparkSession.createDataset(rdd)
val f = parse andThen process andThen serialize
val result = f(ds)
result.write.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("topic", writeTopic)
.option("checkpointLocation", checkpointLocation)
.save()
}
Structured StreamingSpark Streaming
Streaming Pipelines
Structured Streaming
Keyword
Extraction
Keyword
Relevance
Similarity
DB Storage
Structured Streaming
lightbend.com/fast-data-platform
Features
1. One-click component installations
2. Automatic dependency checks
3. One-click access to install logs
4. Real-time cluster visualization
5. Access to consolidated production logs
Benefits:
1. Easy to get started
2. Ready access to all components
3. Increased developer velocity
Fast Data Platform Manager, for Managing Running Clusters
Fast Data Platform
Visit us at the Booth!
Thank You
Gerard Maas
@maasg

More Related Content

What's hot (20)

Streams, Tables, and Time in KSQL
Streams, Tables, and Time in KSQLStreams, Tables, and Time in KSQL
Streams, Tables, and Time in KSQL
confluent
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
Guido Schmutz
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Anton Kirillov
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Databricks
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Databricks
 
Predictive Maintenance at the Dutch Railways with Ivo Everts
Predictive Maintenance at the Dutch Railways with Ivo EvertsPredictive Maintenance at the Dutch Railways with Ivo Everts
Predictive Maintenance at the Dutch Railways with Ivo Everts
Databricks
 
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
confluent
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming  by Tathaga...Deep dive into stateful stream processing in structured streaming  by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...
Databricks
 
[Spark meetup] Spark Streaming Overview
[Spark meetup] Spark Streaming Overview[Spark meetup] Spark Streaming Overview
[Spark meetup] Spark Streaming Overview
Stratio
 
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with StormC*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
DataStax
 
Apache Spark for Library Developers with William Benton and Erik Erlandson
 Apache Spark for Library Developers with William Benton and Erik Erlandson Apache Spark for Library Developers with William Benton and Erik Erlandson
Apache Spark for Library Developers with William Benton and Erik Erlandson
Databricks
 
Performance Analysis and Optimizations for Kafka Streams Applications
Performance Analysis and Optimizations for Kafka Streams ApplicationsPerformance Analysis and Optimizations for Kafka Streams Applications
Performance Analysis and Optimizations for Kafka Streams Applications
Guozhang Wang
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Helena Edelson
 
SMACK Stack 1.1
SMACK Stack 1.1SMACK Stack 1.1
SMACK Stack 1.1
Joe Stein
 
Meet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + KafkaMeet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + Kafka
Knoldus Inc.
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Brian O'Neill
 
Building a Lambda Architecture with Elasticsearch at Yieldbot
Building a Lambda Architecture with Elasticsearch at YieldbotBuilding a Lambda Architecture with Elasticsearch at Yieldbot
Building a Lambda Architecture with Elasticsearch at Yieldbot
yieldbot
 
Crossing the Streams: Rethinking Stream Processing with Kafka Streams and KSQL
Crossing the Streams: Rethinking Stream Processing with Kafka Streams and KSQLCrossing the Streams: Rethinking Stream Processing with Kafka Streams and KSQL
Crossing the Streams: Rethinking Stream Processing with Kafka Streams and KSQL
confluent
 
Introduction to the Processor API
Introduction to the Processor APIIntroduction to the Processor API
Introduction to the Processor API
confluent
 
Streams, Tables, and Time in KSQL
Streams, Tables, and Time in KSQLStreams, Tables, and Time in KSQL
Streams, Tables, and Time in KSQL
confluent
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
Guido Schmutz
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Anton Kirillov
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Databricks
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Databricks
 
Predictive Maintenance at the Dutch Railways with Ivo Everts
Predictive Maintenance at the Dutch Railways with Ivo EvertsPredictive Maintenance at the Dutch Railways with Ivo Everts
Predictive Maintenance at the Dutch Railways with Ivo Everts
Databricks
 
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
confluent
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming  by Tathaga...Deep dive into stateful stream processing in structured streaming  by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...
Databricks
 
[Spark meetup] Spark Streaming Overview
[Spark meetup] Spark Streaming Overview[Spark meetup] Spark Streaming Overview
[Spark meetup] Spark Streaming Overview
Stratio
 
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with StormC*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
DataStax
 
Apache Spark for Library Developers with William Benton and Erik Erlandson
 Apache Spark for Library Developers with William Benton and Erik Erlandson Apache Spark for Library Developers with William Benton and Erik Erlandson
Apache Spark for Library Developers with William Benton and Erik Erlandson
Databricks
 
Performance Analysis and Optimizations for Kafka Streams Applications
Performance Analysis and Optimizations for Kafka Streams ApplicationsPerformance Analysis and Optimizations for Kafka Streams Applications
Performance Analysis and Optimizations for Kafka Streams Applications
Guozhang Wang
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Helena Edelson
 
SMACK Stack 1.1
SMACK Stack 1.1SMACK Stack 1.1
SMACK Stack 1.1
Joe Stein
 
Meet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + KafkaMeet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + Kafka
Knoldus Inc.
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Brian O'Neill
 
Building a Lambda Architecture with Elasticsearch at Yieldbot
Building a Lambda Architecture with Elasticsearch at YieldbotBuilding a Lambda Architecture with Elasticsearch at Yieldbot
Building a Lambda Architecture with Elasticsearch at Yieldbot
yieldbot
 
Crossing the Streams: Rethinking Stream Processing with Kafka Streams and KSQL
Crossing the Streams: Rethinking Stream Processing with Kafka Streams and KSQLCrossing the Streams: Rethinking Stream Processing with Kafka Streams and KSQL
Crossing the Streams: Rethinking Stream Processing with Kafka Streams and KSQL
confluent
 
Introduction to the Processor API
Introduction to the Processor APIIntroduction to the Processor API
Introduction to the Processor API
confluent
 

Similar to Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs (20)

A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
Lightbend
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Summit
 
Spark by Adform Research, Paulius
Spark by Adform Research, PauliusSpark by Adform Research, Paulius
Spark by Adform Research, Paulius
Vasil Remeniuk
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
Samir Bessalah
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
Himanshu Gupta
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
Chetan Khatri
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
Chetan Khatri
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
jeykottalam
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
nickmbailey
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangalore
appaji intelhunt
 
Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopWprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache Hadoop
Sages
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
Databricks
 
Using spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and CassandraUsing spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and Cassandra
Denis Dus
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
Rustam Aliyev
 
Meetup spark structured streaming
Meetup spark structured streamingMeetup spark structured streaming
Meetup spark structured streaming
José Carlos García Serrano
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
Databricks
 
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash courseCodepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Sages
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
Lightbend
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Summit
 
Spark by Adform Research, Paulius
Spark by Adform Research, PauliusSpark by Adform Research, Paulius
Spark by Adform Research, Paulius
Vasil Remeniuk
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
Samir Bessalah
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
Himanshu Gupta
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
Chetan Khatri
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
Chetan Khatri
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
jeykottalam
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
nickmbailey
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangalore
appaji intelhunt
 
Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopWprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache Hadoop
Sages
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
Databricks
 
Using spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and CassandraUsing spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and Cassandra
Denis Dus
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
Rustam Aliyev
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
Databricks
 
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash courseCodepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Sages
 

More from Matt Stubbs (20)

Blueprint Series: Banking In The Cloud – Ultra-high Reliability Architectures
Blueprint Series: Banking In The Cloud – Ultra-high Reliability ArchitecturesBlueprint Series: Banking In The Cloud – Ultra-high Reliability Architectures
Blueprint Series: Banking In The Cloud – Ultra-high Reliability Architectures
Matt Stubbs
 
Speed Up Your Apache Cassandra™ Applications: A Practical Guide to Reactive P...
Speed Up Your Apache Cassandra™ Applications: A Practical Guide to Reactive P...Speed Up Your Apache Cassandra™ Applications: A Practical Guide to Reactive P...
Speed Up Your Apache Cassandra™ Applications: A Practical Guide to Reactive P...
Matt Stubbs
 
Blueprint Series: Expedia Partner Solutions, Data Platform
Blueprint Series: Expedia Partner Solutions, Data PlatformBlueprint Series: Expedia Partner Solutions, Data Platform
Blueprint Series: Expedia Partner Solutions, Data Platform
Matt Stubbs
 
Blueprint Series: Architecture Patterns for Implementing Serverless Microserv...
Blueprint Series: Architecture Patterns for Implementing Serverless Microserv...Blueprint Series: Architecture Patterns for Implementing Serverless Microserv...
Blueprint Series: Architecture Patterns for Implementing Serverless Microserv...
Matt Stubbs
 
Big Data LDN 2018: DATA, WHAT PEOPLE THINK AND WHAT YOU CAN DO TO BUILD TRUST.
Big Data LDN 2018: DATA, WHAT PEOPLE THINK AND WHAT YOU CAN DO TO BUILD TRUST.Big Data LDN 2018: DATA, WHAT PEOPLE THINK AND WHAT YOU CAN DO TO BUILD TRUST.
Big Data LDN 2018: DATA, WHAT PEOPLE THINK AND WHAT YOU CAN DO TO BUILD TRUST.
Matt Stubbs
 
Big Data LDN 2018: DATABASE FOR THE INSTANT EXPERIENCE
Big Data LDN 2018: DATABASE FOR THE INSTANT EXPERIENCEBig Data LDN 2018: DATABASE FOR THE INSTANT EXPERIENCE
Big Data LDN 2018: DATABASE FOR THE INSTANT EXPERIENCE
Matt Stubbs
 
Big Data LDN 2018: BIG DATA TOO SLOW? SPRINKLE IN SOME NOSQL
Big Data LDN 2018: BIG DATA TOO SLOW? SPRINKLE IN SOME NOSQLBig Data LDN 2018: BIG DATA TOO SLOW? SPRINKLE IN SOME NOSQL
Big Data LDN 2018: BIG DATA TOO SLOW? SPRINKLE IN SOME NOSQL
Matt Stubbs
 
Big Data LDN 2018: ENABLING DATA-DRIVEN DECISIONS WITH AUTOMATED INSIGHTS
Big Data LDN 2018: ENABLING DATA-DRIVEN DECISIONS WITH AUTOMATED INSIGHTSBig Data LDN 2018: ENABLING DATA-DRIVEN DECISIONS WITH AUTOMATED INSIGHTS
Big Data LDN 2018: ENABLING DATA-DRIVEN DECISIONS WITH AUTOMATED INSIGHTS
Matt Stubbs
 
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Matt Stubbs
 
Big Data LDN 2018: AI VS. GDPR
Big Data LDN 2018: AI VS. GDPRBig Data LDN 2018: AI VS. GDPR
Big Data LDN 2018: AI VS. GDPR
Matt Stubbs
 
Big Data LDN 2018: REALISING THE PROMISE OF SELF-SERVICE ANALYTICS WITH DATA ...
Big Data LDN 2018: REALISING THE PROMISE OF SELF-SERVICE ANALYTICS WITH DATA ...Big Data LDN 2018: REALISING THE PROMISE OF SELF-SERVICE ANALYTICS WITH DATA ...
Big Data LDN 2018: REALISING THE PROMISE OF SELF-SERVICE ANALYTICS WITH DATA ...
Matt Stubbs
 
Big Data LDN 2018: TURNING MULTIPLE DATA LAKES INTO A UNIFIED ANALYTIC DATA L...
Big Data LDN 2018: TURNING MULTIPLE DATA LAKES INTO A UNIFIED ANALYTIC DATA L...Big Data LDN 2018: TURNING MULTIPLE DATA LAKES INTO A UNIFIED ANALYTIC DATA L...
Big Data LDN 2018: TURNING MULTIPLE DATA LAKES INTO A UNIFIED ANALYTIC DATA L...
Matt Stubbs
 
Big Data LDN 2018: MICROSOFT AZURE AND CLOUDERA – FLEXIBLE CLOUD, WHATEVER TH...
Big Data LDN 2018: MICROSOFT AZURE AND CLOUDERA – FLEXIBLE CLOUD, WHATEVER TH...Big Data LDN 2018: MICROSOFT AZURE AND CLOUDERA – FLEXIBLE CLOUD, WHATEVER TH...
Big Data LDN 2018: MICROSOFT AZURE AND CLOUDERA – FLEXIBLE CLOUD, WHATEVER TH...
Matt Stubbs
 
Big Data LDN 2018: CONSISTENT SECURITY, GOVERNANCE AND FLEXIBILITY FOR ALL WO...
Big Data LDN 2018: CONSISTENT SECURITY, GOVERNANCE AND FLEXIBILITY FOR ALL WO...Big Data LDN 2018: CONSISTENT SECURITY, GOVERNANCE AND FLEXIBILITY FOR ALL WO...
Big Data LDN 2018: CONSISTENT SECURITY, GOVERNANCE AND FLEXIBILITY FOR ALL WO...
Matt Stubbs
 
Big Data LDN 2018: MICROLISE: USING BIG DATA AND AI IN TRANSPORT AND LOGISTICS
Big Data LDN 2018: MICROLISE: USING BIG DATA AND AI IN TRANSPORT AND LOGISTICSBig Data LDN 2018: MICROLISE: USING BIG DATA AND AI IN TRANSPORT AND LOGISTICS
Big Data LDN 2018: MICROLISE: USING BIG DATA AND AI IN TRANSPORT AND LOGISTICS
Matt Stubbs
 
Big Data LDN 2018: EXPERIAN: MAXIMISE EVERY OPPORTUNITY IN THE BIG DATA UNIVERSE
Big Data LDN 2018: EXPERIAN: MAXIMISE EVERY OPPORTUNITY IN THE BIG DATA UNIVERSEBig Data LDN 2018: EXPERIAN: MAXIMISE EVERY OPPORTUNITY IN THE BIG DATA UNIVERSE
Big Data LDN 2018: EXPERIAN: MAXIMISE EVERY OPPORTUNITY IN THE BIG DATA UNIVERSE
Matt Stubbs
 
Big Data LDN 2018: A LOOK INSIDE APPLIED MACHINE LEARNING
Big Data LDN 2018: A LOOK INSIDE APPLIED MACHINE LEARNINGBig Data LDN 2018: A LOOK INSIDE APPLIED MACHINE LEARNING
Big Data LDN 2018: A LOOK INSIDE APPLIED MACHINE LEARNING
Matt Stubbs
 
Big Data LDN 2018: DEUTSCHE BANK: THE PATH TO AUTOMATION IN A HIGHLY REGULATE...
Big Data LDN 2018: DEUTSCHE BANK: THE PATH TO AUTOMATION IN A HIGHLY REGULATE...Big Data LDN 2018: DEUTSCHE BANK: THE PATH TO AUTOMATION IN A HIGHLY REGULATE...
Big Data LDN 2018: DEUTSCHE BANK: THE PATH TO AUTOMATION IN A HIGHLY REGULATE...
Matt Stubbs
 
Big Data LDN 2018: FROM PROLIFERATION TO PRODUCTIVITY: MACHINE LEARNING DATA ...
Big Data LDN 2018: FROM PROLIFERATION TO PRODUCTIVITY: MACHINE LEARNING DATA ...Big Data LDN 2018: FROM PROLIFERATION TO PRODUCTIVITY: MACHINE LEARNING DATA ...
Big Data LDN 2018: FROM PROLIFERATION TO PRODUCTIVITY: MACHINE LEARNING DATA ...
Matt Stubbs
 
Big Data LDN 2018: DATA APIS DON’T DISCRIMINATE
Big Data LDN 2018: DATA APIS DON’T DISCRIMINATEBig Data LDN 2018: DATA APIS DON’T DISCRIMINATE
Big Data LDN 2018: DATA APIS DON’T DISCRIMINATE
Matt Stubbs
 
Blueprint Series: Banking In The Cloud – Ultra-high Reliability Architectures
Blueprint Series: Banking In The Cloud – Ultra-high Reliability ArchitecturesBlueprint Series: Banking In The Cloud – Ultra-high Reliability Architectures
Blueprint Series: Banking In The Cloud – Ultra-high Reliability Architectures
Matt Stubbs
 
Speed Up Your Apache Cassandra™ Applications: A Practical Guide to Reactive P...
Speed Up Your Apache Cassandra™ Applications: A Practical Guide to Reactive P...Speed Up Your Apache Cassandra™ Applications: A Practical Guide to Reactive P...
Speed Up Your Apache Cassandra™ Applications: A Practical Guide to Reactive P...
Matt Stubbs
 
Blueprint Series: Expedia Partner Solutions, Data Platform
Blueprint Series: Expedia Partner Solutions, Data PlatformBlueprint Series: Expedia Partner Solutions, Data Platform
Blueprint Series: Expedia Partner Solutions, Data Platform
Matt Stubbs
 
Blueprint Series: Architecture Patterns for Implementing Serverless Microserv...
Blueprint Series: Architecture Patterns for Implementing Serverless Microserv...Blueprint Series: Architecture Patterns for Implementing Serverless Microserv...
Blueprint Series: Architecture Patterns for Implementing Serverless Microserv...
Matt Stubbs
 
Big Data LDN 2018: DATA, WHAT PEOPLE THINK AND WHAT YOU CAN DO TO BUILD TRUST.
Big Data LDN 2018: DATA, WHAT PEOPLE THINK AND WHAT YOU CAN DO TO BUILD TRUST.Big Data LDN 2018: DATA, WHAT PEOPLE THINK AND WHAT YOU CAN DO TO BUILD TRUST.
Big Data LDN 2018: DATA, WHAT PEOPLE THINK AND WHAT YOU CAN DO TO BUILD TRUST.
Matt Stubbs
 
Big Data LDN 2018: DATABASE FOR THE INSTANT EXPERIENCE
Big Data LDN 2018: DATABASE FOR THE INSTANT EXPERIENCEBig Data LDN 2018: DATABASE FOR THE INSTANT EXPERIENCE
Big Data LDN 2018: DATABASE FOR THE INSTANT EXPERIENCE
Matt Stubbs
 
Big Data LDN 2018: BIG DATA TOO SLOW? SPRINKLE IN SOME NOSQL
Big Data LDN 2018: BIG DATA TOO SLOW? SPRINKLE IN SOME NOSQLBig Data LDN 2018: BIG DATA TOO SLOW? SPRINKLE IN SOME NOSQL
Big Data LDN 2018: BIG DATA TOO SLOW? SPRINKLE IN SOME NOSQL
Matt Stubbs
 
Big Data LDN 2018: ENABLING DATA-DRIVEN DECISIONS WITH AUTOMATED INSIGHTS
Big Data LDN 2018: ENABLING DATA-DRIVEN DECISIONS WITH AUTOMATED INSIGHTSBig Data LDN 2018: ENABLING DATA-DRIVEN DECISIONS WITH AUTOMATED INSIGHTS
Big Data LDN 2018: ENABLING DATA-DRIVEN DECISIONS WITH AUTOMATED INSIGHTS
Matt Stubbs
 
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Matt Stubbs
 
Big Data LDN 2018: AI VS. GDPR
Big Data LDN 2018: AI VS. GDPRBig Data LDN 2018: AI VS. GDPR
Big Data LDN 2018: AI VS. GDPR
Matt Stubbs
 
Big Data LDN 2018: REALISING THE PROMISE OF SELF-SERVICE ANALYTICS WITH DATA ...
Big Data LDN 2018: REALISING THE PROMISE OF SELF-SERVICE ANALYTICS WITH DATA ...Big Data LDN 2018: REALISING THE PROMISE OF SELF-SERVICE ANALYTICS WITH DATA ...
Big Data LDN 2018: REALISING THE PROMISE OF SELF-SERVICE ANALYTICS WITH DATA ...
Matt Stubbs
 
Big Data LDN 2018: TURNING MULTIPLE DATA LAKES INTO A UNIFIED ANALYTIC DATA L...
Big Data LDN 2018: TURNING MULTIPLE DATA LAKES INTO A UNIFIED ANALYTIC DATA L...Big Data LDN 2018: TURNING MULTIPLE DATA LAKES INTO A UNIFIED ANALYTIC DATA L...
Big Data LDN 2018: TURNING MULTIPLE DATA LAKES INTO A UNIFIED ANALYTIC DATA L...
Matt Stubbs
 
Big Data LDN 2018: MICROSOFT AZURE AND CLOUDERA – FLEXIBLE CLOUD, WHATEVER TH...
Big Data LDN 2018: MICROSOFT AZURE AND CLOUDERA – FLEXIBLE CLOUD, WHATEVER TH...Big Data LDN 2018: MICROSOFT AZURE AND CLOUDERA – FLEXIBLE CLOUD, WHATEVER TH...
Big Data LDN 2018: MICROSOFT AZURE AND CLOUDERA – FLEXIBLE CLOUD, WHATEVER TH...
Matt Stubbs
 
Big Data LDN 2018: CONSISTENT SECURITY, GOVERNANCE AND FLEXIBILITY FOR ALL WO...
Big Data LDN 2018: CONSISTENT SECURITY, GOVERNANCE AND FLEXIBILITY FOR ALL WO...Big Data LDN 2018: CONSISTENT SECURITY, GOVERNANCE AND FLEXIBILITY FOR ALL WO...
Big Data LDN 2018: CONSISTENT SECURITY, GOVERNANCE AND FLEXIBILITY FOR ALL WO...
Matt Stubbs
 
Big Data LDN 2018: MICROLISE: USING BIG DATA AND AI IN TRANSPORT AND LOGISTICS
Big Data LDN 2018: MICROLISE: USING BIG DATA AND AI IN TRANSPORT AND LOGISTICSBig Data LDN 2018: MICROLISE: USING BIG DATA AND AI IN TRANSPORT AND LOGISTICS
Big Data LDN 2018: MICROLISE: USING BIG DATA AND AI IN TRANSPORT AND LOGISTICS
Matt Stubbs
 
Big Data LDN 2018: EXPERIAN: MAXIMISE EVERY OPPORTUNITY IN THE BIG DATA UNIVERSE
Big Data LDN 2018: EXPERIAN: MAXIMISE EVERY OPPORTUNITY IN THE BIG DATA UNIVERSEBig Data LDN 2018: EXPERIAN: MAXIMISE EVERY OPPORTUNITY IN THE BIG DATA UNIVERSE
Big Data LDN 2018: EXPERIAN: MAXIMISE EVERY OPPORTUNITY IN THE BIG DATA UNIVERSE
Matt Stubbs
 
Big Data LDN 2018: A LOOK INSIDE APPLIED MACHINE LEARNING
Big Data LDN 2018: A LOOK INSIDE APPLIED MACHINE LEARNINGBig Data LDN 2018: A LOOK INSIDE APPLIED MACHINE LEARNING
Big Data LDN 2018: A LOOK INSIDE APPLIED MACHINE LEARNING
Matt Stubbs
 
Big Data LDN 2018: DEUTSCHE BANK: THE PATH TO AUTOMATION IN A HIGHLY REGULATE...
Big Data LDN 2018: DEUTSCHE BANK: THE PATH TO AUTOMATION IN A HIGHLY REGULATE...Big Data LDN 2018: DEUTSCHE BANK: THE PATH TO AUTOMATION IN A HIGHLY REGULATE...
Big Data LDN 2018: DEUTSCHE BANK: THE PATH TO AUTOMATION IN A HIGHLY REGULATE...
Matt Stubbs
 
Big Data LDN 2018: FROM PROLIFERATION TO PRODUCTIVITY: MACHINE LEARNING DATA ...
Big Data LDN 2018: FROM PROLIFERATION TO PRODUCTIVITY: MACHINE LEARNING DATA ...Big Data LDN 2018: FROM PROLIFERATION TO PRODUCTIVITY: MACHINE LEARNING DATA ...
Big Data LDN 2018: FROM PROLIFERATION TO PRODUCTIVITY: MACHINE LEARNING DATA ...
Matt Stubbs
 
Big Data LDN 2018: DATA APIS DON’T DISCRIMINATE
Big Data LDN 2018: DATA APIS DON’T DISCRIMINATEBig Data LDN 2018: DATA APIS DON’T DISCRIMINATE
Big Data LDN 2018: DATA APIS DON’T DISCRIMINATE
Matt Stubbs
 

Recently uploaded (20)

Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
Decision Trees in Artificial-Intelligence.pdf
Decision Trees in Artificial-Intelligence.pdfDecision Trees in Artificial-Intelligence.pdf
Decision Trees in Artificial-Intelligence.pdf
Saikat Basu
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docxMASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
santosh162
 
定制学历(美国Purdue毕业证)普渡大学电子版毕业证
定制学历(美国Purdue毕业证)普渡大学电子版毕业证定制学历(美国Purdue毕业证)普渡大学电子版毕业证
定制学历(美国Purdue毕业证)普渡大学电子版毕业证
Taqyea
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
4. Multivariable statistics_Using Stata_2025.pdf
4. Multivariable statistics_Using Stata_2025.pdf4. Multivariable statistics_Using Stata_2025.pdf
4. Multivariable statistics_Using Stata_2025.pdf
axonneurologycenter1
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
Decision Trees in Artificial-Intelligence.pdf
Decision Trees in Artificial-Intelligence.pdfDecision Trees in Artificial-Intelligence.pdf
Decision Trees in Artificial-Intelligence.pdf
Saikat Basu
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docxMASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
santosh162
 
定制学历(美国Purdue毕业证)普渡大学电子版毕业证
定制学历(美国Purdue毕业证)普渡大学电子版毕业证定制学历(美国Purdue毕业证)普渡大学电子版毕业证
定制学历(美国Purdue毕业证)普渡大学电子版毕业证
Taqyea
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
4. Multivariable statistics_Using Stata_2025.pdf
4. Multivariable statistics_Using Stata_2025.pdf4. Multivariable statistics_Using Stata_2025.pdf
4. Multivariable statistics_Using Stata_2025.pdf
axonneurologycenter1
 

Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs

  • 1. Processing Fast Data with Apache Spark: The Tale of Two APIs Big Data Lnd Gerard Maas Señor SW Engineer 16/Nov/2017
  • 2. Gerard Maas Señor SW Engineer Computer Engineer Scala Programmer Early Spark Adopter (v0.9) Spark Notebook Contributor Cassandra MVP (2015, 2016) Stack Overflow Top Contributor (Spark, Spark Streaming, Scala) Wannabe { IoT Maker Drone crasher/tinkerer } @maasg https://github.com/maasg https://www.linkedin.com/ in/gerardmaas/ https://stackoverflow.com /users/764040/maasg
  • 8. Apache Spark Core Spark SQL SparkMLLib SparkStreaming Structured Streaming DataFrames DataSets GraphFrames Data Sources
  • 9. Apache Spark Core Spark SQL SparkMLLib SparkStreaming Structured Streaming DataFrames DataSets GraphFrames Data Sources
  • 14. val rawData = sparkSession.readStream .format("kafka") .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("subscribe", topic) .option("group.id", "iot-data-consumer") .option("startingOffsets", "earliest") .load() case class SensorData(sensorId: Int, timestamp: Long, reading: Double) def parse(str: String): SensorData = ... val parsedData = rawData.select($"value").as[String].flatMap(record => parse(record)) val validData = parsedData.where($"reading" > 0) val query = validData.writeStream .outputMode("append") .format("parquet") .option("path", "/dest/sensor-data") .option("checkpointLocation", "/tmp/checkpoint") .start()
  • 15. val rawData = sparkSession.readStream .format("kafka") .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("subscribe", topic) .option("group.id", "iot-data-consumer") .option("startingOffsets", "earliest") .load() case class SensorData(sensorId: Int, timestamp: Long, reading: Double) def parse(str: String): SensorData = ... val parsedData = rawData.select($"value").as[String].flatMap(record => parse(record)) val validData = parsedData.where($"reading" > 0) val query = validData.writeStream .outputMode("append") .format("parquet") .option("path", "/dest/sensor-data") .option("checkpointLocation", "/tmp/checkpoint") .start()
  • 16. val rawData = sparkSession.readStream .format("kafka") .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("subscribe", topic) .option("group.id", "iot-data-consumer") .option("startingOffsets", "earliest") .load() case class SensorData(sensorId: Int, timestamp: Long, reading: Double) def parse(str: String): SensorData = ... val parsedData = rawData.select($"value").as[String].flatMap(record => parse(record)) val validData = parsedData.where($"reading" > 0) val query = validData.writeStream .outputMode("append") .format("parquet") .option("path", "/dest/sensor-data") .option("checkpointLocation", "/tmp/checkpoint") .start()
  • 17. val rawData = sparkSession.readStream .format("kafka") .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("subscribe", topic) .option("group.id", "iot-data-consumer") .option("startingOffsets", "earliest") .load() case class SensorData(sensorId: Int, timestamp: Long, reading: Double) def parse(str: String): SensorData = ... val parsedData = rawData.select($"value").as[String].flatMap(record => parse(record)) val validData = parsedData.where($"reading" > 0) val query = validData.writeStream .outputMode("append") .format("parquet") .option("path", "/dest/sensor-data") .option("checkpointLocation", "/tmp/checkpoint") .start()
  • 18. val rawData = sparkSession.readStream .format("kafka") .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("subscribe", topic) .option("group.id", "iot-data-consumer") .option("startingOffsets", "earliest") .load() case class SensorData(sensorId: Int, timestamp: Long, reading: Double) def parse(str: String): SensorData = ... val parsedData = rawData.select($"value").as[String].flatMap(record => parse(record)) val validData = parsedData.where($"reading" > 0) val query = validData.writeStream .outputMode("append") .format("parquet") .option("path", "/dest/sensor-data") .option("checkpointLocation", "/tmp/checkpoint") .start()
  • 19. val rawData = sparkSession.readStream .format("kafka") .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("subscribe", topic) .option("group.id", "iot-data-consumer") .option("startingOffsets", "earliest") .load() case class SensorData(sensorId: Int, timestamp: Long, reading: Double) def parse(str: String): SensorData = ... val parsedData = rawData.select($"value").as[String].flatMap(record => parse(record)) val validData = parsedData.where($"reading" > 0) val query = validData.writeStream .outputMode("append") .format("parquet") .option("path", "/dest/sensor-data") .option("checkpointLocation", "/tmp/checkpoint") .start()
  • 20. val rawData = sparkSession.readStream .format("kafka") .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("subscribe", topic) .option("group.id", "iot-data-consumer") .option("startingOffsets", "earliest") .load() case class SensorData(sensorId: Int, timestamp: Long, reading: Double) def parse(str: String): SensorData = ... val parsedData = rawData.select($"value").as[String].flatMap(record => parse(record)) val validData = parsedData.where($"reading" > 0) val query = validData.writeStream .outputMode("append") .format("parquet") .option("path", "/dest/sensor-data") .option("checkpointLocation", "/tmp/checkpoint") .start()
  • 21. val rawData = sparkSession.readStream .format("kafka") .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("subscribe", topic) .option("group.id", "iot-data-consumer") .option("startingOffsets", "earliest") .load() case class SensorData(sensorId: Int, timestamp: Long, value: Double) def parse(str: String): SensorData = ... val parsedData = rawData.select($"value").as[String].flatMap(record => parse(record)) val validData = parsedData.where($"value" > 0) val query = validData.writeStream .outputMode("append") .format("parquet") .option("path", "/dest/sensor-data") .option("checkpointLocation", "/tmp/checkpoint") .start()
  • 22. Structured Streaming - Usecases ● ETL ● Stream aggregations, windows ● Event-time oriented analytics ● Join Streams with Fixed Datasets ● Apply Machine Learning Models
  • 24. Spark Streaming Kafka Flume Kinesis Twitter Sockets HDFS/S3 Custom Apache Spark SparkSQL SparkML ... Databases HDFS API Server Streams
  • 26. DStream[T] RDD[T] RDD[T] RDD[T] RDD[T] RDD[T] t0 t1 t2 t3 ti ti+1
  • 27. DStream[T] RDD[T] RDD[T] RDD[T] RDD[T] RDD[T] t0 t1 t2 t3 ti ti+1 RDD[U] RDD[U] RDD[U] RDD[U] RDD[U] Transformation T -> U
  • 28. DStream[T] RDD[T] RDD[T] RDD[T] RDD[T] RDD[T] t0 t1 t2 t3 ti ti+1 RDD[U] RDD[U] RDD[U] RDD[U] RDD[U] Transformation T -> U Actions
  • 31. API: Transformations transform val iotDstream = MQTTUtils.createStream(...) val devicePriority = sparkContext.cassandraTable(...) val prioritizedDStream = iotDstream.transform{rdd => rdd.map(d => (d.id, d)).join(devicePriority) }
  • 34. Structured Streaming - Usecases ● Stream-stream joins ● Complex state management (local + cluster state) ● Streaming Machine Learning ○ Learn ○ Score ● Join Streams with Updatable Datasets ● Event-time oriented analytics ● Continuous processing
  • 36. Spark Streaming + Structured Streaming 36 val parse: Dataset[String] => Dataset[Record] = ??? val process: Dataset[Record] => Dataset[Result] = ??? val serialize: Dataset[Result] => Dataset[String] = ??? val kafkaStream = spark.readStream… val f = parse andThen process andThen serialize val result = f(kafkaStream) result.writeStream .format("kafka") .option("kafka.bootstrap.servers",bootstrapServers) .option("topic", writeTopic) .option("checkpointLocation", checkpointLocation) .start() val dstream = KafkaUtils.createDirectStream(...) dstream.map{rdd => val ds = sparkSession.createDataset(rdd) val f = parse andThen process andThen serialize val result = f(ds) result.write.format("kafka") .option("kafka.bootstrap.servers", bootstrapServers) .option("topic", writeTopic) .option("checkpointLocation", checkpointLocation) .save() } Structured StreamingSpark Streaming
  • 40. Features 1. One-click component installations 2. Automatic dependency checks 3. One-click access to install logs 4. Real-time cluster visualization 5. Access to consolidated production logs Benefits: 1. Easy to get started 2. Ready access to all components 3. Increased developer velocity Fast Data Platform Manager, for Managing Running Clusters
  • 41. Fast Data Platform Visit us at the Booth!