SlideShare a Scribd company logo
Using Spark and Riak for IoT apps
Patterns and Anti-patterns
Pavel Hardak
Basho Technologies
IOT & INDUSTRY VERTICALS
IoT market - growth prediction
Number of connected “things”
•2016 – about 6.4 B
•30% YoY growth, 5.5M activations per day
•2020 – about 21 B
“By 2020 more than half of new major business processes and
systems will incorporate some element of Internet of Things”
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit East talk Pavel Hardak
Copyright © 2017 Daniel Elizalde
We want to be here!
IoT Project Plan
•Investigate those “things” and figure out
• What protocols they support (CoAP, MQTT, HTTP, …)
• What data they generate (temperature, humidity, location, speed, ...)
•Collect this data in our data center
• Implement protocols and parsing routines
• Store into persistent storage (“Data Lake” architecture)
•Once stored in Data Lake
• Analyze, summarize, “slice and dice”
• Predict, make recommendations, discover insights
•Declare a victory (make profit, go for IPO, …)
Data Lake
IoT
Devices
SQL
Apps &
AnalyticsMQTT, CoAP and HTTP
REFERENCE ARCHITECTURE (?)
Not so fast, my friend.
What is wrong with “Data Lake” for IoT ?
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit East talk Pavel Hardak
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit East talk Pavel Hardak
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit East talk Pavel Hardak
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit East talk Pavel Hardak
What is different special about IoT?
It is about the “things”… and more.
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit East talk Pavel Hardak
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit East talk Pavel Hardak
IoT Networks and Protocols
IoT Devices & IoT Network Protocols
•Wireless technologies
•Limited range
•Limited bandwidth
•Shared transmission media
•Mesh or Ad-hoc Topology
•Possible signals interference
•Low cost hardware components
•Low power radio transmitters
•Very small antennas
•“Custom-made” firmware
•Constrained Application Protocol (CoAP)
•“Best Effort” QoS (“shoot and forget”)
IoT is “Big Data” - by definition.
Actually, lots and lots of Big Data.
IoT Data Categories
Category Description
Metadata
& Profiles
Devices Device info (model, SN, firmware, sensors, ..), configuration, owner, …
Users Personal info, preferences, billing info, registered devices, …
Time
Series
Ingested
(“Raw”)
Measurements, statuses and events from devices.
Aggregated
(“Derived”)
Calculated data - from devices & profiles
• Rollups – aggregate metrics from low resolution to higher ones (min -
hour – day) using min, max, avg, ...
• Aggregations – aggregate measurements, configuration and profiles
(model, region, …) over time ranges
Five “V”s IoT data
Velocity Torrent of small writes (sensors). Reads – millions of low-latency queries: user and device
profiles, range queries for TS data (slices). Stream of updates (profiles) - beware of conflicts.
Five “V”s IoT data
Velocity Torrent of small writes (sensors). Reads – millions of low-latency queries, user and device
profiles, range queries for TS data (slices). Stream of updates (profiles) - beware of conflicts.
Variety Sensors data (time series), users and devices profiles, also time series “derived” data (e.g.
rollups, aggregations).
Five “V”s IoT data
Velocity Torrent of small writes (sensors). Reads – millions of low-latency queries, user and device
profiles, range queries for TS data (slices). Stream of updates (profiles) - beware of conflicts.
Variety Sensors data (time series), users and devices profiles, also time series “derived” data (e.g.
rollups, aggregations).
Volume Starts small, grows quickly, keeps coming 24x7x365 (nights, weekends and holidays). Spikes up
on new model launches or successful marketing campaign. Can slow down, but will keep
growing. Efficient data retention policy is critical to prevent overflows.
Five “V”s IoT data
Velocity Torrent of small writes (sensors). Reads – millions of low-latency queries, user and device
profiles, range queries for TS data (slices). Stream of updates (profiles) - beware of conflicts.
Variety Sensors data (time series), users and devices profiles, also time series “derived” data (e.g.
rollups, aggregations).
Volume Starts small, grows quickly, keeps coming 24x7x365 (nights, weekends and holidays). Spikes up
on new model launches or successful marketing campaign. Can slow down, but will keep
growing. Efficient data retention policy is critical to prevent overflows.
Veracity Generally trustworthy, but beware of “low cost” sensors with low accuracy. Sent over
not-so-reliable transport - expect that some data will be corrupted or arrive late or might be lost.
(Hopefully the devices were not hijacked or impersonated by hackers)
Five “V”s IoT data
Velocity Torrent of small writes (sensors). Reads – millions of low-latency queries, user and device
profiles, range queries for TS data (slices). Stream of updates (profiles) - beware of conflicts.
Variety Sensors data (time series), users and devices profiles, also time series “derived” data (e.g.
rollups, aggregations).
Volume Starts small, grows quickly, keeps coming 24x7x365 (nights, weekends and holidays). Spikes up
on new model launches or successful marketing campaign. Can slow down, but will keep
growing. Efficient data retention policy is critical to prevent overflows.
Veracity Generally trustworthy, but beware of “low cost” sensors with low accuracy. Sent over
not-so-reliable transport - expect that some data will be corrupted or arrive late or might be lost.
(Hopefully the devices were not hijacked or impersonated by hackers)
Value Profiles and summaries are much more valuable than raw data samples. The value of “raw”
time series quickly goes down after it was processed and clock advanced. Aggregated
(”derived”) data are more valuable than raw data.
Exceptions: financial transactions, life support, nuclear plants, oil rigs, …
Five “V”s IoT data
Velocity Torrent of small writes (sensors). Reads – millions of low-latency queries, user and device
profiles, range queries for TS data (slices). Stream of updates (profiles) - beware of conflicts.
Variety Sensors data (time series), users and devices profiles, also time series “derived” data (e.g.
rollups, aggregations).
Volume Starts small, grows quickly, keeps coming 24x7x365 (nights, weekends and holidays). Spikes up
on new model launches or successful marketing campaign. Can slow down, but will keep
growing. Efficient data retention policy is critical to prevent overflows.
Veracity Generally trustworthy, but beware of “low cost” sensors with low accuracy. Sent over
not-so-reliable transport - expect that some data will be corrupted or arrive late or might be lost.
(Hopefully the devices were not hijacked or impersonated by hackers)
Value Profiles and summaries are much more valuable than raw data samples. The value of “raw”
time series quickly goes down after it was processed and clock advanced. Aggregated
(”derived”) data are more valuable than raw data.
Exceptions: financial transactions, life support, nuclear plants, oil rigs, …
Complexity Poly-structured using simple schemas and simple relations (usually implicit). Some data is treated
as unstructured (”opaque”) for speed or flexibility.
Note: expect schema or structure changes without preliminary notice.
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit East talk Pavel Hardak
How are we going to solve it ?
IoT Data & Processing
• Data
• Huge amounts of data records - arriving 24x7x365
• Some data records will arrive out-of-order, be late (minutes or hours) or lost
• Expect “unexpected” - e.g. errors, nulls, schema or type changes, drops
• Processing
• Preprocessing - validation and cleansing
• Translation (format, type, version, ...) and enrichment
• Aggregations - min, max, avg, sum, top or bottom N, percentile, …
• Grouping - device vendor and model, location, service, subscription type, ...
• Rollups - from 10 sec raw samples to 1 min, 1 hour, 1 day, 1 week, 1 month, ..
• Alarms (e.g. threshold crossing), anomaly detection (using ML)
• Predefined reports (e.g. daily, weekly, …)
• Ad-hoc reports or exploratory queries
• Insights, predictions, …
Architectural Blueprints
•Lambda Architecture by Nathan Marz (ex-Twitter)
•Kappa Architecture by Jay Kreps (Confluent)
•Zeta Architecture by Jim Scott (MapR)
•… and their variants
Lambda
Kappa
Zeta
Data Processing Framework for IoT
• Uses “Best of breed” OSS technologies
• Combines two paradigms
• “Speed Layer” – pipeline for Stream Processing for “Data in Motion”
• “Serving Layer” – analytics for “Data in Motion” and “Data at Rest”
• Every component is “Distributed by Design”
• Collection Layer
• Message Queue
• Stream Processing
• Data Storage (Database, Object System, Data Warehouse)
• Query and Analytics Engines
Data store for IoT – “Wish list”
• Ingested (Raw) Time Series
• Very high write throughput
• Fast slice (time range) reads
• Aggregated (Derived) Time Series
• Auto-distributed + slice locality
• SQL-like queries
• Aggregations
• Bulk queries (analytics)
• Secondary Indexes (Tags)
• Efficient Storage
• Auto Data Retention (TTL)
• Build-in anti entropy
• Compression
• Hot Backups
• Profiles and Metadata
• Many concurrent reads with low latency
• Reliable writes (ACID or conflict resolution)
• Unstructured or partially structured
• Secondary Indexes + Text Search
• Scalability and Availability
• Distributed architecture, no SPoF
• Linearly scalable - up and down
• Operational simplicity
• Masterless architecture
• Automatic rebalancing
• Metrics, logs, events
• Rolling upgrades
What DB type is a good fit for TS use cases?
Data Access Patterns
Category Description R:W %
Metadata
& Profiles
Devices &
Users
Many low latency small reads - all over the dataset. Occasional updates –
possibly by different “actors” (web, device, app), conflicts need to be
prevented or resolved. Fewer creates and deletes.
90:10
Time
Series
Data Access Patterns
Category Description R:W %
Metadata
& Profiles
Devices &
Users
Many low latency small reads - all over the dataset. Occasional updates –
possibly by different “actors” (web, device, app), conflicts need to be
prevented or resolved. Fewer creates and deletes.
90:10
Time
Series
Ingested
(“Raw”)
Very high throughout of relatively small writes. Most reads are over
recent time range “slice”. Updates are rare (corrections).
This category is a biggest part of the IoT application dataset.
10:90
Data Access Patterns
Category Description R:W %
Metadata
& Profiles
Devices &
Users
Many low latency small reads - all over the dataset. Occasional updates –
possibly by different “actors” (web, device, app), conflicts need to be
prevented or resolved. Fewer creates and deletes.
90:10
Time
Series
Ingested
(“Raw”)
Very high throughout of relatively small writes. Most reads are over
recent time range “slice”. Updates are rare (corrections).
This category is a biggest part of the IoT application dataset.
10:90
Aggregated
(“Derived”)
Mostly reads – users, platform services, reports. Writes are periodical on
each time interval or from batch jobs.
80:20
Database Type For IoT or Time Series
Relational Key Value Document Wide Column Graph
MySQL Riak KV MongoDB Cassandra Neo4J
PostgreSQL DynamoDB CouchBase HBase Titan
Oracle Voldemort RethinkDB Accumulo Infinite Graph
We need a new type of NoSQL database – Time Series
None of existing DB types was designed to handle time series data
• Wide column DBs have high write throughput, but reads and updates are not their strength
• Key Value and Document DBs handle metadata well, but struggle with heavy writes and time-slicing reads
• Relational - good with metadata (unless number of updates is high), but a bad choice for TS data
• Graph DB – not a good choice for either time series or metadata, can be added later on
Database Type For IoT or Time Series
Relational Key Value Document Wide Column Graph
MySQL Riak KV MongoDB Cassandra Neo4J
PostgreSQL DynamoDB CouchBase HBase Titan
Oracle Voldemort RethinkDB Accumulo Infinite Graph
Time Series
InfluxDB Riak TS Blueflood
KairosDB Prometeus Druid
OpenTSDB Dalmatiner Graphite
Iot Sensors Data – Hot to Cold
SENSORS DATA – HOT N’ COLD
Temp Purpose Description Immutable?
Boiling
Hot
App usage
Last known value(s) and/or for last N minutes, useful for
immediate responses, very frequently accessed
No
Hot Operational
dataset
Last 24 hours to several days or weeks (rarely months),
frequently accessed, dashboards and online analytics
Almost*
Warm Historical data
Older data, less frequently accessed, used mostly for offline
analytics and historical analysis
Yes
Cold Archives
Used only in rare situations, kept in long term storage for
regulatory or unpredicted purposes
Yes
STORAGE TIERS – FROM HOT TO COLD
RAM → Database (TSDB) → Object Storage → Archive
Data Lake
Temp Purpose Storage Products Immutable?
Boiling
Hot
App usage Internal app cache, Redis or Memcached No
Hot Operational
dataset
NoSQL Database (preferably Time Series DB)
Riak TS, OpenTSDB, KairosDB, Cassandra, HBase
Almost*
Warm Historical data
Object storage – HDFS (Hadoop), Ceph, Minio, Riak S2
or AWS S3
Yes
Cold Archives Various Yes
STORAGE TIERS – REALITY CHECK
RAM → Database (TSDB) → Object Storage → Archive
Elastic Cache (Redis) → Database (Postgres, DynamoDB) → AWS S3 → Glacier
Data Lake
Temp AWS Service Storage price, GB per month
Boiling Hot Elastic Cache (Redis) $15-45
Hot DynamoDB
RDS (Postgres)
$ 0.25-0.35 (SSD)
from $0.1 (Magnetic)
Warm Simple Storage Service (S3) $0.024 to $0.030
Cold Glacier $0.007
OSS technologies for scalable IoT apps
Component Open Source Technologies
Load balancer Ngnix, HA Proxy
Ingestion Kafka, RabbitMQ, ZeroMQ, Flume
Stream Computing Spark Streaming, Apache Flink, Kafka Streams, Samza
Time Series Store InfluxDB, KairosDB, Riak, Cassandra, OpenTSDB
Profiles Store CouchBase, Riak, MySQL, Postgres, MongoDB
Search Solr, Elastic Search
Object Storage HDFS (Hadoop), Minio, Riak S2, Ceph
Analytics Framework Apache Spark (& MLlib), MapReduce, Hive
SQL Query Engine Spark SQL, Presto, Impala, Drill
Cluster Manager Mesosphere DC/OS or Mesos, Kubernetes, Docker Swarm
❑ Is it vendor lock-in or open source software? Are there open APIs?
❑ Can it be deployed in cloud? At the edge? In a data center? Using hybrid approach?
❑ Can it be used it for free or low cost (no big upfront investment)?
❑ Are the components pre-integrated or can be easily integrated together?
❑ Can you develop your app on your laptop? How many “moving parts”?
❑ Can you easily scale each component in this architecture by 2x? 10x? 50x?
❑ Is there a roadmap, actively worked on, which is aligned with your vision?
❑ Is there a company behind the technology to provide 24x7 support when needed?
Checklist for IoT technology stack
OSS technologies for IoT apps - the “opinionated” choice
Component Open Source Technologies
Load balancer HA Proxy
Ingestion Apache Kafka
Stream Computing Structured Spark Streaming
Time Series Store Riak (TS tables)
Profiles Store Riak (KV buckets)
Search Riak Search (based on Solr)
Object Storage Riak S2
Analytics Framework Apache Spark (& MLlib)
SQL Query Engine Apache Spark SQL
Cluster Manager Mesosphere DC/OS or Kubernetes
• Riak TS (Time Series) - highly scalable NoSQL database for IoT and Time Series
… and more
• Riak Spark Connector for Apache Spark
• Riak Integrations with Redis and Kafka
• Riak Mesos Framework (RMF) for DC/OS
Thank You!
Contact me at [pavel at basho dot com]
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit East talk Pavel Hardak
Ad

More Related Content

What's hot (20)

Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
DataWorks Summit
 
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSetsWebinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Kinetica
 
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Dataconomy Media
 
The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!
DataWorks Summit/Hadoop Summit
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax Enterprise
DataStax
 
Free Servers to Build Big Data System on: Bing’s Approach
Free Servers to Build Big Data System on: Bing’s ApproachFree Servers to Build Big Data System on: Bing’s Approach
Free Servers to Build Big Data System on: Bing’s Approach
DataWorks Summit
 
Lightning Fast Analytics with Hive LLAP and Druid
Lightning Fast Analytics with Hive LLAP and DruidLightning Fast Analytics with Hive LLAP and Druid
Lightning Fast Analytics with Hive LLAP and Druid
DataWorks Summit
 
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Data Con LA
 
Log I am your father
Log I am your fatherLog I am your father
Log I am your father
DataWorks Summit/Hadoop Summit
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with Spark
Matt Ingenthron
 
Intuit Analytics Cloud 101
Intuit Analytics Cloud 101Intuit Analytics Cloud 101
Intuit Analytics Cloud 101
DataWorks Summit/Hadoop Summit
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
Mykola Zerniuk
 
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
Protect your Private Data in your Hadoop Clusters with ORC Column EncryptionProtect your Private Data in your Hadoop Clusters with ORC Column Encryption
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
DataWorks Summit
 
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Big Data Spain
 
Data Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data SecurityData Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data Security
DataWorks Summit
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Real-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stackReal-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stack
Anirvan Chakraborty
 
The Power of Data
The Power of DataThe Power of Data
The Power of Data
DataWorks Summit
 
Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs...
Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs...Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs...
Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs...
Databricks
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
StampedeCon
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
DataWorks Summit
 
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSetsWebinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Kinetica
 
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Dataconomy Media
 
The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!
DataWorks Summit/Hadoop Summit
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax Enterprise
DataStax
 
Free Servers to Build Big Data System on: Bing’s Approach
Free Servers to Build Big Data System on: Bing’s ApproachFree Servers to Build Big Data System on: Bing’s Approach
Free Servers to Build Big Data System on: Bing’s Approach
DataWorks Summit
 
Lightning Fast Analytics with Hive LLAP and Druid
Lightning Fast Analytics with Hive LLAP and DruidLightning Fast Analytics with Hive LLAP and Druid
Lightning Fast Analytics with Hive LLAP and Druid
DataWorks Summit
 
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Data Con LA
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with Spark
Matt Ingenthron
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
Mykola Zerniuk
 
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
Protect your Private Data in your Hadoop Clusters with ORC Column EncryptionProtect your Private Data in your Hadoop Clusters with ORC Column Encryption
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
DataWorks Summit
 
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Big Data Spain
 
Data Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data SecurityData Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data Security
DataWorks Summit
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Real-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stackReal-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stack
Anirvan Chakraborty
 
Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs...
Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs...Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs...
Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs...
Databricks
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
StampedeCon
 

Viewers also liked (20)

Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Spark Summit
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Spark Summit
 
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Spark Summit
 
Predictive Analytics for IoT Network Capacity Planning: Spark Summit East tal...
Predictive Analytics for IoT Network Capacity Planning: Spark Summit East tal...Predictive Analytics for IoT Network Capacity Planning: Spark Summit East tal...
Predictive Analytics for IoT Network Capacity Planning: Spark Summit East tal...
Spark Summit
 
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Spark Summit
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
Databricks
 
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Spark Summit
 
No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet: Spark Summit ...
No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet: Spark Summit ...No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet: Spark Summit ...
No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet: Spark Summit ...
Spark Summit
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Spark Summit
 
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Spark Summit
 
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
Spark Summit
 
IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and M...
IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and M...IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and M...
IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and M...
Spark Summit
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Spark Summit
 
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
Spark Summit
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark Summit
 
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Spark Summit
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkR
Databricks
 
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
Spark Summit
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Spark Summit
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Spark Summit
 
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Spark Summit
 
Predictive Analytics for IoT Network Capacity Planning: Spark Summit East tal...
Predictive Analytics for IoT Network Capacity Planning: Spark Summit East tal...Predictive Analytics for IoT Network Capacity Planning: Spark Summit East tal...
Predictive Analytics for IoT Network Capacity Planning: Spark Summit East tal...
Spark Summit
 
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Spark Summit
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
Databricks
 
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Spark Summit
 
No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet: Spark Summit ...
No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet: Spark Summit ...No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet: Spark Summit ...
No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet: Spark Summit ...
Spark Summit
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Spark Summit
 
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Spark Summit
 
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
Spark Summit
 
IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and M...
IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and M...IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and M...
IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and M...
Spark Summit
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Spark Summit
 
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
Spark Summit
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark Summit
 
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Spark Summit
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkR
Databricks
 
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
Spark Summit
 
Ad

Similar to Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit East talk Pavel Hardak (20)

Building Scalable IoT Apps (QCon S-F)
Building Scalable IoT Apps (QCon S-F)Building Scalable IoT Apps (QCon S-F)
Building Scalable IoT Apps (QCon S-F)
Pavel Hardak
 
Analytics&IoT
Analytics&IoTAnalytics&IoT
Analytics&IoT
Selvaraj Kesavan
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
RojaT4
 
Unushs susus susujss. Ssuusussjjsjsit 4.pptx
Unushs susus susujss. Ssuusussjjsjsit 4.pptxUnushs susus susujss. Ssuusussjjsjsit 4.pptx
Unushs susus susujss. Ssuusussjjsjsit 4.pptx
AshishHiwale1
 
Building your big data solution
Building your big data solution Building your big data solution
Building your big data solution
WSO2
 
Lecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsLecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in details
AbhishekKumarAgrahar2
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
ElsonPaul2
 
Big data.ppt
Big data.pptBig data.ppt
Big data.ppt
IdontKnow66967
 
Lecture1
Lecture1Lecture1
Lecture1
Manish Singh
 
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDB
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDBHBaseCon 2015: Industrial Internet Case Study using HBase and TSDB
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDB
HBaseCon
 
Dw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhanDw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhan
Dr Pradhan PL Pradhan
 
IARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptxIARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptx
AIMLSEMINARS
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Yahoo Developer Network
 
Dell Digital Transformation Through AI and Data Analytics Webinar
Dell Digital Transformation Through AI and  Data Analytics WebinarDell Digital Transformation Through AI and  Data Analytics Webinar
Dell Digital Transformation Through AI and Data Analytics Webinar
Bill Wong
 
Advanced Analytics for Any Data at Real-Time Speed
Advanced Analytics for Any Data at Real-Time SpeedAdvanced Analytics for Any Data at Real-Time Speed
Advanced Analytics for Any Data at Real-Time Speed
danpotterdwch
 
Predictive Analytics World Chicago 2015
Predictive Analytics World Chicago 2015Predictive Analytics World Chicago 2015
Predictive Analytics World Chicago 2015
Dan Potter
 
Big Data Analytics PPT - S1 working .pptx
Big Data Analytics PPT - S1 working .pptxBig Data Analytics PPT - S1 working .pptx
Big Data Analytics PPT - S1 working .pptx
VivekChaurasia43
 
BD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdfBD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdf
eramfatima43
 
Partner webinar presentation aws pebble_treasure_data
Partner webinar presentation aws pebble_treasure_dataPartner webinar presentation aws pebble_treasure_data
Partner webinar presentation aws pebble_treasure_data
Treasure Data, Inc.
 
trisulnsm_6.5_datasheet
trisulnsm_6.5_datasheettrisulnsm_6.5_datasheet
trisulnsm_6.5_datasheet
trisulnsm
 
Building Scalable IoT Apps (QCon S-F)
Building Scalable IoT Apps (QCon S-F)Building Scalable IoT Apps (QCon S-F)
Building Scalable IoT Apps (QCon S-F)
Pavel Hardak
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
RojaT4
 
Unushs susus susujss. Ssuusussjjsjsit 4.pptx
Unushs susus susujss. Ssuusussjjsjsit 4.pptxUnushs susus susujss. Ssuusussjjsjsit 4.pptx
Unushs susus susujss. Ssuusussjjsjsit 4.pptx
AshishHiwale1
 
Building your big data solution
Building your big data solution Building your big data solution
Building your big data solution
WSO2
 
Lecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsLecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in details
AbhishekKumarAgrahar2
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
ElsonPaul2
 
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDB
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDBHBaseCon 2015: Industrial Internet Case Study using HBase and TSDB
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDB
HBaseCon
 
IARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptxIARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptx
AIMLSEMINARS
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Yahoo Developer Network
 
Dell Digital Transformation Through AI and Data Analytics Webinar
Dell Digital Transformation Through AI and  Data Analytics WebinarDell Digital Transformation Through AI and  Data Analytics Webinar
Dell Digital Transformation Through AI and Data Analytics Webinar
Bill Wong
 
Advanced Analytics for Any Data at Real-Time Speed
Advanced Analytics for Any Data at Real-Time SpeedAdvanced Analytics for Any Data at Real-Time Speed
Advanced Analytics for Any Data at Real-Time Speed
danpotterdwch
 
Predictive Analytics World Chicago 2015
Predictive Analytics World Chicago 2015Predictive Analytics World Chicago 2015
Predictive Analytics World Chicago 2015
Dan Potter
 
Big Data Analytics PPT - S1 working .pptx
Big Data Analytics PPT - S1 working .pptxBig Data Analytics PPT - S1 working .pptx
Big Data Analytics PPT - S1 working .pptx
VivekChaurasia43
 
BD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdfBD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdf
eramfatima43
 
Partner webinar presentation aws pebble_treasure_data
Partner webinar presentation aws pebble_treasure_dataPartner webinar presentation aws pebble_treasure_data
Partner webinar presentation aws pebble_treasure_data
Treasure Data, Inc.
 
trisulnsm_6.5_datasheet
trisulnsm_6.5_datasheettrisulnsm_6.5_datasheet
trisulnsm_6.5_datasheet
trisulnsm
 
Ad

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

Recently uploaded (20)

chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxmd-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia
Alexander Romero Arosquipa
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136
illuminati Agent uganda call+256776963507/0741506136
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...
Pixellion
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxmd-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...
Pixellion
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 

Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit East talk Pavel Hardak

  • 1. Using Spark and Riak for IoT apps Patterns and Anti-patterns Pavel Hardak Basho Technologies
  • 2. IOT & INDUSTRY VERTICALS
  • 3. IoT market - growth prediction Number of connected “things” •2016 – about 6.4 B •30% YoY growth, 5.5M activations per day •2020 – about 21 B “By 2020 more than half of new major business processes and systems will incorporate some element of Internet of Things”
  • 5. Copyright © 2017 Daniel Elizalde We want to be here!
  • 6. IoT Project Plan •Investigate those “things” and figure out • What protocols they support (CoAP, MQTT, HTTP, …) • What data they generate (temperature, humidity, location, speed, ...) •Collect this data in our data center • Implement protocols and parsing routines • Store into persistent storage (“Data Lake” architecture) •Once stored in Data Lake • Analyze, summarize, “slice and dice” • Predict, make recommendations, discover insights •Declare a victory (make profit, go for IPO, …)
  • 7. Data Lake IoT Devices SQL Apps & AnalyticsMQTT, CoAP and HTTP REFERENCE ARCHITECTURE (?) Not so fast, my friend.
  • 8. What is wrong with “Data Lake” for IoT ?
  • 13. What is different special about IoT? It is about the “things”… and more.
  • 16. IoT Networks and Protocols
  • 17. IoT Devices & IoT Network Protocols •Wireless technologies •Limited range •Limited bandwidth •Shared transmission media •Mesh or Ad-hoc Topology •Possible signals interference •Low cost hardware components •Low power radio transmitters •Very small antennas •“Custom-made” firmware •Constrained Application Protocol (CoAP) •“Best Effort” QoS (“shoot and forget”)
  • 18. IoT is “Big Data” - by definition. Actually, lots and lots of Big Data.
  • 19. IoT Data Categories Category Description Metadata & Profiles Devices Device info (model, SN, firmware, sensors, ..), configuration, owner, … Users Personal info, preferences, billing info, registered devices, … Time Series Ingested (“Raw”) Measurements, statuses and events from devices. Aggregated (“Derived”) Calculated data - from devices & profiles • Rollups – aggregate metrics from low resolution to higher ones (min - hour – day) using min, max, avg, ... • Aggregations – aggregate measurements, configuration and profiles (model, region, …) over time ranges
  • 20. Five “V”s IoT data Velocity Torrent of small writes (sensors). Reads – millions of low-latency queries: user and device profiles, range queries for TS data (slices). Stream of updates (profiles) - beware of conflicts.
  • 21. Five “V”s IoT data Velocity Torrent of small writes (sensors). Reads – millions of low-latency queries, user and device profiles, range queries for TS data (slices). Stream of updates (profiles) - beware of conflicts. Variety Sensors data (time series), users and devices profiles, also time series “derived” data (e.g. rollups, aggregations).
  • 22. Five “V”s IoT data Velocity Torrent of small writes (sensors). Reads – millions of low-latency queries, user and device profiles, range queries for TS data (slices). Stream of updates (profiles) - beware of conflicts. Variety Sensors data (time series), users and devices profiles, also time series “derived” data (e.g. rollups, aggregations). Volume Starts small, grows quickly, keeps coming 24x7x365 (nights, weekends and holidays). Spikes up on new model launches or successful marketing campaign. Can slow down, but will keep growing. Efficient data retention policy is critical to prevent overflows.
  • 23. Five “V”s IoT data Velocity Torrent of small writes (sensors). Reads – millions of low-latency queries, user and device profiles, range queries for TS data (slices). Stream of updates (profiles) - beware of conflicts. Variety Sensors data (time series), users and devices profiles, also time series “derived” data (e.g. rollups, aggregations). Volume Starts small, grows quickly, keeps coming 24x7x365 (nights, weekends and holidays). Spikes up on new model launches or successful marketing campaign. Can slow down, but will keep growing. Efficient data retention policy is critical to prevent overflows. Veracity Generally trustworthy, but beware of “low cost” sensors with low accuracy. Sent over not-so-reliable transport - expect that some data will be corrupted or arrive late or might be lost. (Hopefully the devices were not hijacked or impersonated by hackers)
  • 24. Five “V”s IoT data Velocity Torrent of small writes (sensors). Reads – millions of low-latency queries, user and device profiles, range queries for TS data (slices). Stream of updates (profiles) - beware of conflicts. Variety Sensors data (time series), users and devices profiles, also time series “derived” data (e.g. rollups, aggregations). Volume Starts small, grows quickly, keeps coming 24x7x365 (nights, weekends and holidays). Spikes up on new model launches or successful marketing campaign. Can slow down, but will keep growing. Efficient data retention policy is critical to prevent overflows. Veracity Generally trustworthy, but beware of “low cost” sensors with low accuracy. Sent over not-so-reliable transport - expect that some data will be corrupted or arrive late or might be lost. (Hopefully the devices were not hijacked or impersonated by hackers) Value Profiles and summaries are much more valuable than raw data samples. The value of “raw” time series quickly goes down after it was processed and clock advanced. Aggregated (”derived”) data are more valuable than raw data. Exceptions: financial transactions, life support, nuclear plants, oil rigs, …
  • 25. Five “V”s IoT data Velocity Torrent of small writes (sensors). Reads – millions of low-latency queries, user and device profiles, range queries for TS data (slices). Stream of updates (profiles) - beware of conflicts. Variety Sensors data (time series), users and devices profiles, also time series “derived” data (e.g. rollups, aggregations). Volume Starts small, grows quickly, keeps coming 24x7x365 (nights, weekends and holidays). Spikes up on new model launches or successful marketing campaign. Can slow down, but will keep growing. Efficient data retention policy is critical to prevent overflows. Veracity Generally trustworthy, but beware of “low cost” sensors with low accuracy. Sent over not-so-reliable transport - expect that some data will be corrupted or arrive late or might be lost. (Hopefully the devices were not hijacked or impersonated by hackers) Value Profiles and summaries are much more valuable than raw data samples. The value of “raw” time series quickly goes down after it was processed and clock advanced. Aggregated (”derived”) data are more valuable than raw data. Exceptions: financial transactions, life support, nuclear plants, oil rigs, … Complexity Poly-structured using simple schemas and simple relations (usually implicit). Some data is treated as unstructured (”opaque”) for speed or flexibility. Note: expect schema or structure changes without preliminary notice.
  • 27. How are we going to solve it ?
  • 28. IoT Data & Processing • Data • Huge amounts of data records - arriving 24x7x365 • Some data records will arrive out-of-order, be late (minutes or hours) or lost • Expect “unexpected” - e.g. errors, nulls, schema or type changes, drops • Processing • Preprocessing - validation and cleansing • Translation (format, type, version, ...) and enrichment • Aggregations - min, max, avg, sum, top or bottom N, percentile, … • Grouping - device vendor and model, location, service, subscription type, ... • Rollups - from 10 sec raw samples to 1 min, 1 hour, 1 day, 1 week, 1 month, .. • Alarms (e.g. threshold crossing), anomaly detection (using ML) • Predefined reports (e.g. daily, weekly, …) • Ad-hoc reports or exploratory queries • Insights, predictions, …
  • 29. Architectural Blueprints •Lambda Architecture by Nathan Marz (ex-Twitter) •Kappa Architecture by Jay Kreps (Confluent) •Zeta Architecture by Jim Scott (MapR) •… and their variants Lambda Kappa Zeta
  • 30. Data Processing Framework for IoT • Uses “Best of breed” OSS technologies • Combines two paradigms • “Speed Layer” – pipeline for Stream Processing for “Data in Motion” • “Serving Layer” – analytics for “Data in Motion” and “Data at Rest” • Every component is “Distributed by Design” • Collection Layer • Message Queue • Stream Processing • Data Storage (Database, Object System, Data Warehouse) • Query and Analytics Engines
  • 31. Data store for IoT – “Wish list” • Ingested (Raw) Time Series • Very high write throughput • Fast slice (time range) reads • Aggregated (Derived) Time Series • Auto-distributed + slice locality • SQL-like queries • Aggregations • Bulk queries (analytics) • Secondary Indexes (Tags) • Efficient Storage • Auto Data Retention (TTL) • Build-in anti entropy • Compression • Hot Backups • Profiles and Metadata • Many concurrent reads with low latency • Reliable writes (ACID or conflict resolution) • Unstructured or partially structured • Secondary Indexes + Text Search • Scalability and Availability • Distributed architecture, no SPoF • Linearly scalable - up and down • Operational simplicity • Masterless architecture • Automatic rebalancing • Metrics, logs, events • Rolling upgrades
  • 32. What DB type is a good fit for TS use cases?
  • 33. Data Access Patterns Category Description R:W % Metadata & Profiles Devices & Users Many low latency small reads - all over the dataset. Occasional updates – possibly by different “actors” (web, device, app), conflicts need to be prevented or resolved. Fewer creates and deletes. 90:10 Time Series
  • 34. Data Access Patterns Category Description R:W % Metadata & Profiles Devices & Users Many low latency small reads - all over the dataset. Occasional updates – possibly by different “actors” (web, device, app), conflicts need to be prevented or resolved. Fewer creates and deletes. 90:10 Time Series Ingested (“Raw”) Very high throughout of relatively small writes. Most reads are over recent time range “slice”. Updates are rare (corrections). This category is a biggest part of the IoT application dataset. 10:90
  • 35. Data Access Patterns Category Description R:W % Metadata & Profiles Devices & Users Many low latency small reads - all over the dataset. Occasional updates – possibly by different “actors” (web, device, app), conflicts need to be prevented or resolved. Fewer creates and deletes. 90:10 Time Series Ingested (“Raw”) Very high throughout of relatively small writes. Most reads are over recent time range “slice”. Updates are rare (corrections). This category is a biggest part of the IoT application dataset. 10:90 Aggregated (“Derived”) Mostly reads – users, platform services, reports. Writes are periodical on each time interval or from batch jobs. 80:20
  • 36. Database Type For IoT or Time Series Relational Key Value Document Wide Column Graph MySQL Riak KV MongoDB Cassandra Neo4J PostgreSQL DynamoDB CouchBase HBase Titan Oracle Voldemort RethinkDB Accumulo Infinite Graph We need a new type of NoSQL database – Time Series None of existing DB types was designed to handle time series data • Wide column DBs have high write throughput, but reads and updates are not their strength • Key Value and Document DBs handle metadata well, but struggle with heavy writes and time-slicing reads • Relational - good with metadata (unless number of updates is high), but a bad choice for TS data • Graph DB – not a good choice for either time series or metadata, can be added later on
  • 37. Database Type For IoT or Time Series Relational Key Value Document Wide Column Graph MySQL Riak KV MongoDB Cassandra Neo4J PostgreSQL DynamoDB CouchBase HBase Titan Oracle Voldemort RethinkDB Accumulo Infinite Graph Time Series InfluxDB Riak TS Blueflood KairosDB Prometeus Druid OpenTSDB Dalmatiner Graphite
  • 38. Iot Sensors Data – Hot to Cold
  • 39. SENSORS DATA – HOT N’ COLD Temp Purpose Description Immutable? Boiling Hot App usage Last known value(s) and/or for last N minutes, useful for immediate responses, very frequently accessed No Hot Operational dataset Last 24 hours to several days or weeks (rarely months), frequently accessed, dashboards and online analytics Almost* Warm Historical data Older data, less frequently accessed, used mostly for offline analytics and historical analysis Yes Cold Archives Used only in rare situations, kept in long term storage for regulatory or unpredicted purposes Yes
  • 40. STORAGE TIERS – FROM HOT TO COLD RAM → Database (TSDB) → Object Storage → Archive Data Lake Temp Purpose Storage Products Immutable? Boiling Hot App usage Internal app cache, Redis or Memcached No Hot Operational dataset NoSQL Database (preferably Time Series DB) Riak TS, OpenTSDB, KairosDB, Cassandra, HBase Almost* Warm Historical data Object storage – HDFS (Hadoop), Ceph, Minio, Riak S2 or AWS S3 Yes Cold Archives Various Yes
  • 41. STORAGE TIERS – REALITY CHECK RAM → Database (TSDB) → Object Storage → Archive Elastic Cache (Redis) → Database (Postgres, DynamoDB) → AWS S3 → Glacier Data Lake Temp AWS Service Storage price, GB per month Boiling Hot Elastic Cache (Redis) $15-45 Hot DynamoDB RDS (Postgres) $ 0.25-0.35 (SSD) from $0.1 (Magnetic) Warm Simple Storage Service (S3) $0.024 to $0.030 Cold Glacier $0.007
  • 42. OSS technologies for scalable IoT apps Component Open Source Technologies Load balancer Ngnix, HA Proxy Ingestion Kafka, RabbitMQ, ZeroMQ, Flume Stream Computing Spark Streaming, Apache Flink, Kafka Streams, Samza Time Series Store InfluxDB, KairosDB, Riak, Cassandra, OpenTSDB Profiles Store CouchBase, Riak, MySQL, Postgres, MongoDB Search Solr, Elastic Search Object Storage HDFS (Hadoop), Minio, Riak S2, Ceph Analytics Framework Apache Spark (& MLlib), MapReduce, Hive SQL Query Engine Spark SQL, Presto, Impala, Drill Cluster Manager Mesosphere DC/OS or Mesos, Kubernetes, Docker Swarm
  • 43. ❑ Is it vendor lock-in or open source software? Are there open APIs? ❑ Can it be deployed in cloud? At the edge? In a data center? Using hybrid approach? ❑ Can it be used it for free or low cost (no big upfront investment)? ❑ Are the components pre-integrated or can be easily integrated together? ❑ Can you develop your app on your laptop? How many “moving parts”? ❑ Can you easily scale each component in this architecture by 2x? 10x? 50x? ❑ Is there a roadmap, actively worked on, which is aligned with your vision? ❑ Is there a company behind the technology to provide 24x7 support when needed? Checklist for IoT technology stack
  • 44. OSS technologies for IoT apps - the “opinionated” choice Component Open Source Technologies Load balancer HA Proxy Ingestion Apache Kafka Stream Computing Structured Spark Streaming Time Series Store Riak (TS tables) Profiles Store Riak (KV buckets) Search Riak Search (based on Solr) Object Storage Riak S2 Analytics Framework Apache Spark (& MLlib) SQL Query Engine Apache Spark SQL Cluster Manager Mesosphere DC/OS or Kubernetes
  • 45. • Riak TS (Time Series) - highly scalable NoSQL database for IoT and Time Series … and more • Riak Spark Connector for Apache Spark • Riak Integrations with Redis and Kafka • Riak Mesos Framework (RMF) for DC/OS
  • 46. Thank You! Contact me at [pavel at basho dot com]