SlideShare a Scribd company logo
Analyzing Time-Series Data with
Apache Spark and Cassandra
Andrew Psaltis
HDF / IoT Product Solution Architect
@itmdata
StampedeCon 2016
If you every wanted to….
Build models over measurements coming in
every second from sensors across the
world?
Dig into intra-day trading prices of millions
of financial instruments?
Compare hourly page view statistics across
every page on Wikipedia?
You need to do it over a large
sequence of measurements
over time.
A problem perfect for
Cassandra and Spark
Time-series data
consists of sequences
of measurements,
each occurring at a
point in time.
Example: Weather Station
Weather station collects data
Cassandra stores in sequence
Application reads in sequence
Query use cases
Weather Station ID
Get weather data given:
Weather Station ID and Time
Weather Station ID and
Range of Time
Aggregation use cases
Weather Station ID
Get temperature stats given:
Weather Station ID and Time
Weather Station ID and
Range of Time
Cassandra Overview
Cassandra architecture is
Shared nothing[1]
Materless peer-to-peer
Shard Free
Based on Amazon Dynamo and Google BigTable
[1] 1986 paper “The Case for Shared Nothing” -http://db.cs.berkeley.edu/papers/hpts85-
nothing.pdf.
Row
Partition
Table
Keyspace
Table 1 Table 2
Partition Key Clustering Columns
Order Override
Partition Key
Clustering Columns
10010:99999
2016:07:28:1
2
-5.6
2016:07:28:1
1
-5.1
2016:07:28:1
0
-4.9
2016:07:28:0
9
-5.3
Primary key relationship
Tokens
Consistent hash between 2-63 and 264
Each node owns a range of those values
The token is the beginning of that range to
the next node’s token value
Virtual Nodes break these down further
Replication
Node Primary
10.0.0.1 00-25
10.0.0.2 26-50
10.0.0.3 51-75
10.0.0.4 76-100
10.0.0.1
00-25
10.0.0.2
26-50
10.0.0.4
76-100
10.0.0.3
51-75
DC 1
DC 1 RF: 1
Replication
Node Primary Replica
10.0.0.1 00-25 76-100
10.0.0.2 26-50 00-25
10.0.0.3 51-75 26-50
10.0.0.4 76-100 51-75
10.0.0.1
00-25
76-100
10.0.0.2
26-50
00-25
10.0.0.4
76-100
51-75
10.0.0.3
51-75
26-50
DC 1
DC 1 RF: 2
Replication
Node Primary Replica Replica
10.0.0.
1
00-25 76-100 51-75
10.0.0.
2
26-50 00-25 76-100
10.0.0.
3
51-75 26-50 00-25
10.0.0.1
00-25
76-100
51-75
10.0.0.2
26-50
00-25
76-100
10.0.0.4
76-100
51-75
26-50
10.0.0.3
51-75
26-50
00-25
DC 1
DC 1 RF: 3
Replication
10.0.0.1
00-25
76-100
51-75
10.0.0.2
26-50
00-25
76-100
10.0.0.4
76-100
51-75
26-50
10.0.0.3
51-75
26-50
00-25
DC 1 RF: 3
Client
Write to
partition 15
Multi-Datacenter
10.0.0.1
00-25
76-100
51-75
10.0.0.2
26-50
00-25
76-100
10.0.0.4
76-100
51-75
26-50
10.0.0.3
51-75
26-50
00-25
DC 1 RF: 3
Client
Write to partition 15
10.0.0.1
00-25
76-100
51-75
10.0.0.2
26-50
00-25
76-100
10.0.0.4
76-100
51-75
26-50
10.0.0.3
51-75
26-50
00-25
DC 2 RF: 3
Query use cases
Weather Station ID
Get weather data given:
Weather Station ID and Time
Weather Station ID and
Range of Time
Spark Overview
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
What is Spark used for?
Fast and general purpose engine for large scale
data processing
Provides a framework that supports In-Memory
Cluster Computing
Designed for iterative computations and interactive
data mining
Resilient Distributed Dataset (RDD)
• Created through transformations
on data (map,filter..) or other
RDDs
• Immutable
• Partitioned
• Reusable
RDD Partitioning
Number of RDD partitions will
control how many parallel tasks
can be run against the data stored
in the RDD
Hint: in general make it at least as large as the # of cpu cores in your cluster
 Transformations - Similar to scala collections API
• Produce new RDDs
• filter, flatmap, map, distinct, groupBy, union, zip,
reduceByKey, subtract
 Actions
• Require materialization of the records to generate
a value
• collect: Array[T], count, fold, reduce..
RDD Operations
Data Locality
Spark asks an RDD for a list of its partitions (splits)
Each split consists of one or more token-ranges
For every partition:
•Spark gets a list of preferred nodes to process on
from RDD
•Spark creates a task and sends it to one of the
nodes for execution
What is Spark Streaming?
•Provides efficient, fault-tolerant stateful stream
processing
•Provides a simple API for implementing complex
algorithms
•Integrates with Spark’s batch and interactive
processing
•Integrates with other Spark extensions
Spark Streaming Overview
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Discretized Streams (DStreams)
•The basic abstraction provided by Spark Streaming
•Continuous series of RDDs
Spark on Cassandra
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Spark on Cassandra
• Server-Side filters (where clauses)
• Cross-table operations (JOIN, UNION, etc.)
• Data locality-aware (speed)
• Data transformation, aggregation, etc.
• Natural Time Series Integration
Spark Cassandra Connector
• Loads data from Cassandra to Spark
• Writes data from Spark to Cassandra
• Implicit Type Conversions and Object Mapping
• Implemented in Scala (offers a Java API)
• Open Source
• Exposes Cassandra Tables as Spark RDDs +
Spark DStreams
Spark Cassandra Connector
Spark Cassandra Example
Locating a Row
Cassandra RDD Use the Token Range to Create Node
Local Spark Partitions
The Spark Executor uses the Java Driver to
Pull Rows from the Local Cassandra Instance
Transactional
10.0.0.1
00-25
10.0.0.2
26-5010.0.0.4
76-100
10.0.0.3
51-75
10.0.0.1
00-25
10.0.0.2
26-50
10.0.0.4
76-100
10.0.0.3
51-75
Analytics
Batch Weather Station Analysis
Weather Station Analysis
Weather station collects data
Cassandra stores in sequence
Spark rolls up data into new tables
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Setup Connection
Get data and aggregate
Store back into Cassandra
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Aggregation use cases
Weather Station ID
Get temperature stats given:
Weather Station ID and Time
Weather Station ID and
Range of Time
Weather Station Stream Analysis
Weather station collects data
Data processed in stream
Cassandra stores in sequence
Weather Station Stream Analysis
Counter
https://github.com/killrweather/killrweather
To explore at home….
Thank You

More Related Content

What's hot (20)

PDF
Time Series Processing with Apache Spark
Josef Adersberger
 
PDF
OLAP with Cassandra and Spark
Evan Chan
 
PDF
Spark with Cassandra by Christopher Batey
Spark Summit
 
PPTX
Spark + Cassandra = Real Time Analytics on Operational Data
Victor Coustenoble
 
PDF
Spark Streaming with Cassandra
Jacek Lewandowski
 
PDF
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
DataStax Academy
 
PDF
Big data analytics with Spark & Cassandra
Matthias Niehoff
 
PDF
Nike Tech Talk: Double Down on Apache Cassandra and Spark
Patrick McFadin
 
PDF
Spark cassandra connector.API, Best Practices and Use-Cases
Duyhai Doan
 
PDF
Cassandra spark connector
Duyhai Doan
 
PDF
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Piotr Kolaczkowski
 
PDF
Laying down the smack on your data pipelines
Patrick McFadin
 
PDF
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
DataStax
 
PDF
Spark Cassandra Connector Dataframes
Russell Spitzer
 
PDF
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Spark Summit
 
PDF
Zero to Streaming: Spark and Cassandra
Russell Spitzer
 
PDF
Apache Spark and DataStax Enablement
Vincent Poncet
 
PDF
Cassandra Basics, Counters and Time Series Modeling
Vassilis Bekiaris
 
PDF
Spark Cassandra Connector: Past, Present, and Future
Russell Spitzer
 
PDF
Feeding Cassandra with Spark-Streaming and Kafka
DataStax Academy
 
Time Series Processing with Apache Spark
Josef Adersberger
 
OLAP with Cassandra and Spark
Evan Chan
 
Spark with Cassandra by Christopher Batey
Spark Summit
 
Spark + Cassandra = Real Time Analytics on Operational Data
Victor Coustenoble
 
Spark Streaming with Cassandra
Jacek Lewandowski
 
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
DataStax Academy
 
Big data analytics with Spark & Cassandra
Matthias Niehoff
 
Nike Tech Talk: Double Down on Apache Cassandra and Spark
Patrick McFadin
 
Spark cassandra connector.API, Best Practices and Use-Cases
Duyhai Doan
 
Cassandra spark connector
Duyhai Doan
 
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Piotr Kolaczkowski
 
Laying down the smack on your data pipelines
Patrick McFadin
 
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
DataStax
 
Spark Cassandra Connector Dataframes
Russell Spitzer
 
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Spark Summit
 
Zero to Streaming: Spark and Cassandra
Russell Spitzer
 
Apache Spark and DataStax Enablement
Vincent Poncet
 
Cassandra Basics, Counters and Time Series Modeling
Vassilis Bekiaris
 
Spark Cassandra Connector: Past, Present, and Future
Russell Spitzer
 
Feeding Cassandra with Spark-Streaming and Kafka
DataStax Academy
 

Similar to Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016 (20)

PDF
Apache cassandra and spark. you got the the lighter, let's start the fire
Patrick McFadin
 
PDF
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Helena Edelson
 
PDF
Data Science Lab Meetup: Cassandra and Spark
Christopher Batey
 
PPTX
Big Data-Driven Applications with Cassandra and Spark
Artem Chebotko
 
PDF
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
Codemotion Tel Aviv
 
PDF
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Helena Edelson
 
PPTX
Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos (Rahul K...
DataStax
 
PDF
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
Codemotion
 
PDF
PySpark Cassandra - Amsterdam Spark Meetup
Frens Jan Rumph
 
PDF
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Helena Edelson
 
PDF
Real-Time Analytics with Apache Cassandra and Apache Spark
Guido Schmutz
 
PDF
Real-Time Analytics with Apache Cassandra and Apache Spark,
Swiss Data Forum Swiss Data Forum
 
PDF
Spark and cassandra (Hulu Talk)
Jon Haddad
 
PDF
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
DataStax
 
PDF
Scala like distributed collections - dumping time-series data with apache spark
Demi Ben-Ari
 
PDF
Spark cassandra integration, theory and practice
Duyhai Doan
 
PDF
Getting started with Spark & Cassandra by Jon Haddad of Datastax
Data Con LA
 
PPTX
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Sameer Farooqui
 
PDF
Analyzing_Data_with_Spark_and_Cassandra
Rich Beaudoin
 
PDF
Cassandra Community Webinar | Getting Started with Apache Cassandra with Patr...
DataStax Academy
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Patrick McFadin
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Helena Edelson
 
Data Science Lab Meetup: Cassandra and Spark
Christopher Batey
 
Big Data-Driven Applications with Cassandra and Spark
Artem Chebotko
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
Codemotion Tel Aviv
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Helena Edelson
 
Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos (Rahul K...
DataStax
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
Codemotion
 
PySpark Cassandra - Amsterdam Spark Meetup
Frens Jan Rumph
 
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Helena Edelson
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Guido Schmutz
 
Real-Time Analytics with Apache Cassandra and Apache Spark,
Swiss Data Forum Swiss Data Forum
 
Spark and cassandra (Hulu Talk)
Jon Haddad
 
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
DataStax
 
Scala like distributed collections - dumping time-series data with apache spark
Demi Ben-Ari
 
Spark cassandra integration, theory and practice
Duyhai Doan
 
Getting started with Spark & Cassandra by Jon Haddad of Datastax
Data Con LA
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Sameer Farooqui
 
Analyzing_Data_with_Spark_and_Cassandra
Rich Beaudoin
 
Cassandra Community Webinar | Getting Started with Apache Cassandra with Patr...
DataStax Academy
 
Ad

More from StampedeCon (20)

PDF
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
StampedeCon
 
PDF
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
StampedeCon
 
PDF
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
StampedeCon
 
PDF
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
StampedeCon
 
PDF
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
StampedeCon
 
PDF
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
StampedeCon
 
PDF
Foundations of Machine Learning - StampedeCon AI Summit 2017
StampedeCon
 
PDF
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
StampedeCon
 
PDF
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
StampedeCon
 
PDF
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
StampedeCon
 
PDF
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
StampedeCon
 
PDF
A Different Data Science Approach - StampedeCon AI Summit 2017
StampedeCon
 
PDF
Graph in Customer 360 - StampedeCon Big Data Conference 2017
StampedeCon
 
PDF
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
StampedeCon
 
PDF
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
StampedeCon
 
PDF
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
StampedeCon
 
PDF
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
StampedeCon
 
PDF
Innovation in the Data Warehouse - StampedeCon 2016
StampedeCon
 
PPTX
Creating a Data Driven Organization - StampedeCon 2016
StampedeCon
 
PPTX
Using The Internet of Things for Population Health Management - StampedeCon 2016
StampedeCon
 
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
StampedeCon
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
StampedeCon
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
StampedeCon
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
StampedeCon
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
StampedeCon
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
StampedeCon
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
StampedeCon
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
StampedeCon
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
StampedeCon
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
StampedeCon
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
StampedeCon
 
A Different Data Science Approach - StampedeCon AI Summit 2017
StampedeCon
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
StampedeCon
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
StampedeCon
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
StampedeCon
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
StampedeCon
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
StampedeCon
 
Innovation in the Data Warehouse - StampedeCon 2016
StampedeCon
 
Creating a Data Driven Organization - StampedeCon 2016
StampedeCon
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
StampedeCon
 
Ad

Recently uploaded (20)

PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 

Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016

Editor's Notes

  • #2: Analyzing Time-Series Data with Apache Spark and Cassandra
  • #6: What is Time-Series Data? Time-series data consists of sequences of measurements, each occurring at a point in time. A variety of terms are used to describe time-series data, and many of them apply to conflicting or overlapping concepts. In the interest of clarity, in spark-ts , we stick to a particular vocabulary: A time series is a sequence of floating-point values, each linked to a timestamp.
  • #18: Consistent hash between 2-63 and 264 •Each node owns a range of those values •The token is the beginning of that range to the next node’s token value •Virtual Nodes break these down further Each partition is a 128 bit value
  • #37: Since we are really just creating discrete RDD’s we will see how we have the opportunity to combine streaming with the rest of the stack