SlideShare a Scribd company logo
Feeding Cassandra with Spark Streaming & Kafka
Cary Bourgeois
Solutions Engineer
DataStax, Central Region
Who Am I
• Datastax < 2 Years
• Not a “developer”
• Legacy BI/Database
• Business Objects
• SAP
• Demo Development
• R
• Java (If I have to)
• Scala (Someday)
2
3
Cassandra Summit 2015
September 22-24, Santa Clara Convention Center
7,000 Attendees
Last Week - Mission Impossible?

A Stretch but possible.
4
Sunday Afternoon - I’m getting my A$$ kicked
5
Monday Afternoon - Arghhhhh!
6
Monday Night - I got this!
7
8
Capture

Raw Data
Analyze &

∑ummarize
Why Mess with Success?
• Spark 1.3+
• New/Improved Kafka
Support
• Dataframes
• Datastax Enterprise 4.8
• Spark 1.4 support
9
https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html
Why Mess with Success?
• Spark 1.3+
• New/Improved Kafka
Support
• Dataframes
• Datastax Enterprise 4.8
• Spark 1.4 support
10https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
Apache Kafka is publish-subscribe messaging rethought as a distributed commit log.
Fast
A single Kafka broker can handle hundreds of megabytes of reads and writes per second
from thousands of clients.
Scalable
Kafka is designed to allow a single cluster to serve as the central data backbone for a large
organization. It can be elastically and transparently expanded without downtime. Data
streams are partitioned and spread over a cluster of machines to allow data streams larger
than the capability of any single machine and to allow clusters of co-ordinated consumers
Durable
Messages are persisted on disk and replicated within the cluster to prevent data loss. Each
broker can handle terabytes of messages without performance impact.
Distributed by Design
Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance
guarantees. 11
• Producers
• Consumers
• Persistence
• Topics
• Partitions
• Replication
12
http://kafka.apache.org/documentation.html
• Create a Kafka topic
bin/kafka-topics.sh --zookeeper localhost:2181 --create --replication-factor 1 --partitions 1 --topic stream_ts
• List all topics
bin/kafka-topics.sh --zookeeper localhost:2181 --list
• Monitor a topic
bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic stream_ts --from-beginning
13
Confidential
Kafka and the Producer
14
The Producer App
• Lots of Options
• I chose
• Scala
• Not steep enough
• Akka
• Producing this message
15
Edge 1;1;401843;2015-11-04 06:23:49.001;64.44286233060423;82.79653847181152
Destination - Cassandra Tables
16
CREATE TABLE demo.data (
edge_id text,
sensor text,
epoch_hr text,
ts timestamp,
depth double,
value double,
PRIMARY KEY (( edge_id, sensor, epoch_hr ), ts)
)
CREATE TABLE demo.last (
edge_id text,
sensor text,
ts timestamp,
depth double,
value double,
PRIMARY KEY (( edge_id, sensor ))
)
CREATE TABLE demo.count (
pk int,
ts timestamp,
count bigint,
count_ma double,
PRIMARY KEY (pk, ts)
)
DSE Analytics => Spark
• No ETL
• Spark 1.4.1 certification
• Simplified map and reduce
• Very developer Friendly
• SparkSQL
• Spark Streaming
• Machine Learning
• DSE Analytics and Search Integration
• Cassandra benefits (scaling, availability)
17
“I want to do processing on data before it hits Cassandra.”
“I need my sums, avgs, group by’s ETC.”
“I want to run real-time analytics on my Cassandra data.”
Processing the Stream
• Simple Scala Job
• Deal with the raw flow
• Capture the raw data
• Capture the latest sensor
reading
• Summarize and Analyze
• Windowing the Stream
• Count Records every x
seconds
• Calculate a moving average
of every x seconds over a
number of periods.
18
Confidential
Full Demo
19
Next Steps
• SparkR
• MLLib workflows
• Notebooks
• Spark
• Jupyter
20
If you would like the code:
21
https://github.com/CaryBourgeois/KafkaSparkCassandraDemo

More Related Content

What's hot (19)

PDF
Cassandra spark connector
Duyhai Doan
 
PDF
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
DataStax
 
PPTX
Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office
DataStax Academy
 
PDF
Akka in Production - ScalaDays 2015
Evan Chan
 
PDF
Building Event Streaming Architectures on Scylla and Kafka
ScyllaDB
 
PDF
Deep dive into event store using Apache Cassandra
AhmedabadJavaMeetup
 
PDF
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
DataStax Academy
 
PDF
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Data Con LA
 
PPTX
Developing a Real-time Engine with Akka, Cassandra, and Spray
Jacob Park
 
PPTX
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
Victor Coustenoble
 
PPTX
Spark + Cassandra = Real Time Analytics on Operational Data
Victor Coustenoble
 
PDF
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Legacy Typesafe (now Lightbend)
 
PDF
Analytics with Spark and Cassandra
DataStax Academy
 
PDF
Fully fault tolerant real time data pipeline with docker and mesos
Rahul Kumar
 
PDF
Spark with Cassandra by Christopher Batey
Spark Summit
 
PDF
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
DataStax
 
PDF
Kafka spark cassandra webinar feb 16 2016
Hiromitsu Komatsu
 
PPTX
Using Spark to Load Oracle Data into Cassandra
Jim Hatcher
 
PDF
Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQL
ScyllaDB
 
Cassandra spark connector
Duyhai Doan
 
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
DataStax
 
Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office
DataStax Academy
 
Akka in Production - ScalaDays 2015
Evan Chan
 
Building Event Streaming Architectures on Scylla and Kafka
ScyllaDB
 
Deep dive into event store using Apache Cassandra
AhmedabadJavaMeetup
 
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
DataStax Academy
 
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Data Con LA
 
Developing a Real-time Engine with Akka, Cassandra, and Spray
Jacob Park
 
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
Victor Coustenoble
 
Spark + Cassandra = Real Time Analytics on Operational Data
Victor Coustenoble
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Legacy Typesafe (now Lightbend)
 
Analytics with Spark and Cassandra
DataStax Academy
 
Fully fault tolerant real time data pipeline with docker and mesos
Rahul Kumar
 
Spark with Cassandra by Christopher Batey
Spark Summit
 
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
DataStax
 
Kafka spark cassandra webinar feb 16 2016
Hiromitsu Komatsu
 
Using Spark to Load Oracle Data into Cassandra
Jim Hatcher
 
Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQL
ScyllaDB
 

Viewers also liked (20)

PPTX
Real-time Data Integration with Kafka and Cassandra (Ewen Cheslack-Postava, C...
DataStax
 
PDF
Advanced Operations
DataStax Academy
 
PPTX
Spark Cassandra Connector: Past, Present and Furure
DataStax Academy
 
PDF
Transactional Streaming: If you can compute it, you can probably stream it.
jhugg
 
PDF
Big data analytics with Spark & Cassandra
Matthias Niehoff
 
PDF
Beginning Operations: 7 Deadly Sins for Apache Cassandra Ops
DataStax Academy
 
PDF
Clickstream Analysis with Apache Spark
QAware GmbH
 
KEY
Near-realtime analytics with Kafka and HBase
dave_revell
 
PDF
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
Spark Summit
 
PDF
Production Ready Cassandra (Beginner)
DataStax Academy
 
PDF
Coursera's Adoption of Cassandra
DataStax Academy
 
PDF
New features in 3.0
DataStax Academy
 
PDF
Introduction to .Net Driver
DataStax Academy
 
PDF
Playlists at Spotify
DataStax Academy
 
PPTX
Using Event-Driven Architectures with Cassandra
DataStax Academy
 
PDF
Successful Software Development with Apache Cassandra
DataStax Academy
 
PDF
Getting Started with Graph Databases
DataStax Academy
 
PDF
Cassandra: One (is the loneliest number)
DataStax Academy
 
PDF
Traveler's Guide to Cassandra
DataStax Academy
 
PDF
Cassandra Data Maintenance with Spark
DataStax Academy
 
Real-time Data Integration with Kafka and Cassandra (Ewen Cheslack-Postava, C...
DataStax
 
Advanced Operations
DataStax Academy
 
Spark Cassandra Connector: Past, Present and Furure
DataStax Academy
 
Transactional Streaming: If you can compute it, you can probably stream it.
jhugg
 
Big data analytics with Spark & Cassandra
Matthias Niehoff
 
Beginning Operations: 7 Deadly Sins for Apache Cassandra Ops
DataStax Academy
 
Clickstream Analysis with Apache Spark
QAware GmbH
 
Near-realtime analytics with Kafka and HBase
dave_revell
 
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
Spark Summit
 
Production Ready Cassandra (Beginner)
DataStax Academy
 
Coursera's Adoption of Cassandra
DataStax Academy
 
New features in 3.0
DataStax Academy
 
Introduction to .Net Driver
DataStax Academy
 
Playlists at Spotify
DataStax Academy
 
Using Event-Driven Architectures with Cassandra
DataStax Academy
 
Successful Software Development with Apache Cassandra
DataStax Academy
 
Getting Started with Graph Databases
DataStax Academy
 
Cassandra: One (is the loneliest number)
DataStax Academy
 
Traveler's Guide to Cassandra
DataStax Academy
 
Cassandra Data Maintenance with Spark
DataStax Academy
 
Ad

Similar to Feeding Cassandra with Spark-Streaming and Kafka (20)

PPTX
Apache kafka
Kumar Shivam
 
PPTX
Kafka Tutorial, Kafka ecosystem with clustering examples
Jean-Paul Azar
 
PPTX
Kafka Tutorial - introduction to the Kafka streaming platform
Jean-Paul Azar
 
PPTX
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Jean-Paul Azar
 
PDF
kafka-tutorial-cloudruable-v2.pdf
PriyamTomar1
 
PPTX
Kafka Tutorial: Streaming Data Architecture
Jean-Paul Azar
 
PDF
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Helena Edelson
 
PPTX
Brief introduction to Kafka Streaming Platform
Jean-Paul Azar
 
PDF
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Helena Edelson
 
PPTX
Kafka Intro With Simple Java Producer Consumers
Jean-Paul Azar
 
PDF
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
DataStax Academy
 
PDF
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
PPTX
Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...
Lviv Startup Club
 
PDF
Hello, kafka! (an introduction to apache kafka)
Timothy Spann
 
PDF
Kafka spark cassandra webinar feb 16 2016
Hiromitsu Komatsu
 
PPTX
Kafka Tutorial - basics of the Kafka streaming platform
Jean-Paul Azar
 
PDF
Kafka syed academy_v1_introduction
Syed Hadoop
 
PDF
Building Event Driven Services with Apache Kafka and Kafka Streams - Devoxx B...
Ben Stopford
 
PDF
Devoxx university - Kafka de haut en bas
Florent Ramiere
 
PDF
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Big Data Spain
 
Apache kafka
Kumar Shivam
 
Kafka Tutorial, Kafka ecosystem with clustering examples
Jean-Paul Azar
 
Kafka Tutorial - introduction to the Kafka streaming platform
Jean-Paul Azar
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Jean-Paul Azar
 
kafka-tutorial-cloudruable-v2.pdf
PriyamTomar1
 
Kafka Tutorial: Streaming Data Architecture
Jean-Paul Azar
 
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Helena Edelson
 
Brief introduction to Kafka Streaming Platform
Jean-Paul Azar
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Helena Edelson
 
Kafka Intro With Simple Java Producer Consumers
Jean-Paul Azar
 
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
DataStax Academy
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...
Lviv Startup Club
 
Hello, kafka! (an introduction to apache kafka)
Timothy Spann
 
Kafka spark cassandra webinar feb 16 2016
Hiromitsu Komatsu
 
Kafka Tutorial - basics of the Kafka streaming platform
Jean-Paul Azar
 
Kafka syed academy_v1_introduction
Syed Hadoop
 
Building Event Driven Services with Apache Kafka and Kafka Streams - Devoxx B...
Ben Stopford
 
Devoxx university - Kafka de haut en bas
Florent Ramiere
 
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Big Data Spain
 
Ad

More from DataStax Academy (20)

PDF
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
DataStax Academy
 
PPTX
Introduction to DataStax Enterprise Graph Database
DataStax Academy
 
PPTX
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
DataStax Academy
 
PPTX
Cassandra on Docker @ Walmart Labs
DataStax Academy
 
PDF
Cassandra 3.0 Data Modeling
DataStax Academy
 
PPTX
Cassandra Adoption on Cisco UCS & Open stack
DataStax Academy
 
PDF
Data Modeling for Apache Cassandra
DataStax Academy
 
PDF
Coursera Cassandra Driver
DataStax Academy
 
PDF
Production Ready Cassandra
DataStax Academy
 
PDF
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
DataStax Academy
 
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 1
DataStax Academy
 
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 2
DataStax Academy
 
PDF
Standing Up Your First Cluster
DataStax Academy
 
PDF
Introduction to Data Modeling with Apache Cassandra
DataStax Academy
 
PDF
Cassandra Core Concepts
DataStax Academy
 
PPTX
Enabling Search in your Cassandra Application with DataStax Enterprise
DataStax Academy
 
PPTX
Bad Habits Die Hard
DataStax Academy
 
PDF
Advanced Data Modeling with Apache Cassandra
DataStax Academy
 
PDF
Advanced Cassandra
DataStax Academy
 
PDF
Apache Cassandra and Drivers
DataStax Academy
 
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
DataStax Academy
 
Introduction to DataStax Enterprise Graph Database
DataStax Academy
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
DataStax Academy
 
Cassandra on Docker @ Walmart Labs
DataStax Academy
 
Cassandra 3.0 Data Modeling
DataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
DataStax Academy
 
Data Modeling for Apache Cassandra
DataStax Academy
 
Coursera Cassandra Driver
DataStax Academy
 
Production Ready Cassandra
DataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
DataStax Academy
 
Standing Up Your First Cluster
DataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
DataStax Academy
 
Cassandra Core Concepts
DataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
DataStax Academy
 
Bad Habits Die Hard
DataStax Academy
 
Advanced Data Modeling with Apache Cassandra
DataStax Academy
 
Advanced Cassandra
DataStax Academy
 
Apache Cassandra and Drivers
DataStax Academy
 

Recently uploaded (20)

PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
Python basic programing language for automation
DanialHabibi2
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPTX
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PDF
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Python basic programing language for automation
DanialHabibi2
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 

Feeding Cassandra with Spark-Streaming and Kafka

  • 1. Feeding Cassandra with Spark Streaming & Kafka Cary Bourgeois Solutions Engineer DataStax, Central Region
  • 2. Who Am I • Datastax < 2 Years • Not a “developer” • Legacy BI/Database • Business Objects • SAP • Demo Development • R • Java (If I have to) • Scala (Someday) 2
  • 3. 3 Cassandra Summit 2015 September 22-24, Santa Clara Convention Center 7,000 Attendees
  • 4. Last Week - Mission Impossible?
 A Stretch but possible. 4
  • 5. Sunday Afternoon - I’m getting my A$$ kicked 5
  • 6. Monday Afternoon - Arghhhhh! 6
  • 7. Monday Night - I got this! 7
  • 9. Why Mess with Success? • Spark 1.3+ • New/Improved Kafka Support • Dataframes • Datastax Enterprise 4.8 • Spark 1.4 support 9 https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html
  • 10. Why Mess with Success? • Spark 1.3+ • New/Improved Kafka Support • Dataframes • Datastax Enterprise 4.8 • Spark 1.4 support 10https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
  • 11. Apache Kafka is publish-subscribe messaging rethought as a distributed commit log. Fast A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Scalable Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers Durable Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact. Distributed by Design Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees. 11
  • 12. • Producers • Consumers • Persistence • Topics • Partitions • Replication 12 http://kafka.apache.org/documentation.html
  • 13. • Create a Kafka topic bin/kafka-topics.sh --zookeeper localhost:2181 --create --replication-factor 1 --partitions 1 --topic stream_ts • List all topics bin/kafka-topics.sh --zookeeper localhost:2181 --list • Monitor a topic bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic stream_ts --from-beginning 13
  • 15. The Producer App • Lots of Options • I chose • Scala • Not steep enough • Akka • Producing this message 15 Edge 1;1;401843;2015-11-04 06:23:49.001;64.44286233060423;82.79653847181152
  • 16. Destination - Cassandra Tables 16 CREATE TABLE demo.data ( edge_id text, sensor text, epoch_hr text, ts timestamp, depth double, value double, PRIMARY KEY (( edge_id, sensor, epoch_hr ), ts) ) CREATE TABLE demo.last ( edge_id text, sensor text, ts timestamp, depth double, value double, PRIMARY KEY (( edge_id, sensor )) ) CREATE TABLE demo.count ( pk int, ts timestamp, count bigint, count_ma double, PRIMARY KEY (pk, ts) )
  • 17. DSE Analytics => Spark • No ETL • Spark 1.4.1 certification • Simplified map and reduce • Very developer Friendly • SparkSQL • Spark Streaming • Machine Learning • DSE Analytics and Search Integration • Cassandra benefits (scaling, availability) 17 “I want to do processing on data before it hits Cassandra.” “I need my sums, avgs, group by’s ETC.” “I want to run real-time analytics on my Cassandra data.”
  • 18. Processing the Stream • Simple Scala Job • Deal with the raw flow • Capture the raw data • Capture the latest sensor reading • Summarize and Analyze • Windowing the Stream • Count Records every x seconds • Calculate a moving average of every x seconds over a number of periods. 18
  • 20. Next Steps • SparkR • MLLib workflows • Notebooks • Spark • Jupyter 20
  • 21. If you would like the code: 21 https://github.com/CaryBourgeois/KafkaSparkCassandraDemo