SlideShare a Scribd company logo
OLAP WITH SPARK AND 
CASSANDRA 
#CassandraSummit 
EVAN CHAN 
SEPT 2014
WHO AM I? 
Principal Engineer, 
@evanfchan 
Creator of 
Socrata, Inc. 
http://github.com/velvia 
Spark Job Server
WE BUILD SOFTWARE TO MAKE DATA USEFUL TO MORE 
PEOPLE. 
data.edmonton.ca finances.worldbank.org data.cityofchicago.org 
data.seattle.gov data.oregon.gov data.wa.gov 
www.metrochicagodata.org data.cityofboston.gov 
info.samhsa.gov explore.data.gov data.cms.gov data.ok.gov 
data.nola.gov data.illinois.gov data.colorado.gov 
data.austintexas.gov data.undp.org www.opendatanyc.com 
data.mo.gov data.nfpa.org data.raleighnc.gov dati.lombardia.it 
data.montgomerycountymd.gov data.cityofnewyork.us 
data.acgov.org data.baltimorecity.gov data.energystar.gov 
data.somervillema.gov data.maryland.gov data.taxpayer.net 
bronx.lehman.cuny.edu data.hawaii.gov data.sfgov.org
WE ARE SWIMMING IN DATA!
BIG DATA AT SOCRATA 
Tens of thousands of datasets, each one up to 30 million rows 
Customer demand for billion row datasets 
Want to analyze across datasets
BIG DATA AT OOYALA 
2.5 billion analytics pings a day = almost a trillion events a 
year. 
Roll up tables - 30 million rows per day
HOW CAN WE ALLOW CUSTOMERS TO QUERY A 
YEAR'S WORTH OF DATA? 
Flexible - complex queries included 
Sometimes you can't denormalize your data enough 
Fast - interactive speeds 
Near Real Time - can't make customers wait hours before 
querying new data
RDBMS? POSTGRES? 
Start hitting latency limits at ~10 million rows 
No robust and inexpensive solution for querying across shards 
No robust way to scale horizontally 
PostGres runs query on single thread unless you partition 
(painful!) 
Complex and expensive to improve performance (eg rollup 
tables, huge expensive servers)
OLAP CUBES? 
Materialize summary for every possible combination 
Too complicated and brittle 
Takes forever to compute - not for real time 
Explodes storage and memory
When in doubt, use brute force 
- Ken Thompson
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
CASSANDRA 
Horizontally scalable 
Very flexible data modelling (lists, sets, custom data types) 
Easy to operate 
No fear of number of rows or documents 
Best of breed storage technology, huge community 
BUT: Simple queries only
APACHE SPARK 
Horizontally scalable, in-memory queries 
Functional Scala transforms - map, filter, groupBy, sort 
etc. 
SQL, machine learning, streaming, graph, R, many more plugins 
all on ONE platform - feed your SQL results to a logistic 
regression, easy! 
THE Hottest big data platform, huge community, leaving 
Hadoop in the dust 
Developers love it
SPARK PROVIDES THE MISSING FAST, DEEP 
ANALYTICS PIECE OF CASSANDRA!
INTEGRATING SPARK AND CASSANDRA 
Scala solutions: 
Datastax integration: 
https://github.com/datastax/spark-cassandra- 
connector 
(CQL-based) 
Calliope
A bit more work: 
Use traditional Cassandra client with RDDs 
Use an existing InputFormat, like CqlPagedInputFormat 
Only reason to go here is probably you are not on CQL version of 
Cassandra, or you're using Shark/Hive.
A SPARK AND CASSANDRA 
OLAP ARCHITECTURE
SEPARATE STORAGE AND QUERY LAYERS 
Combine best of breed storage and query platforms 
Take full advantage of evolution of each 
Storage handles replication for availability 
Query can replicate data for scaling read concurrency - 
independent!
SCALE NODES, NOT 
DEVELOPER TIME!!
KEEPING IT SIMPLE 
Maximize row scan speed 
Columnar representation for efficiency 
Compressed bitmap indexes for fast algebra 
Functional transforms for easy memoization, testing, 
concurrency, composition
SPARK AS CASSANDRA'S CACHE
EVEN BETTER: TACHYON OFF-HEAP CACHING
INITIAL ATTEMPTS 
val rows = Seq( 
Seq("Burglary", "19xx Hurston", 10), 
Seq("Theft", "55xx Floatilla Ave", 5) 
) 
sc.parallelize(rows) 
.map { values => (values[0], values) } 
.groupByKey 
.reduce(_[2] + _[2])
No existing generic query engine for Spark when we started 
(Shark was in infancy, had no indexes, etc.), so we built our own 
For every row, need to extract out needed columns 
Ability to select arbitrary columns means using Seq[Any], no 
type safety 
Boxing makes integer aggregation very expensive and memory 
inefficient
COLUMNAR STORAGE AND QUERYING
The traditional row-based data storage 
approach is dead 
- Michael Stonebraker
TRADITIONAL ROW-BASED STORAGE 
Same layout in memory and on disk: 
Name Age 
Barak 46 
Hillary 66 
Each row is stored contiguously. All columns in row 2 come after 
row 1.
COLUMNAR STORAGE (MEMORY) 
Name column 
0 1 
0 1 
Dictionary: {0: "Barak", 1: "Hillary"} 
Age column 
0 1 
46 66
COLUMNAR STORAGE (CASSANDRA) 
Review: each physical row in Cassandra (e.g. a "partition key") 
stores its columns together on disk. 
Schema CF 
Rowkey Type 
Name StringDict 
Age Int 
Data CF 
Rowkey 0 1 
Name 0 1 
Age 46 66
ADVANTAGES OF COLUMNAR STORAGE 
Compression 
Dictionary compression - HUGE savings for low-cardinality 
string columns 
RLE 
Reduce I/O 
Only columns needed for query are loaded from disk 
Can keep strong types in memory, avoid boxing 
Batch multiple rows in one cell for efficiency
ADVANTAGES OF COLUMNAR QUERYING 
Cache locality for aggregating column of data 
Take advantage of CPU/GPU vector instructions for ints / 
doubles 
avoid row-ifying until last possible moment 
easy to derive computed columns 
Use vector data / linear math libraries
COLUMNAR QUERY ENGINE VS ROW-BASED IN 
SCALA 
Custom RDD of column-oriented blocks of data 
Uses ~10x less heap 
10-100x faster for group by's on a single node 
Scan speed in excess of 150M rows/sec/core for integer 
aggregations
SO, GREAT, OLAP WITH CASSANDRA AND 
SPARK. NOW WHAT?
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
DATASTAX: CASSANDRA SPARK INTEGRATION 
Datastax Enterprise now comes with HA Spark 
HA master, that is. 
spark-cassandra-connector
SPARK SQL 
Appeared with Spark 1.0 
In-memory columnar store 
Can read from Parquet and JSON now; direct Cassandra 
integration coming 
Querying is not column-based (yet) 
No indexes 
Write custom functions in Scala .... take that Hive UDFs!! 
Integrates well with MLBase, Scala/Java/Python
CACHING A SQL TABLE FROM CASSANDRA 
val sqlContext = new org.apache.spark.sql.SQLContext(sc) 
sc.cassandraTable[GDeltRow]("gdelt, "1979to2009") 
.registerAsTable("gdelt") 
sqlContext.cacheTable("gdelt") 
sqlContext.sql("SELECT Actor2Code, Actor2Name, Actor2CountryCode, AvgTone from gdelt ORDER Remember Spark is lazy, nothing is executed until the 
collect() 
In Spark 1.1+: registerTempTable
SOME PERFORMANCE NUMBERS 
GDELT dataset, 117 million rows, 57 columns, ~50GB 
Spark 1.0.2, AWS 8 x c3.xlarge, cached in memory 
Query Avg 
time 
(sec) 
SELECT count(*) FROM gdelt 
WHERE Actor2CountryCode = 
'CHN' 
0.49 
SELECT 4 columns Top K 1.51 
SELECT Top countries by Avg Tone 
2.69 
(Group By)
IMPORTANT - CACHING 
By default, queries will read data from source - Cassandra - 
every time 
Spark RDD Caching - much faster, but big waste of memory 
(row oriented) 
Spark SQL table caching - fastest, memory efficient
WORK STILL NEEDED 
Indexes 
Columnar querying for fast aggregation 
Tachyon support for Cassandra/CQL 
Efficient reading from columnar storage formats
LESSONS 
Extremely fast distributed querying for these use cases 
Data doesn't change much (and only bulk changes) 
Analytical queries for subset of columns 
Focused on numerical aggregations 
Small numbers of group bys 
For fast query performance, cache your data using Spark SQL 
Concurrent queries is a frontier with Spark. Use additional 
Spark contexts.
THANK YOU!
EXTRA SLIDES
EXAMPLE CUSTOM INTEGRATION USING 
ASTYANAX 
val cassRDD = sc.parallelize(rowkeys). 
flatMap { rowkey => 
columnFamily.get(rowkey).execute().asScala 
}
SOME COLUMNAR ALTERNATIVES 
Monetdb and Infobright - true columnar stores (storage + 
querying) 
Vertica and C-Store 
Google BigQuery - columnar cloud database, Dremel based 
Amazon RedShift

More Related Content

What's hot (19)

PDF
Analytics with Cassandra & Spark
Matthias Niehoff
 
PDF
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
Evan Chan
 
PDF
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Miklos Christine
 
PPTX
Spark SQL
Caserta
 
PDF
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Don Drake
 
PDF
Apache Spark and DataStax Enablement
Vincent Poncet
 
PPTX
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
PDF
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
PPTX
Hadoop and Spark for the SAS Developer
DataWorks Summit
 
PPTX
ETL with SPARK - First Spark London meetup
Rafal Kwasny
 
PDF
Spark And Cassandra: 2 Fast, 2 Furious
Jen Aman
 
PDF
Koalas: Making an Easy Transition from Pandas to Apache Spark
Databricks
 
PDF
Tachyon and Apache Spark
rhatr
 
PDF
Spark Cassandra Connector: Past, Present, and Future
Russell Spitzer
 
PDF
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Summit
 
PDF
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Duyhai Doan
 
PDF
Hadoop Strata Talk - Uber, your hadoop has arrived
Vinoth Chandar
 
PDF
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Charles Givre
 
PDF
Scale-Out Using Spark in Serverless Herd Mode!
Databricks
 
Analytics with Cassandra & Spark
Matthias Niehoff
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
Evan Chan
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Miklos Christine
 
Spark SQL
Caserta
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Don Drake
 
Apache Spark and DataStax Enablement
Vincent Poncet
 
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
Hadoop and Spark for the SAS Developer
DataWorks Summit
 
ETL with SPARK - First Spark London meetup
Rafal Kwasny
 
Spark And Cassandra: 2 Fast, 2 Furious
Jen Aman
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Databricks
 
Tachyon and Apache Spark
rhatr
 
Spark Cassandra Connector: Past, Present, and Future
Russell Spitzer
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Summit
 
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Duyhai Doan
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Vinoth Chandar
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Charles Givre
 
Scale-Out Using Spark in Serverless Herd Mode!
Databricks
 

Viewers also liked (20)

PDF
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
DataStax Academy
 
PPTX
BDM8 - Near-realtime Big Data Analytics using Impala
David Lauzon
 
PDF
Overiew of Cassandra and Doradus
randyguck
 
PPTX
Extending Cassandra with Doradus OLAP for High Performance Analytics
randyguck
 
PPTX
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
randyguck
 
PPTX
Big Data-Driven Applications with Cassandra and Spark
Artem Chebotko
 
PDF
Cassandra Summit 2014: A Train of Thoughts About Growing and Scalability — Bu...
DataStax Academy
 
PDF
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
DataStax Academy
 
PDF
Apache Cassandra at Narmal 2014
DataStax Academy
 
PDF
Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...
DataStax Academy
 
PDF
Cassandra Summit 2014: Cassandra in Large Scale Enterprise Grade xPatterns De...
DataStax Academy
 
PDF
Introduction to Dating Modeling for Cassandra
DataStax Academy
 
PPTX
Cassandra Summit 2014: Apache Cassandra at Telefonica CBS
DataStax Academy
 
PDF
Coursera's Adoption of Cassandra
DataStax Academy
 
PDF
Cassandra Summit 2014: Monitor Everything!
DataStax Academy
 
PDF
Production Ready Cassandra (Beginner)
DataStax Academy
 
PDF
Cassandra Summit 2014: The Cassandra Experience at Orange — Season 2
DataStax Academy
 
PDF
The Last Pickle: Distributed Tracing from Application to Database
DataStax Academy
 
PDF
New features in 3.0
DataStax Academy
 
PDF
Introduction to .Net Driver
DataStax Academy
 
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
DataStax Academy
 
BDM8 - Near-realtime Big Data Analytics using Impala
David Lauzon
 
Overiew of Cassandra and Doradus
randyguck
 
Extending Cassandra with Doradus OLAP for High Performance Analytics
randyguck
 
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
randyguck
 
Big Data-Driven Applications with Cassandra and Spark
Artem Chebotko
 
Cassandra Summit 2014: A Train of Thoughts About Growing and Scalability — Bu...
DataStax Academy
 
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
DataStax Academy
 
Apache Cassandra at Narmal 2014
DataStax Academy
 
Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...
DataStax Academy
 
Cassandra Summit 2014: Cassandra in Large Scale Enterprise Grade xPatterns De...
DataStax Academy
 
Introduction to Dating Modeling for Cassandra
DataStax Academy
 
Cassandra Summit 2014: Apache Cassandra at Telefonica CBS
DataStax Academy
 
Coursera's Adoption of Cassandra
DataStax Academy
 
Cassandra Summit 2014: Monitor Everything!
DataStax Academy
 
Production Ready Cassandra (Beginner)
DataStax Academy
 
Cassandra Summit 2014: The Cassandra Experience at Orange — Season 2
DataStax Academy
 
The Last Pickle: Distributed Tracing from Application to Database
DataStax Academy
 
New features in 3.0
DataStax Academy
 
Introduction to .Net Driver
DataStax Academy
 
Ad

Similar to Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark (20)

PDF
Olap with Spark and Cassandra
DataStax Academy
 
PDF
Lightning fast analytics with Spark and Cassandra
Rustam Aliyev
 
PDF
Owning time series with team apache Strata San Jose 2015
Patrick McFadin
 
ODP
Nyc summit intro_to_cassandra
zznate
 
PDF
Lightning fast analytics with Spark and Cassandra
nickmbailey
 
PDF
Spark & Cassandra - DevFest Córdoba
Jose Mº Muñoz
 
PDF
Spark and cassandra (Hulu Talk)
Jon Haddad
 
PDF
Big data analytics with Spark & Cassandra
Matthias Niehoff
 
DOCX
Cassandra data modelling best practices
Sandeep Sharma IIMK Smart City,IoT,Bigdata,Cloud,BI,DW
 
PDF
Munich March 2015 - Cassandra + Spark Overview
Christopher Batey
 
PPTX
Column Stores and Google BigQuery
Csaba Toth
 
PPTX
Lightning Fast Analytics with Cassandra and Spark
Tim Vincent
 
PPTX
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
DataStax
 
PPTX
Using Spark to Load Oracle Data into Cassandra
Jim Hatcher
 
PDF
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Qbeast
 
PDF
Reading Cassandra Meetup Feb 2015: Apache Spark
Christopher Batey
 
PPTX
Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office
DataStax Academy
 
PDF
Apache cassandra & apache spark for time series data
Patrick McFadin
 
PDF
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Duyhai Doan
 
PDF
Manchester Hadoop Meetup: Spark Cassandra Integration
Christopher Batey
 
Olap with Spark and Cassandra
DataStax Academy
 
Lightning fast analytics with Spark and Cassandra
Rustam Aliyev
 
Owning time series with team apache Strata San Jose 2015
Patrick McFadin
 
Nyc summit intro_to_cassandra
zznate
 
Lightning fast analytics with Spark and Cassandra
nickmbailey
 
Spark & Cassandra - DevFest Córdoba
Jose Mº Muñoz
 
Spark and cassandra (Hulu Talk)
Jon Haddad
 
Big data analytics with Spark & Cassandra
Matthias Niehoff
 
Cassandra data modelling best practices
Sandeep Sharma IIMK Smart City,IoT,Bigdata,Cloud,BI,DW
 
Munich March 2015 - Cassandra + Spark Overview
Christopher Batey
 
Column Stores and Google BigQuery
Csaba Toth
 
Lightning Fast Analytics with Cassandra and Spark
Tim Vincent
 
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
DataStax
 
Using Spark to Load Oracle Data into Cassandra
Jim Hatcher
 
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Qbeast
 
Reading Cassandra Meetup Feb 2015: Apache Spark
Christopher Batey
 
Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office
DataStax Academy
 
Apache cassandra & apache spark for time series data
Patrick McFadin
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Duyhai Doan
 
Manchester Hadoop Meetup: Spark Cassandra Integration
Christopher Batey
 
Ad

More from DataStax Academy (20)

PDF
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
DataStax Academy
 
PPTX
Introduction to DataStax Enterprise Graph Database
DataStax Academy
 
PPTX
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
DataStax Academy
 
PPTX
Cassandra on Docker @ Walmart Labs
DataStax Academy
 
PDF
Cassandra 3.0 Data Modeling
DataStax Academy
 
PPTX
Cassandra Adoption on Cisco UCS & Open stack
DataStax Academy
 
PDF
Data Modeling for Apache Cassandra
DataStax Academy
 
PDF
Coursera Cassandra Driver
DataStax Academy
 
PDF
Production Ready Cassandra
DataStax Academy
 
PDF
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
DataStax Academy
 
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 1
DataStax Academy
 
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 2
DataStax Academy
 
PDF
Standing Up Your First Cluster
DataStax Academy
 
PDF
Real Time Analytics with Dse
DataStax Academy
 
PDF
Introduction to Data Modeling with Apache Cassandra
DataStax Academy
 
PDF
Cassandra Core Concepts
DataStax Academy
 
PPTX
Enabling Search in your Cassandra Application with DataStax Enterprise
DataStax Academy
 
PPTX
Bad Habits Die Hard
DataStax Academy
 
PDF
Advanced Data Modeling with Apache Cassandra
DataStax Academy
 
PDF
Advanced Cassandra
DataStax Academy
 
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
DataStax Academy
 
Introduction to DataStax Enterprise Graph Database
DataStax Academy
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
DataStax Academy
 
Cassandra on Docker @ Walmart Labs
DataStax Academy
 
Cassandra 3.0 Data Modeling
DataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
DataStax Academy
 
Data Modeling for Apache Cassandra
DataStax Academy
 
Coursera Cassandra Driver
DataStax Academy
 
Production Ready Cassandra
DataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
DataStax Academy
 
Standing Up Your First Cluster
DataStax Academy
 
Real Time Analytics with Dse
DataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
DataStax Academy
 
Cassandra Core Concepts
DataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
DataStax Academy
 
Bad Habits Die Hard
DataStax Academy
 
Advanced Data Modeling with Apache Cassandra
DataStax Academy
 
Advanced Cassandra
DataStax Academy
 

Recently uploaded (20)

PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 

Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark

  • 1. OLAP WITH SPARK AND CASSANDRA #CassandraSummit EVAN CHAN SEPT 2014
  • 2. WHO AM I? Principal Engineer, @evanfchan Creator of Socrata, Inc. http://github.com/velvia Spark Job Server
  • 3. WE BUILD SOFTWARE TO MAKE DATA USEFUL TO MORE PEOPLE. data.edmonton.ca finances.worldbank.org data.cityofchicago.org data.seattle.gov data.oregon.gov data.wa.gov www.metrochicagodata.org data.cityofboston.gov info.samhsa.gov explore.data.gov data.cms.gov data.ok.gov data.nola.gov data.illinois.gov data.colorado.gov data.austintexas.gov data.undp.org www.opendatanyc.com data.mo.gov data.nfpa.org data.raleighnc.gov dati.lombardia.it data.montgomerycountymd.gov data.cityofnewyork.us data.acgov.org data.baltimorecity.gov data.energystar.gov data.somervillema.gov data.maryland.gov data.taxpayer.net bronx.lehman.cuny.edu data.hawaii.gov data.sfgov.org
  • 4. WE ARE SWIMMING IN DATA!
  • 5. BIG DATA AT SOCRATA Tens of thousands of datasets, each one up to 30 million rows Customer demand for billion row datasets Want to analyze across datasets
  • 6. BIG DATA AT OOYALA 2.5 billion analytics pings a day = almost a trillion events a year. Roll up tables - 30 million rows per day
  • 7. HOW CAN WE ALLOW CUSTOMERS TO QUERY A YEAR'S WORTH OF DATA? Flexible - complex queries included Sometimes you can't denormalize your data enough Fast - interactive speeds Near Real Time - can't make customers wait hours before querying new data
  • 8. RDBMS? POSTGRES? Start hitting latency limits at ~10 million rows No robust and inexpensive solution for querying across shards No robust way to scale horizontally PostGres runs query on single thread unless you partition (painful!) Complex and expensive to improve performance (eg rollup tables, huge expensive servers)
  • 9. OLAP CUBES? Materialize summary for every possible combination Too complicated and brittle Takes forever to compute - not for real time Explodes storage and memory
  • 10. When in doubt, use brute force - Ken Thompson
  • 12. CASSANDRA Horizontally scalable Very flexible data modelling (lists, sets, custom data types) Easy to operate No fear of number of rows or documents Best of breed storage technology, huge community BUT: Simple queries only
  • 13. APACHE SPARK Horizontally scalable, in-memory queries Functional Scala transforms - map, filter, groupBy, sort etc. SQL, machine learning, streaming, graph, R, many more plugins all on ONE platform - feed your SQL results to a logistic regression, easy! THE Hottest big data platform, huge community, leaving Hadoop in the dust Developers love it
  • 14. SPARK PROVIDES THE MISSING FAST, DEEP ANALYTICS PIECE OF CASSANDRA!
  • 15. INTEGRATING SPARK AND CASSANDRA Scala solutions: Datastax integration: https://github.com/datastax/spark-cassandra- connector (CQL-based) Calliope
  • 16. A bit more work: Use traditional Cassandra client with RDDs Use an existing InputFormat, like CqlPagedInputFormat Only reason to go here is probably you are not on CQL version of Cassandra, or you're using Shark/Hive.
  • 17. A SPARK AND CASSANDRA OLAP ARCHITECTURE
  • 18. SEPARATE STORAGE AND QUERY LAYERS Combine best of breed storage and query platforms Take full advantage of evolution of each Storage handles replication for availability Query can replicate data for scaling read concurrency - independent!
  • 19. SCALE NODES, NOT DEVELOPER TIME!!
  • 20. KEEPING IT SIMPLE Maximize row scan speed Columnar representation for efficiency Compressed bitmap indexes for fast algebra Functional transforms for easy memoization, testing, concurrency, composition
  • 22. EVEN BETTER: TACHYON OFF-HEAP CACHING
  • 23. INITIAL ATTEMPTS val rows = Seq( Seq("Burglary", "19xx Hurston", 10), Seq("Theft", "55xx Floatilla Ave", 5) ) sc.parallelize(rows) .map { values => (values[0], values) } .groupByKey .reduce(_[2] + _[2])
  • 24. No existing generic query engine for Spark when we started (Shark was in infancy, had no indexes, etc.), so we built our own For every row, need to extract out needed columns Ability to select arbitrary columns means using Seq[Any], no type safety Boxing makes integer aggregation very expensive and memory inefficient
  • 26. The traditional row-based data storage approach is dead - Michael Stonebraker
  • 27. TRADITIONAL ROW-BASED STORAGE Same layout in memory and on disk: Name Age Barak 46 Hillary 66 Each row is stored contiguously. All columns in row 2 come after row 1.
  • 28. COLUMNAR STORAGE (MEMORY) Name column 0 1 0 1 Dictionary: {0: "Barak", 1: "Hillary"} Age column 0 1 46 66
  • 29. COLUMNAR STORAGE (CASSANDRA) Review: each physical row in Cassandra (e.g. a "partition key") stores its columns together on disk. Schema CF Rowkey Type Name StringDict Age Int Data CF Rowkey 0 1 Name 0 1 Age 46 66
  • 30. ADVANTAGES OF COLUMNAR STORAGE Compression Dictionary compression - HUGE savings for low-cardinality string columns RLE Reduce I/O Only columns needed for query are loaded from disk Can keep strong types in memory, avoid boxing Batch multiple rows in one cell for efficiency
  • 31. ADVANTAGES OF COLUMNAR QUERYING Cache locality for aggregating column of data Take advantage of CPU/GPU vector instructions for ints / doubles avoid row-ifying until last possible moment easy to derive computed columns Use vector data / linear math libraries
  • 32. COLUMNAR QUERY ENGINE VS ROW-BASED IN SCALA Custom RDD of column-oriented blocks of data Uses ~10x less heap 10-100x faster for group by's on a single node Scan speed in excess of 150M rows/sec/core for integer aggregations
  • 33. SO, GREAT, OLAP WITH CASSANDRA AND SPARK. NOW WHAT?
  • 35. DATASTAX: CASSANDRA SPARK INTEGRATION Datastax Enterprise now comes with HA Spark HA master, that is. spark-cassandra-connector
  • 36. SPARK SQL Appeared with Spark 1.0 In-memory columnar store Can read from Parquet and JSON now; direct Cassandra integration coming Querying is not column-based (yet) No indexes Write custom functions in Scala .... take that Hive UDFs!! Integrates well with MLBase, Scala/Java/Python
  • 37. CACHING A SQL TABLE FROM CASSANDRA val sqlContext = new org.apache.spark.sql.SQLContext(sc) sc.cassandraTable[GDeltRow]("gdelt, "1979to2009") .registerAsTable("gdelt") sqlContext.cacheTable("gdelt") sqlContext.sql("SELECT Actor2Code, Actor2Name, Actor2CountryCode, AvgTone from gdelt ORDER Remember Spark is lazy, nothing is executed until the collect() In Spark 1.1+: registerTempTable
  • 38. SOME PERFORMANCE NUMBERS GDELT dataset, 117 million rows, 57 columns, ~50GB Spark 1.0.2, AWS 8 x c3.xlarge, cached in memory Query Avg time (sec) SELECT count(*) FROM gdelt WHERE Actor2CountryCode = 'CHN' 0.49 SELECT 4 columns Top K 1.51 SELECT Top countries by Avg Tone 2.69 (Group By)
  • 39. IMPORTANT - CACHING By default, queries will read data from source - Cassandra - every time Spark RDD Caching - much faster, but big waste of memory (row oriented) Spark SQL table caching - fastest, memory efficient
  • 40. WORK STILL NEEDED Indexes Columnar querying for fast aggregation Tachyon support for Cassandra/CQL Efficient reading from columnar storage formats
  • 41. LESSONS Extremely fast distributed querying for these use cases Data doesn't change much (and only bulk changes) Analytical queries for subset of columns Focused on numerical aggregations Small numbers of group bys For fast query performance, cache your data using Spark SQL Concurrent queries is a frontier with Spark. Use additional Spark contexts.
  • 44. EXAMPLE CUSTOM INTEGRATION USING ASTYANAX val cassRDD = sc.parallelize(rowkeys). flatMap { rowkey => columnFamily.get(rowkey).execute().asScala }
  • 45. SOME COLUMNAR ALTERNATIVES Monetdb and Infobright - true columnar stores (storage + querying) Vertica and C-Store Google BigQuery - columnar cloud database, Dremel based Amazon RedShift