SlideShare a Scribd company logo
Real-time Analytics with
Cassandra, Spark and Shark
Who is this guy
• Staff Engineer, Compute and Data Services, Ooyala
• Building multiple web-scale real-time systems on top of C*, Kafka,
Storm, etc.
• Scala/Akka guy
• Very excited by open source, big data projects - share some today
• @evanfchan
Agenda
• Ooyala and Cassandra
• What problem are we trying to solve?
• Spark and Shark
• Our Spark/Cassandra Architecture
• Demo
Cassandra at Ooyala
Who is Ooyala, and how we use Cassandra
CONFIDENTIAL—DO NOT DISTRIBUTE
OOYALA
Powering personalized video
experiences across all screens.
5
CONFIDENTIAL—DO NOT DISTRIBUTE 6CONFIDENTIAL—DO NOT DISTRIBUTE
Founded in 2007
Commercially launch in 2009
230+ employees in Silicon Valley, LA, NYC,
London, Paris, Tokyo, Sydney & Guadalajara
Global footprint, 200M unique users,
110+ countries, and more than 6,000 websites
Over 1 billion videos played per month
and 2 billion analytic events per day
25% of U.S. online viewers watch video
powered by Ooyala
COMPANY OVERVIEW
CONFIDENTIAL—DO NOT DISTRIBUTE 7
TRUSTED VIDEO PARTNER
STRATEGIC PARTNERS
CUSTOMERS
CONFIDENTIAL—DO NOT DISTRIBUTE
We are a large Cassandra user
• 12 clusters ranging in size from 3 to 115 nodes
• Total of 28TB of data managed over ~200 nodes
• Largest cluster - 115 nodes, 1.92PB storage, 15TB
RAM
• Over 2 billion C* column writes per day
• Powers all of our analytics infrastructure
What problem are we trying to
solve?
Lots of data, complex queries, answered really quickly... but how??
From mountains of useless data...
To nuggets of truth...
To nuggets of truth...
• Quickly
• Painlessly
• At	
  scale?
Today: Precomputed aggregates
• Video metrics computed along several high cardinality dimensions
• Very fast lookups, but inflexible, and hard to change
• Most computed aggregates are never read
• What if we need more dynamic queries?
– Top content for mobile users in France
– Engagement curves for users who watched recommendations
– Data mining, trends, machine learning
The static - dynamic continuum
• Super fast lookups
• Inflexible, wasteful
• Best for 80% most
common queries
• Always compute results
from raw data
• Flexible but slow
100% Precomputation 100% Dynamic
Where we want to be
Partly dynamic
• Pre-aggregate most
common queries
• Flexible, fast dynamic
queries
• Easily generate many
materialized views
Industry Trends
• Fast execution frameworks
– Impala
• In-memory databases
– VoltDB, Druid
• Streaming and real-time
• Higher-level, productive data frameworks
– Cascading, Hive, Pig
Why Spark and Shark?
“Lightning-fast in-memory cluster computing”
Introduction to Spark
• In-memory distributed computing framework
• Created by UC Berkeley AMP Lab in 2010
• Targeted problems that MR is bad at:
– Iterative algorithms (machine learning)
– Interactive data mining
• More general purpose than Hadoop MR
• Active contributions from ~ 15 companies
HDFS
Map
Reduce
Map
Reduce
map()
join()
cache()
transform
Throughput: Memory is king
0 37500 75000 112500 150000
C*, cold cache
C*, warm cache
Spark RDD
6-­‐node	
  C*/DSE	
  1.1.9	
  cluster,
Spark	
  0.7.0
Developers love it
• “I wrote my first aggregation job in 30 minutes”
• High level “distributed collections” API
• No Hadoop cruft
• Full power of Scala, Java, Python
• Interactive REPL shell
• EASY testing!!
• Low latency - quick development cycles
Spark word count example
file = spark.textFile("hdfs://...")
 
file.flatMap(line => line.split(" "))
    .map(word => (word, 1))
    .reduceByKey(_ + _)
1 package org.myorg;
2
3 import java.io.IOException;
4 import java.util.*;
5
6 import org.apache.hadoop.fs.Path;
7 import org.apache.hadoop.conf.*;
8 import org.apache.hadoop.io.*;
9 import org.apache.hadoop.mapreduce.*;
10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
14
15 public class WordCount {
16
17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
18 private final static IntWritable one = new IntWritable(1);
19 private Text word = new Text();
20
21 public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
22 String line = value.toString();
23 StringTokenizer tokenizer = new StringTokenizer(line);
24 while (tokenizer.hasMoreTokens()) {
25 word.set(tokenizer.nextToken());
26 context.write(word, one);
27 }
28 }
29 }
30
31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
32
33 public void reduce(Text key, Iterable<IntWritable> values, Context context)
34 throws IOException, InterruptedException {
35 int sum = 0;
36 for (IntWritable val : values) {
37 sum += val.get();
38 }
39 context.write(key, new IntWritable(sum));
40 }
41 }
42
43 public static void main(String[] args) throws Exception {
44 Configuration conf = new Configuration();
45
46 Job job = new Job(conf, "wordcount");
47
48 job.setOutputKeyClass(Text.class);
49 job.setOutputValueClass(IntWritable.class);
50
51 job.setMapperClass(Map.class);
52 job.setReducerClass(Reduce.class);
53
54 job.setInputFormatClass(TextInputFormat.class);
55 job.setOutputFormatClass(TextOutputFormat.class);
56
57 FileInputFormat.addInputPath(job, new Path(args[0]));
58 FileOutputFormat.setOutputPath(job, new Path(args[1]));
59
60 job.waitForCompletion(true);
61 }
62
63 }
The Spark Ecosystem
Bagel	
  -­‐	
  
Pregel	
  on	
  
Spark
HIVE	
  on	
  Spark
Spark	
  Streaming	
  -­‐	
  
discreRzed	
  stream	
  
processing
Spark
Tachyon	
  -­‐	
  in-­‐memory	
  caching	
  DFS
Shark - HIVE on Spark
• 100% HiveQL compatible
• 10-100x faster than HIVE, answers in seconds
• Reuse UDFs, SerDe’s, StorageHandlers
• Can use DSE / CassandraFS for Metastore
• Easy Scala/Java integration via Spark - easier than
writing UDFs
Our new analytics architecture
How we integrate Cassandra and Spark/Shark
From raw events to fast queries
IngesRon C*
event	
  store
Raw	
  
Events
Raw	
  
Events
Raw	
  
Events
Spark
Spark
Spark
View	
  1
View	
  2
View	
  3
Spark
Shark
Predefined	
  
queries
Ad-­‐hoc	
  
HiveQL
Our Spark/Shark/Cassandra Stack
Node1
Cassandra
InputFormat
SerDe
Spark	
  Worker
Shark
Node2
Cassandra
InputFormat
SerDe
Spark	
  Worker
Shark
Node3
Cassandra
InputFormat
SerDe
Spark	
  Worker
Shark
Spark	
  Master Job	
  Server
Event Store Cassandra schema
t0 t1 t2 t3 t4
2013-­‐04-­‐05T00:
00Z#id1
{event0:
a0}
{event1:
a1}
{event2:
a2}
{event3:
a3}
{event4:
a4}
ipaddr:10.20.30.40:t1 videoId:45678:t1 providerId:500:t0
2013-­‐04-­‐05T00:
00Z#id1
Event	
  CF
EventAfr	
  CF
Unpacking raw events
t0 t1
2013-­‐04-­‐05T00:
00Z#id1
{video: 10,
type:5}
{video: 11,
type:1}
2013-­‐04-­‐05T00:
00Z#id2
{video: 20,
type:5}
{video: 25,
type:9}
UserID Video Type
id1 10 5
Unpacking raw events
t0 t1
2013-­‐04-­‐05T00:
00Z#id1
{video: 10,
type:5}
{video: 11,
type:1}
2013-­‐04-­‐05T00:
00Z#id2
{video: 20,
type:5}
{video: 25,
type:9}
UserID Video Type
id1 10 5
id1 11 1
Unpacking raw events
t0 t1
2013-­‐04-­‐05T00:
00Z#id1
{video: 10,
type:5}
{video: 11,
type:1}
2013-­‐04-­‐05T00:
00Z#id2
{video: 20,
type:5}
{video: 25,
type:9}
UserID Video Type
id1 10 5
id1 11 1
id2 20 5
Unpacking raw events
t0 t1
2013-­‐04-­‐05T00:
00Z#id1
{video: 10,
type:5}
{video: 11,
type:1}
2013-­‐04-­‐05T00:
00Z#id2
{video: 20,
type:5}
{video: 25,
type:9}
UserID Video Type
id1 10 5
id1 11 1
id2 20 5
id2 25 9
Tips for InputFormat Development
• Know which target platforms you are developing for
– Which API to write against? New? Old? Both?
• Be prepared to spend time tuning your split computation
– Low latency jobs require fast splits
• Consider sorting row keys by token for data locality
• Implement predicate pushdown for HIVE SerDe’s
– Use your indexes to reduce size of dataset
Example: OLAP processing
t0
2013-­‐04-­‐0
5T00:00Z#i
d1
{video:
10,
type:5}
2013-­‐04-­‐0
5T00:00Z#i
d2
{video:
20,
type:5}
C*	
  events
OLAP	
  
Aggregates
OLAP	
  
Aggregates
OLAP	
  
Aggregates
Cached	
  Materialized	
  Views
Spark
Spark
Spark
Union
Query	
  1:	
  Plays	
  by	
  
Provider
Query	
  2:	
  Top	
  
content	
  for	
  mobile
Performance numbers
Spark:	
  C*	
  -­‐>	
  OLAP	
  aggregates
cold	
  cache,	
  1.4	
  million	
  events
130	
  seconds
C*	
  -­‐>	
  OLAP	
  aggregates
warmed	
  cache
20-­‐30	
  seconds
OLAP	
  aggregate	
  query	
  via	
  Spark
(56k	
  records)
60	
  ms
6-­‐node	
  C*/DSE	
  1.1.9	
  cluster,
Spark	
  0.7.0
Spark: Under the hood
Map DatasetReduce Map
Driver Map DatasetReduce Map
Map DatasetReduce Map
One	
  executor	
  process	
  per	
  node
Driver
Fault Tolerance
• Cached dataset lives in Java Heap only - what if process dies?
• Spark lineage - automatic recomputation from source, but this is
expensive!
• Can also replicate cached dataset to survive single node failures
• Persist materialized views back to C*, then load into cache -- now
recovery path is much faster
• Persistence also enables multiple processes to hold cached dataset
Demo time
Shark Demo
• Local shark node, 1 core, MBP
• How to create a table from C* using our inputformat
• Creating a cached Shark table
• Running fast queries
Backup Slides
• THE NEXT FEW SLIDES ARE STRICTLY BACKUP IN CASE
LIVE DEMO DOESN’T WORK
Creating a Shark Table from InputFormat
Creating a cached table
Doing a row count ... don’t try in HIVE!
Top k providerId query
THANK YOU

More Related Content

What's hot (19)

PDF
Feeding Cassandra with Spark-Streaming and Kafka
DataStax Academy
 
PPTX
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
PDF
Kafka spark cassandra webinar feb 16 2016
Hiromitsu Komatsu
 
PDF
Apache Cassandra and Python for Analyzing Streaming Big Data
prajods
 
PDF
Cassandra spark connector
Duyhai Doan
 
PDF
Owning time series with team apache Strata San Jose 2015
Patrick McFadin
 
PDF
Lambda architecture
Szilveszter Molnár
 
PDF
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Helena Edelson
 
PPTX
Kafka Lambda architecture with mirroring
Anant Rustagi
 
PPTX
Alpine academy apache spark series #1 introduction to cluster computing wit...
Holden Karau
 
PPTX
Real time data pipeline with spark streaming and cassandra with mesos
Rahul Kumar
 
PPTX
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
StampedeCon
 
PPTX
Spark + Cassandra = Real Time Analytics on Operational Data
Victor Coustenoble
 
PDF
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
PDF
Reactive app using actor model & apache spark
Rahul Kumar
 
PDF
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Piotr Kolaczkowski
 
PDF
How to deploy Apache Spark 
to Mesos/DCOS
Legacy Typesafe (now Lightbend)
 
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Helena Edelson
 
KEY
Scaling Twitter with Cassandra
Ryan King
 
Feeding Cassandra with Spark-Streaming and Kafka
DataStax Academy
 
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
Kafka spark cassandra webinar feb 16 2016
Hiromitsu Komatsu
 
Apache Cassandra and Python for Analyzing Streaming Big Data
prajods
 
Cassandra spark connector
Duyhai Doan
 
Owning time series with team apache Strata San Jose 2015
Patrick McFadin
 
Lambda architecture
Szilveszter Molnár
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Helena Edelson
 
Kafka Lambda architecture with mirroring
Anant Rustagi
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Holden Karau
 
Real time data pipeline with spark streaming and cassandra with mesos
Rahul Kumar
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
StampedeCon
 
Spark + Cassandra = Real Time Analytics on Operational Data
Victor Coustenoble
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
Reactive app using actor model & apache spark
Rahul Kumar
 
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Piotr Kolaczkowski
 
How to deploy Apache Spark 
to Mesos/DCOS
Legacy Typesafe (now Lightbend)
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Helena Edelson
 
Scaling Twitter with Cassandra
Ryan King
 

Viewers also liked (18)

PDF
Enterprise Resource Planning and CSFs
Mayuree Srikulwong
 
PDF
Survey on NoSQL Database
Mayuree Srikulwong
 
PPTX
20130912 YTC_Reynold Xin_Spark and Shark
YahooTechConference
 
PDF
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
DataStax Academy
 
PDF
Mongo db groundup-0-nosql-intro-syedawasekhirni
Dr. Awase Khirni Syed
 
PPTX
Scaling SQL and NoSQL Databases in the Cloud
RightScale
 
PDF
Introduction to Apache Spark Ecosystem
Bojan Babic
 
PPTX
An Intro to NoSQL Databases
Rajith Pemabandu
 
PPT
6 Data Modeling for NoSQL 2/2
Fabio Fumarola
 
PPT
5 Data Modeling for NoSQL 1/2
Fabio Fumarola
 
PDF
Spark and shark
DataWorks Summit
 
PPTX
Pig, Making Hadoop Easy
Nick Dimiduk
 
PDF
introduction to data processing using Hadoop and Pig
Ricardo Varela
 
PDF
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
PDF
Hive Quick Start Tutorial
Carl Steinbach
 
PDF
Integration of Hive and HBase
Hortonworks
 
KEY
Hadoop, Pig, and Twitter (NoSQL East 2009)
Kevin Weil
 
PPTX
Big Data Analytics with Hadoop
Philippe Julio
 
Enterprise Resource Planning and CSFs
Mayuree Srikulwong
 
Survey on NoSQL Database
Mayuree Srikulwong
 
20130912 YTC_Reynold Xin_Spark and Shark
YahooTechConference
 
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
DataStax Academy
 
Mongo db groundup-0-nosql-intro-syedawasekhirni
Dr. Awase Khirni Syed
 
Scaling SQL and NoSQL Databases in the Cloud
RightScale
 
Introduction to Apache Spark Ecosystem
Bojan Babic
 
An Intro to NoSQL Databases
Rajith Pemabandu
 
6 Data Modeling for NoSQL 2/2
Fabio Fumarola
 
5 Data Modeling for NoSQL 1/2
Fabio Fumarola
 
Spark and shark
DataWorks Summit
 
Pig, Making Hadoop Easy
Nick Dimiduk
 
introduction to data processing using Hadoop and Pig
Ricardo Varela
 
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
Hive Quick Start Tutorial
Carl Steinbach
 
Integration of Hive and HBase
Hortonworks
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Kevin Weil
 
Big Data Analytics with Hadoop
Philippe Julio
 
Ad

Similar to C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan Chan (20)

PDF
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
DataStax Academy
 
PPT
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Mike Broberg
 
PDF
Track A-2 基於 Spark 的數據分析
Etu Solution
 
PPTX
Tech Spark Presentation
Stephen Borg
 
PDF
Google Cloud Dataflow Two Worlds Become a Much Better One
DataWorks Summit
 
PDF
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
 
PDF
Complex Data Transformations Made Easy
Data Con LA
 
PPTX
ETL with SPARK - First Spark London meetup
Rafal Kwasny
 
PDF
Dev Ops Training
Spark Summit
 
PDF
Model and pilot all cloud layers with OCCIware - Eclipse Day Lyon 2017
Marc Dutoo
 
PDF
OCCIware presentation at EclipseDay in Lyon, November 2017, by Marc Dutoo, Smile
OCCIware
 
PDF
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
PDF
SnappyData Toronto Meetup Nov 2017
SnappyData
 
PDF
Devoxx university - Kafka de haut en bas
Florent Ramiere
 
PPTX
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Cloudera, Inc.
 
PDF
Building a system for machine and event-oriented data - SF HUG Nov 2015
Felicia Haggarty
 
PDF
Avoiding Common Pitfalls: Spark Structured Streaming with Kafka
HostedbyConfluent
 
PDF
PCM18 (Big Data Analytics)
Stratebi
 
PDF
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Databricks
 
PDF
Webinar: SQL for Machine Data?
Crate.io
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
DataStax Academy
 
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Mike Broberg
 
Track A-2 基於 Spark 的數據分析
Etu Solution
 
Tech Spark Presentation
Stephen Borg
 
Google Cloud Dataflow Two Worlds Become a Much Better One
DataWorks Summit
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
 
Complex Data Transformations Made Easy
Data Con LA
 
ETL with SPARK - First Spark London meetup
Rafal Kwasny
 
Dev Ops Training
Spark Summit
 
Model and pilot all cloud layers with OCCIware - Eclipse Day Lyon 2017
Marc Dutoo
 
OCCIware presentation at EclipseDay in Lyon, November 2017, by Marc Dutoo, Smile
OCCIware
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
SnappyData Toronto Meetup Nov 2017
SnappyData
 
Devoxx university - Kafka de haut en bas
Florent Ramiere
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Cloudera, Inc.
 
Building a system for machine and event-oriented data - SF HUG Nov 2015
Felicia Haggarty
 
Avoiding Common Pitfalls: Spark Structured Streaming with Kafka
HostedbyConfluent
 
PCM18 (Big Data Analytics)
Stratebi
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Databricks
 
Webinar: SQL for Machine Data?
Crate.io
 
Ad

More from DataStax Academy (20)

PDF
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
DataStax Academy
 
PPTX
Introduction to DataStax Enterprise Graph Database
DataStax Academy
 
PPTX
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
DataStax Academy
 
PPTX
Cassandra on Docker @ Walmart Labs
DataStax Academy
 
PDF
Cassandra 3.0 Data Modeling
DataStax Academy
 
PPTX
Cassandra Adoption on Cisco UCS & Open stack
DataStax Academy
 
PDF
Data Modeling for Apache Cassandra
DataStax Academy
 
PDF
Coursera Cassandra Driver
DataStax Academy
 
PDF
Production Ready Cassandra
DataStax Academy
 
PDF
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
DataStax Academy
 
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 1
DataStax Academy
 
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 2
DataStax Academy
 
PDF
Standing Up Your First Cluster
DataStax Academy
 
PDF
Real Time Analytics with Dse
DataStax Academy
 
PDF
Introduction to Data Modeling with Apache Cassandra
DataStax Academy
 
PDF
Cassandra Core Concepts
DataStax Academy
 
PPTX
Enabling Search in your Cassandra Application with DataStax Enterprise
DataStax Academy
 
PPTX
Bad Habits Die Hard
DataStax Academy
 
PDF
Advanced Data Modeling with Apache Cassandra
DataStax Academy
 
PDF
Advanced Cassandra
DataStax Academy
 
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
DataStax Academy
 
Introduction to DataStax Enterprise Graph Database
DataStax Academy
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
DataStax Academy
 
Cassandra on Docker @ Walmart Labs
DataStax Academy
 
Cassandra 3.0 Data Modeling
DataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
DataStax Academy
 
Data Modeling for Apache Cassandra
DataStax Academy
 
Coursera Cassandra Driver
DataStax Academy
 
Production Ready Cassandra
DataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
DataStax Academy
 
Standing Up Your First Cluster
DataStax Academy
 
Real Time Analytics with Dse
DataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
DataStax Academy
 
Cassandra Core Concepts
DataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
DataStax Academy
 
Bad Habits Die Hard
DataStax Academy
 
Advanced Data Modeling with Apache Cassandra
DataStax Academy
 
Advanced Cassandra
DataStax Academy
 

Recently uploaded (20)

PDF
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PPTX
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
PDF
July Patch Tuesday
Ivanti
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PPT
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PPTX
Q2 Leading a Tableau User Group - Onboarding
lward7
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
July Patch Tuesday
Ivanti
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
Q2 Leading a Tableau User Group - Onboarding
lward7
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 

C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan Chan

  • 2. Who is this guy • Staff Engineer, Compute and Data Services, Ooyala • Building multiple web-scale real-time systems on top of C*, Kafka, Storm, etc. • Scala/Akka guy • Very excited by open source, big data projects - share some today • @evanfchan
  • 3. Agenda • Ooyala and Cassandra • What problem are we trying to solve? • Spark and Shark • Our Spark/Cassandra Architecture • Demo
  • 4. Cassandra at Ooyala Who is Ooyala, and how we use Cassandra
  • 5. CONFIDENTIAL—DO NOT DISTRIBUTE OOYALA Powering personalized video experiences across all screens. 5
  • 6. CONFIDENTIAL—DO NOT DISTRIBUTE 6CONFIDENTIAL—DO NOT DISTRIBUTE Founded in 2007 Commercially launch in 2009 230+ employees in Silicon Valley, LA, NYC, London, Paris, Tokyo, Sydney & Guadalajara Global footprint, 200M unique users, 110+ countries, and more than 6,000 websites Over 1 billion videos played per month and 2 billion analytic events per day 25% of U.S. online viewers watch video powered by Ooyala COMPANY OVERVIEW
  • 7. CONFIDENTIAL—DO NOT DISTRIBUTE 7 TRUSTED VIDEO PARTNER STRATEGIC PARTNERS CUSTOMERS CONFIDENTIAL—DO NOT DISTRIBUTE
  • 8. We are a large Cassandra user • 12 clusters ranging in size from 3 to 115 nodes • Total of 28TB of data managed over ~200 nodes • Largest cluster - 115 nodes, 1.92PB storage, 15TB RAM • Over 2 billion C* column writes per day • Powers all of our analytics infrastructure
  • 9. What problem are we trying to solve? Lots of data, complex queries, answered really quickly... but how??
  • 10. From mountains of useless data...
  • 11. To nuggets of truth...
  • 12. To nuggets of truth... • Quickly • Painlessly • At  scale?
  • 13. Today: Precomputed aggregates • Video metrics computed along several high cardinality dimensions • Very fast lookups, but inflexible, and hard to change • Most computed aggregates are never read • What if we need more dynamic queries? – Top content for mobile users in France – Engagement curves for users who watched recommendations – Data mining, trends, machine learning
  • 14. The static - dynamic continuum • Super fast lookups • Inflexible, wasteful • Best for 80% most common queries • Always compute results from raw data • Flexible but slow 100% Precomputation 100% Dynamic
  • 15. Where we want to be Partly dynamic • Pre-aggregate most common queries • Flexible, fast dynamic queries • Easily generate many materialized views
  • 16. Industry Trends • Fast execution frameworks – Impala • In-memory databases – VoltDB, Druid • Streaming and real-time • Higher-level, productive data frameworks – Cascading, Hive, Pig
  • 17. Why Spark and Shark? “Lightning-fast in-memory cluster computing”
  • 18. Introduction to Spark • In-memory distributed computing framework • Created by UC Berkeley AMP Lab in 2010 • Targeted problems that MR is bad at: – Iterative algorithms (machine learning) – Interactive data mining • More general purpose than Hadoop MR • Active contributions from ~ 15 companies
  • 20. Throughput: Memory is king 0 37500 75000 112500 150000 C*, cold cache C*, warm cache Spark RDD 6-­‐node  C*/DSE  1.1.9  cluster, Spark  0.7.0
  • 21. Developers love it • “I wrote my first aggregation job in 30 minutes” • High level “distributed collections” API • No Hadoop cruft • Full power of Scala, Java, Python • Interactive REPL shell • EASY testing!! • Low latency - quick development cycles
  • 22. Spark word count example file = spark.textFile("hdfs://...")   file.flatMap(line => line.split(" "))     .map(word => (word, 1))     .reduceByKey(_ + _) 1 package org.myorg; 2 3 import java.io.IOException; 4 import java.util.*; 5 6 import org.apache.hadoop.fs.Path; 7 import org.apache.hadoop.conf.*; 8 import org.apache.hadoop.io.*; 9 import org.apache.hadoop.mapreduce.*; 10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 14 15 public class WordCount { 16 17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { 18 private final static IntWritable one = new IntWritable(1); 19 private Text word = new Text(); 20 21 public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { 22 String line = value.toString(); 23 StringTokenizer tokenizer = new StringTokenizer(line); 24 while (tokenizer.hasMoreTokens()) { 25 word.set(tokenizer.nextToken()); 26 context.write(word, one); 27 } 28 } 29 } 30 31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { 32 33 public void reduce(Text key, Iterable<IntWritable> values, Context context) 34 throws IOException, InterruptedException { 35 int sum = 0; 36 for (IntWritable val : values) { 37 sum += val.get(); 38 } 39 context.write(key, new IntWritable(sum)); 40 } 41 } 42 43 public static void main(String[] args) throws Exception { 44 Configuration conf = new Configuration(); 45 46 Job job = new Job(conf, "wordcount"); 47 48 job.setOutputKeyClass(Text.class); 49 job.setOutputValueClass(IntWritable.class); 50 51 job.setMapperClass(Map.class); 52 job.setReducerClass(Reduce.class); 53 54 job.setInputFormatClass(TextInputFormat.class); 55 job.setOutputFormatClass(TextOutputFormat.class); 56 57 FileInputFormat.addInputPath(job, new Path(args[0])); 58 FileOutputFormat.setOutputPath(job, new Path(args[1])); 59 60 job.waitForCompletion(true); 61 } 62 63 }
  • 23. The Spark Ecosystem Bagel  -­‐   Pregel  on   Spark HIVE  on  Spark Spark  Streaming  -­‐   discreRzed  stream   processing Spark Tachyon  -­‐  in-­‐memory  caching  DFS
  • 24. Shark - HIVE on Spark • 100% HiveQL compatible • 10-100x faster than HIVE, answers in seconds • Reuse UDFs, SerDe’s, StorageHandlers • Can use DSE / CassandraFS for Metastore • Easy Scala/Java integration via Spark - easier than writing UDFs
  • 25. Our new analytics architecture How we integrate Cassandra and Spark/Shark
  • 26. From raw events to fast queries IngesRon C* event  store Raw   Events Raw   Events Raw   Events Spark Spark Spark View  1 View  2 View  3 Spark Shark Predefined   queries Ad-­‐hoc   HiveQL
  • 27. Our Spark/Shark/Cassandra Stack Node1 Cassandra InputFormat SerDe Spark  Worker Shark Node2 Cassandra InputFormat SerDe Spark  Worker Shark Node3 Cassandra InputFormat SerDe Spark  Worker Shark Spark  Master Job  Server
  • 28. Event Store Cassandra schema t0 t1 t2 t3 t4 2013-­‐04-­‐05T00: 00Z#id1 {event0: a0} {event1: a1} {event2: a2} {event3: a3} {event4: a4} ipaddr:10.20.30.40:t1 videoId:45678:t1 providerId:500:t0 2013-­‐04-­‐05T00: 00Z#id1 Event  CF EventAfr  CF
  • 29. Unpacking raw events t0 t1 2013-­‐04-­‐05T00: 00Z#id1 {video: 10, type:5} {video: 11, type:1} 2013-­‐04-­‐05T00: 00Z#id2 {video: 20, type:5} {video: 25, type:9} UserID Video Type id1 10 5
  • 30. Unpacking raw events t0 t1 2013-­‐04-­‐05T00: 00Z#id1 {video: 10, type:5} {video: 11, type:1} 2013-­‐04-­‐05T00: 00Z#id2 {video: 20, type:5} {video: 25, type:9} UserID Video Type id1 10 5 id1 11 1
  • 31. Unpacking raw events t0 t1 2013-­‐04-­‐05T00: 00Z#id1 {video: 10, type:5} {video: 11, type:1} 2013-­‐04-­‐05T00: 00Z#id2 {video: 20, type:5} {video: 25, type:9} UserID Video Type id1 10 5 id1 11 1 id2 20 5
  • 32. Unpacking raw events t0 t1 2013-­‐04-­‐05T00: 00Z#id1 {video: 10, type:5} {video: 11, type:1} 2013-­‐04-­‐05T00: 00Z#id2 {video: 20, type:5} {video: 25, type:9} UserID Video Type id1 10 5 id1 11 1 id2 20 5 id2 25 9
  • 33. Tips for InputFormat Development • Know which target platforms you are developing for – Which API to write against? New? Old? Both? • Be prepared to spend time tuning your split computation – Low latency jobs require fast splits • Consider sorting row keys by token for data locality • Implement predicate pushdown for HIVE SerDe’s – Use your indexes to reduce size of dataset
  • 34. Example: OLAP processing t0 2013-­‐04-­‐0 5T00:00Z#i d1 {video: 10, type:5} 2013-­‐04-­‐0 5T00:00Z#i d2 {video: 20, type:5} C*  events OLAP   Aggregates OLAP   Aggregates OLAP   Aggregates Cached  Materialized  Views Spark Spark Spark Union Query  1:  Plays  by   Provider Query  2:  Top   content  for  mobile
  • 35. Performance numbers Spark:  C*  -­‐>  OLAP  aggregates cold  cache,  1.4  million  events 130  seconds C*  -­‐>  OLAP  aggregates warmed  cache 20-­‐30  seconds OLAP  aggregate  query  via  Spark (56k  records) 60  ms 6-­‐node  C*/DSE  1.1.9  cluster, Spark  0.7.0
  • 36. Spark: Under the hood Map DatasetReduce Map Driver Map DatasetReduce Map Map DatasetReduce Map One  executor  process  per  node Driver
  • 37. Fault Tolerance • Cached dataset lives in Java Heap only - what if process dies? • Spark lineage - automatic recomputation from source, but this is expensive! • Can also replicate cached dataset to survive single node failures • Persist materialized views back to C*, then load into cache -- now recovery path is much faster • Persistence also enables multiple processes to hold cached dataset
  • 39. Shark Demo • Local shark node, 1 core, MBP • How to create a table from C* using our inputformat • Creating a cached Shark table • Running fast queries
  • 40. Backup Slides • THE NEXT FEW SLIDES ARE STRICTLY BACKUP IN CASE LIVE DEMO DOESN’T WORK
  • 41. Creating a Shark Table from InputFormat
  • 43. Doing a row count ... don’t try in HIVE!