Real-Time Analytics With Cassandra, Spark and Shark: Tuesday, June 18, 13
Real-Time Analytics With Cassandra, Spark and Shark: Tuesday, June 18, 13
Scala/Akka guy
Very excited by open source, big data projects - share some today
@evanfchan
Agenda
Agenda
Ooyala and Cassandra
Agenda
Ooyala and Cassandra
What problem are we trying to solve?
Agenda
Ooyala and Cassandra
What problem are we trying to solve?
Spark and Shark
Agenda
Ooyala and Cassandra
What problem are we trying to solve?
Spark and Shark
Our Spark/Cassandra Architecture
Agenda
Ooyala and Cassandra
What problem are we trying to solve?
Spark and Shark
Our Spark/Cassandra Architecture
Demo
Cassandra at Ooyala
Who is Ooyala, and how we use Cassandra
OOYALA
Powering personalized video
experiences across all screens.
COMPANY OVERVIEW
Founded in 2007
Commercially launch in 2009
230+ employees in Silicon Valley, LA, NYC,
London, Paris, Tokyo, Sydney & Guadalajara
Global footprint, 200M unique users,
110+ countries, and more than 6,000 websites
Over 1 billion videos played per month
and 2 billion analytic events per day
25% of U.S. online viewers watch video
powered by Ooyala
STRATEGIC PARTNERS
To nuggets of truth...
To nuggets of truth...
Quickly
To nuggets of truth...
Quickly
Painlessly
To nuggets of truth...
Quickly
Painlessly
At
scale?
Inflexible, wasteful
100% Dynamic
Always
Super
fast lookups
compute results
from raw data
Inflexible, wasteful
slow
BestFlexible
for 80% but
most
common queries
Where we want to be
Partly dynamic
Pre-aggregate most
common queries
Industry Trends
Industry Trends
Impala
Industry Trends
Impala
In-memory databases
VoltDB, Druid
Industry Trends
In-memory databases
Impala
VoltDB, Druid
Industry Trends
Impala
In-memory databases
VoltDB, Druid
Introduction to Spark
Introduction to Spark
In-memory distributed computing framework
Introduction to Spark
In-memory distributed computing framework
Created by UC Berkeley AMP Lab in 2010
Introduction to Spark
In-memory distributed computing framework
Created by UC Berkeley AMP Lab in 2010
Targeted problems that MR is bad at:
Iterative algorithms (machine learning)
Interactive data mining
Introduction to Spark
In-memory distributed computing framework
Created by UC Berkeley AMP Lab in 2010
Targeted problems that MR is bad at:
Iterative algorithms (machine learning)
Interactive data mining
More general purpose than Hadoop MR
Introduction to Spark
In-memory distributed computing framework
Created by UC Berkeley AMP Lab in 2010
Targeted problems that MR is bad at:
Iterative algorithms (machine learning)
Interactive data mining
More general purpose than Hadoop MR
Active contributions from ~ 15 companies
Map
Reduce
HDFS
Map
Reduce
Map
Reduce
HDFS
Map
Reduce
Data Source
Source 2
Map
map()
Reduce
HDFS
Map
Reduce
join()
Data Source
Source 2
Map
map()
Reduce
join()
HDFS
Map
Reduce
cache()
Data Source
Source 2
Map
map()
Reduce
join()
HDFS
Map
Reduce
cache()
transform
37500
75000
112500
150000
37500
75000
112500
150000
37500
75000
112500
150000
37500
75000
112500
150000
Developers love it
Developers love it
I wrote my first aggregation job in 30 minutes
Developers love it
I wrote my first aggregation job in 30 minutes
High level distributed collections API
Developers love it
I wrote my first aggregation job in 30 minutes
High level distributed collections API
No Hadoop cruft
Developers love it
I wrote my first aggregation job in 30 minutes
High level distributed collections API
No Hadoop cruft
Full power of Scala, Java, Python
Developers love it
I wrote my first aggregation job in 30 minutes
High level distributed collections API
No Hadoop cruft
Full power of Scala, Java, Python
Interactive REPL shell
Developers love it
I wrote my first aggregation job in 30 minutes
High level distributed collections API
No Hadoop cruft
Full power of Scala, Java, Python
Interactive REPL shell
EASY testing!!
Developers love it
I wrote my first aggregation job in 30 minutes
High level distributed collections API
No Hadoop cruft
Full power of Scala, Java, Python
Interactive REPL shell
EASY testing!!
Low latency - quick development cycles
1 package org.myorg;
2
3 import java.io.IOException;
4 import java.util.*;
5
6 import org.apache.hadoop.fs.Path;
7 import org.apache.hadoop.conf.*;
8 import org.apache.hadoop.io.*;
9 import org.apache.hadoop.mapreduce.*;
10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
14
15 public class WordCount {
16
17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
18
private final static IntWritable one = new IntWritable(1);
19
private Text word = new Text();
20
21
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
22
String line = value.toString();
23
StringTokenizer tokenizer = new StringTokenizer(line);
24
while (tokenizer.hasMoreTokens()) {
25
word.set(tokenizer.nextToken());
26
context.write(word, one);
27
}
28
}
29 }
30
31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
32
33
public void reduce(Text key, Iterable<IntWritable> values, Context context)
34
throws IOException, InterruptedException {
35
int sum = 0;
36
for (IntWritable val : values) {
37
sum += val.get();
38
}
39
context.write(key, new IntWritable(sum));
40
}
41 }
42
43 public static void main(String[] args) throws Exception {
44
Configuration conf = new Configuration();
45
46
Job job = new Job(conf, "wordcount");
47
48
job.setOutputKeyClass(Text.class);
49
job.setOutputValueClass(IntWritable.class);
50
51
job.setMapperClass(Map.class);
52
job.setReducerClass(Reduce.class);
53
54
job.setInputFormatClass(TextInputFormat.class);
55
job.setOutputFormatClass(TextOutputFormat.class);
56
57
FileInputFormat.addInputPath(job, new Path(args[0]));
58
FileOutputFormat.setOutputPath(job, new Path(args[1]));
59
60
job.waitForCompletion(true);
61 }
62
63 }
file = spark.textFile("hdfs://...")
1 package org.myorg;
2
3 import java.io.IOException;
4 import java.util.*;
5
6 import org.apache.hadoop.fs.Path;
7 import org.apache.hadoop.conf.*;
8 import org.apache.hadoop.io.*;
9 import org.apache.hadoop.mapreduce.*;
10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
14
15 public class WordCount {
16
17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
18
private final static IntWritable one = new IntWritable(1);
19
private Text word = new Text();
20
21
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
22
String line = value.toString();
23
StringTokenizer tokenizer = new StringTokenizer(line);
24
while (tokenizer.hasMoreTokens()) {
25
word.set(tokenizer.nextToken());
26
context.write(word, one);
27
}
28
}
29 }
30
31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
32
33
public void reduce(Text key, Iterable<IntWritable> values, Context context)
34
throws IOException, InterruptedException {
35
int sum = 0;
36
for (IntWritable val : values) {
37
sum += val.get();
38
}
39
context.write(key, new IntWritable(sum));
40
}
41 }
42
43 public static void main(String[] args) throws Exception {
44
Configuration conf = new Configuration();
45
46
Job job = new Job(conf, "wordcount");
47
48
job.setOutputKeyClass(Text.class);
49
job.setOutputValueClass(IntWritable.class);
50
51
job.setMapperClass(Map.class);
52
job.setReducerClass(Reduce.class);
53
54
job.setInputFormatClass(TextInputFormat.class);
55
job.setOutputFormatClass(TextOutputFormat.class);
56
57
FileInputFormat.addInputPath(job, new Path(args[0]));
58
FileOutputFormat.setOutputPath(job, new Path(args[1]));
59
60
job.waitForCompletion(true);
61 }
62
63 }
Spark
Tachyon - in-memory caching DFS
HIVE on Spark
Spark
Tachyon - in-memory caching DFS
HIVE on Spark
Spark
Tachyon - in-memory caching DFS
Ingestion
Raw
Events
C*
event
store
Ingestion
Raw
Events
C*
event
store
Spark
View 1
Spark
View 2
Spark
View 3
Ingestion
Raw
Events
C*
event
store
Spark
View 1
Spark
View 2
Spark
View 3
Spark
Predefined
queries
Ingestion
Raw
Events
C*
event
store
Spark
View 1
Spark
Predefined
queries
Spark
View 2
Shark
Ad-hoc
HiveQL
Spark
View 3
Cassandra
Cassandra
Cassandra
Node1
Node2
Node3
SerDe
SerDe
SerDe
InputFormat
InputFormat
InputFormat
Cassandra
Cassandra
Cassandra
Node1
Node2
Node3
Shark
Spark
Worker
Shark
Spark
Worker
SerDe
Shark
Spark
Worker
SerDe
SerDe
InputFormat
InputFormat
InputFormat
Cassandra
Cassandra
Cassandra
Node1
Node2
Node3
Shark
Spark
Worker
Shark
Spark
Worker
SerDe
Shark
Spark
Worker
SerDe
SerDe
InputFormat
InputFormat
InputFormat
Cassandra
Cassandra
Cassandra
Node1
Node2
Node3
Shark
Spark
Worker
Shark
Spark
Worker
SerDe
Job Server
Shark
Spark
Worker
SerDe
SerDe
InputFormat
InputFormat
InputFormat
Cassandra
Cassandra
Cassandra
Node1
Node2
Node3
t0
t1
t2
t3
t4
{event0:
a0}
{event1:
a1}
{event2:
a2}
{event3:
a3}
{event4:
a4}
t0
t1
t2
t3
t4
{event0:
a0}
{event1:
a1}
{event2:
a2}
{event3:
a3}
{event4:
a4}
EventAttr CF
ipaddr:10.20.30.40:t1
2013-04-05
T00:00Z#id1
videoId:45678:t1
providerId:500:t0
t1
2013-04-05
T00:00Z#id1
2013-04-05
T00:00Z#id2
UserID
Video
Type
id1
10
t1
2013-04-05
T00:00Z#id1
2013-04-05
T00:00Z#id2
UserID
Video
Type
id1
10
id1
11
t1
2013-04-05
T00:00Z#id1
2013-04-05
T00:00Z#id2
UserID
Video
Type
id1
10
id1
11
id2
20
t1
2013-04-05
T00:00Z#id1
2013-04-05
T00:00Z#id2
UserID
Video
Type
id1
10
id1
11
id2
20
id2
25
{video:
10,
type:5}
2013-04
-05T00:
00Z#id2
{video:
20,
type:5}
C* events
OLAP
Aggregates
Spark
OLAP
Aggregates
Spark
OLAP
Aggregates
t0
2013-04
-05T00:
00Z#id1
{video:
10,
type:5}
2013-04
-05T00:
00Z#id2
{video:
20,
type:5}
C* events
OLAP
Aggregates
Spark
OLAP
Aggregates
Spark
OLAP
Aggregates
t0
2013-04
-05T00:
00Z#id1
{video:
10,
type:5}
2013-04
-05T00:
00Z#id2
{video:
20,
type:5}
C* events
Union
OLAP
Aggregates
Spark
OLAP
Aggregates
Spark
OLAP
Aggregates
t0
2013-04
-05T00:
00Z#id1
{video:
10,
type:5}
2013-04
-05T00:
00Z#id2
{video:
20,
type:5}
C* events
Query 1: Plays
by Provider
Union
OLAP
Aggregates
Spark
OLAP
Aggregates
t0
2013-04
-05T00:
00Z#id1
{video:
10,
type:5}
2013-04
-05T00:
00Z#id2
{video:
20,
type:5}
C* events
Spark
Query 1: Plays
by Provider
Union
OLAP
Aggregates
Cached Materialized Views
Query 2: Top
content for
mobile
Performance numbers
Performance numbers
Spark: C* -> OLAP aggregates
cold cache, 1.4 million events
C* -> OLAP aggregates
warmed cache
OLAP aggregate query via Spark
(56k records)
130 seconds
20-30 seconds
60 ms
Performance numbers
Spark: C* -> OLAP aggregates
cold cache, 1.4 million events
C* -> OLAP aggregates
warmed cache
OLAP aggregate query via Spark
(56k records)
130 seconds
20-30 seconds
60 ms
Performance numbers
Spark: C* -> OLAP aggregates
cold cache, 1.4 million events
C* -> OLAP aggregates
warmed cache
OLAP aggregate query via Spark
(56k records)
130 seconds
20-30 seconds
60 ms
OLAP WorkFlow
Aggregate
Spark
Executors
Aggregation Job
OLAP WorkFlow
Aggregate
Spark
Executors
Aggregation Job
Cassandra
OLAP WorkFlow
Aggregate
Spark
Executors
Aggregation Job
Cassandra
Dataset
OLAP WorkFlow
Aggregate
Query
Spark
Executors
Aggregation Job
Cassandra
Dataset
Query Job
OLAP WorkFlow
Result
Aggregate
Query
Spark
Executors
Aggregation Job
Cassandra
Dataset
Query Job
OLAP WorkFlow
Result
Aggregate
Query
Result
Query
Spark
Executors
Aggregation Job
Cassandra
Dataset
Query Job
Query Job
Fault Tolerance
Fault Tolerance
Fault Tolerance
Fault Tolerance
Fault Tolerance
Persist materialized views back to C*, then load into cache -- now
recovery path is much faster
Fault Tolerance
Persist materialized views back to C*, then load into cache -- now
recovery path is much faster
Demo time
Shark Demo
THANK YOU
THANK YOU
@evanfchan
THANK YOU
@evanfchan
THANK YOU
@evanfchan
THANK YOU
@evanfchan
WE ARE HIRING!!
Driver
Map
Reduce
Dataset
Map
Map
Reduce
Dataset
Map
Map
Reduce
Dataset
Map
Driver