0% found this document useful (0 votes)

69 views

Real-Time Analytics With Cassandra, Spark and Shark: Tuesday, June 18, 13

The document discusses a presentation about real-time analytics with Cassandra, Spark, and Shark. The agenda includes an overview of Ooyala and how they use Cassandra, the problem of performing complex queries over large amounts of data quickly, an introduction to Spark and Shark, and their Spark/Cassandra architecture. The presentation also includes a demo.

Uploaded by

linux87s

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

69 views

Real-Time Analytics With Cassandra, Spark and Shark: Tuesday, June 18, 13

Uploaded by

linux87s

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 128

Real-time Analytics with

Cassandra, Spark and Shark

Tuesday, June 18, 13

Who is this guy

Staff Engineer, Compute and Data Services, Ooyala

Building multiple web-scale real-time systems on top of C*, Kafka,

Storm, etc.

Scala/Akka guy

Very excited by open source, big data projects - share some today

@evanfchan

Tuesday, June 18, 13

Agenda

Tuesday, June 18, 13

Agenda
Ooyala and Cassandra

Tuesday, June 18, 13

Agenda
Ooyala and Cassandra
What problem are we trying to solve?

Tuesday, June 18, 13

Agenda
Ooyala and Cassandra
What problem are we trying to solve?
Spark and Shark

Tuesday, June 18, 13

Agenda
Ooyala and Cassandra
What problem are we trying to solve?
Spark and Shark
Our Spark/Cassandra Architecture

Tuesday, June 18, 13

Agenda
Ooyala and Cassandra
What problem are we trying to solve?
Spark and Shark
Our Spark/Cassandra Architecture
Demo

Tuesday, June 18, 13

Cassandra at Ooyala
Who is Ooyala, and how we use Cassandra

Tuesday, June 18, 13

OOYALA
Powering personalized video
experiences across all screens.

CONFIDENTIALDO NOT DISTRIBUTE

Tuesday, June 18, 13

COMPANY OVERVIEW
Founded in 2007
Commercially launch in 2009
230+ employees in Silicon Valley, LA, NYC,
London, Paris, Tokyo, Sydney & Guadalajara
Global footprint, 200M unique users,
110+ countries, and more than 6,000 websites
Over 1 billion videos played per month
and 2 billion analytic events per day
25% of U.S. online viewers watch video
powered by Ooyala

CONFIDENTIALDO NOT DISTRIBUTE

Tuesday, June 18, 13

TRUSTED VIDEO PARTNER

CUSTOMERS

STRATEGIC PARTNERS

CONFIDENTIALDO NOT DISTRIBUTE

Tuesday, June 18, 13

We are a large Cassandra user

Tuesday, June 18, 13

We are a large Cassandra user

11 clusters ranging in size from 3 to 36 nodes

Tuesday, June 18, 13

We are a large Cassandra user

11 clusters ranging in size from 3 to 36 nodes

Total of 28TB of data managed over ~85 nodes

Tuesday, June 18, 13

We are a large Cassandra user

11 clusters ranging in size from 3 to 36 nodes

Total of 28TB of data managed over ~85 nodes

Over 2 billion C* column writes per day

Tuesday, June 18, 13

We are a large Cassandra user

11 clusters ranging in size from 3 to 36 nodes

Total of 28TB of data managed over ~85 nodes

Over 2 billion C* column writes per day

Powers all of our analytics infrastructure

Tuesday, June 18, 13

We are a large Cassandra user

11 clusters ranging in size from 3 to 36 nodes

Total of 28TB of data managed over ~85 nodes

Over 2 billion C* column writes per day

Powers all of our analytics infrastructure

Much much bigger cluster coming soon

Tuesday, June 18, 13

What problem are we trying to

solve?
Lots of data, complex queries, answered really quickly... but how??

Tuesday, June 18, 13

From mountains of useless data...

Tuesday, June 18, 13

To nuggets of truth...

Tuesday, June 18, 13

To nuggets of truth...
Quickly

Tuesday, June 18, 13

To nuggets of truth...
Quickly
Painlessly

Tuesday, June 18, 13

To nuggets of truth...
Quickly
Painlessly
At

Tuesday, June 18, 13

scale?

Today: Precomputed aggregates

Tuesday, June 18, 13

Today: Precomputed aggregates

Video metrics computed along several high cardinality dimensions

Tuesday, June 18, 13

Today: Precomputed aggregates

Video metrics computed along several high cardinality dimensions

Very fast lookups, but inflexible, and hard to change

Tuesday, June 18, 13

Today: Precomputed aggregates

Video metrics computed along several high cardinality dimensions

Very fast lookups, but inflexible, and hard to change

Most computed aggregates are never read

Tuesday, June 18, 13

Today: Precomputed aggregates

Video metrics computed along several high cardinality dimensions

Very fast lookups, but inflexible, and hard to change

Most computed aggregates are never read

What if we need more dynamic queries?

Engagement curves for users who watched recommendations

Data mining, trends, machine learning

Tuesday, June 18, 13

The static - dynamic continuum

100% Precomputation

Super fast lookups

Inflexible, wasteful

Best for 80% most

common queries

Tuesday, June 18, 13

100% Dynamic

Always compute results

from raw data

Flexible but slow

The static - dynamic continuum

100%
100%
Precomputation
Dynamic

Always
Super
fast lookups
compute results
from raw data
Inflexible, wasteful

slow
BestFlexible
for 80% but
most
common queries

Tuesday, June 18, 13

Where we want to be
Partly dynamic

Tuesday, June 18, 13

Pre-aggregate most
common queries

Flexible, fast dynamic

queries

Easily generate many

materialized views

Industry Trends

Tuesday, June 18, 13

Industry Trends

Fast execution frameworks

Impala

Tuesday, June 18, 13

Industry Trends

Fast execution frameworks

Impala

In-memory databases

VoltDB, Druid

Tuesday, June 18, 13

Industry Trends

Fast execution frameworks

In-memory databases

Impala
VoltDB, Druid

Streaming and real-time

Tuesday, June 18, 13

Industry Trends

Fast execution frameworks

Impala

In-memory databases

VoltDB, Druid

Streaming and real-time

Higher-level, productive data frameworks

Cascading, Hive, Pig

Tuesday, June 18, 13

Why Spark and Shark?

Lightning-fast in-memory cluster computing

Tuesday, June 18, 13

Introduction to Spark

Tuesday, June 18, 13

Introduction to Spark
In-memory distributed computing framework

Tuesday, June 18, 13

Introduction to Spark
In-memory distributed computing framework
Created by UC Berkeley AMP Lab in 2010

Tuesday, June 18, 13

Introduction to Spark
In-memory distributed computing framework
Created by UC Berkeley AMP Lab in 2010
Targeted problems that MR is bad at:
Iterative algorithms (machine learning)
Interactive data mining

Tuesday, June 18, 13

Map
Reduce

HDFS
Map
Reduce

Tuesday, June 18, 13

Map
Reduce

HDFS
Map
Reduce

Tuesday, June 18, 13

Data Source
Source 2

Map
map()
Reduce

HDFS
Map
Reduce

Tuesday, June 18, 13

join()

Data Source
Source 2

Map
map()
Reduce
join()

HDFS
Map
Reduce

Tuesday, June 18, 13

cache()

Data Source
Source 2

Map
map()
Reduce
join()

HDFS
Map
Reduce

Tuesday, June 18, 13

cache()
transform

Throughput: Memory is king

6-node C*/DSE 1.1.9 cluster,

Spark 0.7.0

Tuesday, June 18, 13

Throughput: Memory is king

0
C*, cold cache
C*, warm cache
Spark RDD
6-node C*/DSE 1.1.9 cluster,
Spark 0.7.0

Tuesday, June 18, 13

37500

75000

112500

150000

Throughput: Memory is king

0
C*, cold cache
C*, warm cache
Spark RDD
6-node C*/DSE 1.1.9 cluster,
Spark 0.7.0

Tuesday, June 18, 13

37500

75000

112500

150000

Throughput: Memory is king

0
C*, cold cache
C*, warm cache
Spark RDD
6-node C*/DSE 1.1.9 cluster,
Spark 0.7.0

Tuesday, June 18, 13

37500

75000

112500

150000

Throughput: Memory is king

0
C*, cold cache
C*, warm cache
Spark RDD
6-node C*/DSE 1.1.9 cluster,
Spark 0.7.0

Tuesday, June 18, 13

37500

75000

112500

150000

Developers love it

Tuesday, June 18, 13

Developers love it
I wrote my first aggregation job in 30 minutes

Tuesday, June 18, 13

Developers love it
I wrote my first aggregation job in 30 minutes
High level distributed collections API

Tuesday, June 18, 13

Developers love it
I wrote my first aggregation job in 30 minutes
High level distributed collections API
No Hadoop cruft

Tuesday, June 18, 13

Developers love it
I wrote my first aggregation job in 30 minutes
High level distributed collections API
No Hadoop cruft
Full power of Scala, Java, Python

Tuesday, June 18, 13

Developers love it
I wrote my first aggregation job in 30 minutes
High level distributed collections API
No Hadoop cruft
Full power of Scala, Java, Python
Interactive REPL shell

Tuesday, June 18, 13

Developers love it
I wrote my first aggregation job in 30 minutes
High level distributed collections API
No Hadoop cruft
Full power of Scala, Java, Python
Interactive REPL shell
EASY testing!!

Tuesday, June 18, 13

Developers love it
I wrote my first aggregation job in 30 minutes
High level distributed collections API
No Hadoop cruft
Full power of Scala, Java, Python
Interactive REPL shell
EASY testing!!
Low latency - quick development cycles

Tuesday, June 18, 13

Spark word count example

Tuesday, June 18, 13

1 package org.myorg;
2
3 import java.io.IOException;
4 import java.util.*;
5
6 import org.apache.hadoop.fs.Path;
7 import org.apache.hadoop.conf.*;
8 import org.apache.hadoop.io.*;
9 import org.apache.hadoop.mapreduce.*;
10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
14
15 public class WordCount {
16
17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
18
private final static IntWritable one = new IntWritable(1);
19
private Text word = new Text();
20
21
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
22
String line = value.toString();
23
StringTokenizer tokenizer = new StringTokenizer(line);
24
while (tokenizer.hasMoreTokens()) {
25
word.set(tokenizer.nextToken());
26
context.write(word, one);
27
}
28
}
29 }
30
31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
32
33
public void reduce(Text key, Iterable<IntWritable> values, Context context)
34
throws IOException, InterruptedException {
35
int sum = 0;
36
for (IntWritable val : values) {
37
sum += val.get();
38
}
39
context.write(key, new IntWritable(sum));
40
}
41 }
42
43 public static void main(String[] args) throws Exception {
44
Configuration conf = new Configuration();
45
46
Job job = new Job(conf, "wordcount");
47
48
job.setOutputKeyClass(Text.class);
49
job.setOutputValueClass(IntWritable.class);
50
51
job.setMapperClass(Map.class);
52
job.setReducerClass(Reduce.class);
53
54
job.setInputFormatClass(TextInputFormat.class);
55
job.setOutputFormatClass(TextOutputFormat.class);
56
57
FileInputFormat.addInputPath(job, new Path(args[0]));
58
FileOutputFormat.setOutputPath(job, new Path(args[1]));
59
60
job.waitForCompletion(true);
61 }
62
63 }

Spark word count example

file = spark.textFile("hdfs://...")

file.flatMap(line => line.split(" "))

.map(word => (word, 1))
.reduceByKey(_ + _)

Tuesday, June 18, 13

The Spark Ecosystem

Spark
Tachyon - in-memory caching DFS

Tuesday, June 18, 13

The Spark Ecosystem

Bagel Pregel on
Spark
Spark
Tachyon - in-memory caching DFS

Tuesday, June 18, 13

The Spark Ecosystem

Bagel Pregel on
Spark

HIVE on Spark

Spark
Tachyon - in-memory caching DFS

Tuesday, June 18, 13

The Spark Ecosystem

Bagel Pregel on
Spark

HIVE on Spark

Spark Streaming discretized stream

processing

Spark
Tachyon - in-memory caching DFS

Tuesday, June 18, 13

Shark - HIVE on Spark

Tuesday, June 18, 13

Shark - HIVE on Spark

100% HiveQL compatible

Tuesday, June 18, 13

Shark - HIVE on Spark

100% HiveQL compatible
10-100x faster than HIVE, answers in seconds

Tuesday, June 18, 13

Shark - HIVE on Spark

100% HiveQL compatible
10-100x faster than HIVE, answers in seconds
Reuse UDFs, SerDes, StorageHandlers

Tuesday, June 18, 13

Shark - HIVE on Spark

100% HiveQL compatible
10-100x faster than HIVE, answers in seconds
Reuse UDFs, SerDes, StorageHandlers
Can use DSE / CassandraFS for Metastore

Tuesday, June 18, 13

Shark - HIVE on Spark

100% HiveQL compatible
10-100x faster than HIVE, answers in seconds
Reuse UDFs, SerDes, StorageHandlers
Can use DSE / CassandraFS for Metastore
Easy Scala/Java integration via Spark - easier than
writing UDFs

Tuesday, June 18, 13

Our new analytics architecture

How we integrate Cassandra and Spark/Shark

Tuesday, June 18, 13

From raw events to fast queries

Raw
Events
Raw
Events
Raw
Events

Tuesday, June 18, 13

From raw events to fast queries

Raw
Events
Raw
Events

Ingestion

Raw
Events

Tuesday, June 18, 13

C*
event
store

From raw events to fast queries

Raw
Events
Raw
Events

Ingestion

Raw
Events

Tuesday, June 18, 13

C*
event
store

Spark

Spark

Spark

From raw events to fast queries

Raw
Events
Raw
Events

Ingestion

Raw
Events

Tuesday, June 18, 13

C*
event
store

Spark

Spark

Spark

Spark

Predefined
queries

From raw events to fast queries

Raw
Events
Raw
Events

Ingestion

Raw
Events

Tuesday, June 18, 13

C*
event
store

Spark

Spark

Predefined
queries

Spark

Shark

Ad-hoc
HiveQL

Spark

Our Spark/Shark/Cassandra Stack

Cassandra

Node1

Node2

Node3

Tuesday, June 18, 13

Our Spark/Shark/Cassandra Stack

SerDe

InputFormat

Cassandra

Node1

Node2

Node3

Tuesday, June 18, 13

Our Spark/Shark/Cassandra Stack

Shark

Spark
Worker

Shark

Spark
Worker

SerDe

Shark

Spark
Worker

SerDe

InputFormat

Cassandra

Node1

Node2

Node3

Tuesday, June 18, 13

Our Spark/Shark/Cassandra Stack

Spark Master

Shark

Spark
Worker

Shark

Spark
Worker

SerDe

Shark

Spark
Worker

SerDe

InputFormat

Cassandra

Node1

Node2

Node3

Tuesday, June 18, 13

Our Spark/Shark/Cassandra Stack

Spark Master

Shark

Spark
Worker

Shark

Spark
Worker

SerDe

Job Server

Shark

Spark
Worker

SerDe

InputFormat

Cassandra

Node1

Node2

Node3

Tuesday, June 18, 13

Event Store Cassandra schema

Event CF
2013-04-05
T00:00Z#id1

Tuesday, June 18, 13

{event0:
a0}

{event1:
a1}

{event2:
a2}

{event3:
a3}

{event4:
a4}

Event Store Cassandra schema

Event CF
2013-04-05
T00:00Z#id1

{event0:
a0}

{event1:
a1}

{event2:
a2}

{event3:
a3}

{event4:
a4}

EventAttr CF
ipaddr:10.20.30.40:t1
2013-04-05
T00:00Z#id1

Tuesday, June 18, 13

videoId:45678:t1

providerId:500:t0

Unpacking raw events

2013-04-05
T00:00Z#id1

{video: 10, {video: 11,

type:5}
type:1}

2013-04-05
T00:00Z#id2

{video: 20, {video: 25,

type:5}
type:9}

Tuesday, June 18, 13

UserID

Video

Type

id1

Unpacking raw events

2013-04-05
T00:00Z#id1

{video: 10, {video: 11,

type:5}
type:1}

2013-04-05
T00:00Z#id2

{video: 20, {video: 25,

type:5}
type:9}

Tuesday, June 18, 13

UserID

Video

Type

id1

Unpacking raw events

2013-04-05
T00:00Z#id1

{video: 10, {video: 11,

type:5}
type:1}

2013-04-05
T00:00Z#id2

{video: 20, {video: 25,

type:5}
type:9}

Tuesday, June 18, 13

UserID

Video

Type

id1

id2

Unpacking raw events

2013-04-05
T00:00Z#id1

{video: 10, {video: 11,

type:5}
type:1}

2013-04-05
T00:00Z#id2

{video: 20, {video: 25,

type:5}
type:9}

Tuesday, June 18, 13

UserID

Video

Type

id1

id2

Tips for InputFormat Development

Tuesday, June 18, 13

Tips for InputFormat Development

Know which target platforms you are developing for

Which API to write against? New? Old? Both?

Tuesday, June 18, 13

Tips for InputFormat Development

Know which target platforms you are developing for

Which API to write against? New? Old? Both?

Be prepared to spend time tuning your split computation

Low latency jobs require fast splits

Tuesday, June 18, 13

Tips for InputFormat Development

Know which target platforms you are developing for

Be prepared to spend time tuning your split computation

Which API to write against? New? Old? Both?

Low latency jobs require fast splits

Consider sorting row keys by token for data locality

Tuesday, June 18, 13

Tips for InputFormat Development

Know which target platforms you are developing for

Which API to write against? New? Old? Both?

Be prepared to spend time tuning your split computation

Low latency jobs require fast splits

Consider sorting row keys by token for data locality

Implement predicate pushdown for HIVE SerDes

Use your indexes to reduce size of dataset

Tuesday, June 18, 13

Example: OLAP processing

t0
2013-04
-05T00:
00Z#id1

{video:
10,
type:5}

2013-04
-05T00:
00Z#id2

{video:
20,
type:5}

C* events

Tuesday, June 18, 13

Example: OLAP processing

Spark

OLAP
Aggregates

Spark

OLAP
Aggregates

Spark

OLAP
Aggregates

t0
2013-04
-05T00:
00Z#id1

{video:
10,
type:5}

2013-04
-05T00:
00Z#id2

{video:
20,
type:5}

C* events

Tuesday, June 18, 13

Cached Materialized Views

Example: OLAP processing

Spark

OLAP
Aggregates

Spark

OLAP
Aggregates

Spark

OLAP
Aggregates

t0
2013-04
-05T00:
00Z#id1

{video:
10,
type:5}

2013-04
-05T00:
00Z#id2

{video:
20,
type:5}

C* events

Tuesday, June 18, 13

Union

Cached Materialized Views

Example: OLAP processing

Spark

OLAP
Aggregates

Spark

OLAP
Aggregates

Spark

OLAP
Aggregates

t0
2013-04
-05T00:
00Z#id1

{video:
10,
type:5}

2013-04
-05T00:
00Z#id2

{video:
20,
type:5}

C* events

Tuesday, June 18, 13

Query 1: Plays
by Provider
Union

Cached Materialized Views

Example: OLAP processing

Spark

OLAP
Aggregates

Spark

OLAP
Aggregates

t0
2013-04
-05T00:
00Z#id1

{video:
10,
type:5}

2013-04
-05T00:
00Z#id2

{video:
20,
type:5}

C* events

Tuesday, June 18, 13

Spark

Query 1: Plays
by Provider
Union

OLAP
Aggregates
Cached Materialized Views

Query 2: Top
content for
mobile

Performance numbers

6-node C*/DSE 1.1.9 cluster,

Spark 0.7.0

Tuesday, June 18, 13

Performance numbers
Spark: C* -> OLAP aggregates
cold cache, 1.4 million events
C* -> OLAP aggregates
warmed cache
OLAP aggregate query via Spark
(56k records)

6-node C*/DSE 1.1.9 cluster,

Spark 0.7.0

Tuesday, June 18, 13

130 seconds

20-30 seconds

60 ms

Performance numbers
Spark: C* -> OLAP aggregates
cold cache, 1.4 million events
C* -> OLAP aggregates
warmed cache
OLAP aggregate query via Spark
(56k records)

6-node C*/DSE 1.1.9 cluster,

Spark 0.7.0

Tuesday, June 18, 13

130 seconds

20-30 seconds

60 ms

Performance numbers
Spark: C* -> OLAP aggregates
cold cache, 1.4 million events
C* -> OLAP aggregates
warmed cache
OLAP aggregate query via Spark
(56k records)

6-node C*/DSE 1.1.9 cluster,

Spark 0.7.0

Tuesday, June 18, 13

130 seconds

20-30 seconds

60 ms

OLAP WorkFlow
Aggregate

REST Job Server

Spark
Executors

Tuesday, June 18, 13

Aggregation Job

OLAP WorkFlow
Aggregate

REST Job Server

Spark
Executors

Aggregation Job

Cassandra

Tuesday, June 18, 13

OLAP WorkFlow
Aggregate

REST Job Server

Spark
Executors

Aggregation Job

Cassandra

Tuesday, June 18, 13

Dataset

OLAP WorkFlow
Aggregate

Query

REST Job Server

Spark
Executors

Aggregation Job

Cassandra

Tuesday, June 18, 13

Dataset

Query Job

OLAP WorkFlow

Result

Aggregate

Query

REST Job Server

Spark
Executors

Aggregation Job

Cassandra

Tuesday, June 18, 13

Dataset

Query Job

OLAP WorkFlow

Result

Aggregate

Query

Result
Query

REST Job Server

Spark
Executors

Aggregation Job

Cassandra

Tuesday, June 18, 13

Dataset

Query Job

Fault Tolerance

Tuesday, June 18, 13

Fault Tolerance

Cached dataset lives in Java Heap only - what if process dies?

Tuesday, June 18, 13

Fault Tolerance

Cached dataset lives in Java Heap only - what if process dies?

Spark lineage - automatic recomputation from source, but this is

expensive!

Tuesday, June 18, 13

Fault Tolerance

Cached dataset lives in Java Heap only - what if process dies?

Spark lineage - automatic recomputation from source, but this is

expensive!

Can also replicate cached dataset to survive single node failures

Tuesday, June 18, 13

Fault Tolerance

Cached dataset lives in Java Heap only - what if process dies?

Spark lineage - automatic recomputation from source, but this is

expensive!

Can also replicate cached dataset to survive single node failures

Persist materialized views back to C*, then load into cache -- now
recovery path is much faster

Tuesday, June 18, 13

Fault Tolerance

Cached dataset lives in Java Heap only - what if process dies?

Spark lineage - automatic recomputation from source, but this is

expensive!

Can also replicate cached dataset to survive single node failures

Persist materialized views back to C*, then load into cache -- now
recovery path is much faster

Persistence also enables multiple processes to hold cached dataset

Tuesday, June 18, 13

Demo time

Tuesday, June 18, 13

Shark Demo

Local shark node, 1 core, MBP

How to create a table from C* using our inputformat

Creating a cached Shark table

Running fast queries

Tuesday, June 18, 13

Creating a Shark Table from InputFormat

Tuesday, June 18, 13

Creating a cached table

Tuesday, June 18, 13

Querying cached table

Tuesday, June 18, 13

THANK YOU

Tuesday, June 18, 13

THANK YOU

Tuesday, June 18, 13

@evanfchan

THANK YOU

@evanfchan

[email protected]

Tuesday, June 18, 13

THANK YOU

@evanfchan

[email protected]

Tuesday, June 18, 13

THANK YOU

@evanfchan

[email protected]

WE ARE HIRING!!

Tuesday, June 18, 13

Spark: Under the hood

Driver

Map

Reduce

Dataset

Map

Reduce

Dataset

Map

Reduce

Dataset

Map

One executor process per node

Tuesday, June 18, 13

Driver

No ratings yet
27 pages
76780
No ratings yet
76780
56 pages
Anthony Nyström: Fellow, Managing Director of Engineering
No ratings yet
Anthony Nyström: Fellow, Managing Director of Engineering
180 pages
Oow2012 Essbase Inmemory Venkat
No ratings yet
Oow2012 Essbase Inmemory Venkat
48 pages
Data Mining With Rattle For: Akhil Anil Karun Full Stack Engineer (Java)
No ratings yet
Data Mining With Rattle For: Akhil Anil Karun Full Stack Engineer (Java)
40 pages
Lecture 1 Annotated
No ratings yet
Lecture 1 Annotated
76 pages
Dancing With Big Data: Inferno + Disco
No ratings yet
Dancing With Big Data: Inferno + Disco
14 pages
Action PlanJournaling
No ratings yet
Action PlanJournaling
7 pages
ch01 Intro
No ratings yet
ch01 Intro
45 pages
Big Data PPT [Autosaved]
No ratings yet
Big Data PPT [Autosaved]
193 pages
Data Science Bootcamp Curriculum 2
No ratings yet
Data Science Bootcamp Curriculum 2
7 pages
Intro CH 06atypes of Storage Devices
No ratings yet
Intro CH 06atypes of Storage Devices
28 pages
PDF
No ratings yet
PDF
25 pages
Scaling Infrastructure
No ratings yet
Scaling Infrastructure
48 pages
Sleeba S Resume
No ratings yet
Sleeba S Resume
4 pages
Research Paper On Hadoop
No ratings yet
Research Paper On Hadoop
47 pages
1 Introduction
No ratings yet
1 Introduction
31 pages
Marko Grobelnik, Blaz Fortuna, Dunja Mladenic Jozef Stefan Institute, Slovenia
100% (1)
Marko Grobelnik, Blaz Fortuna, Dunja Mladenic Jozef Stefan Institute, Slovenia
107 pages
Lecture 6 BigData
No ratings yet
Lecture 6 BigData
61 pages
Comptuer Graphics Class
No ratings yet
Comptuer Graphics Class
195 pages
Big R Data
No ratings yet
Big R Data
17 pages
Bigdata and Nosql DBS: Piyushgupta July2013
No ratings yet
Bigdata and Nosql DBS: Piyushgupta July2013
27 pages
whyPostgres
No ratings yet
whyPostgres
16 pages
Mag Pi Magazine For Raspberry Pi Issue 8
No ratings yet
Mag Pi Magazine For Raspberry Pi Issue 8
32 pages
Course Outline and Introduction
No ratings yet
Course Outline and Introduction
37 pages
UGF4928 Johnson-OOW14 ADF Performance
No ratings yet
UGF4928 Johnson-OOW14 ADF Performance
33 pages
Big Data
No ratings yet
Big Data
25 pages
Wednesday, October 19, 11
No ratings yet
Wednesday, October 19, 11
47 pages
Top 20 R Machine Learning and Data Science Packages
No ratings yet
Top 20 R Machine Learning and Data Science Packages
5 pages
ML , ALgo roadmap
No ratings yet
ML , ALgo roadmap
21 pages
4-2 Bda PPTS
No ratings yet
4-2 Bda PPTS
114 pages
The Complete Collection of Data Science Cheatsheets KDnuggets
100% (1)
The Complete Collection of Data Science Cheatsheets KDnuggets
17 pages
Screenshot 2023-09-22 at 5.47.45 PM
No ratings yet
Screenshot 2023-09-22 at 5.47.45 PM
1 page
Session 8 - George Strawn - Big Data
No ratings yet
Session 8 - George Strawn - Big Data
34 pages
003 This Course 1
No ratings yet
003 This Course 1
7 pages
Mining Thesis PDF
100% (3)
Mining Thesis PDF
6 pages
CS 61C: Great Ideas in Computer Architecture: Course Introduction
No ratings yet
CS 61C: Great Ideas in Computer Architecture: Course Introduction
55 pages
Preparingyourstudentsforsuccessontheapcsprinciplesexam
No ratings yet
Preparingyourstudentsforsuccessontheapcsprinciplesexam
3 pages
Big Data Analytics: by S. P. Sajjan
No ratings yet
Big Data Analytics: by S. P. Sajjan
21 pages
Introduction To Big Data & Basic Data Analysis
No ratings yet
Introduction To Big Data & Basic Data Analysis
47 pages
1 Introduction To Big Data Management and Processing
No ratings yet
1 Introduction To Big Data Management and Processing
42 pages
Mcgraw-Hill Technology Education Mcgraw-Hill Technology Education
No ratings yet
Mcgraw-Hill Technology Education Mcgraw-Hill Technology Education
43 pages
R in Action 3rd Edition Robert I. Kabacoff - Download the ebook today and own the complete version
100% (3)
R in Action 3rd Edition Robert I. Kabacoff - Download the ebook today and own the complete version
69 pages
The Omega Directive: Mobile First at All Costs
No ratings yet
The Omega Directive: Mobile First at All Costs
11 pages
Data Engineering Cookbook
86% (7)
Data Engineering Cookbook
88 pages
Unit II - Data Science
No ratings yet
Unit II - Data Science
113 pages
The Big Data Ecosystem at LinkedIn Presentation 1
No ratings yet
The Big Data Ecosystem at LinkedIn Presentation 1
33 pages
Data Science: Lecture #1
No ratings yet
Data Science: Lecture #1
22 pages
Big Data - Mini Workshop
100% (2)
Big Data - Mini Workshop
29 pages
From Promise To A Platform - Next Steps in Bringing Workload Diversity To Hadoop Presentation
No ratings yet
From Promise To A Platform - Next Steps in Bringing Workload Diversity To Hadoop Presentation
47 pages
a structured learning guide for becoming a Data Scientist
No ratings yet
a structured learning guide for becoming a Data Scientist
9 pages
Python Project Report
No ratings yet
Python Project Report
14 pages
2023713662-PythonSQLPyspark
No ratings yet
2023713662-PythonSQLPyspark
5 pages
RockstarApps Optimizing The Web
No ratings yet
RockstarApps Optimizing The Web
20 pages
Ecs765p W1
No ratings yet
Ecs765p W1
39 pages
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
From Everand
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
e3
No ratings yet
R Data Structures and Algorithms
From Everand
R Data Structures and Algorithms
Dr. PKS Prakash
No ratings yet
Real-Time Big Data Analytics
From Everand
Real-Time Big Data Analytics
Shilpi
5/5 (1)
Instant Nokogiri
From Everand
Instant Nokogiri
Hunter Powers
No ratings yet
Troubleshooting PostgreSQL
From Everand
Troubleshooting PostgreSQL
Hans-Jurgen Schonig
5/5 (1)
Note
No ratings yet
Note
1 page
2016-04 AIX Roadmap and Lifecycle
No ratings yet
2016-04 AIX Roadmap and Lifecycle
1 page
Learning Spark Preview Ed
No ratings yet
Learning Spark Preview Ed
18 pages
OliverKNN Presentation
No ratings yet
OliverKNN Presentation
29 pages
Improvements in Phrase-Based Statistical Machine Translation
No ratings yet
Improvements in Phrase-Based Statistical Machine Translation
8 pages
Ch9 Virtual Memory Edited
No ratings yet
Ch9 Virtual Memory Edited
65 pages
Laporan Praktikum Komunikasi Data 2: Damas Yusli Arfani 2 Aec 217341056
No ratings yet
Laporan Praktikum Komunikasi Data 2: Damas Yusli Arfani 2 Aec 217341056
11 pages
Curdir
No ratings yet
Curdir
114 pages
1ST Q - G9 Ani-Exam
No ratings yet
1ST Q - G9 Ani-Exam
3 pages
Rosario
No ratings yet
Rosario
3 pages
BECE 355L AWS Cloud Module 3 Total
No ratings yet
BECE 355L AWS Cloud Module 3 Total
133 pages
Database Programming Section 13 Quiz
No ratings yet
Database Programming Section 13 Quiz
13 pages
12 - DH-PFS4210-8GT-150 - Datasheet-12Dec18 - OK
No ratings yet
12 - DH-PFS4210-8GT-150 - Datasheet-12Dec18 - OK
1 page
MySQL WorkBench Installation Guide
No ratings yet
MySQL WorkBench Installation Guide
16 pages
Core Data
No ratings yet
Core Data
179 pages
Frame Relay: CCNA 4 Chapter 15
No ratings yet
Frame Relay: CCNA 4 Chapter 15
28 pages
09 WLAN Service Configuration
No ratings yet
09 WLAN Service Configuration
28 pages
AN12437
No ratings yet
AN12437
17 pages
CPU Architecture: Control Unit (CU)
100% (1)
CPU Architecture: Control Unit (CU)
10 pages
Professional Summary:: Vishnu Vardhan
No ratings yet
Professional Summary:: Vishnu Vardhan
5 pages
Automatic Backup From Ubuntu Server With Rsync
No ratings yet
Automatic Backup From Ubuntu Server With Rsync
6 pages
Cisco-to-Juniper Show Commands
No ratings yet
Cisco-to-Juniper Show Commands
4 pages
No Data Available in P6 - BI Publisher
No ratings yet
No Data Available in P6 - BI Publisher
5 pages
Ems SQL Storage
No ratings yet
Ems SQL Storage
5 pages
What Is Database Architecture
No ratings yet
What Is Database Architecture
7 pages
Bluetooth LE
No ratings yet
Bluetooth LE
103 pages
Unit. 2
No ratings yet
Unit. 2
21 pages
ZhongXun Topin Locator Communication Protocol-180612
No ratings yet
ZhongXun Topin Locator Communication Protocol-180612
30 pages
Experiment No. 6: Channel Capacity
No ratings yet
Experiment No. 6: Channel Capacity
5 pages
DBMS I.P
No ratings yet
DBMS I.P
5 pages
E!c ApplicationNote WagoAppDatalogger
No ratings yet
E!c ApplicationNote WagoAppDatalogger
28 pages
Dronacharya Microprocessor - Lab - 17012013
No ratings yet
Dronacharya Microprocessor - Lab - 17012013
25 pages
5145 Computer Grade 4 Revision Worksheet Term1 2019-20
No ratings yet
5145 Computer Grade 4 Revision Worksheet Term1 2019-20
6 pages
T214 User Manual
No ratings yet
T214 User Manual
124 pages
LinuxIT Linux System Administration 1
No ratings yet
LinuxIT Linux System Administration 1
180 pages

Real-Time Analytics With Cassandra, Spark and Shark: Tuesday, June 18, 13

Uploaded by

Real-Time Analytics With Cassandra, Spark and Shark: Tuesday, June 18, 13

Uploaded by

Real-time Analytics with

Cassandra, Spark and Shark

Tuesday, June 18, 13

Who is this guy

Staff Engineer, Compute and Data Services, Ooyala

Building multiple web-scale real-time systems on top of C*, Kafka,

Tuesday, June 18, 13

Tuesday, June 18, 13

Tuesday, June 18, 13

Tuesday, June 18, 13

Tuesday, June 18, 13

Tuesday, June 18, 13

Tuesday, June 18, 13

Tuesday, June 18, 13

CONFIDENTIALDO NOT DISTRIBUTE

Tuesday, June 18, 13

CONFIDENTIALDO NOT DISTRIBUTE

Tuesday, June 18, 13

TRUSTED VIDEO PARTNER

CONFIDENTIALDO NOT DISTRIBUTE

Tuesday, June 18, 13

We are a large Cassandra user

Tuesday, June 18, 13

We are a large Cassandra user

11 clusters ranging in size from 3 to 36 nodes

Tuesday, June 18, 13

We are a large Cassandra user

11 clusters ranging in size from 3 to 36 nodes

Total of 28TB of data managed over ~85 nodes

Tuesday, June 18, 13

We are a large Cassandra user

11 clusters ranging in size from 3 to 36 nodes

Total of 28TB of data managed over ~85 nodes

Over 2 billion C* column writes per day

Tuesday, June 18, 13

We are a large Cassandra user

11 clusters ranging in size from 3 to 36 nodes

Total of 28TB of data managed over ~85 nodes

Over 2 billion C* column writes per day

Powers all of our analytics infrastructure

Tuesday, June 18, 13

We are a large Cassandra user

11 clusters ranging in size from 3 to 36 nodes

Total of 28TB of data managed over ~85 nodes

Over 2 billion C* column writes per day

Powers all of our analytics infrastructure

Much much bigger cluster coming soon

Tuesday, June 18, 13

What problem are we trying to

Tuesday, June 18, 13

From mountains of useless data...

Tuesday, June 18, 13

Tuesday, June 18, 13

Tuesday, June 18, 13

Tuesday, June 18, 13

Tuesday, June 18, 13

Today: Precomputed aggregates

Tuesday, June 18, 13

Today: Precomputed aggregates

Video metrics computed along several high cardinality dimensions

Tuesday, June 18, 13

Today: Precomputed aggregates

Video metrics computed along several high cardinality dimensions

Very fast lookups, but inflexible, and hard to change

Tuesday, June 18, 13

Today: Precomputed aggregates

Video metrics computed along several high cardinality dimensions

Very fast lookups, but inflexible, and hard to change

Most computed aggregates are never read

Tuesday, June 18, 13

Today: Precomputed aggregates

Video metrics computed along several high cardinality dimensions

Very fast lookups, but inflexible, and hard to change

Most computed aggregates are never read

What if we need more dynamic queries?

Top content for mobile users in France