Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine

Splice
Machine
Proprietary
and
Confidential
Powering
Real-‐Time

Applications
&
Analytics

Enabling
Decisions
in
the
Moment
John
Leach
CTO
&
Co-‐Founder

Splice
Machine
Proprietary
and
Confidential
Life
Sciences
Digital
Marketing Fraud
Detection
DECISIONS
IN
THE
MOMENT
Supply
Chain
Optimization

Splice
Machine
Proprietary
and
Confidential
Today’s
Reality:
Stale
Data,
Backward-‐Looking
Decisions
3
How
old
is
the
data
in
your
reports?
¨ 1
day
+
¨ 1
day
¨ 4
hours
+
¨ 1
hour
+
¨ Real-‐time

Splice
Machine
Proprietary
and
Confidential
Today’s
Reality:
Stale
Data,
Backward-‐Looking
Decisions
4
24%
50%
7%
9%
9%
* Source: Webinars on 11-3-15 and 12-10-15, 237 respondents
How
old
is
the
data
in
your
reports?
¨ 1
day
+
¨ 1
day
¨ 4
hours
+
¨ 1
hour
+
¨ Real-‐time

Splice
Machine
Proprietary
and
Confidential
Data
Gridlock:
Complex,
Outdated
ETL
Pipelines
Ad Hoc
Analytics
Executive
Business Reports
Operational
Reports
ERP
CRM
Supply
Chain
HR
…
Data
Warehouse
Datamart
Stream or
Batch Updates
Mixed
Workload Apps
ODS
ETL
OLTP
Systems
Extract
Transform
Load
OLAP
Systems§ Pain
§ Separate
OLTP
&
OLAP

systems
§ Messy
ETL
“glue”
§ Why?
§ Different
workloads
§ Different
data
structures
§ Hard
to
isolate
workloads
§ No
longer
adequate
§ Can’t
afford
to
wait
days
or

hours
to
analyze
data
Current
architectures
unable
to
keep
up
5

Splice
Machine
Proprietary
and
Confidential
Nirvana:
No
More
ETL
OLAP
Report
§ Benefits
§ Faster
since
Big
Data
does

not
need
to
be
moved
§ Eliminate
expensive
ETL
and

data
warehouse
systems
§ Act
on
real-‐time
data
instead

of
yesterday’s
§ Why
is
it
Possible
Now?
OLTP App
OLTP/OLAP
Simultaneous
OLTP
&
OLAP
workloads
6

Splice
Machine
Proprietary
and
Confidential
Disruptive
Technology
Enablers
Scale-Out
Technology
In-Memory
Technology
Scale
Up
(Increase
server
size)
Scale
Out
(More
small
servers)
vs.
$ $ $ $ $ $
7

Splice
Machine
Proprietary
and
Confidential
The
Splice
Machine
RDBMS:
Replace
Oracle
&
MySQL
The
First
RDBMS
Powered
by
Hadoop &
Spark
8
ANSI
SQL
No
retraining
or
rewrites
for
SQL-‐based

analysts,
reports,
and
applications

¼
the
Cost

Scales
out
on

commodity
hardware
SQL Scale
Out Speed
Transactions
Ensure
reliable
updates

across
multiple
rows
Mixed
Workloads
Simultaneously
support

OLTP
and
OLAP
workloads
Elastic
Increase
scale
in

just
a
few
minutes
10-‐20x
Faster
Leverages
Spark

in-‐memory
technology

Splice
Machine
Proprietary
and
Confidential
Omni-‐Channel
Marketing:
Harte-‐Hanks
9
Overview

Digital
marketing
services
provider
Unified
Customer
Profile
Real-‐time
campaign
management
Operational
application
with
BI
reports
Challenges
Oracle
RAC
too
expensive
to
scale
Queries
too
slow
– even
up
to
½
hour
Getting
worse
– expect
30-‐50%
data
growth
Looked
for
9
months
for
a
cost-‐effective
solution
Solution
Diagram Initial
Results
¼cost
with
commodity
scale
out
3-‐7x
faster
through
parallelized
queries
10-‐20x
price/perf
with
no
application,
BI
or
ETL
rewrites
Cross-Channel
Campaigns
Real-Time
Personalization
Real-Time Actions

Splice
Machine
Proprietary
and
Confidential
Simultaneous
OLTP
&
OLAP
Workloads
10
Very
few
applications
are
OLTP
only
Traditional RDBMSs Splice Machine
HBASE SPARK
BOTTLENECKS,
DELAYS
O
L
A
P
WORKLOAD
ISOLATION
O
L
T
P
K E Y

Splice
Machine
Proprietary
and
Confidential
Simultaneous
OLTP
&
OLAP
Workloads
11
Separate
OLTP
&
OLAP
processes
isolate
workloads

Traditional RDBMSs Splice Machine
As
OLAP
load
rises,

OLTP
response
times
increase
OLAP
LOAD
OLTP
RESPONSE
TIME
As
OLAP
load
rises,

OLTP
response
times
remain
flat
OLAP
LOAD
OLTP
RESPONSE
TIME

Splice
Machine
Proprietary
and
Confidential
Proven
Building
Blocks:
Spark,
Hadoop and
Derby
Apache
Derby
§ ANSI
SQL-‐99
RDBMS
§ Java-‐based
§ ODBC/JDBC
Compliant
Apache
HBase/Hadoop
§ Auto-‐sharding
§ High
availability
§ Scalability
to
100s
of
PBs
Apache
Spark
§ Analytical
engine
§ Fast,
in-‐memory
technology
§ Memory
resilient
to
node
failure
12

Splice
Machine
Proprietary
and
Confidential
HBase:
Proven
Scale-‐Out
§ Auto-‐sharding
§ Scales
with
commodity
hardware
§ Cost-‐effective
from
GBs
to
PBs
§ High
availability
thru
failover
and
replication
§ LSM-‐trees
13

Splice
Machine
Proprietary
and
Confidential
Apache

14
Unmatched
Performance
§ Fastest
sort
of
1PB
of
data
Advanced
In-‐Memory
Technology
§ Spill-‐to-‐disk
for
large
datasets
§ Resilient
against
node
failures
§ Pipelining
for
computation
parallelism
Most
Active
Apache
Community
§ Almost
500
committers
Extensive
Libraries
§ Over
140
and
growing
§ Libraries
for
machine
learning,

streaming
and
graph
processing

Splice
Machine
Proprietary
and
Confidential 15
Address
HBase Challenges
§ Compactions
§ Large
Data
Movements
RDBMS
Features
§ Index
Creation
§ Statistics
Collection
§ Import
§ Admin
UI
Analytic
Processing
§ Pipelining
for
computation
parallelism
§ Lineage
Machine
Learning
§ Incorporating
MLib into
the
RDBMS
How
is
Spark
aiding
Splice
Machine?

Splice
Machine
Proprietary
and
Confidential
Splice
Machine:
Advanced
Spark
Integration
16
Compaction:
LSM
Tree
(Deal
with
the
Devil)

Splice
Machine
Proprietary
and
Confidential
Splice
Machine:
Spark/HBase Integration:
Compaction
17
Minor
Compaction Major
Compaction
•••

Splice
Machine
Proprietary
and
Confidential
Splice
Machine:
Advanced
Spark
Integration
18
Innovative,
High-‐Performance

RDD
Creation
§ Fast
access
to
HFiles in
HDFS
§ Merged
with
deltas
from
Memstore
§ Avoids
slower
HBase API
Universal
Execution
Plan

and
Byte
Code
§ Optimizer,
plan
and
code
shared

across
Spark
or
HBase execution
•••
HBase Region
Server
HDFS
•••
Region
1
Memstore
Spark
Worker
•••RDD
1
HFile HFile•••
P H Y S I C A L
N O D E
RDD
N
HFile••• HFile•••
Region
N
Memstore
HBase Region
Server
HDFS
•••
Region
1
Memstore
Spark
Worker
•••RDD
1
HFile HFile•••
P H Y S I C A L
N O D E
RDD
N
HFile••• HFile•••
Region
N
Memstore

Splice
Machine
Proprietary
and
Confidential
Splice
Machine
Architecture
1. Standard
install
of
HBase
Cluster
(HBase,
HDFS,

ZooKeeper)
with
Spark
HBase
Co-‐Processor
L

E

G

E

N

D
2. Distribute
Splice
Machine

JAR
to
each
region
server
3. Automatically
invoke
co-‐
processors
on
each
region
19
Cach
e
•••
Tas
k
Executor
Tas
k
HBase Region
Server
•••
HDFS
SPLICE
PARSER
SPLICE
PLANNER
SPLICE
OPTIMIZER
SPLICE
EXECUTOR

• Snapshot
Isolation
• Indexes
Region Region
SPLICE
EXECUTOR

• Snapshot
Isolation
• Indexes
Spark
Worker RDD
Spark
Master
RDD
Cach
e
•••
Tas
k
Executor
Tas
k
•••
•••
•••
Cach
e
•••
Tas
k
Executor
Tas
k
HBase Region
Server
HDFS
SPLICE
PARSER
SPLICE
PLANNER
SPLICE
OPTIMIZER
SPLICE
EXECUTOR

• Snapshot
Isolation
• Indexes
Region Region
SPLICE
EXECUTOR

• Snapshot
Isolation
• Indexes
Spark
Worker RDDRDD
Cach
e
•••
Tas
k
Executor
Tas
k
•••
•••
•••
HMasterZookeeper

Splice
Machine
Proprietary
and
Confidential
Splice
Machine:
Query
Execution
20

Splice
Machine
Proprietary
and
Confidential
Splice
Machine:
Query
Execution
21
1. Parse SQL
• Generate Abstract Syntax Tree (AST)
• Bind AST to Transactional Dictionary

Splice
Machine
Proprietary
and
Confidential
Splice
Machine:
Query
Execution
22
1. Parse SQL
2. Optimize query plan
• Determine join order and storage
structure (e.g., base table, index)
using table statistics (e.g., cardinality
estimates)
• Push predicates
• Unroll nested subqueries

Splice
Machine
Proprietary
and
Confidential
Splice
Machine:
Query
Execution
23
3. Generate optimal byte code
1. Parse SQL

Splice
Machine
Proprietary
and
Confidential
Splice
Machine:
Query
Execution
24
OLTP Execution on HBase
4a. Execute OLTP query from
byte code
5a. Use block cache and bloom
filters to optimize data access
6a. Return results
1. Parse SQL

Splice
Machine
Proprietary
and
Confidential
Splice
Machine:
Query
Execution
25
OLAP Execution on Spark
4b. Generate Spark execution plan
OLTP Execution on HBase
4a. Execute OLTP query from
byte code
5a. Use block cache and bloom
filters to optimize data access
6a. Return results
1. Parse SQL
OLAP Execution on Spark
4b. Generate Spark execution plan
5b. Submit Spark plan with byte code
6b. Fair scheduling of distributed of tasks
7b. Generate RDD from HFiles and Memstore
8b. Execute query and return results

Splice
Machine
Proprietary
and
Confidential
Isolated
Resource
Management
26
Isolate
Spark
&
HBase resources
through
Linux
Cgroups

Splice
Machine
Proprietary
and
Confidential
Isolated
Resource
Management
27
Isolate
Spark
&
HBase resources
through
Linux
Cgroups

Splice
Machine
Proprietary
and
Confidential
Configurable
Spark
Resource
Management
28
Prioritize
Spark
resources
between
Query,
Admin
&
Import
jobs
Custom
resource
pools

through
XML

Splice
Machine
Proprietary
and
Confidential
Spark
Query
Management
29
Visualization
of
active
and
completed
queries

Splice
Machine
Proprietary
and
Confidential
Spark
Query
Management
(cont’d)
30
Visualization
of
stages
for
each
query,
plus
kill
function

Splice
Machine
Proprietary
and
Confidential
Spark
Query
Management
(cont’d)
31
Visualization
of
stages
for
query
plan,
plus
kill
function

Splice
Machine
Proprietary
and
Confidential
Spark
Query
Management
(cont’d)
32
Detailed
metrics
for
tasks
in
each
stage

Splice
Machine
Proprietary
and
Confidential
Spark
Query
Management
(cont’d)
33

Splice
Machine
Proprietary
and
Confidential
Federated
Query
Support
34
Virtual
Table
Interface
(VTI)
§ Execute
federated
queries
against

external
files,
libraries
or
databases
§ External
Databases
§ Use
JDBC
to
access
data
in
DBs
such

as
Oracle
and
DB2
§ External
Libraries
§ Access
over
140
Spark
libraries
for

machine
learning
and
streaming
§ External
Files
§ Pre-‐defined
or
dynamic
schema
§ Access
local
FS,
HDFS,
AWS
S3
§ Sample
query:
MapReduceI/O
Formats
§ Accept
federated
queries
from

MapReduce,
Pig,
and
Hive
§ Register
Splice
Machine
schema
in

HCATALOG
§ Merge
structured
(Splice)
and

unstructured
data
in
ad-‐hoc
query
§ Seamless
integration
to
Hadoop
ecosystem

Splice
Machine
Proprietary
and
Confidential
Machine
Learning:

Adding
Multivariate
Statistics
via
a
Stored
Procedure
35
public
static
void
getStatementStatistics(String
statement,
ResultSet[]
resultSets)
throws
SQLException{
try
{
//
Run
sql statement
Connection
con
=
DriverManager.getConnection("jdbc:default:connection");
PreparedStatement ps =
con.prepareStatement(statement);
ResultSet rs =
ps.executeQuery();
//
Convert
result
set
to
Java
RDD
JavaRDD<LocatedRow>
resultSetRDD=
ResultSetToRDD(rs);
//
Collect
column
statistics
int[]
fieldsToConvert =
getFieldsToConvert(ps);
MultivariateStatisticalSummary summary
=
getColumnStatisticsSummary(resultSetRDD,
fieldsToConvert);
IteratorNoPutResultSet resultsToWrap =
wrapResults((EmbedConnection)
con,
getColumnStatistics(ps,
summary,
fieldsToConvert));
resultSets[0]
=
new
EmbedResultSet40((EmbedConnection)con,
resultsToWrap,
false,
null,
true);
}
catch
(StandardException e)
{
throw
new
SQLException(Throwables.getRootCause(e));
}
}
private
static
MultivariateStatisticalSummary getColumnStatisticsSummary(JavaRDD<LocatedRow>
resultSetRDD,
int[]
fieldsToConvert)
throws
StandardException{
JavaRDD<Vector>
vectorJavaRDD=
SparkMLibUtils.locatedRowRDDToVectorRDD(resultSetRDD,
fieldsToConvert);
MultivariateStatisticalSummary summary
=
Statistics.colStats(vectorJavaRDD.rdd());
return
summary;
}

Splice
Machine
Proprietary
and
Confidential
ANSI
SQL-‐99+
Coverage
36
§ Data
types
– e.g.,
INTEGER,
REAL,

CHARACTER,
DATE,
BOOLEAN,
BIGINT
§ DDL
– e.g.,
CREATE
TABLE,
CREATE
SCHEMA,

ALTER
TABLE,
DELETE,
UPDATE
TABLE
§ Predicates
– e.g.,
IN,
BETWEEN,
LIKE,
EXISTS
§ DML
– e.g.,
INSERT,
DELETE,
UPDATE,
SELECT
§ Query
specification – e.g.,
GROUP
BY,

HAVING
§ SET
functions
– e.g.,
UNION,
ABS,
MOD,
ALL,

INTERSECT,
EXCEPT
§ Aggregation
functions – e.g.,
AVG,
MAX,

COUNT
§ String
functions
– e.g.,
SUBSTRING,

concatenation,
UPPER,
LOWER,
TRIM,

LENGTH
§ Constraints
– e.g.,
PRIMARY
KEY,
CHECK,

FOREIGN
KEY,
UNIQUE,
NOT
NULL
§ Conditional
functions
– e.g.,
CASE,

searched
CASE
§ Privileges
– e.g.,
privileges
for
SELECT,

DELETE,
INSERT,
EXECUTE
§ Joins
– e.g.,
INNER
JOIN,
LEFT
OUTER
JOIN
§ Transactions – e.g.,
COMMIT,
ROLLBACK,

Snapshot
Isolation
§ Sub-‐queries
§ Triggers
§ User-‐defined
functions
(UDFs)
§ Views – including
grouped
views
§ Window
Functions
– e.g.,
FIRST_VALUE,

LAST_VALUE,
LEAD,
LAG

Splice
Machine
Proprietary
and
Confidential 37
High
Concurrency,
ACID
transactions
Required
to
support
OLTP
applications
share_quantity share_price
TIMESTAMP VALUE TIMESTAMP VALUE
T12 4,000
“Virtual”

Snapshot
T7 $15.11
T7 2,000 T5 $15.65
T3 5,000
Transaction

@T6
T2 $15.74
T1 3,000 T0 $15.27
T3 5,000
Transaction

@T6
T2 $15.74
T5 $15.65
value_held=
share_quality*
share_price
@T6:
value_held=
5,000
*
$15.65
@T3:
value_held=
5,000
*
$15.74
§ State-‐of-‐the-‐art,
distributed

snapshot
isolation
§ Form
of
Multi-‐Version

Concurrency
Control
(MVCC)
§ Writers
do
not
block
readers
§ Fast,
high
concurrency

§ Delivers
performance
for
small

reads/writes
&
batch
loads
§ Extends
research
from
Google

Percolator &
Yahoo
Labs
§ Patent
pending
technology

Splice
Machine
Proprietary
and
Confidential
BI
and
SQL
tool
support
via
ODBC/JDBC
38
No
application
rewrites
needed

Splice
Machine
Proprietary
and
Confidential
Application
Framework
Support
39

Splice
Machine
Proprietary
and
Confidential
Advisory
Board
40
Advisory
Board
includes
luminaries
in
databases
and
technology

Roger
Bamford
Former
Principal
Architect
at
Oracle
Father
of
Oracle
RAC
Mike
Franklin
Computer
Science
Chair,
UC
Berkeley
Director,
UC
Berkeley
AMPLab
Founder
of
Apache
Spark
Marie-‐Anne
Neimat
Co-‐Founder,
Times-‐Ten
Database
Former
VP,
Database
Eng.
at
Oracle
Ken
Rudin
Head
of
Growth
and
Analysis

for
Google
Search
Head
of
Analytics
at
Facebook
Abhinav Gupta

Co-‐Founder,
VP
Engineering
at
Rocket
Fuel
Runs
15PB
HBase Cluster

Splice
Machine
Proprietary
and
Confidential
The
Splice
Machine
RDBMS:
Replace
Oracle
&
MySQL
The
First
RDBMS
Powered
by
Hadoop &
Spark
41
ANSI
SQL
No
retraining
or
rewrites
for
SQL-‐based

analysts,
reports,
and
applications

¼
the
Cost

Scales
out
on

commodity
hardware
SQL Scale
Out Speed
Transactions
Ensure
reliable
updates

across
multiple
rows
Mixed
Workloads
Simultaneously
support

OLTP
and
OLAP
workloads
Elastic
Increase
scale
in

just
a
few
minutes
10-‐20x
Faster
Leverages
Spark

in-‐memory
technology

Splice
Machine
Proprietary
and
Confidential 42
Make
Decisions
in
the
Moment

Splice
Machine
Proprietary
and
Confidential
Next
Steps
43
Try
Us!
Proof
of
Concept

Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine

Recommended

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine (20)

More from Data Con LA (20)

Recently uploaded (20)

Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine