SlideShare a Scribd company logo
Grab some
coffee and
enjoy the
pre-­show
banter
before the
top of the
hour!
The Briefing Room
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
Twitter Tag: #briefr The Briefing Room
Welcome
Host:
Eric Kavanagh
eric.kavanagh@bloorgroup.com
@eric_kavanagh
Twitter Tag: #briefr The Briefing Room
  Reveal the essential characteristics of enterprise
software, good and bad
  Provide a forum for detailed analysis of today s innovative
technologies
  Give vendors a chance to explain their product to savvy
analysts
  Allow audience members to pose serious questions... and
get answers!
Mission
Twitter Tag: #briefr The Briefing Room
Topics
August: REAL-TIME DATA
September: HADOOP 2.0
October: DATA MANAGEMENT
Twitter Tag: #briefr The Briefing Room
Why Data Gets in a Jam
Ø  ETL is dated
technology
Ø  New super-highways
are needed
Ø  Data gravity is real
Twitter Tag: #briefr The Briefing Room
Analyst: Robin Bloor
Robin Bloor is
Chief Analyst at
The Bloor Group
robin.bloor@bloorgroup.com
@robinbloor
Twitter Tag: #briefr The Briefing Room
Splice Machine
  Splice Machine is a SQL-on-Hadoop database
  The product is ACID-compliant and can power both
OLAP and OLTP workloads
  Splice Machine is built on Java-based Apache Derby
and HBase/Hadoop
Twitter Tag: #briefr The Briefing Room
Guest: Rich Reimer
Rich Reimer, VP of Marketing and Product Management
Rich has over 15 years of sales, marketing and management experience in high-
tech companies. Before joining Splice Machine, Rich worked at Zynga as the
Treasure Isle studio head, where he used petabytes of data from millions of daily
users to optimize the business in real-time. Prior to Zynga, he was the COO and
co-founder of a social media platform named Grouply. Before founding Grouply,
Rich held executive positions at Siebel Systems, Blue Martini Software and Oracle
Corporation as well as sales and marketing positions at General Electric and Bell
Atlantic.
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
  
ETL:	
  Gatekeeper	
  to	
  
Real-­‐Time	
  Big	
  Data	
  
Rich	
  Reimer	
  
VP,	
  Product	
  Management	
  
rreimer@splicemachine.com	
  
	
  
August	
  11,	
  2015	
  
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
  
What	
  Is	
  Real-­‐Time?	
  Are	
  We	
  There	
  Yet?	
  
2	
  
Capture Analyze Act
Depends	
  on	
  where	
  you	
  are	
  in	
  the	
  insight-­‐to-­‐ac4on	
  con4nuum	
  
Current
Real-Time
•  Nightly ETL
•  Data Lakes
•  Interactive Reports
on Old Data
•  Days for Data
Scientists to Analyze
•  Millisecond
Delay
•  Automated Machine
Learning
•  Days to Update Rules
•  Months to Update
Apps
•  Autonomic
Applications
Crawl Walk Run
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
  
ETL:	
  Boring,	
  Unglamorous,	
  Inevitable	
  Burden	
  
3	
  
“ETL	
  is	
  something	
  you	
  do	
  that	
  nobody	
  no4ces	
  un4l	
  you	
  don’t	
  do	
  it.”	
  
-­‐	
  Author	
  Unknown	
  
	
  
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
  
But	
  It’s	
  Killing	
  You	
  Slowly…	
  
4	
  
Iner4a	
  and	
  hidden	
  costs	
  dragging	
  your	
  business	
  down	
  
ERP
CRM
…
Data
Warehouse
ETL
ODS
Systems of
Record
Expensive
Scale-up hardware and
proprietary software
Tuning
Ongoing database tuning to
address performance issues
Script
Maintenance
Constant updating of ETL
scripts to handle changing
sources and reports
Unable to Meet
Business Needs
Takes weeks or months to
change or create new reports
Delayed Reports
Errors or performance issues
cause miss of ETL window
and delay reports
Data Too Old
Data is hours or days old, when
business needs it near real-time
Too Slow
Can take hours or
even days to finish
ETL pipeline
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
  
Big	
  Data	
  Makes	
  It	
  Worse	
  
5	
  
ETL	
  becomes	
  bigger	
  boCleneck	
  as	
  data	
  grows	
  
ETL	
  
Bo'leneck	
  
Applica1ons	
   Analysis	
  
Source:	
  2013	
  IBM	
  Briefing	
  Book	
  
30-40%
data	
  growth	
  
per	
  year	
  
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
   6	
  
Scale-­‐Out:	
  The	
  Future	
  of	
  Databases	
  
Drama4c	
  improvement	
  in	
  price/performance	
  
	
  
Scale	
  Up	
  
(Increase	
  server	
  size)	
  
Scale	
  Out	
  
(More	
  small	
  servers)	
  
vs.	
  
$ $
 $
 $
 $
 $
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
  
Fixing	
  ETL:	
  Incremental	
  Approach	
  
7	
  
Incremental	
  evolu4on	
  to	
  reduce	
  lag	
  from	
  days	
  to	
  seconds	
  
ETL:
Scale-up
ETL:
Scale-out
ELT T Only
Legacy Now Now Future
Days/Hours Hours/Minutes Minutes/Seconds No Lag
Transform
TransformTransform
OLTPOLAP OLTP
Transform
OLTP/OLAPOLTP OLAP OLAP
Timing
Architecture
Lag
Approach
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
   8	
  
Reference	
  Architecture:	
  Typical	
  Data	
  Processing	
  Pipeline	
  
How	
  do	
  you	
  reduce	
  lag	
  from	
  days	
  to	
  minutes	
  to	
  seconds?	
  
Ad Hoc
Analytics
Executive
Business Reports
Operational
Reports & Analytics
ERP
CRM
Supply
Chain
HR
… Data
Warehouse
Datamart
Stream or
Batch Updates
Mixed
Workload AppsODS
ETL
Systems of
Record
Extract
Transform
Load
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
   9	
  
Ad Hoc
Analytics
Executive
Business Reports
Operational
Reports & Analytics
ERP
CRM
Supply
Chain
HR
… Data
Warehouse
Datamart
Stream or
Batch Updates
Mixed
Workload Apps
ETL
Systems of
Record
Extract
Transform
Load
Reference	
  Architecture:	
  Scale-­‐Out	
  Data	
  Processing	
  Pipeline	
  
Accelerate	
  Data	
  Processing	
  Pipeline	
  to	
  minutes	
  or	
  even	
  seconds	
  
Operational
Data Lake
Benefits
§  5-­‐10x	
  faster	
  
§  75%	
  less	
  cost	
  
§  Elas4c	
  scalability	
  
§  Unstructured	
  data	
  support	
  
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
   10	
  
You	
  Need	
  More	
  Than	
  Hadoop	
  By	
  Itself	
  For	
  ETL	
  
Errors	
  or	
  data	
  quality	
  issues	
  force	
  ETL	
  restarts	
  
Restart	
  ETL	
  to	
  fix	
  errors	
  or	
  
update	
  records	
  
Hours
Seconds
Use	
  transac4on	
  to	
  
restart	
  step	
  or	
  
update	
  records	
  
Hadoop RDBMS
ETL
Hadoop ETL
Apps	
  
ETL	
   Analy4cs	
  
Apps	
  
ETL	
  
Hours
Analy4cs	
  
Benefits
§  SQL-­‐based	
  transforms	
  
§  Improved	
  data	
  quality	
  
§  Faster	
  recovery	
  with	
  
transac4ons	
  
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
  
Streamlining	
  the	
  Structured	
  Data	
  Pipeline	
  in	
  Hadoop	
  
11	
  
Source
Systems
ERP
…
CRM
Sqoop
Apply
Inferred
Schema
Stored as
flat files
SQL Query Engines BI Tools
Tradi3onal	
  Hadoop	
  Pipeline	
  
vs.	
  
Source
Systems
ERP
…
CRM
Existing
ETL Tool
Stored in
same
schema
BI Tools
Streamlined	
  Hadoop	
  Pipeline	
   Benefits
§  Less	
  cost	
  and	
  
complexity	
  
§  Faster	
  w/	
  fewer	
  
transla4ons	
  
§  Improved	
  data	
  quality	
  
§  Bejer	
  SQL	
  support	
  
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
   12	
  
Seamless	
  Integra4on	
  of	
  Structured	
  and	
  Unstructured	
  Data	
  
Op4mizing	
  storage	
  and	
  querying	
  of	
  structured	
  data	
  as	
  part	
  of	
  ELT	
  or	
  Hadoop	
  query	
  engines	
  
OLTP
Systems
ERP
CRM
Supply
Chain
HR
…
Structured
Data
Unstructured
Data
HCATALOG
Pig
SCHEMA
ON INGEST:
Streamlined,
structured-to-
structured
integration
1	
  
2	
  
3	
  
SCHEMA BEFORE READ:
Repository for structured data or
metadata from ELT process on
unstructured data
SCHEMA ON READ:
Ad-hoc Hadoop queries across
structured and unstructured
data
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
  
Case	
  Study:	
  Opera4onal	
  Data	
  Lake	
  
13	
  13	
  
Overview	
  	
  
  Computer	
  technology	
  corpora4on	
  
  Update	
  database	
  technology	
  for:	
  
  ODS	
  layer	
  replacement	
  
  ETL	
  processing	
  and	
  analysis	
  of	
  Omniture	
  data	
  
  Real-­‐4me	
  OLTP	
  for	
  Global	
  Tech	
  Support	
  app	
  
	
  
Challenges	
  
  Oracle	
  and	
  Teradata	
  too	
  expensive	
  to	
  scale	
  
  Many	
  Oracle	
  queries	
  couldn’t	
  complete	
  
  Can	
  only	
  hold	
  7	
  days	
  worth	
  of	
  data	
  in	
  Oracle	
  
  Missing	
  ETL	
  window	
  with	
  current	
  Hadoop	
  data	
  lake	
  
	
  
Solu1on	
  Diagram	
  
	
  
(400TB)	
  
OLTP Systems
ERP
CRM
Supply
Chain
Benefits	
  
75%	
  less	
  cost	
  
with	
  commodity	
  scale	
  out	
  
Incremental	
  ETL	
  processing	
  
gracefully	
  handle	
  data	
  quality	
  issues	
  
5x-­‐10x	
  faster	
  
comple4ng	
  queries	
  on	
  which	
  Oracle	
  failed	
  	
  	
  
	
  
✔	
  
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
   14	
  
Internet	
  of	
  Things	
  
ETL/Opera4onal	
  Data	
  Lake	
  Digital	
  Marke4ng	
  
Precision	
  
Medicine	
  
Use	
  Cases	
  
Splice	
  Machine	
  |	
  Proprietary	
  &	
  Confiden4al	
  
Fraud	
  Detec4on	
  
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
   15	
  
Who	
  Are	
  We?	
  
Affordable,	
  Scale-­‐Out	
  –	
  Commodity	
  hardware	
  
Elas3c	
  –	
  Easy	
  to	
  expand	
  or	
  scale	
  back	
  
Transac3onal	
  –	
  Real-­‐4me	
  updates	
  &	
  ACID	
  Transac4ons	
  	
  
ANSI	
  SQL	
  –	
  Leverage	
  exis4ng	
  SQL	
  code,	
  tools,	
  &	
  skills	
  
Flexible	
  –	
  Support	
  opera4onal	
  and	
  analy4cal	
  workloads	
  
10x	
  	
  
Bejer	
  	
  
Price/Perf	
  
	
  
THE	
  HADOOP	
  RDBMS	
  	
  
Replace	
  Oracle	
  with	
  Splice	
  Machine	
  
to	
  scale	
  out	
  your	
  applica4ons	
  
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
   16	
  
Proven	
  Building	
  Blocks:	
  Hadoop	
  and	
  Derby	
  
APACHE	
  DERBY	
  	
  
§  	
  ANSI	
  SQL-­‐99	
  RDBMS	
  
§  	
  Java-­‐based	
  
§  	
  ODBC/JDBC	
  Compliant	
  
	
  
APACHE	
  HBASE/HDFS	
  
§  Auto-­‐sharding	
  
§  Real-­‐4me	
  updates	
  
§  Fault-­‐tolerance	
  
§  Scalability	
  to	
  100s	
  of	
  PBs	
  
§  Data	
  replica4on	
  	
  
	
  
	
  
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
   17	
  
Distributed,	
  Parallelized	
  Query	
  Execu4on	
  
Parallelized	
  
computa4on	
  across	
  
cluster	
  
Moves	
  
computa4on	
  to	
  	
  
the	
  data	
  
U4lizes	
  HBase	
  	
  
co-­‐processors	
  
No	
  MapReduce	
  
HBase	
  	
  
Co-­‐Processor	
  
	
  
HBase	
  Server	
  
Memory	
  Space	
  
LEGEND	
  
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
  
ANSI	
  SQL-­‐99	
  Coverage	
  
18	
  
§  Data	
  types	
  –	
  e.g.,	
  INTEGER,	
  REAL,	
  
CHARACTER,	
  DATE,	
  BOOLEAN,	
  BIGINT	
  
§  DDL	
  –	
  e.g.,	
  CREATE	
  TABLE,	
  CREATE	
  SCHEMA,	
  
ALTER	
  TABLE,	
  DELETE,	
  UPDATE	
  TABLE	
  
§  Predicates	
  –	
  e.g.,	
  IN,	
  BETWEEN,	
  LIKE,	
  EXISTS	
  
§  DML	
  –	
  e.g.,	
  INSERT,	
  DELETE,	
  UPDATE,	
  SELECT	
  
§  Query	
  specifica3on	
  –	
  e.g.,	
  GROUP	
  BY,	
  
HAVING	
  
§  SET	
  func3ons	
  –	
  e.g.,	
  UNION,	
  ABS,	
  MOD,	
  ALL	
  
§  Aggrega3on	
  func3ons	
  –	
  e.g.,	
  AVG,	
  MAX,	
  
COUNT	
  
§  String	
  func3ons	
  –	
  e.g.,	
  SUBSTRING,	
  
concatena4on,	
  UPPER,	
  LOWER,	
  TRIM,	
  
LENGTH	
  
§  Constraints	
  –	
  e.g.,	
  PRIMARY	
  KEY,	
  FOREIGN	
  
KEY,	
  UNIQUE,	
  NOT	
  NULL	
  
§  Condi3onal	
  func3ons	
  –	
  e.g.,	
  CASE,	
  searched	
  
CASE	
  
§  Privileges	
  –	
  e.g.,	
  privileges	
  for	
  SELECT,	
  
DELETE,	
  INSERT,	
  EXECUTE	
  
§  Joins	
  –	
  e.g.,	
  INNER	
  JOIN,	
  LEFT	
  OUTER	
  JOIN	
  
§  Transac3ons	
  –	
  e.g.,	
  COMMIT,	
  ROLLBACK,	
  
Snapshot	
  Isola4on	
  
§  Sub-­‐queries	
  
§  Triggers	
  
§  User-­‐defined	
  func3ons	
  (UDFs)	
  
§  Views	
  –	
  including	
  grouped	
  views	
  
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
   19	
  
Lockless,	
  ACID	
  transac4ons	
  
•  Adds	
  mul4-­‐row,	
  mul4-­‐table	
  
transac4ons	
  to	
  HBase	
  w/	
  rollback	
  
•  Fast,	
  lockless,	
  high	
  concurrency	
  	
  
•  Extends	
  research	
  from	
  Google	
  
Percolator,	
  Yahoo	
  Labs,	
  U	
  of	
  
Waterloo	
  
•  Patent	
  pending	
  technology	
  
	
  
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
  
What	
  People	
  are	
  Saying…	
  
20	
  
Recognized	
  as	
  a	
  key	
  innovator	
  in	
  databases	
  
Scaling	
  out	
  on	
  Splice	
  
Machine	
  presented	
  	
  
some	
  major	
  benefits	
  	
  
over	
  Oracle	
  
...automa4c	
  balancing	
  between	
  
clusters...avoiding	
  the	
  costly	
  
licensing	
  issues.	
  
Quotes	
  
Awards	
  
	
  
An	
  alterna3ve	
  to	
  today’s	
  
RDBMSes,	
  
Splice	
  Machine	
  effec4vely	
  	
  
combines	
  tradi4onal	
  rela4onal	
  
database	
  	
  technology	
  with	
  	
  
the	
  scale-­‐out	
  capabili4es	
  	
  
of	
  Hadoop.	
  
	
  
The	
  unique	
  claim	
  of	
  …	
  Splice	
  
Machine	
  is	
  that	
  it	
  can	
  run	
  
transac3onal	
  applica3ons	
  
as	
  well	
  as	
  support	
  analy4cs	
  on	
  	
  
top	
  of	
  Hadoop.	
  
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
  
Ini4al	
  Advisory	
  Board	
  
21	
  
Advisory	
  Board	
  includes	
  luminaries	
  in	
  databases	
  and	
  technology	
  	
  
Roger	
  Bamford	
  
Former	
  Principal	
  Architect	
  at	
  Oracle	
  
Father	
  of	
  Oracle	
  RAC	
  
Mike	
  Franklin	
  
Computer	
  Science	
  Chair,	
  UC	
  Berkeley	
  
Director,	
  UC	
  Berkeley	
  AMPLab	
  
Founder	
  of	
  Apache	
  Spark	
  
Marie-­‐Anne	
  Neimat	
  
Co-­‐Founder,	
  Times-­‐Ten	
  Database	
  
Former	
  VP,	
  Database	
  Eng.	
  at	
  Oracle	
  
Ken	
  Rudin	
  
Head	
  of	
  Analy4cs	
  at	
  Facebook	
  
Former	
  GM	
  of	
  Oracle	
  Data	
  Warehousing	
  
Abhinav	
  Gupta	
  	
  
Co-­‐Founder,	
  VP	
  Engineering	
  at	
  Rocket	
  Fuel	
  
Runs	
  15PB	
  HBase	
  Cluster	
  
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
   22	
  
The	
  First	
  Step	
  to	
  Real-­‐Time	
  Big	
  Data	
  Requires	
  Fixing	
  ETL	
  
ETL	
  on	
  Hadoop	
  
§  Drive	
  lag	
  down	
  from	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
hours	
  è	
  minutes	
  è	
  seconds	
  
§  Start	
  by	
  replacing	
  ODS	
  with	
  	
  	
  
Opera4onal	
  Data	
  Lake	
  
§  5-­‐10x	
  faster	
  and	
  ¼	
  cost	
  
	
  
Splice	
  Machine	
  
§  Replace	
  RDBMSs	
  like	
  Oracle	
  	
  	
  	
  
and	
  MySQL	
  
§  Best	
  of	
  both	
  worlds	
  
§  SQL	
  and	
  transac4ons	
  of	
  RDBMSs	
  
§  Scale-­‐out	
  of	
  NoSQL	
  
§  10x	
  bejer	
  price/performance	
  
	
  
	
  
	
  
Transform
TransformTransform
OLTPOLAP OLTP
Transform
OLTP/OLAPOLTP OLAP OLAP
ETL: Scale-up ETL: Scale-out ELT T Only
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
  
ETL:	
  Gatekeeper	
  to	
  
Real-­‐Time	
  Big	
  Data	
  
Rich	
  Reimer	
  
VP,	
  Product	
  Management	
  
rreimer@splicemachine.com	
  
	
  
August	
  11,	
  2015	
  
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
  
Focused	
  on	
  Opera4onal	
  Workloads	
  
24	
  
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
   25	
  
Oracle	
  Vs	
  Splice	
  Machine	
  TCO	
  comparison	
  
Oracle	
  RAC	
  Costs	
   List	
  Price	
   Unit	
   3	
  Year	
  Cost	
  
(Discounted	
  60%)	
  
	
  
Oracle	
  Database	
  Enterprise	
  
Edi4on	
  with	
  RAC	
  	
  
$37,750	
   64	
   $966,400	
  
3	
  years	
  DB	
  Maintenance	
  
(22%	
  list	
  price/yr)	
  	
  
$24,915	
   64	
   $637,824	
  
3	
  years	
  Opera4ng	
  System	
  
Support	
  (Oracle	
  Linux)	
  	
  
	
  
$6,897	
   4	
   $11,035	
  
Server	
  Costs	
  (mid-­‐range,	
  
Intel	
  Xeon-­‐based)	
  
$16,000	
   4	
   $64,000	
  
Primary	
  Storage	
   $143,360	
   $143,360	
  
TOTAL	
   $228,922	
   $1,822,619	
  
Assumes	
  Oracle	
  Enterprise	
  Edi4on	
  ($47.5K/CPU)	
  and	
  RAC	
  ($23K/CPU)	
  	
  
Splice	
  Machine	
  Costs	
   List	
  Price	
   Unit	
   3	
  Year	
  Cost	
  
(without	
  discount)	
  
Splice	
  Machine	
  Annual	
  
Subscrip4on	
  
$10,000	
   7	
   $210,000	
  
Cloudera	
  Enterprise	
  
Edi4on	
  Annual	
  
Subscrip4on	
  
$7,500	
   8	
   $180,000	
  
Server	
  Costs	
  	
  with	
  Storage	
   $5,000	
   8	
   $40,000	
  
TOTAL	
   $22,500	
   $430,000	
  
76%	
  TCO	
  Reduc3on	
  
Twitter Tag: #briefr The Briefing Room
Perceptions & Questions
Analyst:
Robin Bloor
Life in the Data Lake
Robin Bloor, Ph.D.
Hadoop: One Ring to Rule Them All
Hadoop has become the de facto
processing environment for big
data.
Is it going to become the de facto
environment for
ALL SERVER COMPUTING?
Empires to Conquer
u  Big Data
u  Analytics
u  Real-time analytics
u  OLTP
u  Document shares
u  Office systems
✔︎
✔︎
?
?
??
Just A Few Years Ago
What Hadoop Dreams Of
Hadoop Possibilities?
u  Hadoop is evolving faster than any equivalent
technology I can remember
u  It has a very long way to go to become the
“server OS for everything.”
u  First it would need to become a genuine OS
u  It has no stated direction.
u  It may vanish into the cloud.
u  Nevertheless it is interesting to watch
The Net Net
Meanwhile, it has become a lab for
server software
u  It’s not just ETL: it’s ETL, data cleansing,
metadata capture, MDM, etc. How do you
accommodate that?
u  Do you have any ETL customer experiences to
report?
u  How’s your OLTP business going? (Is this ETL
emphasis a complementary activity?)
u  How well are you doing versus Oracle?
u  How well does it integrate with other
technologies?
u  What is your current largest customer(s)?
u  Do you have any direct competition on Hadoop?
Twitter Tag: #briefr The Briefing Room
Twitter Tag: #briefr The Briefing Room
Upcoming Topics
www.insideanalysis.com
August: REAL-TIME DATA
September: HADOOP 2.0
October: DATA MANAGEMENT
Twitter Tag: #briefr The Briefing Room
THANK YOU
for your
ATTENTION!
Some images provided courtesy of Wikimedia Commons and by basykes [CC BY 2.0 (http://
creativecommons.org/licenses/by/2.0)], via Wikimedia Commons (https://upload.wikimedia.org/wikipedia/
commons/9/94/Beijing_traffic_jam.jpg)

More Related Content

What's hot (19)

PPTX
Gobblin' Big Data With Ease @ QConSF 2014
Lin Qiao
 
PPTX
Insights into Real World Data Management Challenges
DataWorks Summit
 
PDF
HAWQ: a massively parallel processing SQL engine in hadoop
BigData Research
 
PPTX
Mutable Data in Hive's Immutable World
DataWorks Summit
 
PDF
Spark meetup - Zoomdata Streaming
Zoomdata
 
PPTX
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
PPTX
Trafodion – an enterprise class sql based on hadoop
Krishna-Kumar
 
PPTX
Hadoop crash course workshop at Hadoop Summit
DataWorks Summit
 
PDF
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
✔ Eric David Benari, PMP
 
PPTX
Reaching scale limits on a Hadoop platform: issues and errors created by spee...
DataWorks Summit
 
PPTX
Jethro data meetup index base sql on hadoop - oct-2014
Eli Singer
 
PPTX
2 - Trafodion and Hadoop HBase
Rohit Jain
 
PPTX
Format Wars: from VHS and Beta to Avro and Parquet
DataWorks Summit
 
PPTX
Hadoop & Cloud Storage: Object Store Integration in Production
DataWorks Summit/Hadoop Summit
 
PDF
ETL using Big Data Talend
Edureka!
 
PDF
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
 
PPTX
HDFS: Optimization, Stabilization and Supportability
DataWorks Summit/Hadoop Summit
 
PDF
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
DataWorks Summit
 
PPTX
Real Time Machine Learning Visualization with Spark
DataWorks Summit/Hadoop Summit
 
Gobblin' Big Data With Ease @ QConSF 2014
Lin Qiao
 
Insights into Real World Data Management Challenges
DataWorks Summit
 
HAWQ: a massively parallel processing SQL engine in hadoop
BigData Research
 
Mutable Data in Hive's Immutable World
DataWorks Summit
 
Spark meetup - Zoomdata Streaming
Zoomdata
 
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
Trafodion – an enterprise class sql based on hadoop
Krishna-Kumar
 
Hadoop crash course workshop at Hadoop Summit
DataWorks Summit
 
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
✔ Eric David Benari, PMP
 
Reaching scale limits on a Hadoop platform: issues and errors created by spee...
DataWorks Summit
 
Jethro data meetup index base sql on hadoop - oct-2014
Eli Singer
 
2 - Trafodion and Hadoop HBase
Rohit Jain
 
Format Wars: from VHS and Beta to Avro and Parquet
DataWorks Summit
 
Hadoop & Cloud Storage: Object Store Integration in Production
DataWorks Summit/Hadoop Summit
 
ETL using Big Data Talend
Edureka!
 
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
 
HDFS: Optimization, Stabilization and Supportability
DataWorks Summit/Hadoop Summit
 
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
DataWorks Summit
 
Real Time Machine Learning Visualization with Spark
DataWorks Summit/Hadoop Summit
 

Viewers also liked (9)

PPTX
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Chicago Hadoop Users Group
 
PPTX
HBaseConEast2016: Splice machine open source rdbms
Michael Stack
 
PPTX
Splice Machine Overview
Kunal Gupta
 
PDF
Hadoop and the Relational Database: The Best of Both Worlds
Inside Analysis
 
PDF
Crawl, Walk, Run: How to Get Started with Hadoop
Inside Analysis
 
PDF
SQL on Hadoop
nvvrajesh
 
PPTX
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
Yahoo Developer Network
 
PDF
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Data Con LA
 
PPTX
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
Yahoo Developer Network
 
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Chicago Hadoop Users Group
 
HBaseConEast2016: Splice machine open source rdbms
Michael Stack
 
Splice Machine Overview
Kunal Gupta
 
Hadoop and the Relational Database: The Best of Both Worlds
Inside Analysis
 
Crawl, Walk, Run: How to Get Started with Hadoop
Inside Analysis
 
SQL on Hadoop
nvvrajesh
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
Yahoo Developer Network
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Data Con LA
 
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
Yahoo Developer Network
 
Ad

Similar to Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL (20)

PDF
The Maturity Model: Taking the Growing Pains Out of Hadoop
Inside Analysis
 
PPTX
Simplifying and Future-Proofing Hadoop
Precisely
 
PPTX
Capacity management for ETL System
ASHOK BHATLA
 
PPTX
Capacity Management of an ETL System
ASHOK BHATLA
 
PDF
Application Modernization
Sulaiman64
 
PDF
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Precisely
 
PPTX
In-Memory Computing Webcast. Market Predictions 2017
SingleStore
 
PDF
Exploring the Wider World of Big Data- Vasalis Kapsalis
NetAppUK
 
PPTX
Cisco event 6 05 2014v3 wwt only
Arthur_Hansen
 
PDF
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
Adaryl "Bob" Wakefield, MBA
 
PPTX
Deutsche Telekom on Big Data
DataWorks Summit
 
PPTX
The computing age
Didier Mamma
 
PPTX
Hortonworks Oracle Big Data Integration
Hortonworks
 
PPTX
Big Data Management: What's New, What's Different, and What You Need To Know
SnapLogic
 
PDF
The New Model
David Kaiser
 
PDF
Accenture hana-in-memory-pov
K Thomas
 
PDF
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
Denodo
 
PDF
Hadoop and Your Enterprise Data Warehouse
Edgar Alejandro Villegas
 
PDF
Big Data LDN 2017: The New Dominant Companies Are Running on Data
Matt Stubbs
 
PDF
Big Data LDN 2017: The New Dominant Companies Are Running on Data
Matt Stubbs
 
The Maturity Model: Taking the Growing Pains Out of Hadoop
Inside Analysis
 
Simplifying and Future-Proofing Hadoop
Precisely
 
Capacity management for ETL System
ASHOK BHATLA
 
Capacity Management of an ETL System
ASHOK BHATLA
 
Application Modernization
Sulaiman64
 
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Precisely
 
In-Memory Computing Webcast. Market Predictions 2017
SingleStore
 
Exploring the Wider World of Big Data- Vasalis Kapsalis
NetAppUK
 
Cisco event 6 05 2014v3 wwt only
Arthur_Hansen
 
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
Adaryl "Bob" Wakefield, MBA
 
Deutsche Telekom on Big Data
DataWorks Summit
 
The computing age
Didier Mamma
 
Hortonworks Oracle Big Data Integration
Hortonworks
 
Big Data Management: What's New, What's Different, and What You Need To Know
SnapLogic
 
The New Model
David Kaiser
 
Accenture hana-in-memory-pov
K Thomas
 
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
Denodo
 
Hadoop and Your Enterprise Data Warehouse
Edgar Alejandro Villegas
 
Big Data LDN 2017: The New Dominant Companies Are Running on Data
Matt Stubbs
 
Big Data LDN 2017: The New Dominant Companies Are Running on Data
Matt Stubbs
 
Ad

More from Inside Analysis (20)

PDF
An Ounce of Prevention: Forging Healthy BI
Inside Analysis
 
PDF
Agile, Automated, Aware: How to Model for Success
Inside Analysis
 
PDF
First in Class: Optimizing the Data Lake for Tighter Integration
Inside Analysis
 
PDF
Fit For Purpose: Preventing a Big Data Letdown
Inside Analysis
 
PDF
To Serve and Protect: Making Sense of Hadoop Security
Inside Analysis
 
PDF
The Hadoop Guarantee: Keeping Analytics Running On Time
Inside Analysis
 
PDF
Introducing: A Complete Algebra of Data
Inside Analysis
 
PDF
The Role of Data Wrangling in Driving Hadoop Adoption
Inside Analysis
 
PDF
Ahead of the Stream: How to Future-Proof Real-Time Analytics
Inside Analysis
 
PDF
All Together Now: Connected Analytics for the Internet of Everything
Inside Analysis
 
PDF
The Biggest Picture: Situational Awareness on a Global Level
Inside Analysis
 
PDF
Structurally Sound: How to Tame Your Architecture
Inside Analysis
 
PDF
SQL In Hadoop: Big Data Innovation Without the Risk
Inside Analysis
 
PDF
The Perfect Fit: Scalable Graph for Big Data
Inside Analysis
 
PDF
A Revolutionary Approach to Modernizing the Data Warehouse
Inside Analysis
 
PDF
Rethinking Data Availability and Governance in a Mobile World
Inside Analysis
 
PDF
DisrupTech - Dave Duggal
Inside Analysis
 
PPTX
Modus Operandi
Inside Analysis
 
PPTX
Phasic Systems - Dr. Geoffrey Malafsky
Inside Analysis
 
PPT
Red Hat - Sarangan Rangachari
Inside Analysis
 
An Ounce of Prevention: Forging Healthy BI
Inside Analysis
 
Agile, Automated, Aware: How to Model for Success
Inside Analysis
 
First in Class: Optimizing the Data Lake for Tighter Integration
Inside Analysis
 
Fit For Purpose: Preventing a Big Data Letdown
Inside Analysis
 
To Serve and Protect: Making Sense of Hadoop Security
Inside Analysis
 
The Hadoop Guarantee: Keeping Analytics Running On Time
Inside Analysis
 
Introducing: A Complete Algebra of Data
Inside Analysis
 
The Role of Data Wrangling in Driving Hadoop Adoption
Inside Analysis
 
Ahead of the Stream: How to Future-Proof Real-Time Analytics
Inside Analysis
 
All Together Now: Connected Analytics for the Internet of Everything
Inside Analysis
 
The Biggest Picture: Situational Awareness on a Global Level
Inside Analysis
 
Structurally Sound: How to Tame Your Architecture
Inside Analysis
 
SQL In Hadoop: Big Data Innovation Without the Risk
Inside Analysis
 
The Perfect Fit: Scalable Graph for Big Data
Inside Analysis
 
A Revolutionary Approach to Modernizing the Data Warehouse
Inside Analysis
 
Rethinking Data Availability and Governance in a Mobile World
Inside Analysis
 
DisrupTech - Dave Duggal
Inside Analysis
 
Modus Operandi
Inside Analysis
 
Phasic Systems - Dr. Geoffrey Malafsky
Inside Analysis
 
Red Hat - Sarangan Rangachari
Inside Analysis
 

Recently uploaded (20)

PDF
Bitkom eIDAS Summit | European Business Wallet: Use Cases, Macroeconomics, an...
Carsten Stoecker
 
PDF
TrustArc Webinar - Navigating APAC Data Privacy Laws: Compliance & Challenges
TrustArc
 
PDF
Enhancing Environmental Monitoring with Real-Time Data Integration: Leveragin...
Safe Software
 
PPTX
Practical Applications of AI in Local Government
OnBoard
 
PDF
''Taming Explosive Growth: Building Resilience in a Hyper-Scaled Financial Pl...
Fwdays
 
PPTX
Smart Factory Monitoring IIoT in Machine and Production Operations.pptx
Rejig Digital
 
PDF
Java 25 and Beyond - A Roadmap of Innovations
Ana-Maria Mihalceanu
 
PDF
Darley - FIRST Copenhagen Lightning Talk (2025-06-26) Epochalypse 2038 - Time...
treyka
 
PDF
Optimizing the trajectory of a wheel loader working in short loading cycles
Reno Filla
 
PDF
How to Comply With Saudi Arabia’s National Cybersecurity Regulations.pdf
Bluechip Advanced Technologies
 
PDF
Understanding AI Optimization AIO, LLMO, and GEO
CoDigital
 
PDF
My Journey from CAD to BIM: A True Underdog Story
Safe Software
 
PDF
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
PDF
ArcGIS Utility Network Migration - The Hunter Water Story
Safe Software
 
PPTX
Mastering Authorization: Integrating Authentication and Authorization Data in...
Hitachi, Ltd. OSS Solution Center.
 
PDF
Unlocking FME Flow’s Potential: Architecture Design for Modern Enterprises
Safe Software
 
PPTX
Enabling the Digital Artisan – keynote at ICOCI 2025
Alan Dix
 
PDF
“A Re-imagination of Embedded Vision System Design,” a Presentation from Imag...
Edge AI and Vision Alliance
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
Bitkom eIDAS Summit | European Business Wallet: Use Cases, Macroeconomics, an...
Carsten Stoecker
 
TrustArc Webinar - Navigating APAC Data Privacy Laws: Compliance & Challenges
TrustArc
 
Enhancing Environmental Monitoring with Real-Time Data Integration: Leveragin...
Safe Software
 
Practical Applications of AI in Local Government
OnBoard
 
''Taming Explosive Growth: Building Resilience in a Hyper-Scaled Financial Pl...
Fwdays
 
Smart Factory Monitoring IIoT in Machine and Production Operations.pptx
Rejig Digital
 
Java 25 and Beyond - A Roadmap of Innovations
Ana-Maria Mihalceanu
 
Darley - FIRST Copenhagen Lightning Talk (2025-06-26) Epochalypse 2038 - Time...
treyka
 
Optimizing the trajectory of a wheel loader working in short loading cycles
Reno Filla
 
How to Comply With Saudi Arabia’s National Cybersecurity Regulations.pdf
Bluechip Advanced Technologies
 
Understanding AI Optimization AIO, LLMO, and GEO
CoDigital
 
My Journey from CAD to BIM: A True Underdog Story
Safe Software
 
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
ArcGIS Utility Network Migration - The Hunter Water Story
Safe Software
 
Mastering Authorization: Integrating Authentication and Authorization Data in...
Hitachi, Ltd. OSS Solution Center.
 
Unlocking FME Flow’s Potential: Architecture Design for Modern Enterprises
Safe Software
 
Enabling the Digital Artisan – keynote at ICOCI 2025
Alan Dix
 
“A Re-imagination of Embedded Vision System Design,” a Presentation from Imag...
Edge AI and Vision Alliance
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 

Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL

  • 1. Grab some coffee and enjoy the pre-­show banter before the top of the hour!
  • 2. The Briefing Room Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
  • 3. Twitter Tag: #briefr The Briefing Room Welcome Host: Eric Kavanagh [email protected] @eric_kavanagh
  • 4. Twitter Tag: #briefr The Briefing Room   Reveal the essential characteristics of enterprise software, good and bad   Provide a forum for detailed analysis of today s innovative technologies   Give vendors a chance to explain their product to savvy analysts   Allow audience members to pose serious questions... and get answers! Mission
  • 5. Twitter Tag: #briefr The Briefing Room Topics August: REAL-TIME DATA September: HADOOP 2.0 October: DATA MANAGEMENT
  • 6. Twitter Tag: #briefr The Briefing Room Why Data Gets in a Jam Ø  ETL is dated technology Ø  New super-highways are needed Ø  Data gravity is real
  • 7. Twitter Tag: #briefr The Briefing Room Analyst: Robin Bloor Robin Bloor is Chief Analyst at The Bloor Group [email protected] @robinbloor
  • 8. Twitter Tag: #briefr The Briefing Room Splice Machine   Splice Machine is a SQL-on-Hadoop database   The product is ACID-compliant and can power both OLAP and OLTP workloads   Splice Machine is built on Java-based Apache Derby and HBase/Hadoop
  • 9. Twitter Tag: #briefr The Briefing Room Guest: Rich Reimer Rich Reimer, VP of Marketing and Product Management Rich has over 15 years of sales, marketing and management experience in high- tech companies. Before joining Splice Machine, Rich worked at Zynga as the Treasure Isle studio head, where he used petabytes of data from millions of daily users to optimize the business in real-time. Prior to Zynga, he was the COO and co-founder of a social media platform named Grouply. Before founding Grouply, Rich held executive positions at Siebel Systems, Blue Martini Software and Oracle Corporation as well as sales and marketing positions at General Electric and Bell Atlantic.
  • 10. Splice  Machine  Proprietary  and  Confiden4al   ETL:  Gatekeeper  to   Real-­‐Time  Big  Data   Rich  Reimer   VP,  Product  Management   [email protected]     August  11,  2015  
  • 11. Splice  Machine  Proprietary  and  Confiden4al   What  Is  Real-­‐Time?  Are  We  There  Yet?   2   Capture Analyze Act Depends  on  where  you  are  in  the  insight-­‐to-­‐ac4on  con4nuum   Current Real-Time •  Nightly ETL •  Data Lakes •  Interactive Reports on Old Data •  Days for Data Scientists to Analyze •  Millisecond Delay •  Automated Machine Learning •  Days to Update Rules •  Months to Update Apps •  Autonomic Applications Crawl Walk Run
  • 12. Splice  Machine  Proprietary  and  Confiden4al   ETL:  Boring,  Unglamorous,  Inevitable  Burden   3   “ETL  is  something  you  do  that  nobody  no4ces  un4l  you  don’t  do  it.”   -­‐  Author  Unknown    
  • 13. Splice  Machine  Proprietary  and  Confiden4al   But  It’s  Killing  You  Slowly…   4   Iner4a  and  hidden  costs  dragging  your  business  down   ERP CRM … Data Warehouse ETL ODS Systems of Record Expensive Scale-up hardware and proprietary software Tuning Ongoing database tuning to address performance issues Script Maintenance Constant updating of ETL scripts to handle changing sources and reports Unable to Meet Business Needs Takes weeks or months to change or create new reports Delayed Reports Errors or performance issues cause miss of ETL window and delay reports Data Too Old Data is hours or days old, when business needs it near real-time Too Slow Can take hours or even days to finish ETL pipeline
  • 14. Splice  Machine  Proprietary  and  Confiden4al   Big  Data  Makes  It  Worse   5   ETL  becomes  bigger  boCleneck  as  data  grows   ETL   Bo'leneck   Applica1ons   Analysis   Source:  2013  IBM  Briefing  Book   30-40% data  growth   per  year  
  • 15. Splice  Machine  Proprietary  and  Confiden4al   6   Scale-­‐Out:  The  Future  of  Databases   Drama4c  improvement  in  price/performance     Scale  Up   (Increase  server  size)   Scale  Out   (More  small  servers)   vs.   $ $ $ $ $ $
  • 16. Splice  Machine  Proprietary  and  Confiden4al   Fixing  ETL:  Incremental  Approach   7   Incremental  evolu4on  to  reduce  lag  from  days  to  seconds   ETL: Scale-up ETL: Scale-out ELT T Only Legacy Now Now Future Days/Hours Hours/Minutes Minutes/Seconds No Lag Transform TransformTransform OLTPOLAP OLTP Transform OLTP/OLAPOLTP OLAP OLAP Timing Architecture Lag Approach
  • 17. Splice  Machine  Proprietary  and  Confiden4al   8   Reference  Architecture:  Typical  Data  Processing  Pipeline   How  do  you  reduce  lag  from  days  to  minutes  to  seconds?   Ad Hoc Analytics Executive Business Reports Operational Reports & Analytics ERP CRM Supply Chain HR … Data Warehouse Datamart Stream or Batch Updates Mixed Workload AppsODS ETL Systems of Record Extract Transform Load
  • 18. Splice  Machine  Proprietary  and  Confiden4al   9   Ad Hoc Analytics Executive Business Reports Operational Reports & Analytics ERP CRM Supply Chain HR … Data Warehouse Datamart Stream or Batch Updates Mixed Workload Apps ETL Systems of Record Extract Transform Load Reference  Architecture:  Scale-­‐Out  Data  Processing  Pipeline   Accelerate  Data  Processing  Pipeline  to  minutes  or  even  seconds   Operational Data Lake Benefits §  5-­‐10x  faster   §  75%  less  cost   §  Elas4c  scalability   §  Unstructured  data  support  
  • 19. Splice  Machine  Proprietary  and  Confiden4al   10   You  Need  More  Than  Hadoop  By  Itself  For  ETL   Errors  or  data  quality  issues  force  ETL  restarts   Restart  ETL  to  fix  errors  or   update  records   Hours Seconds Use  transac4on  to   restart  step  or   update  records   Hadoop RDBMS ETL Hadoop ETL Apps   ETL   Analy4cs   Apps   ETL   Hours Analy4cs   Benefits §  SQL-­‐based  transforms   §  Improved  data  quality   §  Faster  recovery  with   transac4ons  
  • 20. Splice  Machine  Proprietary  and  Confiden4al   Streamlining  the  Structured  Data  Pipeline  in  Hadoop   11   Source Systems ERP … CRM Sqoop Apply Inferred Schema Stored as flat files SQL Query Engines BI Tools Tradi3onal  Hadoop  Pipeline   vs.   Source Systems ERP … CRM Existing ETL Tool Stored in same schema BI Tools Streamlined  Hadoop  Pipeline   Benefits §  Less  cost  and   complexity   §  Faster  w/  fewer   transla4ons   §  Improved  data  quality   §  Bejer  SQL  support  
  • 21. Splice  Machine  Proprietary  and  Confiden4al   12   Seamless  Integra4on  of  Structured  and  Unstructured  Data   Op4mizing  storage  and  querying  of  structured  data  as  part  of  ELT  or  Hadoop  query  engines   OLTP Systems ERP CRM Supply Chain HR … Structured Data Unstructured Data HCATALOG Pig SCHEMA ON INGEST: Streamlined, structured-to- structured integration 1   2   3   SCHEMA BEFORE READ: Repository for structured data or metadata from ELT process on unstructured data SCHEMA ON READ: Ad-hoc Hadoop queries across structured and unstructured data
  • 22. Splice  Machine  Proprietary  and  Confiden4al   Case  Study:  Opera4onal  Data  Lake   13  13   Overview       Computer  technology  corpora4on     Update  database  technology  for:     ODS  layer  replacement     ETL  processing  and  analysis  of  Omniture  data     Real-­‐4me  OLTP  for  Global  Tech  Support  app     Challenges     Oracle  and  Teradata  too  expensive  to  scale     Many  Oracle  queries  couldn’t  complete     Can  only  hold  7  days  worth  of  data  in  Oracle     Missing  ETL  window  with  current  Hadoop  data  lake     Solu1on  Diagram     (400TB)   OLTP Systems ERP CRM Supply Chain Benefits   75%  less  cost   with  commodity  scale  out   Incremental  ETL  processing   gracefully  handle  data  quality  issues   5x-­‐10x  faster   comple4ng  queries  on  which  Oracle  failed         ✔  
  • 23. Splice  Machine  Proprietary  and  Confiden4al   14   Internet  of  Things   ETL/Opera4onal  Data  Lake  Digital  Marke4ng   Precision   Medicine   Use  Cases   Splice  Machine  |  Proprietary  &  Confiden4al   Fraud  Detec4on  
  • 24. Splice  Machine  Proprietary  and  Confiden4al   15   Who  Are  We?   Affordable,  Scale-­‐Out  –  Commodity  hardware   Elas3c  –  Easy  to  expand  or  scale  back   Transac3onal  –  Real-­‐4me  updates  &  ACID  Transac4ons     ANSI  SQL  –  Leverage  exis4ng  SQL  code,  tools,  &  skills   Flexible  –  Support  opera4onal  and  analy4cal  workloads   10x     Bejer     Price/Perf     THE  HADOOP  RDBMS     Replace  Oracle  with  Splice  Machine   to  scale  out  your  applica4ons  
  • 25. Splice  Machine  Proprietary  and  Confiden4al   16   Proven  Building  Blocks:  Hadoop  and  Derby   APACHE  DERBY     §   ANSI  SQL-­‐99  RDBMS   §   Java-­‐based   §   ODBC/JDBC  Compliant     APACHE  HBASE/HDFS   §  Auto-­‐sharding   §  Real-­‐4me  updates   §  Fault-­‐tolerance   §  Scalability  to  100s  of  PBs   §  Data  replica4on        
  • 26. Splice  Machine  Proprietary  and  Confiden4al   17   Distributed,  Parallelized  Query  Execu4on   Parallelized   computa4on  across   cluster   Moves   computa4on  to     the  data   U4lizes  HBase     co-­‐processors   No  MapReduce   HBase     Co-­‐Processor     HBase  Server   Memory  Space   LEGEND  
  • 27. Splice  Machine  Proprietary  and  Confiden4al   ANSI  SQL-­‐99  Coverage   18   §  Data  types  –  e.g.,  INTEGER,  REAL,   CHARACTER,  DATE,  BOOLEAN,  BIGINT   §  DDL  –  e.g.,  CREATE  TABLE,  CREATE  SCHEMA,   ALTER  TABLE,  DELETE,  UPDATE  TABLE   §  Predicates  –  e.g.,  IN,  BETWEEN,  LIKE,  EXISTS   §  DML  –  e.g.,  INSERT,  DELETE,  UPDATE,  SELECT   §  Query  specifica3on  –  e.g.,  GROUP  BY,   HAVING   §  SET  func3ons  –  e.g.,  UNION,  ABS,  MOD,  ALL   §  Aggrega3on  func3ons  –  e.g.,  AVG,  MAX,   COUNT   §  String  func3ons  –  e.g.,  SUBSTRING,   concatena4on,  UPPER,  LOWER,  TRIM,   LENGTH   §  Constraints  –  e.g.,  PRIMARY  KEY,  FOREIGN   KEY,  UNIQUE,  NOT  NULL   §  Condi3onal  func3ons  –  e.g.,  CASE,  searched   CASE   §  Privileges  –  e.g.,  privileges  for  SELECT,   DELETE,  INSERT,  EXECUTE   §  Joins  –  e.g.,  INNER  JOIN,  LEFT  OUTER  JOIN   §  Transac3ons  –  e.g.,  COMMIT,  ROLLBACK,   Snapshot  Isola4on   §  Sub-­‐queries   §  Triggers   §  User-­‐defined  func3ons  (UDFs)   §  Views  –  including  grouped  views  
  • 28. Splice  Machine  Proprietary  and  Confiden4al   19   Lockless,  ACID  transac4ons   •  Adds  mul4-­‐row,  mul4-­‐table   transac4ons  to  HBase  w/  rollback   •  Fast,  lockless,  high  concurrency     •  Extends  research  from  Google   Percolator,  Yahoo  Labs,  U  of   Waterloo   •  Patent  pending  technology    
  • 29. Splice  Machine  Proprietary  and  Confiden4al   What  People  are  Saying…   20   Recognized  as  a  key  innovator  in  databases   Scaling  out  on  Splice   Machine  presented     some  major  benefits     over  Oracle   ...automa4c  balancing  between   clusters...avoiding  the  costly   licensing  issues.   Quotes   Awards     An  alterna3ve  to  today’s   RDBMSes,   Splice  Machine  effec4vely     combines  tradi4onal  rela4onal   database    technology  with     the  scale-­‐out  capabili4es     of  Hadoop.     The  unique  claim  of  …  Splice   Machine  is  that  it  can  run   transac3onal  applica3ons   as  well  as  support  analy4cs  on     top  of  Hadoop.  
  • 30. Splice  Machine  Proprietary  and  Confiden4al   Ini4al  Advisory  Board   21   Advisory  Board  includes  luminaries  in  databases  and  technology     Roger  Bamford   Former  Principal  Architect  at  Oracle   Father  of  Oracle  RAC   Mike  Franklin   Computer  Science  Chair,  UC  Berkeley   Director,  UC  Berkeley  AMPLab   Founder  of  Apache  Spark   Marie-­‐Anne  Neimat   Co-­‐Founder,  Times-­‐Ten  Database   Former  VP,  Database  Eng.  at  Oracle   Ken  Rudin   Head  of  Analy4cs  at  Facebook   Former  GM  of  Oracle  Data  Warehousing   Abhinav  Gupta     Co-­‐Founder,  VP  Engineering  at  Rocket  Fuel   Runs  15PB  HBase  Cluster  
  • 31. Splice  Machine  Proprietary  and  Confiden4al   22   The  First  Step  to  Real-­‐Time  Big  Data  Requires  Fixing  ETL   ETL  on  Hadoop   §  Drive  lag  down  from                                 hours  è  minutes  è  seconds   §  Start  by  replacing  ODS  with       Opera4onal  Data  Lake   §  5-­‐10x  faster  and  ¼  cost     Splice  Machine   §  Replace  RDBMSs  like  Oracle         and  MySQL   §  Best  of  both  worlds   §  SQL  and  transac4ons  of  RDBMSs   §  Scale-­‐out  of  NoSQL   §  10x  bejer  price/performance         Transform TransformTransform OLTPOLAP OLTP Transform OLTP/OLAPOLTP OLAP OLAP ETL: Scale-up ETL: Scale-out ELT T Only
  • 32. Splice  Machine  Proprietary  and  Confiden4al   ETL:  Gatekeeper  to   Real-­‐Time  Big  Data   Rich  Reimer   VP,  Product  Management   [email protected]     August  11,  2015  
  • 33. Splice  Machine  Proprietary  and  Confiden4al   Focused  on  Opera4onal  Workloads   24  
  • 34. Splice  Machine  Proprietary  and  Confiden4al   25   Oracle  Vs  Splice  Machine  TCO  comparison   Oracle  RAC  Costs   List  Price   Unit   3  Year  Cost   (Discounted  60%)     Oracle  Database  Enterprise   Edi4on  with  RAC     $37,750   64   $966,400   3  years  DB  Maintenance   (22%  list  price/yr)     $24,915   64   $637,824   3  years  Opera4ng  System   Support  (Oracle  Linux)       $6,897   4   $11,035   Server  Costs  (mid-­‐range,   Intel  Xeon-­‐based)   $16,000   4   $64,000   Primary  Storage   $143,360   $143,360   TOTAL   $228,922   $1,822,619   Assumes  Oracle  Enterprise  Edi4on  ($47.5K/CPU)  and  RAC  ($23K/CPU)     Splice  Machine  Costs   List  Price   Unit   3  Year  Cost   (without  discount)   Splice  Machine  Annual   Subscrip4on   $10,000   7   $210,000   Cloudera  Enterprise   Edi4on  Annual   Subscrip4on   $7,500   8   $180,000   Server  Costs    with  Storage   $5,000   8   $40,000   TOTAL   $22,500   $430,000   76%  TCO  Reduc3on  
  • 35. Twitter Tag: #briefr The Briefing Room Perceptions & Questions Analyst: Robin Bloor
  • 36. Life in the Data Lake Robin Bloor, Ph.D.
  • 37. Hadoop: One Ring to Rule Them All Hadoop has become the de facto processing environment for big data. Is it going to become the de facto environment for ALL SERVER COMPUTING?
  • 38. Empires to Conquer u  Big Data u  Analytics u  Real-time analytics u  OLTP u  Document shares u  Office systems ✔︎ ✔︎ ? ? ??
  • 39. Just A Few Years Ago
  • 41. Hadoop Possibilities? u  Hadoop is evolving faster than any equivalent technology I can remember u  It has a very long way to go to become the “server OS for everything.” u  First it would need to become a genuine OS u  It has no stated direction. u  It may vanish into the cloud. u  Nevertheless it is interesting to watch
  • 42. The Net Net Meanwhile, it has become a lab for server software
  • 43. u  It’s not just ETL: it’s ETL, data cleansing, metadata capture, MDM, etc. How do you accommodate that? u  Do you have any ETL customer experiences to report? u  How’s your OLTP business going? (Is this ETL emphasis a complementary activity?) u  How well are you doing versus Oracle?
  • 44. u  How well does it integrate with other technologies? u  What is your current largest customer(s)? u  Do you have any direct competition on Hadoop?
  • 45. Twitter Tag: #briefr The Briefing Room
  • 46. Twitter Tag: #briefr The Briefing Room Upcoming Topics www.insideanalysis.com August: REAL-TIME DATA September: HADOOP 2.0 October: DATA MANAGEMENT
  • 47. Twitter Tag: #briefr The Briefing Room THANK YOU for your ATTENTION! Some images provided courtesy of Wikimedia Commons and by basykes [CC BY 2.0 (http:// creativecommons.org/licenses/by/2.0)], via Wikimedia Commons (https://upload.wikimedia.org/wikipedia/ commons/9/94/Beijing_traffic_jam.jpg)