Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

End User Panel on Real-Time Data Analytics
Building Predictive Applications with
Real-Time Data Pipelines and Streamliner
Eric Frenkiel, CEO and Co-Founder, MemSQL

Going Real-Time is the Next Phase for Big Data
More
Devices
More
Interconnectivity
More
User Demand
…and companies are at risk of being left behind

MemSQL Architecture
St ream in g Da ta W areh o u se
Streaming
Integrated streaming
with Streamliner
Database
High volume transactions
for structured and
unstructured data
Data Warehouse
Fast, scalable
SQL for immediate
analytics

Applications and Technology Trends
Real-Time Analytics Risk-Management Personalization
Portfolio Tracking
Monitoring and
Detection
Internet of Things | Real-Time Data Pipelines | Operationalizing Apache Spark

Put Apache Spark in the fast lane.
Persist. Perform. Perfect.

Changing the Way the World Invests
Noah Zucker, Vice President – Tactical Engineering, Novus Partners
Scalable Portfolio Intelligence with MemSQL

 100+ Investment Managers, $2 Trillion AUM
 Research Platform: 10,000+ Institutions
 Founded 2007, Privately Held
We help investors discover their true
investment acumen and risk
About Novus

True Investment Acumen and Risk…at Scale

 24/7 ETL Handholding
 Overnight Failure =
Business Hours Slowdown
 Scala worker pool limited
by the database
 Non-trivial code changes
needed to shard and scale
Before MemSQL…

Today’s Portfolio Intelligence…Right Now
Before MemSQL:
With MemSQL:
90 Min.
2 Min.
Customer
Data
Persistent
StoreETL Analytics
(Scala)

First-Class JSON Support…Happy Developers
memsql> select * from tasks t where t.task::uid::%clientId = 7;
+---------+---------------------------------------------------------------+
| task_id | task |
+---------+---------------------------------------------------------------+
| 3 | {"uid":{"clientId":7,"id":1009,"which":"P"},"user":"noahlz"} |
+---------+---------------------------------------------------------------+
1 row in set (0.00 sec)
Salat

 Client team focuses on
service, not ETL
 Predictable application
performance
 Scala workers: 12  126
 Add servers to scale –
No code changes needed
With MemSQL…

http://www.novus.com
http://tech.novus.com
@NovusCode

Ian Hansen, Software Engineering Manager
Digital Ocean
ETL Tools for Small Teams

Problem: Business Intelligence Slows as We Grow
 Data lives in SQL
 Easy to ask new questions in SQL
 But… Business Intelligence tasks taking longer
 Database isn’t built for quick aggregations

Solution: Scale-out SQL Database
 SQL team stays powerful
 Quick to iterate with quick answers
 Prepare for the future!

Problem: Data isn’t in MemSQL
Plus
 You don’t have an engineer on
your team
 It’s hard to get an engineer’s time
 You’ve got a job to do…
(which is taking more and more
time)

Solution: ETL Using REPLACE INTO
 MySQL SQL flavor (available in MemSQL)
 Handles new rows and updates on rows
 Easy to write
• Query source database then replace into target database
 Many other scale-out SQL databases don’t have
equivalent

Problem: Now Load JSON Event Data
 ~300K events per day
 Many different types of JSON events

Solution: MemSQL Loader + JSON Type
 Only loads new files (or files
whose content has changed)
 Parallelizes the process
 Transformation script
simple: return id and raw json data
 SQL team unaffected by new
JSON events
./memsql-loader load /opt/events/**
--table events
--script=/opt/events-etl
--file-id-column file_id
--columns id,data

Problem: Processing Data on Select
 Need computed value in SQL query
 Computing the value slows down queries
 Computed value used on many queries
• e.g. domain from a URL string

Solution: Persistent Columns
 Pre-compute result and
save it on the row
 Automatically updated if
row changes
 No need to alter ETL
pipeline
ALTER TABLE events
ADD COLUMN (
referring_domain AS
substring_index(substring(data::$re
ferrer, (locate('//',
data::$referrer)) + 2), '/', 1)
PERSISTED varchar(255)
)

Solution: Persistent Columns
Use pre-computed value in select
memsql> select data, referring_domain from events limit 2;
+-------------------------------------+------------------+
| data | referring_domain |
+-------------------------------------+------------------+
| {"referrer":"http://example.com/b"} | example.com |
| {"referrer":"http://example.com/a"} | example.com |
+-------------------------------------+------------------+

Tools
 REPLACE INTO syntax
 JSON native type
 MemSQL Loader
 Persistent columns
 Now, MemSQL Streamliner

Mike DePrizio, Senior Architect, Akamai Technologies
Unlocking Revenue with In-Memory Technology

We are the leading provider of
cloud services for delivering,
optimizing and securing online
content and business applications
$1.96B
Revenue
1,300
Locations
5,000+
Customers
5,100+
Employees
CORPORATE STATS (2014):
OUR HISTORY:
Founded 1998 and rooted in MIT
technology—solving Internet
congestion with math not hardware

The Business of Billing
Billing domino effect
 Akamai  Customers  Sub-customers
Daily billing requires:
 Fast data delivery
 Accurate data
Old Model New Model
Generating a bill at end of month for
customer services
Generating a bill at the end of every
day for sub-customer services

Current Billing Data Management
Gather logs from 190,000+ servers in 1400 locations in 110
countries
 Multiple PBs/day aggregate/reduce into relevant billing data feed
 Typical data record: 3 key fields plus metrics
 Load resulting data record into our RDBMS system

Greatest Challenges
 Current system cannot handle expected throughput
 Difficult to quickly scale up existing environments
 New model will generate 10x+ data

Deploying MemSQL
Application
Daily Sub-customer billing
Problem
Existing RDMS pipeline loads were maxed out at 150-
300K upserts/second, could not keep up with projected
size of new billing model
Results
MemSQL cluster performs at 1.9
million upserts/second, allowing
transition from monthly to daily billing
Billing Data resource
usage statistics
INSERT... ON
DUPLICATE KEY
UPDATE...
(1.9 million/sec)
Billing Application
• Compute sub-customer
charges daily
• Roll up sub-customer usage by
customer/cloud provider
• More sophisticated platform
offers customers better
service, partners new business
opportunities

Results Speak for Themselves
 2M upserts/second on AWS EC2
instances
 Scalability on commodity hardware
 Meeting our billing windows
 Unlocking revenue

 Adapt PoC for real-world
situations
 Continue scaling linearly
 Optimize results with small
cluster deployment
What Next?

Eric Frenkiel, MemSQL CEO and co-founder
September 30, 2015 • New York, NY
Introducing MemSQL Streamliner

 One click deployment of
integrated Apache Spark
 Put Spark in the Fast Lane
• GUI pipeline setup
• Multiple data pipelines
• Real-time transformation
 Eliminates batch ETL
 Open source on GitHub
Introducing the MemSQL Streamliner

Simple Deployment Process
Application

1. Deploy MemSQL
Cluster
In-Memory | Distributed | Relational
Application

2. Deploy Spark
Cluster
Application

Kafka Connects to Each Node
Cluster
Application

Streamliner Architecture
First of many integrated Apache Spark solutions
Other
Real-Time Data
Sources Application
Apache Spark
Future Solution
Future Machine
Learning Solution
STREAMLINER

Streamliner ETL Detail
Other
Real-Time Data
Sources Application
Apache Spark
Future Solution
Future Machine
Learning Solution
STREAMLINER
STREAMLINER
Custom
Future Extractor
JSON
Custom
Future Transformer
Extract Transform Load

Building Predictive Applications
Streamliner
Input
User Jar
SAS Generated PMML
Industrial
Equipment
Sensor Data
S1 S2 S3 P1 P2 P3
Scoring Real-Time Data
with Predictive Models
Sensor 1 Predictive Model 1

Streamliner Benefits
 Build end-to-end data pipelines in minutes
 Reduce data latency from days or hours to ZERO
 Support thousands of concurrent users running real-time
queries
 Give users immediate access to fresh data via innovative
applications

THE GAME
See MemSQL Streamliner in Action at Booth #831

Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

Recommended

More Related Content

What's hot (19)

Similar to Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics (20)

More from SingleStore (20)

Recently uploaded (20)

Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

Editor's Notes