The Other HPC: High Productivity Computing in Polystore Environments

The Other HPC: High
Productivity Computing in
Polystore Environments
Bill Howe, Ph.D.
Associate Professor, Information School
Adjunct Associate Professor, Computer Science & Engineering
Associate Director, eScience Institute
Director, Urbanalytics Group
8/7/2017 Bill Howe, UW 1

Time
Amountofdataintheworld
Time
Processingpower
What is the rate-limiting step in data understanding?
Processing power:
Moore’s Law
Amount of data in
the world

Processingpower
Time
What is the rate-limiting step in data understanding?
Processing power:
Moore’s Law
Human cognitive capacity
Idea adapted from “Less is More” by Bill Buxton (2001)
Amount of data in
the world
slide src: Cecilia Aragon, UW HCDE

Productivity
How long I have to wait for results
monthsweeksdayshoursminutessecondsmilliseconds
HPC
Systems
Databases
feasibility
threshold
interactivity
threshold
Claim: Only these two performance
thresholds are generally important;
other performance requirements
are application-specific

priority is machine efficiency
HPC DB/ Dataflow
priority is developer efficiency
data manipulation considered
pre-processing
batch
analysis considered post-
processing
batch and interactive

Observations
• Every interesting application has both a data
manipulation component and an analytics
component
• Different people like to express things different
ways
• Different systems offer better performance at
different things
• …but in between people and systems, there is
no real difference in expressiveness between
linear and relational algebra
• So we want full “anything anywhere” rewrites8/7/2017 Bill Howe, UW 6

Observations
component
• Different people like to express things different
ways
different things
• …but in between people and systems, there is
no real difference in expressiveness between
linear and relational algebra
• So we want full “anything anywhere” rewrites8/7/2017 Bill Howe, UW 7

sparsity exponent r where m=nr
Complexity
exponent
n2.38
mn
m0.7n1.2+n2
slide adapted from ZwickR. Yuster and U. Zwick, Fast Sparse Matrix Multiplication
n = number of rows
m = number of non-zerosComplexity of matrix multiply
naïve sparse
algorithm
best known
sparse algorithm
best known
dense algorithmlots of room
here

Single-Server Experiment
Top-shelf SQL (Hyper)
vs. Top-shelf Dense Library (MKL BLAS)
vs. Top-shelf Sparse Library (MKL SpBLAS)
Who wins?
By how much?

BLAS vs. SpBLAS vs. SQL (10k)
off the shelf
database
(HyPer)
15X

Single Node Sparse Matrix Multiply:
BLAS vs. SpBLAS vs. HyperDB (N=20k)
1. Dense is not competitive
2. 50X-100X gap between DB and library

BLAS vs. SpBLAS vs. SQL (N=50k)
1. Dense is not competitive
2. 10X-50X gap between DB and library

Relative Speedup of SpBLAS vs. HyperDB
100X on small data
“Only” 5X on big data

SpBLAS vs. SQL (Real Data)
About 5X on Real Data

Distributed Experiment
MyriaX
vs. CombBLAS
Who wins?
By how much?

CombBLAS vs. MyriaX (N=50k) on star
8X to 45X

CombBLAS vs. MyriaX (Real Data)
• CombBLAS 10X faster on one
dataset
• MyriaX 1.5X faster on another!

A x B x C
select AB.i, C.m, sum(AB.val*C.val)
from
(select A.i, B.k, sum(A.val*B.val)
from A, B
where A.j = B.j
group by A.i, B.k
) AB,
C
where AB.k = C.k
group by AB.i, C.m
select A.i, C.m, sum(A.val*B.val*C.val)
from A, B, C
where A.j = B.j
and B.k = C.k
group by A.i, C.m
group . join . join
group . join . group . join

Observations
component
• Different people like to express things different ways
different things
• …but in between people and systems, there is no
real difference in expressiveness between linear
and relational algebra
• So we want full “anything anywhere” rewrites

Linear Algebra
Relational Algebra

The Other HPC: High Productivity Computing in Polystore Environments

Example: Combine measurements from
sensors, compute means & covariances
26
Preprocessing
(easier to
express in RA)
Analysis
(easier to
express in LA)
Dylan
Hutchison

Example: Sensor Difference Mean &
Covariance
27 https://arrayofthings.github.io/
t c v
466 temp 55.2
466 hum 40.1
492 temp 56.3
492 hum 35.0
528 temp 56.5
Filter, bin onto
common time
buckets
Filter, bin onto
common time
buckets
Subtract
Compute
Mean
Compute
Covariance
Preprocessing
(easier to
express in RA)
Analysis
(easier to
express in LA)
Array of Things
Sensor Data
Collected in CSV files
Dylan
Hutchison

Bin query: easy in RA, harder in LA
28
t c v
466 temp 55.2
466 hum 40.1
492 temp 56.3
492 hum 35.0
528 temp 56.5
𝑡𝑒𝑚𝑝 ℎ𝑢𝑚
466
492
528
55.2 40.1
𝟓𝟔.𝟑 35.0
𝟓𝟔.𝟓
t’ c v
460 temp 55.2
460 hum 40.1
520 temp 56.4
520 hum 35.0
𝑡𝑒𝑚𝑝 ℎ𝑢𝑚
460
520
55.2 40.1
𝟓𝟔.𝟒 35.0
bin 𝑡 = 𝑡 − 𝑡 % 60 + 60 𝑡 % 60
60
+ .5
LA
Multiply:
using avg on added elements
466 492 528
460
520
1
1 1
RA
SELECT bin(t) AS t', c, avg(v) AS v
GROUP BY t', c
* =
Dylan
Hutchison

Covariance query: easy in LA, harder in RA
29
𝑋 is an 𝑛 ⨉ 𝑑 matrix
𝑀 = 1
𝑛
1 𝑇 𝑋 is a 1 ⨉ 𝑑 matrix
𝐶 = 1
𝑛
𝑋 𝑇 𝑋 − 𝑀 𝑇 𝑀 is a 𝑑 ⨉ 𝑑 matrix
LA
N = size(X, 1);
M = mean(X, 1);
C = X'*X / N – M'*M;
Carlos Ordonez.
Building Statistical Models and Scoring with UDFs. SIGMOD 2007.
𝑥11 𝑥12
𝑥21 𝑥22
𝑥31 𝑥32
d attributes
n points
RA
(Generated SQL statements for each entry)
T = SELECT FROM X sum(1.0) AS N,
sum(X1) AS M1, sum(X2) AS M2, …, sum(Xd) AS Md,
sum(X1*X1) AS Q11, sum(X1*X2) AS Q12, …,
sum(Xd-1*Xd) AS Q(d-1)d, sum(Xd*Xd) AS Qdd
C = SELECT FROM T
(1 AS i, 1 AS j, Q11/N – M1*M1 AS v) UNION
(1 AS i, 2 AS j, Q12/N – M1*M2 AS v) UNION
…
Dylan
Hutchison

LARA: COMPREHENSIVE UNIFIED
LINEAR AND RELATIONAL ALGEBRA
30

Spark Myria CombBLAS GEMS
Parallel
Algebra
Logical
Algebra
Myria Middleware
CombBLAS
API
Spark
API
Myria
API
GEMS
API
rewrite
rules
Array
Algebra
MyriaL
Services: visualization, logging, discovery, history, browsing
Orchestration and Execution of the Polystore Plan
Graph
Algebra
Accumulo
KeyVal
Algebra
Accumulo
API
Serial
C
Serial
Algebra
C

Spark Myria CombBLAS GEMS
Parallel
Algebra
Logical
Algebra
Myria Middleware
CombBLAS
API
Spark
API
Myria
API
GEMS
API
rewrite
rules
Array
Algebra
MyriaL
Services: visualization, logging, discovery, history, browsing
Orchestration and Execution of the Polystore Plan
Graph
Algebra
Accumulo
KeyVal
Algebra
Accumulo
API
Serial
C
Serial
Algebra
C
LARA Algebra
LARA API
LARA Physical Plans
LaraDB (Accumulo)

34
k1 k2
[0]
v1
[‘’]
v2
a 37 7 ‘dan’
a 20 0 ‘’
b 25 0 ‘dylan'
b 20 2 ‘bill’
⋈⊗
⋈
extf⨁
Join Union Extension
Objects:
Associative Tables
Operators:
Join and Union adapted from:
M. Spight and V. Tropashko.
First steps in relational lattice. 2006.
Ext is a restricted form
of monadic bind
Total functions from keys
to values with finite support
Default Values
ValuesKeys
Attributes
“horizontal
concat”
“vertical
concat”
“flatmap”
UDFs: ⊗, ⨁, f
Think “Semiring”
⊗⋈⊕
Support

Join: Horizontal
Concat
35
a c
[0]
x
[0]
z
a1 c1 11 1
a1 c2 12 2
a2 c1 13 3
a3 c3 14 4
c b
[0']
z
[0']
y
c1 b1 5 15
c2 b1 6 16
c2 b2 7 17
c4 b1 8 18
⋈⊗ = a c b
[0 ⊗ 0']
z
a1 c1 b1 1 ⊗ 5
a1 c2 b1 3 ⊗ 5
a1 c2 b2 2 ⊗ 6
(a3 c3 b1 4 ⊗ 0
= 0 ⊗ 0')
Requires:
vA ⊗ 0' = 0 ⊗ vB = 0 ⊗ 0'

Union:
Vertical Concat
36
= c
[0]
x
[0 ⨁ 0 = 0]
z
[0]
y
c1 11 ⨁ 13 1 ⨁ 3 ⨁ 5 15
c2 12 2 ⨁ 6 ⨁ 7 16 ⨁ 17
c3 14 14 0
c4 0 8 18
⋈
⨁
a c
[0]
x
[0]
z
a1 c1 11 1
a1 c2 12 2
a2 c1 13 3
a3 c3 14 4
c b
[0]
z
[0]
y
c1 b1 5 15
c2 b1 6 16
c2 b2 7 17
c4 b1 8 18
Requires:
v ⨁ 0 = 0 ⨁ v = v

Ext:
Flatmap
37
a c
[0]
x
[0]
z
a1 c1 11 1
a1 c2 12 2
a2 c1 13 3
=
a c k'
[0 – 0 = 0]
v'
a1 c1 a1c1 11 – 5
a1 c1 c1a1 5 – 11
a1 c2 a1c2 12 – 2
a1 c2 c2a1 2 – 12
a2 c1 a2c1 13 – 3
a2 c1 c1a2 3 – 13
extf
k' v'
ac x – z
ca z – x
f(a, c, x, z) =
Requires:

Summary: Union, Join, Ext
38
Key Types Value Types Support
Union ( 𝑨 ⊕ 𝑩 ) = 𝐾 𝐴 ∩ 𝐾 𝐵 = 𝑉 𝐴 ∪ 𝑉 𝐵 ⊆ 𝑆 𝐴 ∪ 𝑆 𝐵
Join ( 𝑨 ⋈⊗ 𝑩 ) = 𝐾 𝐴 ∪ 𝐾 𝐵 = 𝑉 𝐴 ∩ 𝑉 𝐵 ⊆ 𝑆𝐴 ∩ 𝑆 𝐵
Ext ( extf A ) extended by f set by f ⊆ 𝑆𝐴 × 𝑆𝑓
⋈
For Support, ‘⊆’ becomes ‘=’ if
⊕ is zero-sum-free or ⊗ has zero-product-property
Duality

> If ⨁ or ⊗ is associative, commutative, or idempotent,
then so is Union or Join
> (Push Aggregation into Join) If ⊗ distributes over ⨁,
> (Distribute Join over Union)
If , then
LARA Properties
39
= sum(AB ⊗ CT)

RADISH: COMPILING QUERIES TO
HPC ARCHITECTURES

Query compilation for distributed processing
pipeline
as
parallel
code
parallel compiler
machine
code
[Myers ’14]
pipeline
fragment
code
pipeline
fragment
code
sequential
compiler
machine
code
[Crotty ’14, Li ’14, Seo ’14, Murray ‘11]
sequential
compiler

8/7/2017 Bill Howe, UW 43/57
1% selection microbenchmark, 20GB
Avoid long code paths
ICS 16
Brandon
Myers

8/7/2017 Bill Howe, UW 44/57
Q2 SP2Bench, 100M triples, multiple self-joins
Communication optimization
ICS 16
Brandon
Myers

Graph Patterns
45
• SP2Bench, 100 million triples
• Queries compiled to a PGAS C++ language layer, then
compiled again by a low-level PGAS compiler
• One of Myria’s supported back ends
• Comparison with Shark/Spark, which itself has been shown to
be 100X faster than Hadoop-based systems
• …plus PageRank, Naïve Bayes, and more
RADISH
ICS 16
Brandon
Myers

ICS 15
RADISH
ICS 16
Brandon
Myers

Recap
• Productivity is the new performance
• …but this doesn’t mean give up on orders of
magnitude performance difference by doing
everything on one system
• Everything interesting is LA + RA
• There is no difference except syntax and
systems
• We want to comprehensively optimize
across them, generate code anywhere

Other Productivity Work
• Workload Analytics for SQL Data Lakes
– Shrainik Jain
• AI for Scientific Data Curation
– Maxim Grechkin, Hoing Poon (MSR)
• Visualization Recommendation
– Kanit “Ham” Wongsuphasawat, Dom Moritz, Jeff
Heer
• Information Extraction from Scientific Figures
– Poshen Lee, Sean Yang
• Scalable Approximate Community Detection
– Seung-Hee Bae (Western Michigan)

The SQLShare Corpus:
A multi-year log of hand-written SQL queries
Queries 24275
Views 4535
Tables 3891
Users 591
SIGMOD 2016
Shrainik Jain
https://uwescience.github.io/sqlshare
Workload Analytics for Data Lakes

lifetime = days between first and last access of table
SIGMOD 2016
Shrainik Jain
http://uwescience.github.io/sqlshare/
Data “Grazing”: Short dataset lifetimes

Key idea: Embed queries as vectors
• Learn query embeddings; use them for
all workload analytics tasks:
– Query recommendation
– Workload summarization / index selection
– User behavior modeling
– Predicting heavy hitters
– Forensics
• Get rid of specialized feature
engineering

Doc2Vec on SQL
Can we recover known
patterns in the workload?
TPC-H queries,
generated with
different
parameters

Can we recover known
patterns in the workload?
TPC-H queries,
generated with
different
parameters
Doc2Vec on Templatized Query Plans

Workload Summarization
and Index Selection

DEEP CURATION FOR
SCIENTIFIC DATA LAKES

Microarray samples submitted to the Gene Expression Omnibus
Curation is fast becoming the
bottleneck to data sharing
Maxim
Gretchkin
Hoifung
Poon

Maxim
Gretchkin
Hoifung
Poon
No growth in number of
datasets used per paper!

Maxim
Gretchkin
Hoifung
Poon
Majority of samples are
one-time-use only!

color = labels supplied
as metadata
clusters = 1st two PCA
dimensions on the
gene expression data
itself
Can we use curate algorithmically?
Maxim
Gretchkin
Hoifung
Poon
The expression data
and the text labels
appear to disagree

Maxim
Gretchkin
Hoifung
Poon
Better Tissue
Type Labels
Domain knowledge
(Ontology)
Expression data
Free-text Metadata
2 Deep Networks
text
expr
SVM
NIPS 18 (review)

Deep Curation Maxim
Gretchkin
Hoifung
Poon
Distant supervision and co-learning between text-
based classified and expression-based classifier: Both
models improve by training on each others’ results.
Free-text classifier
Expression classifier
NIPS 18 (review)

Deep Curation:
Our stuff wins, with ZERO training data
Maxim
Gretchkin
Hoifung
Poon
state of the art
our reimplementation
of the state of the art
our dueling
pianos NN
amount of training data used
NIPS 18 (review)

Viziometrics: Analysis of Visualization
in the Scientific Literature
Proportion of
non-quantitative
figures in paper
Paper impact, grouped into 5% percentiles
Poshen Lee

Voyager
Kanit “Ham” Wongsuphasawat Dominik Moritz
InfoVis 15

Seung-Hee
BaeScalable Graph Clustering
Version 1
Parallelize Best-known
Serial Algorithm
ICDM 2013
Version 2
Free 30% improvement
for any algorithm
TKDD 2014 SC 2015
Version 3
Distributed approx.
algorithm, 1.5B edges

RESPONSIBLE DATA SCIENCE

70
The Special Committee on Criminal Justice Reform's
hearing of reducing the pre-trial jail population.
Technical.ly, September 2016
Philadelphia is grappling with the prospect of a racist computer algorithm
Any background signal in the
data of institutional racism is
amplified by the algorithm
operationalized by the algorithm
legitimized by the algorithm
“Should I be afraid of risk assessment tools?”
“No, you gotta tell me a lot more about yourself.
At what age were you first arrested?
What is the date of your most recent crime?”
“And what’s the culture of policing in the
neighborhood in which I grew up in?”

Amazon Prime Now Delivery Area: Atlanta Bloomberg, 2016

Amazon Prime Now Delivery Area: Boston Bloomberg, 2016

Amazon Prime Now Delivery Area: Chicago Bloomberg, 2016

First decade of Data Science research and practice:
What can we do with massive, noisy, heterogeneous datasets?
Next decade of Data Science research and practice:
What should we do with massive, noisy, heterogeneous datasets?
The way I think about this…..(1)

The way I think about this…. (2)
Decisions are based on two sources of information:
1. Past examples
e.g., “prior arrests tend to increase likelihood of future arrests”
2. Societal constraints
e.g., “we must avoid racial discrimination”
8/7/2017 Data, Responsibly / SciTech NW 75
We’ve become very good at automating the use of past examples
We’ve only just started to think about incorporating societal constraints

The way I think about this… (3)
How do we apply societal constraints to algorithmic
decision-making?
Option 1: Rely on human oversight
Ex: EU General Data Protection Regulation requires that a
human be involved in legally binding algorithmic decision-making
Ex: Wisconsin Supreme Court says a human must review
algorithmic decisions made by recidivism models
Issues with scalability, prejudice
Option 2: Build systems to help enforce these constraints
This is the approach we are exploring

The way I think about this…(4)
On transparency vs. accountability:
• For human decision-making, sometimes explanations are
required, improving transparency
– Supreme court decisions
– Employee reprimands/termination
• But when transparency is difficult, accountability takes over
– medical emergencies, business decisions
• As we shift decisions to algorithms, we lose both
transparency AND accountability
• “The buck stops where?”

Fairness
Accountability
Transparency
Privacy
Reproducibility
Fides: A platform for responsible data science
joint with Stoyanovich [US], Abiteboul [FR], Miklau [US], Sahuguet [US], Weikum [DE]
Data Curation
novel features to support:
So what do we do about it?

The Other HPC: High Productivity Computing in Polystore Environments

Recommended

More Related Content

What's hot (20)

Similar to The Other HPC: High Productivity Computing in Polystore Environments (20)

More from University of Washington (20)

Recently uploaded (20)

The Other HPC: High Productivity Computing in Polystore Environments

Editor's Notes