Science in The Clouds: History, Challenges, and Opportunities
Science in The Clouds: History, Challenges, and Opportunities
http://www.cse.nd.edu/~ccl
Cost
2X
OpEx of Ownership
Capital Expense of Ownership OpEx of Cloud Computing
Time
8
Clouds vs Grids
Grids provide a job execution interface:
Run program P on input A, return the output. Allows the system to maximize utilization and hide failures, but provides few performance guarantees and inaccurate metering.
Submit 1M Jobs
Grid Computing Layer Provides Job Execution Dispatch Jobs Manage Load
Run 1M Jobs
12
13
Clusters, clouds, and grids give us access to unlimited CPUs. How do we write programs that can run effectively in large systems?
16
MapReduce ( S, M, R )
Set S
K,V K,V K,V K,V K,V K,V K,V
Key0 V V V
O0
Key1
O1
KeyN V V V V
17
O2
18
0.97
0.05
19
1.0
0.8 1.0
Challenge Workload: 60,000 iris images 1MB each .02s per F 833 CPU-days 600 TB of I/O
1.0
20
I have 60,000 iris images acquired in my research lab. I want to reduce each one to a feature space, and then compare all of them to each other. I want to spend my time doing science, not struggling with computers.
I own a few machines I have a laptop. I can buy time from Amazon or TeraGrid.
Now What?
21
22
23
24
HN
HN
Try 3: Bundle all files into one package. Failure: Everyone loads 1GB at once.
HN
25
Observation
In a given field of study, many people repeat the same pattern of work many times, making slight changes to the data and algorithms. If the system knows the overall pattern in advance, then it can do a better job of executing it reliably and efficiently. If the user knows in advance what patterns are allowed, then they have a better idea of how to construct their workloads.
26
AllPairs( A, B, F )
Cloud or Grid
28
All-Pairs Abstraction
AllPairs( set A, set B, function F ) returns matrix M where M[i][j] = F( A[i], B[j] ) for all i,j
A1 A1 An
B1 B1 Bn F B3 F F F
29
A1
A2
A3
allpairs A B F.exe
B1 AllPairs(A,B,F) B2 F F F F F F
All of these tasks are nearly impossible for arbitrary workloads, but are tractable (not trivial) to solve for a specific abstraction.
30
31
32
Resources Consumed
33
All-Pairs in Production
Our All-Pairs implementation has provided over 57 CPU-years of computation to the ND biometrics research group over the last year. Largest run so far: 58,396 irises from the Face Recognition Grand Challenge. The largest experiment ever run on publically available data. Competing biometric research relies on samples of 100-1000 images, which can miss important population effects. Reduced computation time from 833 days to 10 days, making it feasible to repeat multiple times for 34 a graduate thesis. (We can go faster yet.)
35
36
Wavefront( matrix M, function F(x,y,d) ) returns matrix M such that M[i,j] = F( M[i-1,j], M[I,j-1], M[i-1,j-1] )
M[0,4] M[0,3]
x d
F
y
M
Wavefront(M,F)
F
d y
F
d y
M[3,2] M[4,3]
x d x d
M[0,2] M[0,1]
x
d
F
y
x d x d
F
y
F
y
M[4,2]
x d
x d
F
y
F
y
F
y
F
y
Applications of Wavefront
Bioinformatics:
Compute the alignment of two large DNA strings in order to find similarities between species. Existing tools do not scale up to complete DNA strings.
Economics:
Simulate the interaction between two competing firms, each of which has an effect on resource consumption and market price. E.g. When will we run out of oil?
put F.exe put in.txt exec F.exe <in.txt >out.txt get out.txt
worker
In.txt
out.txt
40
Any delayed task in Wavefront has a cascading effect on the rest of the workload. Solution - Fast Abort: Keep statistics on task runtimes, and abort those that lie significantly outside the mean. Prefer to assign jobs to machines with a fast history.
41
42
43
44
Computational Assembly
Sample Genomes
Reads
A. gambiae scaffold A. gambiae complete S. Bicolor simulated 101K
Data
80MB
46
Some-Pairs Abstraction
SomePairs( set A, list (i,j), function F(x,y) ) returns list of F( A[i], A[j] )
A1 A1 An (1,2) (2,1) (2,3) (3,3) F A1
A1
A2
A3
SomePairs(A,L,F)
A2 F F
A3
F
47
100s of workers dispatched to worker Notre Dame, worker worker Purdue, and worker worker Wisconsin worker
in.txt
out.txt
48
49
50
51
53
Makeflow
part1 part2 part3: input.data split.py ./split.py input.data out1: part1 mysim.exe ./mysim.exe part1 >out1 out2: part2 mysim.exe ./mysim.exe part2 >out2 out3: part3 mysim.exe ./mysim.exe part3 >out3
result: out1 out2 out3 join.py ./join.py out1 out2 out3 > result
54
Makeflow Implementation
bfile: afile prog prog afile >bfile
worker worker worker worker worker worker
Two optimizations: Cache inputs and output. Dispatch tasks to nodes with data.
afile
prog
bfile
55
Conclusion
Grids, clouds, and clusters provide enormous computing power, but are very challenging to use effectively. An abstraction provides a robust, scalable solution to a narrow category of problems; each requires different kinds of optimizations. Limiting expressive power, results in systems that are usable, predictable, and reliable. Is there a menu of abstractions that would satisfy many consumers of clouds?
58
Acknowledgments
Cooperative Computing Lab
http://www.cse.nd.edu/~ccl
Faculty:
Patrick Flynn Nitesh Chawla Kenneth Judd Scott Emrich
Grad Students
Chris Moretti Hoang Bui Li Yu Mike Olson Michael Albrecht
Undergrads
Mike Kelly Rory Carmichael Mark Pasquier Christopher Lyon Jared Bulosan