0% found this document useful (0 votes)
124 views

Supercomputer Architecture: The Teraflops Race

This document discusses supercomputer architecture and the race for teraflops performance. It provides examples of some of the fastest supercomputers including IBM's Blue Gene/L, SGI's Altix, Earth Simulator, and Cray X1. Key trends are the use of commodity processors, vector machines, tighter integration of processors through shared memory, and faster interconnects that are struggling to keep pace with CPU speeds. Open research areas include programming models, grid computing, and optIPuter technologies.

Uploaded by

Mayank Agarwal
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
124 views

Supercomputer Architecture: The Teraflops Race

This document discusses supercomputer architecture and the race for teraflops performance. It provides examples of some of the fastest supercomputers including IBM's Blue Gene/L, SGI's Altix, Earth Simulator, and Cray X1. Key trends are the use of commodity processors, vector machines, tighter integration of processors through shared memory, and faster interconnects that are struggling to keep pace with CPU speeds. Open research areas include programming models, grid computing, and optIPuter technologies.

Uploaded by

Mayank Agarwal
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Supercomputer Architecture: The

TeraFLOPS Race

Stephen Jenks
Scalable Parallel & Dist. Systems Lab
EECS Colloquium Feb. 16, 2005

2/16/2005 1
Why Supercomputing?
† Some Problems Larger Than Single
Computer Can Process
„ Memory Space (>> 4-8 GB)
„ Computation Cost (O(n^3), for example)
„ More Iterations (100 years)
„ Data Sources (Sensor processing)
† National Pride
† Technology Migrates to Consumers

2/16/2005 2
Supercomputer Applications
† Weather Prediction
† Pollution Flow
† Fluid Dynamics
† Stress Analysis
† Protein Folding
† Chemistry Simulation
† Nuclear Simulation
† Equation Solving
† Code Breaking

2/16/2005 3
How Fast Are Supercomputers?
† The Top Machines Can Perform Tens of Trillions
Floating Point Operations per Second
(TeraFLOPS)
† They Can Store Trillions of Data Items in RAM!
† Example: 1 KM grid over USA
„ 4000x2000x100 = 800 million grid points
„ If each point has 10 values, and each value takes 10
ops to compute => 80 billion ops per iteration
„ If we want 1 hour timesteps for 10 years, 87600 iters
„ More than 7 Peta-ops total!

2/16/2005 4
How Fast is That?
† Cray-1 (1977)
„ 250 MFLOPS
„ 80 MHz
„ 1 MWord (64-bit)
† Orig PC 8088 (1979)
„ 5 MHz
„ 1 MB RAM
† Modern PC (Pentium 4)
„ 3 GHz
„ 6 GFLOPS
„ 4 GB RAM

2/16/2005 5
http://ed-thelen.org/comp-hist/CRAY-1-HardRefMan/CRAY-1-HRM.html
Lies, Damn Lies, and Statistics
† Manufacturers Claim Ideal Performance
„ 2 FP Units @ 3 GHz => 6 GFLOPS
„ Dependences mean we won't get that much!
† How Do We Know Real Performance
„ Top500.org Uses High-Perf LINPACK
„ http://www.netlib.org/benchmark/hpl
„ Solves Dense Set of Linear Equations
„ Much Communications and Parallelism
„ Not Necessarily Reflective of Target Apps

2/16/2005 6
Who Makes Supercomputers?

2/16/2005 7
Supercomputer Architectures
† All Have Some Parallelism, Most Have Several
Types
„ Pipelining (Overlapping Execution of Several
Instructions)
„ Shared Address Space Parallelism
„ Distributed Memory (Multicomputer)
„ Vector or SIMD
† Almost All Use Single-Program, Multiple-Data
(SPMD) Model
„ Same Program Runs on All CPUs
„ Unique Identifier Per Copy

2/16/2005 8
Architecture Diagrams
Shared Address Space Distributed Memory
C C C C MEM MEM MEM MEM
P P P P
U U U U CPU CPU CPU CPU

NIC NIC NIC NIC


Shared Memory
Interconnection Network
Conceptual view only.
Real Shared Memory
Machines have
distributed memory
2/16/2005 9
#1: IBM Blue Gene/L
† Prototype System with Only 32768 CPUs
† Final System Will Have 4 Times That
† Each CPU is 700MHz
† Intended for Protein Folding and Massively
Parallel Simulations
† Achieved 70.72 TFLOPS
† Networks:
„ 3D Toroidal Mesh (350MB/sec times 6 links per node)
„ Gigabit Ethernet for storage
„ Combining Tree for Global Operations (Reduce, etc.)
„ Barrier/interrupt network
2/16/2005 10
Blue Gene/L Continued

2/16/2005 11
From Top500.org Website
#2: SGI Altix (NASA Columbia)
† 10240 Itanium 2 Processors Grouped Grouped in Clusters
of 512
„ 1.5 GHz, 6MB Cache
„ Shared memory within 512 CPU Cluster
„ 20TB Total Memory
† Runs Linux
† Networks
„ SGI NUMAlink (6.4GB/sec)
„ Infiniband (10Gb/sec, 4 microsecond latency)
„ 10 Gigabit Ethernet
„ 1 Gigabit Ethernet
† 51.87 TFLOPS

2/16/2005 12
Columbia Photo

2/16/2005 13
From NASA Ames Research Center Website
#3: Earth Simulator
† Was #1 for 3 Years Until Nov. 2004
† 5120 Processors
„ 640 Nodes with 8 Processors Each
„ 16 GB RAM per Node
„ NEC SX6 Vector Processors
† Full Crossbar Interconnect
„ Bidirectional 12.3GB/sec
„ 8TB/sec Total
† 35.86 TFLOPS
2/16/2005 14
Earth Simulator Pictures

Processing Interconnect
Node Node

2/16/2005 15
Pictures from http://www.es.jamstec.go.jp/esc/eng/ES/hardware.html
Beowulf Clusters
† Started as networks of
low-cost PCs
† Now, thousands of CPUS
„ Many single proc
„ Some dual proc or more
† Interconnection network
key to performance
„ Myrinet: 2Gbps, 10µs
„ InfiniBand: 10Gbps, 5µs
„ Quadrics: 9Gbps, 4µs
„ GigE: 1Gbps, 40µs

2/16/2005 16
Top Clusters
Name/Org CPUs Interconnect Rpeak Rmax
(GFLOPS) (GFLOPS
Barcelona 2563 Myrinet 31363 20530
Mare PPC970
Nostrum*
LLNL 4096 Quadrics 22938 19940
Thunder Itanium2
LANL ASCI 8192 Alpha Quadrics 20480 13880
Q
VA Tech 2200 InfiniBand 20240 12250
System X PPC970
* See SlashDot article today on building of MareNostrum
2/16/2005 17
From Top500.org Website
Top Machines Summary
100000
Actual GFLOPS
90000
Peak GFLOPS
80000
70000
60000
50000
40000
30000
20000
10000
0
B

Ea

Th

Sy
B
lu

ol

SC
ar

un
rt

st
eG

um

c
h

IQ

e
el

de

m
Si
en

bi

on

r
m

X
a
e/

a
L

2/16/2005 18
Cray X1 (Vector)
† Distributed Shared
Memory Vector
Multiprocessor
„ 4 CPUs per node
„ 800 MHz, 16 ops/cycle
„ 16 nodes/cabinet
† 819 GFLOPS
† 512 GB RAM
„ Up to 64 cabinets
† Modified 2D Torus
Interconnect
http://www.cray.com/products/x1/specifications.html

2/16/2005 19
Cray XD1 (Supercluster)
† Each Chassis
„ 12 Opterons
„ 2-way SMPs
„ 58 (Peak) GFLOPS
„ Virtex-II Pro FPGAs
„ RapidArray Interconnect
† Each Rack
„ 12 Chassis
„ RapidArray Interconnect
„ MPI Latency: 2.0 µsec

http://www.cray.com/downloads/Cray_XD1_Datasheet.pdf

2/16/2005 20
IBM Power Series
† 8 to 32 POWER4 or
POWER5 CPUs
„ Multi-chip packages
„ Simultaneous
Multithreading
† Multi-Gbps Interconnect
Between Components
† Pictured: UCI’s Earth
System Modeling Facility -
88 CPUs
„ 7x8 CPUs
„ 1x32 CPUs

2/16/2005 21
Trends
† What are the Trends, Based on Current
Machines?
† Commodity Processors
† Vector Machines Still Around
† Processors Moved Closer to Each Other
„ Nodes Composed of SMPs
„ From 2 to 512 CPUs share memory
† Interconnection Networks Getting Fast
„ But Not as Quickly as CPU Speed
† Machines Hot and Power Hungry
„ Exception: Blue Gene/L (1.2MW)
2/16/2005 22
Research Topics
† Programming Models
† Grid Computing
„ Combining resources/utility computing
† OptIPuter
„ High-Performance Computing, Storage, and
Visualization Resources Connected by Fiber
„ WDM allows dedicated lambdas per app.
„ UCSD (Larry Smarr-PI), UIC, USC, UCI

2/16/2005 23
Shared Memory Programming Model
† Shared Memory Programming Looks Easy
„ Threads: POSIX, OpenMP, etc.
„ Implicit Parallelism (OpenMP)
#pragma omp parallel for private (i,k)
for (i = 0; i < nx; i++)
for (k = 0; k < nz; k++) { /* front and black plates */
ez[i][0][k] = 0.0; ez[i][1][k] = 0.0; …

† But Shared Resources Make Things Ugly


„ Shared Data => Locks
„ Memory Allocation => Hidden locks kill performance
„ Contention for Memory Regions
† So Many Shared Memory Machines are
Programmed as if they were Distributed
2/16/2005 24
Message Passing Programming Model
† Message Passing Interface (MPI)
„ High Performance, Relatively Simple
„ All Parallelism Managed by User
„ Explicit Send/Receive Operations
MPI_Isend(&AR_INDEX(ex, 0, lowy, 0) /*lowest plane on node*/, 1/*count*/,
XZ_PlaneType, neighbor_nodes[Y_DIMENSION][LOW_NEIGHBOR], TAG_EXXZ,
MPI_COMM_WORLD, &requestArray[count++]);

MPI_Irecv(&AR_INDEX(ex, 0, 0, highz + 1) /* one past highz point */,


1 /* count */, XY_PlaneType,
neighbor_nodes[Z_DIMENSION][HIGH_NEIGHBOR] /* source */,
TAG_EXXY, MPI_COMM_WORLD, &requestArray[count++]);

2/16/2005 25
Debugging
† Parallel debugging is mostly awful
„ 10s or 100s of program states
„ GDB for Threads is bad enough!
† Need way to capture and visualize program
state
„ Zero in on trouble spots
„ Deadlocks common

2/16/2005 26
Future Architecture Research
† IBM/Toshiba/Sony Cell Architecture
„ General Purpose CPU With SMT
„ SIMD Units with Fast RAM
„ Said to be comparable to Earth Simulator
Node
† Stream Processors (& Media Processors)
† Quantum Computing
† Fault Tolerance
† Power Consumption Awareness
2/16/2005 27
Conclusion
† Despite our Home Computers Being Faster
than Early Supercomputers
„ Many Supercomputers being built
„ Different architectures still abound
† Problem sizes getting larger
„ Finer meshes
„ More time steps
„ More precise calculations

2/16/2005 28

You might also like