0% found this document useful (0 votes)

124 views

Supercomputer Architecture: The Teraflops Race

This document discusses supercomputer architecture and the race for teraflops performance. It provides examples of some of the fastest supercomputers including IBM's Blue Gene/L, SGI's Altix, Earth Simulator, and Cray X1. Key trends are the use of commodity processors, vector machines, tighter integration of processors through shared memory, and faster interconnects that are struggling to keep pace with CPU speeds. Open research areas include programming models, grid computing, and optIPuter technologies.

Uploaded by

Mayank Agarwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

124 views

Supercomputer Architecture: The Teraflops Race

Uploaded by

Mayank Agarwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Supercomputer Architecture: The

TeraFLOPS Race

Stephen Jenks
Scalable Parallel & Dist. Systems Lab
EECS Colloquium Feb. 16, 2005

2/16/2005 1
Why Supercomputing?
Some Problems Larger Than Single
Computer Can Process
Memory Space (>> 4-8 GB)
Computation Cost (O(n^3), for example)
More Iterations (100 years)
Data Sources (Sensor processing)
National Pride
Technology Migrates to Consumers

2/16/2005 2
Supercomputer Applications
Weather Prediction
Pollution Flow
Fluid Dynamics
Stress Analysis
Protein Folding
Chemistry Simulation
Nuclear Simulation
Equation Solving
Code Breaking

2/16/2005 3
How Fast Are Supercomputers?
The Top Machines Can Perform Tens of Trillions
Floating Point Operations per Second
(TeraFLOPS)
They Can Store Trillions of Data Items in RAM!
Example: 1 KM grid over USA
4000x2000x100 = 800 million grid points
If each point has 10 values, and each value takes 10
ops to compute => 80 billion ops per iteration
If we want 1 hour timesteps for 10 years, 87600 iters
More than 7 Peta-ops total!

2/16/2005 4
How Fast is That?
Cray-1 (1977)
250 MFLOPS
80 MHz
1 MWord (64-bit)
Orig PC 8088 (1979)
5 MHz
1 MB RAM
Modern PC (Pentium 4)
3 GHz
6 GFLOPS
4 GB RAM

2/16/2005 5
http://ed-thelen.org/comp-hist/CRAY-1-HardRefMan/CRAY-1-HRM.html
Lies, Damn Lies, and Statistics
Manufacturers Claim Ideal Performance
2 FP Units @ 3 GHz => 6 GFLOPS
Dependences mean we won't get that much!
How Do We Know Real Performance
Top500.org Uses High-Perf LINPACK
http://www.netlib.org/benchmark/hpl
Solves Dense Set of Linear Equations
Much Communications and Parallelism
Not Necessarily Reflective of Target Apps

2/16/2005 6
Who Makes Supercomputers?

2/16/2005 7
Supercomputer Architectures
All Have Some Parallelism, Most Have Several
Types
Pipelining (Overlapping Execution of Several
Instructions)
Shared Address Space Parallelism
Distributed Memory (Multicomputer)
Vector or SIMD
Almost All Use Single-Program, Multiple-Data
(SPMD) Model
Same Program Runs on All CPUs
Unique Identifier Per Copy

2/16/2005 8
Architecture Diagrams
Shared Address Space Distributed Memory
C C C C MEM MEM MEM MEM
P P P P
U U U U CPU CPU CPU CPU

NIC NIC NIC NIC

Shared Memory
Interconnection Network
Conceptual view only.
Real Shared Memory
Machines have
distributed memory
2/16/2005 9
#1: IBM Blue Gene/L
Prototype System with Only 32768 CPUs
Final System Will Have 4 Times That
Each CPU is 700MHz
Intended for Protein Folding and Massively
Parallel Simulations
Achieved 70.72 TFLOPS
Networks:
3D Toroidal Mesh (350MB/sec times 6 links per node)
Gigabit Ethernet for storage
Combining Tree for Global Operations (Reduce, etc.)
Barrier/interrupt network
2/16/2005 10
Blue Gene/L Continued

2/16/2005 11
From Top500.org Website
#2: SGI Altix (NASA Columbia)
10240 Itanium 2 Processors Grouped Grouped in Clusters
of 512
1.5 GHz, 6MB Cache
Shared memory within 512 CPU Cluster
20TB Total Memory
Runs Linux
Networks
SGI NUMAlink (6.4GB/sec)
Infiniband (10Gb/sec, 4 microsecond latency)
10 Gigabit Ethernet
1 Gigabit Ethernet
51.87 TFLOPS

2/16/2005 12
Columbia Photo

2/16/2005 13
From NASA Ames Research Center Website
#3: Earth Simulator
Was #1 for 3 Years Until Nov. 2004
5120 Processors
640 Nodes with 8 Processors Each
16 GB RAM per Node
NEC SX6 Vector Processors
Full Crossbar Interconnect
Bidirectional 12.3GB/sec
8TB/sec Total
35.86 TFLOPS
2/16/2005 14
Earth Simulator Pictures

Processing Interconnect
Node Node

2/16/2005 15
Pictures from http://www.es.jamstec.go.jp/esc/eng/ES/hardware.html
Beowulf Clusters
Started as networks of
low-cost PCs
Now, thousands of CPUS
Many single proc
Some dual proc or more
Interconnection network
key to performance
Myrinet: 2Gbps, 10µs
InfiniBand: 10Gbps, 5µs
Quadrics: 9Gbps, 4µs
GigE: 1Gbps, 40µs

2/16/2005 16
Top Clusters
Name/Org CPUs Interconnect Rpeak Rmax
(GFLOPS) (GFLOPS
Barcelona 2563 Myrinet 31363 20530
Mare PPC970
Nostrum*
LLNL 4096 Quadrics 22938 19940
Thunder Itanium2
LANL ASCI 8192 Alpha Quadrics 20480 13880
Q
VA Tech 2200 InfiniBand 20240 12250
System X PPC970
* See SlashDot article today on building of MareNostrum
2/16/2005 17
From Top500.org Website
Top Machines Summary
100000
Actual GFLOPS
90000
Peak GFLOPS
80000
70000
60000
50000
40000
30000
20000
10000
0
B

Sy
B
lu

SC
ar

un
rt

st
eG

c
h

e
el

m
Si
en

r
m

X
a
e/

a
L

2/16/2005 18
Cray X1 (Vector)
Distributed Shared
Memory Vector
Multiprocessor
4 CPUs per node
800 MHz, 16 ops/cycle
16 nodes/cabinet
819 GFLOPS
512 GB RAM
Up to 64 cabinets
Modified 2D Torus
Interconnect
http://www.cray.com/products/x1/specifications.html

2/16/2005 19
Cray XD1 (Supercluster)
Each Chassis
12 Opterons
2-way SMPs
58 (Peak) GFLOPS
Virtex-II Pro FPGAs
RapidArray Interconnect
Each Rack
12 Chassis
RapidArray Interconnect
MPI Latency: 2.0 µsec

http://www.cray.com/downloads/Cray_XD1_Datasheet.pdf

2/16/2005 20
IBM Power Series
8 to 32 POWER4 or
POWER5 CPUs
Multi-chip packages
Simultaneous
Multithreading
Multi-Gbps Interconnect
Between Components
Pictured: UCI’s Earth
System Modeling Facility -
88 CPUs
7x8 CPUs
1x32 CPUs

2/16/2005 21
Trends
What are the Trends, Based on Current
Machines?
Commodity Processors
Vector Machines Still Around
Processors Moved Closer to Each Other
Nodes Composed of SMPs
From 2 to 512 CPUs share memory
Interconnection Networks Getting Fast
But Not as Quickly as CPU Speed
Machines Hot and Power Hungry
Exception: Blue Gene/L (1.2MW)
2/16/2005 22
Research Topics
Programming Models
Grid Computing
Combining resources/utility computing
OptIPuter
High-Performance Computing, Storage, and
Visualization Resources Connected by Fiber
WDM allows dedicated lambdas per app.
UCSD (Larry Smarr-PI), UIC, USC, UCI

2/16/2005 23
Shared Memory Programming Model
Shared Memory Programming Looks Easy
Threads: POSIX, OpenMP, etc.
Implicit Parallelism (OpenMP)
#pragma omp parallel for private (i,k)
for (i = 0; i < nx; i++)
for (k = 0; k < nz; k++) { /* front and black plates */
ez[i][0][k] = 0.0; ez[i][1][k] = 0.0; …

But Shared Resources Make Things Ugly

Shared Data => Locks
Memory Allocation => Hidden locks kill performance
Contention for Memory Regions
So Many Shared Memory Machines are
Programmed as if they were Distributed
2/16/2005 24
Message Passing Programming Model
Message Passing Interface (MPI)
High Performance, Relatively Simple
All Parallelism Managed by User
Explicit Send/Receive Operations
MPI_Isend(&AR_INDEX(ex, 0, lowy, 0) /*lowest plane on node*/, 1/*count*/,
XZ_PlaneType, neighbor_nodes[Y_DIMENSION][LOW_NEIGHBOR], TAG_EXXZ,
MPI_COMM_WORLD, &requestArray[count++]);

MPI_Irecv(&AR_INDEX(ex, 0, 0, highz + 1) /* one past highz point */,

1 /* count */, XY_PlaneType,
neighbor_nodes[Z_DIMENSION][HIGH_NEIGHBOR] /* source */,
TAG_EXXY, MPI_COMM_WORLD, &requestArray[count++]);

2/16/2005 25
Debugging
Parallel debugging is mostly awful
10s or 100s of program states
GDB for Threads is bad enough!
Need way to capture and visualize program
state
Zero in on trouble spots
Deadlocks common

2/16/2005 26
Future Architecture Research
IBM/Toshiba/Sony Cell Architecture
General Purpose CPU With SMT
SIMD Units with Fast RAM
Said to be comparable to Earth Simulator
Node
Stream Processors (& Media Processors)
Quantum Computing
Fault Tolerance
Power Consumption Awareness
2/16/2005 27
Conclusion
Despite our Home Computers Being Faster
than Early Supercomputers
Many Supercomputers being built
Different architectures still abound
Problem sizes getting larger
Finer meshes
More time steps
More precise calculations

2/16/2005 28

Fire Safety Audit Checklist
100% (5)
Fire Safety Audit Checklist
7 pages
An 01 1B 40 PDF
No ratings yet
An 01 1B 40 PDF
24 pages
Daf Xf95 Series Workshop Manual
100% (67)
Daf Xf95 Series Workshop Manual
20 pages
Super Computer Documentation
100% (1)
Super Computer Documentation
12 pages
Software Engineering Slides 1
No ratings yet
Software Engineering Slides 1
11 pages
Overview of Parallel Computing: Shawn T. Brown
No ratings yet
Overview of Parallel Computing: Shawn T. Brown
46 pages
CAQA5e ch1
No ratings yet
CAQA5e ch1
42 pages
SMM Cap1
No ratings yet
SMM Cap1
101 pages
Super Computer
100% (1)
Super Computer
25 pages
IV. Physical Organization and Models: March 9, 2009
No ratings yet
IV. Physical Organization and Models: March 9, 2009
35 pages
IV. Physical Organization and Models: March 9, 2009
No ratings yet
IV. Physical Organization and Models: March 9, 2009
35 pages
Outline: - Course Administration 18.337/6.338/SMA5505 - Parallel Machines in 2003
No ratings yet
Outline: - Course Administration 18.337/6.338/SMA5505 - Parallel Machines in 2003
22 pages
Multi-Core Programming - Increasing Performance Through Software Multi-Threading
No ratings yet
Multi-Core Programming - Increasing Performance Through Software Multi-Threading
11 pages
Computer Hierarchy
67% (3)
Computer Hierarchy
37 pages
Overviewof Supercomputer S: Presented by
No ratings yet
Overviewof Supercomputer S: Presented by
21 pages
Quad Core
No ratings yet
Quad Core
31 pages
Supercomputer: High-Throughput Computing Many-Task Computing Supercomputer (Disambiguation)
No ratings yet
Supercomputer: High-Throughput Computing Many-Task Computing Supercomputer (Disambiguation)
26 pages
P 1
No ratings yet
P 1
44 pages
p1
No ratings yet
p1
30 pages
Parallel Computation Lecture Notes
No ratings yet
Parallel Computation Lecture Notes
44 pages
Supercomputers
No ratings yet
Supercomputers
14 pages
Supercomputers: Presented By: Avijit Hait, Debanjan Dutta, Md. Arif Hasan, Narottam Panday, Victor Arkaprava Dhar
No ratings yet
Supercomputers: Presented By: Avijit Hait, Debanjan Dutta, Md. Arif Hasan, Narottam Panday, Victor Arkaprava Dhar
12 pages
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 03-Aug-2021 Lecture1-Course Introduction
No ratings yet
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 03-Aug-2021 Lecture1-Course Introduction
39 pages
Parallel_computing
No ratings yet
Parallel_computing
32 pages
Supercomputerppt 170514174737
No ratings yet
Supercomputerppt 170514174737
18 pages
Paralelismo_2024
No ratings yet
Paralelismo_2024
30 pages
Types of Computers and Computing: Presented by Ratnadeep B.Tech (Computer Science and Engineering)
No ratings yet
Types of Computers and Computing: Presented by Ratnadeep B.Tech (Computer Science and Engineering)
35 pages
Multi Threading
No ratings yet
Multi Threading
168 pages
Why Multiprocessors?: Motivation: Opportunity
No ratings yet
Why Multiprocessors?: Motivation: Opportunity
20 pages
Martin Tsenkov ELFE 221210014 77B
No ratings yet
Martin Tsenkov ELFE 221210014 77B
17 pages
Lec1 Introduction
No ratings yet
Lec1 Introduction
23 pages
2013luv Supercomputers
No ratings yet
2013luv Supercomputers
12 pages
Parallel Computing: Overview: John Urbanic Urbanic@psc - Edu
No ratings yet
Parallel Computing: Overview: John Urbanic Urbanic@psc - Edu
34 pages
Lecture#6-Classes of Gigital Computers
No ratings yet
Lecture#6-Classes of Gigital Computers
27 pages
Advanced Computer Architecture - Parallelism Scalability & Programability - Kai Hwang
100% (2)
Advanced Computer Architecture - Parallelism Scalability & Programability - Kai Hwang
165 pages
Par Proc Prsntatn
No ratings yet
Par Proc Prsntatn
33 pages
unit-ii_ppt[1]
No ratings yet
unit-ii_ppt[1]
43 pages
CS 258 Parallel Computer Architecture: CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley
No ratings yet
CS 258 Parallel Computer Architecture: CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley
44 pages
CSCE569 Parallel Computing: TTH 03:30AM-04:45PM Dr. Jianjun Hu
No ratings yet
CSCE569 Parallel Computing: TTH 03:30AM-04:45PM Dr. Jianjun Hu
37 pages
Architecture1 1 (2012)
No ratings yet
Architecture1 1 (2012)
87 pages
1 Introduction
No ratings yet
1 Introduction
58 pages
Architecture
No ratings yet
Architecture
67 pages
Super Computer: Presented By: Asad Ikhlaq Presented To: Mam Mufassara Naz Roll Number: F20BA121 Date: 20-2-2021
No ratings yet
Super Computer: Presented By: Asad Ikhlaq Presented To: Mam Mufassara Naz Roll Number: F20BA121 Date: 20-2-2021
15 pages
A Supercomputer Is A Computer That Is at The Frontline of Current Processing Capacity
No ratings yet
A Supercomputer Is A Computer That Is at The Frontline of Current Processing Capacity
17 pages
Super Computer Presentation
100% (1)
Super Computer Presentation
16 pages
Lec 01
No ratings yet
Lec 01
67 pages
Lecture 4 Spring 2024
No ratings yet
Lecture 4 Spring 2024
26 pages
Motivation For Parallelism Motivation For Parallelism
No ratings yet
Motivation For Parallelism Motivation For Parallelism
6 pages
CICS 504 Computer Organization
No ratings yet
CICS 504 Computer Organization
35 pages
SUPERCOMPUTERS1
No ratings yet
SUPERCOMPUTERS1
12 pages
Parallel Computing
No ratings yet
Parallel Computing
57 pages
W3C1 Principles of Parallel Computing
No ratings yet
W3C1 Principles of Parallel Computing
28 pages
Theory of Distributed Computing and Parallel Processing With Its Applications, Advantages and Disadvantages
No ratings yet
Theory of Distributed Computing and Parallel Processing With Its Applications, Advantages and Disadvantages
11 pages
The Earth Simulator: Presented by Jin Soon Lim For CS 566
No ratings yet
The Earth Simulator: Presented by Jin Soon Lim For CS 566
29 pages
Super Computer Presentation
No ratings yet
Super Computer Presentation
16 pages
Chapter 4
No ratings yet
Chapter 4
46 pages
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
No ratings yet
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
78 pages
Advanced Computer Architecture Assigment
No ratings yet
Advanced Computer Architecture Assigment
60 pages
Neo Geo Architecture: Architecture of Consoles: A Practical Analysis, #23
From Everand
Neo Geo Architecture: Architecture of Consoles: A Practical Analysis, #23
Rodrigo Copetti
No ratings yet
Nanoscale CMOS: Innovative Materials, Modeling and Characterization
From Everand
Nanoscale CMOS: Innovative Materials, Modeling and Characterization
Francis Balestra
No ratings yet
LEARN MPLS FROM SCRATCH PART-A: A Beginner's Guide to Next Level of Networking
From Everand
LEARN MPLS FROM SCRATCH PART-A: A Beginner's Guide to Next Level of Networking
POONAM DEVI
No ratings yet
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet
PC Engine / TurboGrafx-16 Architecture: Architecture of Consoles: A Practical Analysis, #16
From Everand
PC Engine / TurboGrafx-16 Architecture: Architecture of Consoles: A Practical Analysis, #16
Rodrigo Copetti
No ratings yet
Mega Drive Architecture: Architecture of Consoles: A Practical Analysis, #3
From Everand
Mega Drive Architecture: Architecture of Consoles: A Practical Analysis, #3
Rodrigo Copetti
No ratings yet
OS Quiz Solutions PDF
No ratings yet
OS Quiz Solutions PDF
4 pages
Patios & Walkways PDF
100% (11)
Patios & Walkways PDF
257 pages
3COM OS Switch 5500G V3.03.02p01 Release Notes
No ratings yet
3COM OS Switch 5500G V3.03.02p01 Release Notes
61 pages
Halcyon Amika - AMH51 - USERGUIDE - (EN)
No ratings yet
Halcyon Amika - AMH51 - USERGUIDE - (EN)
17 pages
2013 - Annual Report - Dakota County Communications Center
No ratings yet
2013 - Annual Report - Dakota County Communications Center
46 pages
GSM Frequency Bands - Wikipedia, The Free Encyclopedia
No ratings yet
GSM Frequency Bands - Wikipedia, The Free Encyclopedia
5 pages
Dis Gpa Bro 001 e Rev.1
No ratings yet
Dis Gpa Bro 001 e Rev.1
2 pages
Punna Shiva Kumar: SPO Code
No ratings yet
Punna Shiva Kumar: SPO Code
2 pages
DBK386WDR+: Refrigerator-Congelator Tip I Refrigerator-Freezer Type I
No ratings yet
DBK386WDR+: Refrigerator-Congelator Tip I Refrigerator-Freezer Type I
36 pages
Tesla and Lakhovsky
100% (7)
Tesla and Lakhovsky
2 pages
Suleman Ali: Personal Information
No ratings yet
Suleman Ali: Personal Information
1 page
Airdrop Results
No ratings yet
Airdrop Results
1,806 pages
Assignment 2 HCI
No ratings yet
Assignment 2 HCI
3 pages
OpTransactionHistoryUX506 05 2024
No ratings yet
OpTransactionHistoryUX506 05 2024
6 pages
Change Management - Resume
100% (2)
Change Management - Resume
14 pages
S P Ms Notes
No ratings yet
S P Ms Notes
56 pages
1st B.tech 1st Sem 2014-15 Pattern-1
No ratings yet
1st B.tech 1st Sem 2014-15 Pattern-1
20 pages
Videonics Titlemaker 2000 User Manual
100% (1)
Videonics Titlemaker 2000 User Manual
95 pages
Design Manifesto (For A Design Enabled Technical Education)
No ratings yet
Design Manifesto (For A Design Enabled Technical Education)
15 pages
Analog To Digital Converter
100% (3)
Analog To Digital Converter
26 pages
HVAC Slide Show
100% (2)
HVAC Slide Show
48 pages
Jaipur Rajasthan Patrika Plus English
100% (1)
Jaipur Rajasthan Patrika Plus English
1 page
Aid120091-02 - SC - TC - PS50 - Ffe - 504 - A PDF
No ratings yet
Aid120091-02 - SC - TC - PS50 - Ffe - 504 - A PDF
228 pages
Airborne Internet Presentation[1]
No ratings yet
Airborne Internet Presentation[1]
16 pages
SKF Shaft Alignment Tool
No ratings yet
SKF Shaft Alignment Tool
16 pages
Catalog - Hot Rolled Mills
No ratings yet
Catalog - Hot Rolled Mills
18 pages

Supercomputer Architecture: The Teraflops Race

Uploaded by

Supercomputer Architecture: The Teraflops Race

Uploaded by

Supercomputer Architecture: The

NIC NIC NIC NIC

 But Shared Resources Make Things Ugly

MPI_Irecv(&AR_INDEX(ex, 0, 0, highz + 1) /* one past highz point */,

You might also like

But Shared Resources Make Things Ugly