0% found this document useful (0 votes)
9 views58 pages

L1.0 HPC Overview

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views58 pages

L1.0 HPC Overview

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 58

Reference Books

1) High Performance Computing Modern Systems and


Practices, Thomas Sterling, Matthew Anderson, Maciej
Brodowicz, 2018 Elsevier

2) Introduction to High Performance Computing for Scientists


and Engineers, Georg Hager, Gerhard Wellein, 2011, CRC
Press

3) Next-Gen Computer Architecture - Till the end of Silicon,


Smruti R. Sarangi, 2021, IIT Delhi
High Performance
Computing: An
Introduction
Contents:
□ Computational Science and Engineering
□ Basics of Computer Architecture:
– What a normal programmer should know about

□ High Performance Computers


– What is this? What’s on? Why should I care?

□ High Performance Applications


– What I can expect? What are the tools? What should I learn?

□ Final Comments and Discussion


COMPUTATIONAL SCIENCE AND
ENGINEERING
Computational Science Engineering

□ In broad terms it is about using computers


to analyze scientific problems.
□ Thus we distinguish it from computer science,
which is the study of computers and
computation, and from theory and experiment,
the traditional forms of science.
□ Computational Science and Engineering seeks
to gain understanding principally through the
analysis of mathematical models on high
performance computers.
Layered Structure of CSE

From: A SCIENCE-BASED CASE FOR LARGE-SCALE SIMULATION, DOE,


2003
BASICS OF COMPUTER
ARCHITECTURE
Basics of Computer Architecture

□ Processors
□ Memory
□ Buses
□ I/O
□ Operational Systems
□ Performance Model
The main components of a
computer system are:

· Processors
· Memory
· Communications Channels

These components of a computer architecture


are often summarized in terms of a PMS
diagram.
(P = "processors", M = "memory", S =
"switches".)
Processors
Fetch-Decode-Execute Cycle
The essential task of computer processors is to perform a
Fetch- Decode-Execute cycle:
1. In the fetch cycle the processor gets an instruction
from memory; the address of the instruction is contained in
an internal register called the Program Counter or PC.
2. While the instruction is being fetched from memory,
the PC is incremented by one. Thus, in the next Fetch-
Decode-Execute cycle the instruction will be fetched from
the next sequential location in memory (unless the PC is
changed by some other instruction in the interim. )
3. In the decode phase of the cycle, the processor
stores the information fetched from memory in an internal
register called the Instruction Register or IR.
4. In the execution phase of the cycle, the processor
carries out the instruction stored in the IR.
Classification of Processor Instructions
Instructions for the processor may be classified
into three major types:
1. Arithmetic/Logic instructions apply
primitive functions to one or two arguments;
an example is the addition of two numbers.
2. Data Transfer instructions move data
from one location to another, for example,
from an internal processor register to a
location in the main memory.
3. Control instructions modify the order in
which instruction are executed, for example,
in loops or logical decisions.
Clock Cycles
Operations within a processor are controlled by an
external clock (a circuit generating a square wave of fixed
period).

The quantum unit of time is a clock cycle. The clock


frequency (e.g., 3GHz would be a common clock
frequency for a modern workstation) is one measure of
how fast a computer is, but the length of time to carry out
an operation depends not only on how fast the processor
cycles, but how many cycles are required to perform a
given operation.
Computer
Memory
Memory Classifications

Computers have hierarchies of memories that may be


classified according to
· Function
· Capacity
· Response Times

Memory Function

"Reads" transfer information from the memory; "Writes"


transfer information to the memory:
· Random Access Memory (RAM) performs both reads
and writes.
·Read-Only Memory (ROM) contains information stored
at the time of manufacture that can only be read.
Memory Capacity

bit = b
byte = 8 bits abbreviated by "B
"
Common prefixes: k=kilo=1000, M=mega=10^6, G=giga=10^9, Tera=10^12; P, E

Memory Response
Memory response is characterized by two different measures:

Access Time (also termed response time or latency) defines how quickly
the memory can respond to a read or write request.

·Memory Cycle Time refers to the minimum period


between two
successive requests of the memory

For chips in small personal computers i s about 10 ns or


less
Locality of Reference and Memory
Hierarchies

In practice, processors tend to access memory in a


patterned way. For example, in the absence of logical
branches, the Program Counter is incremented by
one after each instruction. Thus, if memory location
x is accessed at time t, there is a high probability that
the processor will request an instruction from memory
location x+1 in the near future. This clustering of
memory references into groups is termed Locality
of Reference.

Locality of reference can be exploited by implementing


the memory as a hierarchy of memories, with each
level of the hierarchy having characteristic access times
and capacity.
Memory Hierarchy

fast, expensive slow, cheap


Effective Access
Time

The performance of a hierarchical memory is characterized by an


Effective Access Time. If T = effective access time, H = cache hit rate,
T(cache) = cache access time, and T(main) = main memory access time,

T = H*T(cache) + (1-H)*T(main)

For example, if the hit rate is 98% (not uncommon on modern computers),
cache speed is 10 ns, and main memory has a speed of 100 ns,

T = 0.98*10ns + 0.02*100ns = 11.8 ns

The memory behaves as if it were composed entirely of fast chips with


11.8 ns access time, even though it is composed mostly of cheap 100 ns
chips!
Operating Systems
Most modern computers are Multitasking: they run several Processes or
Tasks at the same time. The most common operating system for workstations
and high-performance computers is Unix.

Active Processes are being executed by the Processing Unit(s)


Idle Processes are waiting to execute
Blocked Processes are waiting for some external event (e.g., the reading of
data from a file). When this is accomplished, they become idle, waiting their turn
for execution.
Performance
Models
Measures of Machine Performance

The clock cycle time is a simple, but rather inadequate


measure of the performance of a modern compute:
· Processors must act in conjunction with memories and buses.
·The efficiency of executing instructions for each clock cycle
can vary widely.
The basic performance for a single-processor computer
system can be expressed in terms of

T= n x CPI x t

where T is the time to execute, n is the number of


instructions executed, t is the time per instruction, and CPI is the
number of cycles per instruction.
RISC vs. CISC Architectures
RISC (Reduced Instruction Set Computer): Implement a few very
simple instructions.
CISC (Complex Instruction Set Computer): Implement a larger
instruction set that does more complicated things.

In recent years, the RISC architecture has proven a better match


with modern developments in VLSI (Very Large Scale Integration)
chip manufacture:

·Simple instructions allow powerful implementation techniques such


as pipelining
·Simple instructions allow more stuff on the chip: on-board cache,
CPUs with multiple arithmetic units, etc.
· Simple instruction sets mean cycle times can often be much faster.
· Simple instruction sets need less logic, smaller space
required on chip, and smaller circuits run faster and cooler.
Overlapping Instructions: Pipelining
Some Performance Metrics

MIPS = "Millions of instructions per second"

MFLOP/S = "Millions of floating-point operations per second“,


GigaFlop/s, TeraFlop/s, ExaFlop/s …

Theoretical Peak MFLOP/S = MFLOP/S if the machine did


nothing but numerical operations

Benchmarks = Programs designed to determine performance


metrics for machines.

Examples: HPL, NASA-NPB’s, LINKPACK


Parallel
Processing
The basic performance for a single-processor computer system can be
expressed in terms of

T= n * CPI * t

where T is the time to execute, n is the number of instructions executed, t is the


time per instruction, and CPI is the number of cycles per instruction.
Decreasing the clock time t is a matter of engineering. Generally smaller,
faster circuits lead to better clock speed. Decreasing the other two factors
involves some version of parallelism. There are several levels of parallelism:

1.Job-Level Parallelism: The computer center purchases more


computers so more jobs can be run in a given period.
2.Program-Level Parallelism: A single program is broken into constituent
parts, and different processors compute each part.
3.Instruction-Level Parallelism: Techniques such as pipelining
allow more throughput by the execution of overlapping instructions.
4.Arithmetic and Bit-Level Parallelism: Low-level parallelism primarily
of interest to designers of the arithmetic logic units; relatively invisible to user.
Granularity of Tasks
Parallel operations may be classified according to the size
of the operations running in parallel.

Large-Grain System: Operations running in parallel are


large (of the order of program size).

Small-Grain System: Operations running in parallel are


small (of the order of a few instructions).
Resource Conflicts and Dependencies
Overlapping operations in a pipeline require that the operations be
independent of each other. There are various ways in which this
condition may be violated. Such Resource Conflicts or
Dependencies inhibit the pipelining efficiency. For example, suppose
a code implements the instructions

R2 = R0 +

R1 R4 = R2

+ R1

This is an example of a data dependency: the processor cannot


send the second pair of operands to the pipeline adder until the
result of the first addition has exited the pipeline (because only
then will the correct value of R2 be known). Such dependencies
lead to periods when the pipeline stages are empty that are termed
bubbles.
Instruction pipelines are used to speed the fetch-decode-execute cycle. The
pipeline is constantly exploiting locality of reference by "looking ahead" and
fetching instructions it thinks the processor will soon need. If a branch or loop
instruction in the program invalidates this look-ahead, a bubble appears in the
instruction pipeline while the fetch stage goes to look for the new instructions.
This is called a control dependency.
Memory
Organization
In high-performance computing, it is important to match the
(generally slower) memory accesses as well as possible with the
(generally faster) processor cycling. This is particularly true for
pipelined units that derive their efficiency from a constant
supply of fresh operands for the pipeline. The primary difficulty
is the memory cycle time, during which the memory is not
accessible by subsequent operations.

For parallel systems there are two general memory designs:


·Shared Memory Systems for which there is one large
virtual memory that all processors have equivalent
access to.
·Distributed Memory Systems for which each
processor has its own local memory not directly accessible
from other processors.
Interconnect Topologies for Parallel Systems

A major consideration for parallel systems is the manner in


which the processors, memories, and switches communicate
with each other. The connections among these define the
topology for the machine.

Ring vs. Fully Connected


Hipercube
Network
s
Tree and Star
Topologies

Mesh
Topologies
Basic Types of Parallel Architectures

Memo
ry

P0 P1 ……. PN
P2

Shared memory
Machine
Flinn’s Taxonomy of Parallel Architectures

http://csep1.phy.ornl.gov/csep.html
Vector Supercomputers
HIGH PERFORMANCE
COMPUTERS
High Performance Computers
or Supercomputers

Supercomputers are the


fastest and most powerful
general purpose scientific
computing systems available
at any given time.
Dongarra et al, “Numerical Linear Algebra for
High-Performance Computers”, SIAM, 1998
Turing´s Bombe, UK,
1941

Cray XT5 at Oak Ridge, USA,


2009
2.3 Petaflops, 224K AMD cores
The TOP500 List
www.top500.org
□ The main objective of TOP500 is to provide a ranked list of
general purpose systems that are in common use for high
end applications
□ It is based on LINPACK Benchmark, that solves a dense
system of linear equations by LU factorization
□ A parallel implementation of the LINPACK benchmark and
instructions on how to run it can be found at
http://www.netlib.org/benchmark/hpl/
□ TOP500 uses the benchmark version that allows the user to
scale the size of the problem and to optimize the software
in order to achieve the best performance for a given
machine
LINPACK’s Evolution

LINPACK100

Serial

LINPACK1000

High Performance LINPACK n×n (HPL)

Other packages for dense linear algebra (www.netlib.org):


LAPACK and ScaLAPACK
LINPACK’s Building Blocks
□ BLAS (Basic Linear Algebra Subprograms):
http://www.netlib.org/blas
– Level 1: vector-vector operations; y = y + ax
– Level 2: matrix-vector operations; y = y + Ax
– Level 3: matrix-matrix operations; A=B+C

□ Highly optimzed BLAS:


– ATLAS (free optmized BLAS generator);
http://www.netlib.org/atlas
HPL Benchmark Highlights
□ Extract MAXIMUM sustained performance of
a given system
□ Results listed in TOP500:
– Rmax = Performance in Gflop/s for the biggest
problem ran
– Nmax =
= Size
Size of
of biggest
problemproblem
where half of Rmax is sustained
– N1/2
– Rpeak = Theoretical peak performance

□ Factors affecting HPL performance


– Implementation; human effort; operational system;
hardware, network, compiler, BLAS, etc.
The Power Wall

#1 TOP500, Jaguar, 7MW, MFLOPS/W=251


Fonte: J. Dongarra, http://www.netlib.org/utk/people/JackDongarra/talks.html
Moore’s Law revisited

Clock frequency scaling replaced by scaling cores / chip


source: J. J. Dongarra, http://www.netlib.org/utk/people/JackDongarra/talks.htm
http://www.lanl.gov/roadrunner/
Fonte: J. Dongarra, http://www.netlib.org/utk/people/JackDongarra/talks.html
from: J. Dongarra, http://www.netlib.org/utk/people/JackDongarra/talks.html

Coupling refers to the degree of direct knowledge that one element has of another.
from: J. Dongarra, http://www.netlib.org/utk/people/JackDongarra/talks.html
Current HPC Systems -

Exascale Computing
HIGH PERFORMANCE
APPLICATIONS
Taxonomy of Parallel Applications
Few or no Embarssingly Parallel
communication (EP)

Explicit Explicit
Explicit (Neighbor) Unstructured
Communication Structured
(ES) (EU)

Implicit Implicit
Implicit (Global) Structured Unstructured
Communication (IS) (IU)

Structured Unstructured
Communication Communication
H. D. Simon. High performance computing: Architecture, software, algorithms.
Technical Report RNR-93-018, NASA Ames Research Center, Moffett Field, CA 94035, December 1993
Measuring Parallel Performance

□ T1 – serial execution time (1 processor)

□ TP – parallel execution time in p processors


□ Speed-up ➔ SP=T1/TP
□ Efficiency ➔ EP = T1/(pTP)
□ Therefore:
EP = SP/p SP  p EP  1
□ Note: anomalies may happen due to other
resources (cache) as p increases– superlinear
speed-up
Amdahl’s Law (1967)

□ Serial fraction: s, 0s1


□ Parallel fraction in p processors:
1-s
□ Then:

□ Corollary:
Scalability

□ Scalability refers on how an given algorithm


can use efficiently additional processors;
□ An algorithm is scalable when p increases if its
efficiency is constant as problem size
increases
Code Optimization
and Programming
Before even thinking in parallelizing a code:
□ Optimize your code for a given class of
processors
– This is what reduces CPU time
□ Use all optimization TOOLS existing in
the compiler and in the system;
□ Always verify if the code is working
properly;
□ Always use standard libraries such as
BLAS, LAPACK, etc.
Basic Parallel Programming Models
□ Distributed Memory Machines

Message Passing: send/receive

□ Message Passing Library

Message Passing Interface http://www.mpi.com

call MPI_Send( ... )


call
MPI_Recv( ... )
Basic Parallel Programming Models
□ Threaded machines: compiler
directives
OpenMP !$OMP PARALLEL DO PRIVATE (J)
DO J=1,M
...
http://www.openmp.org
ENDDO

OpenCL http://www.khronos.org/opencl/

CUDA: // send data from host to device: a_h to


a_d
cudaMemcpy(a_d, a_h, sizeof(float)*N,
cudaMemcpyHostToDevice);
http://www.nvidia.com/object/cuda_home.html#/
Development Tools

Numerical Libraries
– Netlib: http://www.netlib.org
– ACTS (Advanced Computational Testing and Simulation)
Toolkit http://acts.nersc.gov/
• PETSc (Portable, Extensible Toolkit for Scientific
Computation)

• ScaLAPACK library extends LAPACK's high-performance


linear algebra software to distributed memory
Why computer simulation?

From: A SCIENCE-BASED CASE FOR LARGE-SCALE SIMULATION, DOE,


2003
Final Remarks
□ Computational Engineering and Science
changed the way we view engineering
□ There is no general approach
□ Integrated approach: HPC, Visualization,
Storage and Communications
□ Challenges:
– Managing complexity: programming models, data structures
and computer architecture ➔ performance
– Understanding the results of a computation: visualization,
data integration, knowledge extraction
– Collaboration: grid, web, data security

You might also like