0% found this document useful (0 votes)

9 views58 pages

L1.0 HPC Overview

Uploaded by

krishnasribodduri07

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views58 pages

L1.0 HPC Overview

Uploaded by

krishnasribodduri07

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 58

Reference Books

1) High Performance Computing Modern Systems and

Practices, Thomas Sterling, Matthew Anderson, Maciej
Brodowicz, 2018 Elsevier

2) Introduction to High Performance Computing for Scientists

and Engineers, Georg Hager, Gerhard Wellein, 2011, CRC
Press

3) Next-Gen Computer Architecture - Till the end of Silicon,

Smruti R. Sarangi, 2021, IIT Delhi
High Performance
Computing: An
Introduction
Contents:
□ Computational Science and Engineering
□ Basics of Computer Architecture:
– What a normal programmer should know about

□ High Performance Computers

– What is this? What’s on? Why should I care?

□ High Performance Applications

– What I can expect? What are the tools? What should I learn?

□ Final Comments and Discussion

COMPUTATIONAL SCIENCE AND
ENGINEERING
Computational Science Engineering

□ In broad terms it is about using computers

to analyze scientific problems.
□ Thus we distinguish it from computer science,
which is the study of computers and
computation, and from theory and experiment,
the traditional forms of science.
□ Computational Science and Engineering seeks
to gain understanding principally through the
analysis of mathematical models on high
performance computers.
Layered Structure of CSE

From: A SCIENCE-BASED CASE FOR LARGE-SCALE SIMULATION, DOE,

2003
BASICS OF COMPUTER
ARCHITECTURE
Basics of Computer Architecture

□ Processors
□ Memory
□ Buses
□ I/O
□ Operational Systems
□ Performance Model
The main components of a
computer system are:

· Processors
· Memory
· Communications Channels

These components of a computer architecture

are often summarized in terms of a PMS
diagram.
(P = "processors", M = "memory", S =
"switches".)
Processors
Fetch-Decode-Execute Cycle
The essential task of computer processors is to perform a
Fetch- Decode-Execute cycle:
1. In the fetch cycle the processor gets an instruction
from memory; the address of the instruction is contained in
an internal register called the Program Counter or PC.
2. While the instruction is being fetched from memory,
the PC is incremented by one. Thus, in the next Fetch-
Decode-Execute cycle the instruction will be fetched from
the next sequential location in memory (unless the PC is
changed by some other instruction in the interim. )
3. In the decode phase of the cycle, the processor
stores the information fetched from memory in an internal
register called the Instruction Register or IR.
4. In the execution phase of the cycle, the processor
carries out the instruction stored in the IR.
Classification of Processor Instructions
Instructions for the processor may be classified
into three major types:
1. Arithmetic/Logic instructions apply
primitive functions to one or two arguments;
an example is the addition of two numbers.
2. Data Transfer instructions move data
from one location to another, for example,
from an internal processor register to a
location in the main memory.
3. Control instructions modify the order in
which instruction are executed, for example,
in loops or logical decisions.
Clock Cycles
Operations within a processor are controlled by an
external clock (a circuit generating a square wave of fixed
period).

The quantum unit of time is a clock cycle. The clock

frequency (e.g., 3GHz would be a common clock
frequency for a modern workstation) is one measure of
how fast a computer is, but the length of time to carry out
an operation depends not only on how fast the processor
cycles, but how many cycles are required to perform a
given operation.
Computer
Memory
Memory Classifications

Computers have hierarchies of memories that may be

classified according to
· Function
· Capacity
· Response Times

Memory Function

"Reads" transfer information from the memory; "Writes"

transfer information to the memory:
· Random Access Memory (RAM) performs both reads
and writes.
·Read-Only Memory (ROM) contains information stored
at the time of manufacture that can only be read.
Memory Capacity

bit = b
byte = 8 bits abbreviated by "B
"
Common prefixes: k=kilo=1000, M=mega=10^6, G=giga=10^9, Tera=10^12; P, E

Memory Response
Memory response is characterized by two different measures:

Access Time (also termed response time or latency) defines how quickly
the memory can respond to a read or write request.

·Memory Cycle Time refers to the minimum period

between two
successive requests of the memory

For chips in small personal computers i s about 10 ns or

less
Locality of Reference and Memory
Hierarchies

In practice, processors tend to access memory in a

patterned way. For example, in the absence of logical
branches, the Program Counter is incremented by
one after each instruction. Thus, if memory location
x is accessed at time t, there is a high probability that
the processor will request an instruction from memory
location x+1 in the near future. This clustering of
memory references into groups is termed Locality
of Reference.

Locality of reference can be exploited by implementing

the memory as a hierarchy of memories, with each
level of the hierarchy having characteristic access times
and capacity.
Memory Hierarchy

fast, expensive slow, cheap

Effective Access
Time

The performance of a hierarchical memory is characterized by an

Effective Access Time. If T = effective access time, H = cache hit rate,
T(cache) = cache access time, and T(main) = main memory access time,

T = H*T(cache) + (1-H)*T(main)

For example, if the hit rate is 98% (not uncommon on modern computers),
cache speed is 10 ns, and main memory has a speed of 100 ns,

T = 0.9810ns + 0.02100ns = 11.8 ns

The memory behaves as if it were composed entirely of fast chips with

11.8 ns access time, even though it is composed mostly of cheap 100 ns
chips!
Operating Systems
Most modern computers are Multitasking: they run several Processes or
Tasks at the same time. The most common operating system for workstations
and high-performance computers is Unix.

Active Processes are being executed by the Processing Unit(s)

Idle Processes are waiting to execute
Blocked Processes are waiting for some external event (e.g., the reading of
data from a file). When this is accomplished, they become idle, waiting their turn
for execution.
Performance
Models
Measures of Machine Performance

The clock cycle time is a simple, but rather inadequate

measure of the performance of a modern compute:
· Processors must act in conjunction with memories and buses.
·The efficiency of executing instructions for each clock cycle
can vary widely.
The basic performance for a single-processor computer
system can be expressed in terms of

T= n x CPI x t

where T is the time to execute, n is the number of

instructions executed, t is the time per instruction, and CPI is the
number of cycles per instruction.
RISC vs. CISC Architectures
RISC (Reduced Instruction Set Computer): Implement a few very
simple instructions.
CISC (Complex Instruction Set Computer): Implement a larger
instruction set that does more complicated things.

In recent years, the RISC architecture has proven a better match

with modern developments in VLSI (Very Large Scale Integration)
chip manufacture:

·Simple instructions allow powerful implementation techniques such

as pipelining
·Simple instructions allow more stuff on the chip: on-board cache,
CPUs with multiple arithmetic units, etc.
· Simple instruction sets mean cycle times can often be much faster.
· Simple instruction sets need less logic, smaller space
required on chip, and smaller circuits run faster and cooler.
Overlapping Instructions: Pipelining
Some Performance Metrics

MIPS = "Millions of instructions per second"

MFLOP/S = "Millions of floating-point operations per second“,

GigaFlop/s, TeraFlop/s, ExaFlop/s …

Theoretical Peak MFLOP/S = MFLOP/S if the machine did

nothing but numerical operations

Benchmarks = Programs designed to determine performance

metrics for machines.

Examples: HPL, NASA-NPB’s, LINKPACK

Parallel
Processing
The basic performance for a single-processor computer system can be
expressed in terms of

T= n * CPI * t

where T is the time to execute, n is the number of instructions executed, t is the

time per instruction, and CPI is the number of cycles per instruction.
Decreasing the clock time t is a matter of engineering. Generally smaller,
faster circuits lead to better clock speed. Decreasing the other two factors
involves some version of parallelism. There are several levels of parallelism:

1.Job-Level Parallelism: The computer center purchases more

computers so more jobs can be run in a given period.
2.Program-Level Parallelism: A single program is broken into constituent
parts, and different processors compute each part.
3.Instruction-Level Parallelism: Techniques such as pipelining
allow more throughput by the execution of overlapping instructions.
4.Arithmetic and Bit-Level Parallelism: Low-level parallelism primarily
of interest to designers of the arithmetic logic units; relatively invisible to user.
Granularity of Tasks
Parallel operations may be classified according to the size
of the operations running in parallel.

Large-Grain System: Operations running in parallel are

large (of the order of program size).

Small-Grain System: Operations running in parallel are

small (of the order of a few instructions).
Resource Conflicts and Dependencies
Overlapping operations in a pipeline require that the operations be
independent of each other. There are various ways in which this
condition may be violated. Such Resource Conflicts or
Dependencies inhibit the pipelining efficiency. For example, suppose
a code implements the instructions

R2 = R0 +

R1 R4 = R2

+ R1

This is an example of a data dependency: the processor cannot

send the second pair of operands to the pipeline adder until the
result of the first addition has exited the pipeline (because only
then will the correct value of R2 be known). Such dependencies
lead to periods when the pipeline stages are empty that are termed
bubbles.
Instruction pipelines are used to speed the fetch-decode-execute cycle. The
pipeline is constantly exploiting locality of reference by "looking ahead" and
fetching instructions it thinks the processor will soon need. If a branch or loop
instruction in the program invalidates this look-ahead, a bubble appears in the
instruction pipeline while the fetch stage goes to look for the new instructions.
This is called a control dependency.
Memory
Organization
In high-performance computing, it is important to match the
(generally slower) memory accesses as well as possible with the
(generally faster) processor cycling. This is particularly true for
pipelined units that derive their efficiency from a constant
supply of fresh operands for the pipeline. The primary difficulty
is the memory cycle time, during which the memory is not
accessible by subsequent operations.

For parallel systems there are two general memory designs:

·Shared Memory Systems for which there is one large
virtual memory that all processors have equivalent
access to.
·Distributed Memory Systems for which each
processor has its own local memory not directly accessible
from other processors.
Interconnect Topologies for Parallel Systems

A major consideration for parallel systems is the manner in

which the processors, memories, and switches communicate
with each other. The connections among these define the
topology for the machine.

Ring vs. Fully Connected

Hipercube
Network
s
Tree and Star
Topologies

Mesh
Topologies
Basic Types of Parallel Architectures

Memo
ry

P0 P1 ……. PN
P2

Shared memory
Machine
Flinn’s Taxonomy of Parallel Architectures

http://csep1.phy.ornl.gov/csep.html
Vector Supercomputers
HIGH PERFORMANCE
COMPUTERS
High Performance Computers
or Supercomputers

Supercomputers are the

fastest and most powerful
general purpose scientific
computing systems available
at any given time.
Dongarra et al, “Numerical Linear Algebra for
High-Performance Computers”, SIAM, 1998
Turing´s Bombe, UK,
1941

Cray XT5 at Oak Ridge, USA,

2009
2.3 Petaflops, 224K AMD cores
The TOP500 List
www.top500.org
□ The main objective of TOP500 is to provide a ranked list of
general purpose systems that are in common use for high
end applications
□ It is based on LINPACK Benchmark, that solves a dense
system of linear equations by LU factorization
□ A parallel implementation of the LINPACK benchmark and
instructions on how to run it can be found at
http://www.netlib.org/benchmark/hpl/
□ TOP500 uses the benchmark version that allows the user to
scale the size of the problem and to optimize the software
in order to achieve the best performance for a given
machine
LINPACK’s Evolution

LINPACK100

Serial

LINPACK1000

High Performance LINPACK n×n (HPL)

Other packages for dense linear algebra (www.netlib.org):

LAPACK and ScaLAPACK
LINPACK’s Building Blocks
□ BLAS (Basic Linear Algebra Subprograms):
http://www.netlib.org/blas
– Level 1: vector-vector operations; y = y + ax
– Level 2: matrix-vector operations; y = y + Ax
– Level 3: matrix-matrix operations; A=B+C

□ Highly optimzed BLAS:

– ATLAS (free optmized BLAS generator);
http://www.netlib.org/atlas
HPL Benchmark Highlights
□ Extract MAXIMUM sustained performance of
a given system
□ Results listed in TOP500:
– Rmax = Performance in Gflop/s for the biggest
problem ran
– Nmax =
= Size
Size of
of biggest
problemproblem
where half of Rmax is sustained
– N1/2
– Rpeak = Theoretical peak performance

□ Factors affecting HPL performance

– Implementation; human effort; operational system;
hardware, network, compiler, BLAS, etc.
The Power Wall

#1 TOP500, Jaguar, 7MW, MFLOPS/W=251

Fonte: J. Dongarra, http://www.netlib.org/utk/people/JackDongarra/talks.html
Moore’s Law revisited

Clock frequency scaling replaced by scaling cores / chip

source: J. J. Dongarra, http://www.netlib.org/utk/people/JackDongarra/talks.htm
http://www.lanl.gov/roadrunner/
Fonte: J. Dongarra, http://www.netlib.org/utk/people/JackDongarra/talks.html
from: J. Dongarra, http://www.netlib.org/utk/people/JackDongarra/talks.html

Coupling refers to the degree of direct knowledge that one element has of another.
from: J. Dongarra, http://www.netlib.org/utk/people/JackDongarra/talks.html
Current HPC Systems -

Exascale Computing
HIGH PERFORMANCE
APPLICATIONS
Taxonomy of Parallel Applications
Few or no Embarssingly Parallel
communication (EP)

Explicit Explicit
Explicit (Neighbor) Unstructured
Communication Structured
(ES) (EU)

Implicit Implicit
Implicit (Global) Structured Unstructured
Communication (IS) (IU)

Structured Unstructured
Communication Communication
H. D. Simon. High performance computing: Architecture, software, algorithms.
Technical Report RNR-93-018, NASA Ames Research Center, Moffett Field, CA 94035, December 1993
Measuring Parallel Performance

□ T1 – serial execution time (1 processor)

□ TP – parallel execution time in p processors

□ Speed-up ➔ SP=T1/TP
□ Efficiency ➔ EP = T1/(pTP)
□ Therefore:
EP = SP/p SP  p EP  1
□ Note: anomalies may happen due to other
resources (cache) as p increases– superlinear
speed-up
Amdahl’s Law (1967)

□ Serial fraction: s, 0s1

□ Parallel fraction in p processors:
1-s
□ Then:

□ Corollary:
Scalability

□ Scalability refers on how an given algorithm

can use efficiently additional processors;
□ An algorithm is scalable when p increases if its
efficiency is constant as problem size
increases
Code Optimization
and Programming
Before even thinking in parallelizing a code:
□ Optimize your code for a given class of
processors
– This is what reduces CPU time
□ Use all optimization TOOLS existing in
the compiler and in the system;
□ Always verify if the code is working
properly;
□ Always use standard libraries such as
BLAS, LAPACK, etc.
Basic Parallel Programming Models
□ Distributed Memory Machines

Message Passing: send/receive

□ Message Passing Library

Message Passing Interface http://www.mpi.com

call MPI_Send( ... )

call
MPI_Recv( ... )
Basic Parallel Programming Models
□ Threaded machines: compiler
directives
OpenMP !$OMP PARALLEL DO PRIVATE (J)
DO J=1,M
...
http://www.openmp.org
ENDDO

OpenCL http://www.khronos.org/opencl/

CUDA: // send data from host to device: a_h to

a_d
cudaMemcpy(a_d, a_h, sizeof(float)*N,
cudaMemcpyHostToDevice);
http://www.nvidia.com/object/cuda_home.html#/
Development Tools

Numerical Libraries
– Netlib: http://www.netlib.org
– ACTS (Advanced Computational Testing and Simulation)
Toolkit http://acts.nersc.gov/
• PETSc (Portable, Extensible Toolkit for Scientific
Computation)

• ScaLAPACK library extends LAPACK's high-performance

linear algebra software to distributed memory
Why computer simulation?

From: A SCIENCE-BASED CASE FOR LARGE-SCALE SIMULATION, DOE,

2003
Final Remarks
□ Computational Engineering and Science
changed the way we view engineering
□ There is no general approach
□ Integrated approach: HPC, Visualization,
Storage and Communications
□ Challenges:
– Managing complexity: programming models, data structures
and computer architecture ➔ performance
– Understanding the results of a computation: visualization,
data integration, knowledge extraction
– Collaboration: grid, web, data security

Computer Organization & Design The Hardware/Software Interface, 2nd Edition Patterson & Hennessy
80% (5)
Computer Organization & Design The Hardware/Software Interface, 2nd Edition Patterson & Hennessy
118 pages
Introduction To High Performance Computing: Unit-I
No ratings yet
Introduction To High Performance Computing: Unit-I
70 pages
2009 Quad Oem Catalog
100% (1)
2009 Quad Oem Catalog
12 pages
Computer Performance Enhancing Techniques - Session-2
100% (1)
Computer Performance Enhancing Techniques - Session-2
36 pages
Kautilya's Arthashastra
100% (2)
Kautilya's Arthashastra
10 pages
Ascii Code
No ratings yet
Ascii Code
4 pages
BEd Foundation Phase - Prospectus 2022
No ratings yet
BEd Foundation Phase - Prospectus 2022
51 pages
HPC-Unit-1
No ratings yet
HPC-Unit-1
65 pages
Introduction To Computer Architecture
No ratings yet
Introduction To Computer Architecture
2 pages
Vi Semester Result Analysis (2021 Batch) - 2023-2024
No ratings yet
Vi Semester Result Analysis (2021 Batch) - 2023-2024
2 pages
Coverage Initiation - Biocon
No ratings yet
Coverage Initiation - Biocon
32 pages
ACA Notes UNIT-1
No ratings yet
ACA Notes UNIT-1
20 pages
L1.3b_OOOpipelines
No ratings yet
L1.3b_OOOpipelines
72 pages
Hpca Notes
No ratings yet
Hpca Notes
216 pages
Quicklabelcrop 1715144546431
No ratings yet
Quicklabelcrop 1715144546431
7 pages
Computer Architecture Note 2024
No ratings yet
Computer Architecture Note 2024
45 pages
YANMAR 3TNV70-GGE-Spec
No ratings yet
YANMAR 3TNV70-GGE-Spec
17 pages
IAS & MIPS Rate
No ratings yet
IAS & MIPS Rate
42 pages
TPS-Instruction Manual
No ratings yet
TPS-Instruction Manual
42 pages
COMP-unit-1
No ratings yet
COMP-unit-1
52 pages
IQ Digital Back Range
No ratings yet
IQ Digital Back Range
2 pages
Littelfuse MEGA Datasheet
No ratings yet
Littelfuse MEGA Datasheet
2 pages
Hcgvgbybubunibiguft
No ratings yet
Hcgvgbybubunibiguft
2 pages
CH02-COA10e Spring 2025
No ratings yet
CH02-COA10e Spring 2025
24 pages
Introduction To Computer Architecture and Performance Measurement
No ratings yet
Introduction To Computer Architecture and Performance Measurement
41 pages
Propostitional Predicate Logic
No ratings yet
Propostitional Predicate Logic
36 pages
Abstraction & Technology_1
No ratings yet
Abstraction & Technology_1
74 pages
Computer Organization: Virtual Memory
No ratings yet
Computer Organization: Virtual Memory
26 pages
CO - 22CSE132 - Module 1
No ratings yet
CO - 22CSE132 - Module 1
114 pages
Arduino CC
No ratings yet
Arduino CC
5 pages
Chapter 2
No ratings yet
Chapter 2
28 pages
1 Al
No ratings yet
1 Al
10 pages
Chapter 01 Modified
No ratings yet
Chapter 01 Modified
55 pages
chapter_1
No ratings yet
chapter_1
53 pages
1
No ratings yet
1
52 pages
CH02-HP Computer Abstractions and Technology
No ratings yet
CH02-HP Computer Abstractions and Technology
36 pages
L1.3a HPC Concepts
No ratings yet
L1.3a HPC Concepts
43 pages
Lec 1
No ratings yet
Lec 1
32 pages
Chapter_1_Introduction
No ratings yet
Chapter_1_Introduction
49 pages
L1.2 HPC Introduction
No ratings yet
L1.2 HPC Introduction
42 pages
CCS 1202 Lecture 2_Computer Evolution and Performance
No ratings yet
CCS 1202 Lecture 2_Computer Evolution and Performance
32 pages
Student Notes 1
No ratings yet
Student Notes 1
65 pages
01_Chapter 1 (1)
No ratings yet
01_Chapter 1 (1)
41 pages
KK CV L&T
No ratings yet
KK CV L&T
1 page
Computer Architecture Introduction
No ratings yet
Computer Architecture Introduction
61 pages
Slide 1
No ratings yet
Slide 1
33 pages
CH02-COA10e Spring 2025
No ratings yet
CH02-COA10e Spring 2025
24 pages
CS3350B Computer Architecture: Marc Moreno Maza
100% (1)
CS3350B Computer Architecture: Marc Moreno Maza
45 pages
Shahverdi A R Fakhimi A Shahverdi H R Minaian S Sy
No ratings yet
Shahverdi A R Fakhimi A Shahverdi H R Minaian S Sy
5 pages
Chapter - 01 - Computer Abstractions
No ratings yet
Chapter - 01 - Computer Abstractions
37 pages
CHAPTER 1 and 2
No ratings yet
CHAPTER 1 and 2
25 pages
ARM Computer Organization-Chapter01
No ratings yet
ARM Computer Organization-Chapter01
55 pages
CCE 131 Lecture1
No ratings yet
CCE 131 Lecture1
26 pages
RRB Crash Course
No ratings yet
RRB Crash Course
15 pages
Lecture1 Cda3101
No ratings yet
Lecture1 Cda3101
44 pages
CA0216D_Chapter1B
No ratings yet
CA0216D_Chapter1B
32 pages
Lirik Naruto
No ratings yet
Lirik Naruto
23 pages
Accounting NSC P1 Answer Book May June 2023 Eng Ga 240219 123410
No ratings yet
Accounting NSC P1 Answer Book May June 2023 Eng Ga 240219 123410
14 pages
BRM QB With Answer Key For MCQ
No ratings yet
BRM QB With Answer Key For MCQ
6 pages
RTSEC Documentation
No ratings yet
RTSEC Documentation
4 pages
L1.4_The World of Computing
No ratings yet
L1.4_The World of Computing
34 pages
Aula Ch1
No ratings yet
Aula Ch1
40 pages
Chapter 1 Measuring Understanding Performance
No ratings yet
Chapter 1 Measuring Understanding Performance
63 pages
CI-0120 Arquitectura de Computadoras Ejemplos FundamentosDiseño
No ratings yet
CI-0120 Arquitectura de Computadoras Ejemplos FundamentosDiseño
52 pages
Lecture1 2
No ratings yet
Lecture1 2
30 pages
Chapter 1 Edit
No ratings yet
Chapter 1 Edit
463 pages
Computers: - Computers Impact Our Lives in A Huge Number of Ways
No ratings yet
Computers: - Computers Impact Our Lives in A Huge Number of Ways
14 pages
1266 41
No ratings yet
1266 41
171 pages
CSC232 - Chp1 (Compatibility Mode)
No ratings yet
CSC232 - Chp1 (Compatibility Mode)
50 pages
Hpc_unit-1 Insem Notes
No ratings yet
Hpc_unit-1 Insem Notes
76 pages
Financial Sector
No ratings yet
Financial Sector
136 pages
How To Speak English Fluently Guide - Success Darpan
100% (2)
How To Speak English Fluently Guide - Success Darpan
60 pages
Lect 1
No ratings yet
Lect 1
54 pages
Unit I-Basic Structure of A Computer: System
No ratings yet
Unit I-Basic Structure of A Computer: System
64 pages
UNIT1
No ratings yet
UNIT1
11 pages
Lecture 1 8405 Computer Architecture
No ratings yet
Lecture 1 8405 Computer Architecture
15 pages
Computer Architecture
No ratings yet
Computer Architecture
3 pages
L5-L6-Performance Issues
No ratings yet
L5-L6-Performance Issues
47 pages
Bill Gates
No ratings yet
Bill Gates
25 pages
Computer Architecture and Operating Systems (Caos) Course Code: CS31702 4-0-0
No ratings yet
Computer Architecture and Operating Systems (Caos) Course Code: CS31702 4-0-0
33 pages
Virtual Carding Handbook 2.0 - 2020
67% (6)
Virtual Carding Handbook 2.0 - 2020
32 pages
Computer Abstractions and Technology
No ratings yet
Computer Abstractions and Technology
47 pages
Chapter 1 Edit PDF
No ratings yet
Chapter 1 Edit PDF
40 pages
Lect 1
No ratings yet
Lect 1
56 pages
Mesh (Scale) - Wikipedia
No ratings yet
Mesh (Scale) - Wikipedia
3 pages
CMP2008 L1
No ratings yet
CMP2008 L1
47 pages
Bped 110 Intro Lecture
No ratings yet
Bped 110 Intro Lecture
9 pages
BX14 User&Technical Manual en v1.3
100% (2)
BX14 User&Technical Manual en v1.3
108 pages
Instructor: L. N. Bhuyan
No ratings yet
Instructor: L. N. Bhuyan
32 pages
Assessment of Knowledge, Attitude, and Practice of Self-Medication Among Harar Health Sciences College Students, Harar, Eastern Ethiopia
No ratings yet
Assessment of Knowledge, Attitude, and Practice of Self-Medication Among Harar Health Sciences College Students, Harar, Eastern Ethiopia
6 pages
10.0000@Www - Onepetro.org@conference Paper@ISOPE I 14 112
No ratings yet
10.0000@Www - Onepetro.org@conference Paper@ISOPE I 14 112
5 pages
Fundamentals of Modern Computer Architecture: From Logic Gates to Parallel Processing
From Everand
Fundamentals of Modern Computer Architecture: From Logic Gates to Parallel Processing
Sam Steed
No ratings yet
Quantum Computer Vs Traditional Computer
From Everand
Quantum Computer Vs Traditional Computer
Arief Muinnudin
No ratings yet
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
From Everand
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
Jonathan Rigdon
No ratings yet

L1.0 HPC Overview

Uploaded by

L1.0 HPC Overview

Uploaded by

Reference Books

1) High Performance Computing Modern Systems and

2) Introduction to High Performance Computing for Scientists

3) Next-Gen Computer Architecture - Till the end of Silicon,

□ High Performance Computers

□ High Performance Applications

□ Final Comments and Discussion

□ In broad terms it is about using computers

From: A SCIENCE-BASED CASE FOR LARGE-SCALE SIMULATION, DOE,

These components of a computer architecture

The quantum unit of time is a clock cycle. The clock

Computers have hierarchies of memories that may be

"Reads" transfer information from the memory; "Writes"

·Memory Cycle Time refers to the minimum period

For chips in small personal computers i s about 10 ns or

In practice, processors tend to access memory in a

Locality of reference can be exploited by implementing

fast, expensive slow, cheap

The performance of a hierarchical memory is characterized by an

T = 0.98*10ns + 0.02*100ns = 11.8 ns

The memory behaves as if it were composed entirely of fast chips with

Active Processes are being executed by the Processing Unit(s)

The clock cycle time is a simple, but rather inadequate

where T is the time to execute, n is the number of

In recent years, the RISC architecture has proven a better match

·Simple instructions allow powerful implementation techniques such

MIPS = "Millions of instructions per second"

MFLOP/S = "Millions of floating-point operations per second“,

Theoretical Peak MFLOP/S = MFLOP/S if the machine did

Benchmarks = Programs designed to determine performance

Examples: HPL, NASA-NPB’s, LINKPACK

where T is the time to execute, n is the number of instructions executed, t is the

1.Job-Level Parallelism: The computer center purchases more

Large-Grain System: Operations running in parallel are

Small-Grain System: Operations running in parallel are

This is an example of a data dependency: the processor cannot

For parallel systems there are two general memory designs:

A major consideration for parallel systems is the manner in

Ring vs. Fully Connected

Supercomputers are the

Cray XT5 at Oak Ridge, USA,

High Performance LINPACK n×n (HPL)

Other packages for dense linear algebra (www.netlib.org):

□ Highly optimzed BLAS:

□ Factors affecting HPL performance

#1 TOP500, Jaguar, 7MW, MFLOPS/W=251

Clock frequency scaling replaced by scaling cores / chip

□ T1 – serial execution time (1 processor)

□ TP – parallel execution time in p processors

□ Serial fraction: s, 0s1

□ Scalability refers on how an given algorithm

Message Passing: send/receive

□ Message Passing Library

Message Passing Interface http://www.mpi.com

call MPI_Send( ... )

CUDA: // send data from host to device: a_h to

• ScaLAPACK library extends LAPACK's high-performance

From: A SCIENCE-BASED CASE FOR LARGE-SCALE SIMULATION, DOE,

You might also like

T = 0.9810ns + 0.02100ns = 11.8 ns