0% found this document useful (0 votes)
17 views41 pages

01 Introduction

Uploaded by

chirag
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views41 pages

01 Introduction

Uploaded by

chirag
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

CS516: Parallelization of Programs

Introduction

Vishwesh Jatala
Assistant Professor
Department of CSE
Indian Institute of Technology Bhilai
[email protected]

2023-24 W
1
What is the output?

2
Outline of Today’s Lecture

■ Why?
■ What?
■ How?

3
Moore’s Law

■ The number of transistors on a IC doubles about every


two years

4
Moore’s Law Effect

5
Moore’s Law Effect

Single Core Performance

Power and Heat Multicore Processors

6
Parallel Architectures are Everywhere!

7
Parallel Hardwares

Distributed CPUs

Multicores

GPUs
8
Hardware and Software

Core

L1 Cache
Output

L2 Cache

DRAM
Sequential

Single-core CPU

9
Hardware and Software

Core1 Core2

L1 Cache L1 Cache

Core3 Core4

L1 Cache L1 Cache

L2 Cache
Same sequential?
DRAM

Multi-core CPU

10
Professor P

15 questions
300 Answer sheets

11
Professor P’s Teaching Assistants

TA#1
TA#2 TA#3

12
Benefits of Parallel Programming

Fast: Less Save Money


execution time

Solves Larger Problem

13
Parallel Programming Applications

Deep Learning and


Machine Learning Medical Imaging

Climate Modeling
CFD

14
Parallel Programming Applications

OpenAI used a super computer


with more than 285,000 CPU cores,
10,000 GPUs and
400 gigabits per second of network
connectivity for each GPU server.

15
Challenges!

16
Example-1

Sequential Execution

Core-0

for (int i=0; i<5; i++)


A[i]= i A[0]=0

A[1]=1

A[2]=2

A[3]=3

A[4]=4

17
Example-1

for (int i=0; i<5; i++)


A[i]= i

Parallel Execution

Core-0 Core-1 Core-2 Core-3 Core-4

A[0]=0 A[1]=1 A[2]=2 A[3]=3 A[4]=4

18
Example-2

Sequential Execution

Core-0

int count = 0;
A[0]=0
for (int i=0; i<5; i++)
A[i]= count++;
A[1]=1

A[2]=2

A[3]=3

A[4]=4

19
Example-2

int count = 0;
for (int i=0; i<5; i++)
A[i]= count++;

Parallel Execution

Core-0 Core-1 Core-2 Core-3 Core-4

A[0]=1 A[1]=2 A[2]=0 A[3]=4 A[4]=3

20
Challenges:

Detecting Parallelism is Hard!

21
Example-3: Sequential Version

int sum = 0;
for (i = 0; i < n; i++) {
x = f(i);
sum = sum+x;
}

A Sequential Program for Sum

Core-0

1 4 3 9 2 8 5 1 1 6 2 7 2 5 0 4 1 8 6 5 1 2 3 9

sum = 95

22
Example-3: Parallel Version-1

my_sum = 0;
my_first i = . . . ;
my_last i = . . . ;
for (my_i = my_first i; my_i < p_last_i; my_i++) {
my_x = f(i);
my_sum += my_x;
}

Core-0 Core-1 Core-2 Core-3 Core-4 Core-5 Core-6 Core-7

1 4 3 9 2 8 5 1 1 6 2 7 2 5 0 4 1 8 6 5 1 2 3 9

my_sum 8 19 7 15 7 13 12 14

23
Example-3: Parallel Version-1

if (I’m the master core) {


sum = my_sum;
for (each core other than myself) {
receive value from core;
sum += value;
}
}
else
{ send my_sum to the master; }

Core-0 Core-1 Core-2 Core-3 Core-4 Core-5 Core-6 Core-7

1 4 3 9 2 8 5 1 1 6 2 7 2 5 0 4 1 8 6 5 1 2 3 9

my_sum 8 19 7 15 7 13 12 14

sum 95 (8+19+7+15+7+13+12+14)

24
Example-3: Parallel Version-2

25
Parallel Version-1 or Version-2?

■ Both have same number of operations


■ Version-1 sum is sequential
■ Version-2 exposes parallelism

26
Challenges:

Detecting Parallelism is Hard!

Communication

Synchronization

27
Example-4:

int sum = 0;
for (i = 0; i < n; i++) {
x = f(i);
sum = sum+x;
}

28
Challenges:

Detecting Parallelism is Hard!

Communication

Synchronization

Load Balancing

29
What did we learn so far?

Sequential
Parallel Hardwares
Programs are
are Inevitable
Inefficient

Parallelization is
Challenges!
promising

30
Let’s discuss details from next class!

31
Course Outline (Part-1)

■ Introduction (this lecture)


■ Overview of Parallel Architectures and
Programming models
■ Amdahl's law and Performance
■ Parallel programming
❑ GPUs and CUDA programming

❑ Optimizations
■ Case studies
■ Extracting Parallelism from Sequential Programs
Automatically

32
Course Logistics

■ Lecture Hours:
❑ Mon, Tue, Thursday 10:30 am - 11:25 am

■ Course Website: Canvas platform


❑ Lecture notes
❑ Assignments
❑ Project
❑ Discussions
❑ Marks

33
Course Logistics: Evaluation Scheme

■ Evaluation scheme (can be changed slightly):


❑ Exams: ~40%
❑ Project: ~35%
❑ Attendance: ~10%
❑ Assignments (Paper presentation?): 15%

■ Attendance
❑ 0% - 50%: 0 Marks
❑ >50%: Marks will be awarded out of 10 accordingly.
❑ Example:
■ Total sessions: 16
■ #sessions attended = 7 (<50%), marks = 0
■ #sessions attended = 10 (62.5%), marks = 2.5 (2*10/8)

34
Course Logistics: References

■ Lecture notes will be available on Canvas


■ Reference material will be provided on Canvas
■ Text book for extracting parallelism:
❑ Randy Allen, Ken Kennedy,Optimizing Compilers for

Modern Architectures: A Dependence-based


Approach, Morgan Kaufmann, 2001

35
Course Logistics: Tools

■ Platforms:
❑ Prefer Google Colab for GPUs and CUDA Programming.

❑ A demo session

36
Course Logistics

■ Evaluation Policy:
❑ Acknowledge all the sources
❑ Do not cheat

37
Outcome of the Course?

■ State-of-the-art techniques in parallel computing


■ Develop parallel programming skills
■ Transferable skills -
❑ Parallel programming is used in multiple disciplines
❑ Industries
❑ Education and research
■ Handle projects
■ High-performance computing

38
Course Logistics

Office: B-010

[email protected]

39
References

■ https://www.cse.iitd.ac.in/~soham/COL380/page.html
■ https://s3.wp.wsu.edu/uploads/sites/1122/2017/05/6-
9-2017-slides-vFinal.pptx
■ Miscellaneous resources on internet

40
Thank you!

41

You might also like