SlideShare a Scribd company logo
Advanced Computer Architecture
Program Partitioning and Scheduling
Program partitioning
The transformation of sequentially coded program
into a parallel executable form can b done
manually by the programmer using explicit
parallelism or by a compiler detecting implicit
parallelism automatically.
Program partitioning determines whether the given
program can be partitioned or split into pieces that
can be executed in parallel or follow a certain
prespecified order of execution.
Program Partitioning & Scheduling
The size of the parts or pieces of a program that
can be considered for parallel execution can vary.
The sizes are roughly classified using the term
“granule size,” or simply “granularity.”
The simplest measure, for example, is the number
of instructions in a program part.
Grain sizes are usually described as fine, medium
or coarse, depending on the level of parallelism
involved.
Latency
Latency is the time required for communication
between different subsystems in a computer.
Memory latency, for example, is the time required
by a processor to access memory.
Synchronization latency is the time required for two
processes to synchronize their execution.
Communication latency is the interprocessor
communication.
Levels of Parallelism
Jobs or programs
Instructions
or statements
Non-recursive loops
or unfolded iterations
Procedures, subroutines,
tasks, or coroutines
Subprograms, job steps or
related parts of a program
}
}
Coarse grain
Medium grain
}Fine grain
Increasing
communication
demand and
scheduling
overhead
Higher degree
of parallelism
Instruction Level Parallelism
This fine-grained, or smallest granularity level
typically involves less than 20 instructions per
grain. The number of candidates for parallel
execution varies from 2 to thousands, with about
five instructions or statements (on the average)
being the average level of parallelism.
Advantages: The exploitation of fine-grain parallelism can
be assisted by an optimizing compiler which should be able
to automatically detect parallelism and translate the source
code to a parallel form which can be recognized by the run-
time system.
Loop-level Parallelism
Typical loop has less than 500 instructions.
If a loop operation is independent between
iterations, it can be handled by a pipeline, or by a
SIMD machine.
Loop Level Parallelism is the most optimized
program construct to execute on a parallel or
vector machine
Some loops (e.g. recursive) are difficult to handle.
Loop-level parallelism is still considered fine grain
computation.
Procedure-level Parallelism
Medium-sized grain; usually less than 2000
instructions.
Detection of parallelism is more difficult than with
smaller grains;
Interprocedural dependence analysis is difficult
and history-sensitive.
Communication requirement less than instruction-
level
SPMD (single procedure multiple data)execution
mode is a special case at this level.
Multitasking belongs to this level.
Subprogram-level Parallelism
Job step level; grain typically has thousands of
instructions; medium- or coarse-grain level.
Job steps can overlap across different jobs.
Multiprogramming conducted at this level.
In the past ,parallelism at this level has been
exploited by algorithm or programmers, rather then
by compiler.
We do not have good compiler available to exploit
medium- or coarse-grain parallelism at present.
Job or Program-Level Parallelism
Corresponds to execution of essentially
independent jobs or programs on a parallel
computer.
The grain size can be large as tens of thousand of
instruction in a single program.
For supercomputers with a small number of very
powerful processors, such coarse-grain parallelism
is practical.
Time sharing or space-sharing multiprocessor
explore this level of parallelism.
Job level parallelism is handled by program loader
or by the operating system.
Summary
Fine-grain exploited at instruction or loop levels, assisted by
the compiler.
Medium-grain (task or job step) requires programmer and
compiler support.
Coarse-grain relies heavily on effective OS support.
Shared-variable communication used at fine- and medium-
grain levels.
Message passing can be used for medium- and coarse-
grain communication, but fine-grain really need better
technique because of heavier communication requirements.
Communication Latency
Balancing granularity and latency can yield better
performance.
Various latencies attributed to machine
architecture, technology, and communication
patterns used.
Latency imposes a limiting factor on machine
scalability. Ex. Memory latency increases as
memory capacity increases, limiting the amount of
memory that can be used with a given tolerance for
communication latency.
Interprocessor Communication
Latency
Needs to be minimized by system designer
Affected by signal delays and communication
patterns
Ex. n communicating tasks may require n (n - 1)/2
communication links, and the complexity grows
quadratically, effectively limiting the number of
processors in the system.
Communication Patterns
Determined by algorithms used and architectural
support provided
Patterns include
permutations
broadcast
multicast
conference
Tradeoffs often exist between granularity of
parallelism and communication demand.
Grain Packing and Scheduling
Two questions:
How can I partition a program into parallel “pieces” to
yield the shortest execution time?
What is the optimal size of parallel grains?
There is an obvious tradeoff between the time
spent scheduling and synchronizing parallel grains
and the speedup obtained by parallel execution.
One approach to the problem is called “grain
packing.”
EENG-630 - Chapter 2
Grain packing and scheduling
The grain size problem requires determination of
both the number of partitions and the size of grains
in a parallel problem
Solution is problem dependent and machine
dependent
Want a short schedule for fast execution of
subdivided program modules
EENG-630 - Chapter 2
Grain determination and
scheduling optimization
Step 1: Construct a fine-grain program graph
Step 2: Schedule the fine-grain computation
Step 3: Grain packing to produce coarse grains
Step 4: Generate a parallel schedule based on
the packed graph
Program Graphs and Packing
A program graph is similar to a dependence graph
Nodes = { (n,s) }, where n = node name, s = size (larger
s = larger grain size).
Edges = { (v,d) }, where v = variable being
“communicated,” and d = communication delay.
Packing two (or more) nodes produces a node with
a larger grain size and possibly more edges to
other nodes.
Packing is done to eliminate unnecessary
communication delays or reduce overall scheduling
overhead.
program partitioning and scheduling  IN Advanced Computer Architecture
The basic concept of program partitioning is introduced here . In Fig.
we show an example program graph in two different grain sizes.
A program graph shows the structure of a program.
It is very similar to the dependence graph .
Each node in the program graph corresponds to a computational unit
in the program.
The grain size is measured by the number of basic machine
cycle[including both processor and memory cycles] needed to
execute all the operations within the node.
We denote each node in Fig. by a pair (n,s) , where n is the node
name {id} and s is the grain size of the node.
Thus grain size reflects the number of computations involved in a
program segment.
Fine-grain nodes have a smaller grain size. and coarse-grain nodes
have a larger grain size.
The edge label (v,d) between two end nodes specifies the output
variable v from the source node or the input variable to the
destination node , and the communication delay d between them.
 This delay includes all the path delays and memory latency
involved.
There are 17 nodes in the fine-grain program graph (Fig. a) and 5
in the coarse-grain program graph (Fig. b].
The coarse-grain node is obtained by combining {grouping}
multiple fine-grain nodes.
 The fine grain corresponds to the following program:
program partitioning and scheduling  IN Advanced Computer Architecture
Nodes l, 2, 3, 4, 5, and 6 are memory reference (data fetch}
operations.
Each takes one cycle to address and six cycles to fetch
from memory.
All remaining nodes (7 to 17) are CPU operations, each
requiring two cycles to complete. After packing, the
coarse-grain nodes have larger grain sizes ranging
from 4 to 8 as shown.
The node (A, 8) in Fig. (b) is obtained by combining the
nodes (1, 1), (2, 1), (3, 1), (4, I), (5, 1), (6, 1) and (11, 2) in
Fig.(a). The grain size, 8. of node A is the summation of all
grain sizes (1 + 1 + 1 + 1 + 1 +1 + 2 = 8) being combined.
The idea of grain packing is to apply fine grain first in order to achieve
a higher degree of parallelism. Then one combines (packs) multiple
fine-grain nodes into a coarse grain node if it can eliminate
unnecessary communications delays or reduce the overall scheduling
overhead.
Usually, all fine-grain operations within a single coarse-grain node are
assigned to the same processor for execution. Fine-grain partition of a
program often demands more inter processor communication than that
required in a coarse-grain partition. Thus grain packing offers a trade
off between parallelism and scheduling/communication overhead.
Internal delays among fine-grain operations within the same coarse-
grain node are negligible because the communication delay is
contributed mainly by inter processor delays rather than by delays
within the same processor.
The choice of the optimal grain size is meant to achieve the shortest
schedule for the nodes on a parallel computer system.
THANK YOU

More Related Content

What's hot (20)

PPT
system interconnect architectures in ACA
Pankaj Kumar Jain
 
PDF
Aca2 10 11
Sumit Mittu
 
PPTX
System interconnect architecture
Gagan Kumar
 
PPT
multiprocessors and multicomputers
Pankaj Kumar Jain
 
PPT
Hardware and Software parallelism
prashantdahake
 
PDF
Interconnection Network
Heman Pathak
 
PPTX
parallel language and compiler
Vignesh Tamil
 
PPT
Parallel processing
Syed Zaid Irshad
 
PPTX
Phases of Compiler
Tanzeela_Hussain
 
PPTX
2.2. language evaluation criteria
annahallare_
 
PPTX
Network software
SakthiVinoth78
 
PPTX
Introduction to Parallel and Distributed Computing
Sayed Chhattan Shah
 
PPT
System models in distributed system
ishapadhy
 
PPTX
Message and Stream Oriented Communication
Dilum Bandara
 
PDF
Distributed Operating System_1
Dr Sandeep Kumar Poonia
 
PPT
1.prallelism
Mahesh Kumar Attri
 
PPTX
Transport layer
reshmadayma
 
PPT
Network layer tanenbaum
Mahesh Kumar Chelimilla
 
PPT
process management
Ashish Kumar
 
PDF
Parallel Algorithms
Dr Sandeep Kumar Poonia
 
system interconnect architectures in ACA
Pankaj Kumar Jain
 
Aca2 10 11
Sumit Mittu
 
System interconnect architecture
Gagan Kumar
 
multiprocessors and multicomputers
Pankaj Kumar Jain
 
Hardware and Software parallelism
prashantdahake
 
Interconnection Network
Heman Pathak
 
parallel language and compiler
Vignesh Tamil
 
Parallel processing
Syed Zaid Irshad
 
Phases of Compiler
Tanzeela_Hussain
 
2.2. language evaluation criteria
annahallare_
 
Network software
SakthiVinoth78
 
Introduction to Parallel and Distributed Computing
Sayed Chhattan Shah
 
System models in distributed system
ishapadhy
 
Message and Stream Oriented Communication
Dilum Bandara
 
Distributed Operating System_1
Dr Sandeep Kumar Poonia
 
1.prallelism
Mahesh Kumar Attri
 
Transport layer
reshmadayma
 
Network layer tanenbaum
Mahesh Kumar Chelimilla
 
process management
Ashish Kumar
 
Parallel Algorithms
Dr Sandeep Kumar Poonia
 

Similar to program partitioning and scheduling IN Advanced Computer Architecture (20)

PDF
Program and Network Properties
Beekrum Duwal
 
PPTX
Lec 4 (program and network properties)
Sudarshan Mondal
 
PPT
programnetwork_properties-parallelism_ch2.ppt
Jatin1071
 
PPT
BIL406-Chapter-6-Basic Parallelism and CPU.ppt
Kadri20
 
PPT
Grain Packing & scheduling Ch2 Hwang - Copy.ppt
mksindumithra09
 
PPTX
MiniGolf Showdown TENOKE Free Download
elonbuda
 
PPTX
Arctic Motel Simulator TENOKE Free Download
elonbuda
 
PPTX
The XBMC Free CRACK Download for Pc .
dshut956
 
PPTX
Red Giant Shooter Suite 13 64 Bit Free CRACK Download
softcover72
 
PDF
Auslogics Video Grabber 1.0.0.7 Crack Free
shan05kpa
 
PDF
Auslogics Video Grabber 1.0.0.7 Crack Free
mohsinrazakpa92
 
PDF
Movavi Screen Recorder Studio 2025 crack Free Download
imran03kr
 
PDF
Zoom Meeting Crack License 100% Working [2025]
jhonjosh91
 
PDF
Windows 8.1 Pro Activator Crack Version [April-2025]
jhonjosh91
 
PDF
Wondershare Recoverit 13.5.12.11 Free Download
alihamzakpa086
 
PPTX
Apowersoft Screen Recorder Pro Free CRACK Download
castp261
 
PPTX
Movavi Screen Recorder Studio 2025 crack Free Download
sundayghar
 
PPTX
Download Celtx Plus 2025 crack free for Mac
gangpage308
 
PDF
Iobit Uninstaller Pro Crack Free Download
blouch97kp
 
PDF
Flexible PDF 2025 Crack License 100% Working
jackalen459
 
Program and Network Properties
Beekrum Duwal
 
Lec 4 (program and network properties)
Sudarshan Mondal
 
programnetwork_properties-parallelism_ch2.ppt
Jatin1071
 
BIL406-Chapter-6-Basic Parallelism and CPU.ppt
Kadri20
 
Grain Packing & scheduling Ch2 Hwang - Copy.ppt
mksindumithra09
 
MiniGolf Showdown TENOKE Free Download
elonbuda
 
Arctic Motel Simulator TENOKE Free Download
elonbuda
 
The XBMC Free CRACK Download for Pc .
dshut956
 
Red Giant Shooter Suite 13 64 Bit Free CRACK Download
softcover72
 
Auslogics Video Grabber 1.0.0.7 Crack Free
shan05kpa
 
Auslogics Video Grabber 1.0.0.7 Crack Free
mohsinrazakpa92
 
Movavi Screen Recorder Studio 2025 crack Free Download
imran03kr
 
Zoom Meeting Crack License 100% Working [2025]
jhonjosh91
 
Windows 8.1 Pro Activator Crack Version [April-2025]
jhonjosh91
 
Wondershare Recoverit 13.5.12.11 Free Download
alihamzakpa086
 
Apowersoft Screen Recorder Pro Free CRACK Download
castp261
 
Movavi Screen Recorder Studio 2025 crack Free Download
sundayghar
 
Download Celtx Plus 2025 crack free for Mac
gangpage308
 
Iobit Uninstaller Pro Crack Free Download
blouch97kp
 
Flexible PDF 2025 Crack License 100% Working
jackalen459
 
Ad

Recently uploaded (20)

PPTX
OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data Science
A R SIVANESH M.E., (Ph.D)
 
PPTX
仿制LethbridgeOffer加拿大莱斯桥大学毕业证范本,Lethbridge成绩单
Taqyea
 
PPTX
Water Resources Engineering (CVE 728)--Slide 4.pptx
mohammedado3
 
PPTX
MODULE 04 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
PDF
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
PDF
aAn_Introduction_to_Arcadia_20150115.pdf
henriqueltorres1
 
PPT
Footbinding.pptmnmkjkjkknmnnjkkkkkkkkkkkkkk
mamadoundiaye42742
 
PPTX
澳洲电子毕业证澳大利亚圣母大学水印成绩单UNDA学生证网上可查学历
Taqyea
 
PPT
New_school_Engineering_presentation_011707.ppt
VinayKumar304579
 
PPTX
Knowledge Representation : Semantic Networks
Amity University, Patna
 
PPTX
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
PDF
WD2(I)-RFQ-GW-1415_ Shifting and Filling of Sand in the Pond at the WD5 Area_...
ShahadathHossain23
 
PDF
REINFORCEMENT LEARNING IN DECISION MAKING SEMINAR REPORT
anushaashraf20
 
PDF
20ES1152 Programming for Problem Solving Lab Manual VRSEC.pdf
Ashutosh Satapathy
 
PDF
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
PDF
Viol_Alessandro_Presentazione_prelaurea.pdf
dsecqyvhbowrzxshhf
 
PDF
Submit Your Papers-International Journal on Cybernetics & Informatics ( IJCI)
IJCI JOURNAL
 
PPTX
Alan Turing - life and importance for all of us now
Pedro Concejero
 
PPTX
Distribution reservoir and service storage pptx
dhanashree78
 
PDF
Electrical Engineer operation Supervisor
ssaruntatapower143
 
OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data Science
A R SIVANESH M.E., (Ph.D)
 
仿制LethbridgeOffer加拿大莱斯桥大学毕业证范本,Lethbridge成绩单
Taqyea
 
Water Resources Engineering (CVE 728)--Slide 4.pptx
mohammedado3
 
MODULE 04 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
aAn_Introduction_to_Arcadia_20150115.pdf
henriqueltorres1
 
Footbinding.pptmnmkjkjkknmnnjkkkkkkkkkkkkkk
mamadoundiaye42742
 
澳洲电子毕业证澳大利亚圣母大学水印成绩单UNDA学生证网上可查学历
Taqyea
 
New_school_Engineering_presentation_011707.ppt
VinayKumar304579
 
Knowledge Representation : Semantic Networks
Amity University, Patna
 
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
WD2(I)-RFQ-GW-1415_ Shifting and Filling of Sand in the Pond at the WD5 Area_...
ShahadathHossain23
 
REINFORCEMENT LEARNING IN DECISION MAKING SEMINAR REPORT
anushaashraf20
 
20ES1152 Programming for Problem Solving Lab Manual VRSEC.pdf
Ashutosh Satapathy
 
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
Viol_Alessandro_Presentazione_prelaurea.pdf
dsecqyvhbowrzxshhf
 
Submit Your Papers-International Journal on Cybernetics & Informatics ( IJCI)
IJCI JOURNAL
 
Alan Turing - life and importance for all of us now
Pedro Concejero
 
Distribution reservoir and service storage pptx
dhanashree78
 
Electrical Engineer operation Supervisor
ssaruntatapower143
 
Ad

program partitioning and scheduling IN Advanced Computer Architecture

  • 1. Advanced Computer Architecture Program Partitioning and Scheduling
  • 2. Program partitioning The transformation of sequentially coded program into a parallel executable form can b done manually by the programmer using explicit parallelism or by a compiler detecting implicit parallelism automatically. Program partitioning determines whether the given program can be partitioned or split into pieces that can be executed in parallel or follow a certain prespecified order of execution.
  • 3. Program Partitioning & Scheduling The size of the parts or pieces of a program that can be considered for parallel execution can vary. The sizes are roughly classified using the term “granule size,” or simply “granularity.” The simplest measure, for example, is the number of instructions in a program part. Grain sizes are usually described as fine, medium or coarse, depending on the level of parallelism involved.
  • 4. Latency Latency is the time required for communication between different subsystems in a computer. Memory latency, for example, is the time required by a processor to access memory. Synchronization latency is the time required for two processes to synchronize their execution. Communication latency is the interprocessor communication.
  • 5. Levels of Parallelism Jobs or programs Instructions or statements Non-recursive loops or unfolded iterations Procedures, subroutines, tasks, or coroutines Subprograms, job steps or related parts of a program } } Coarse grain Medium grain }Fine grain Increasing communication demand and scheduling overhead Higher degree of parallelism
  • 6. Instruction Level Parallelism This fine-grained, or smallest granularity level typically involves less than 20 instructions per grain. The number of candidates for parallel execution varies from 2 to thousands, with about five instructions or statements (on the average) being the average level of parallelism. Advantages: The exploitation of fine-grain parallelism can be assisted by an optimizing compiler which should be able to automatically detect parallelism and translate the source code to a parallel form which can be recognized by the run- time system.
  • 7. Loop-level Parallelism Typical loop has less than 500 instructions. If a loop operation is independent between iterations, it can be handled by a pipeline, or by a SIMD machine. Loop Level Parallelism is the most optimized program construct to execute on a parallel or vector machine Some loops (e.g. recursive) are difficult to handle. Loop-level parallelism is still considered fine grain computation.
  • 8. Procedure-level Parallelism Medium-sized grain; usually less than 2000 instructions. Detection of parallelism is more difficult than with smaller grains; Interprocedural dependence analysis is difficult and history-sensitive. Communication requirement less than instruction- level SPMD (single procedure multiple data)execution mode is a special case at this level. Multitasking belongs to this level.
  • 9. Subprogram-level Parallelism Job step level; grain typically has thousands of instructions; medium- or coarse-grain level. Job steps can overlap across different jobs. Multiprogramming conducted at this level. In the past ,parallelism at this level has been exploited by algorithm or programmers, rather then by compiler. We do not have good compiler available to exploit medium- or coarse-grain parallelism at present.
  • 10. Job or Program-Level Parallelism Corresponds to execution of essentially independent jobs or programs on a parallel computer. The grain size can be large as tens of thousand of instruction in a single program. For supercomputers with a small number of very powerful processors, such coarse-grain parallelism is practical. Time sharing or space-sharing multiprocessor explore this level of parallelism. Job level parallelism is handled by program loader or by the operating system.
  • 11. Summary Fine-grain exploited at instruction or loop levels, assisted by the compiler. Medium-grain (task or job step) requires programmer and compiler support. Coarse-grain relies heavily on effective OS support. Shared-variable communication used at fine- and medium- grain levels. Message passing can be used for medium- and coarse- grain communication, but fine-grain really need better technique because of heavier communication requirements.
  • 12. Communication Latency Balancing granularity and latency can yield better performance. Various latencies attributed to machine architecture, technology, and communication patterns used. Latency imposes a limiting factor on machine scalability. Ex. Memory latency increases as memory capacity increases, limiting the amount of memory that can be used with a given tolerance for communication latency.
  • 13. Interprocessor Communication Latency Needs to be minimized by system designer Affected by signal delays and communication patterns Ex. n communicating tasks may require n (n - 1)/2 communication links, and the complexity grows quadratically, effectively limiting the number of processors in the system.
  • 14. Communication Patterns Determined by algorithms used and architectural support provided Patterns include permutations broadcast multicast conference Tradeoffs often exist between granularity of parallelism and communication demand.
  • 15. Grain Packing and Scheduling Two questions: How can I partition a program into parallel “pieces” to yield the shortest execution time? What is the optimal size of parallel grains? There is an obvious tradeoff between the time spent scheduling and synchronizing parallel grains and the speedup obtained by parallel execution. One approach to the problem is called “grain packing.”
  • 16. EENG-630 - Chapter 2 Grain packing and scheduling The grain size problem requires determination of both the number of partitions and the size of grains in a parallel problem Solution is problem dependent and machine dependent Want a short schedule for fast execution of subdivided program modules
  • 17. EENG-630 - Chapter 2 Grain determination and scheduling optimization Step 1: Construct a fine-grain program graph Step 2: Schedule the fine-grain computation Step 3: Grain packing to produce coarse grains Step 4: Generate a parallel schedule based on the packed graph
  • 18. Program Graphs and Packing A program graph is similar to a dependence graph Nodes = { (n,s) }, where n = node name, s = size (larger s = larger grain size). Edges = { (v,d) }, where v = variable being “communicated,” and d = communication delay. Packing two (or more) nodes produces a node with a larger grain size and possibly more edges to other nodes. Packing is done to eliminate unnecessary communication delays or reduce overall scheduling overhead.
  • 20. The basic concept of program partitioning is introduced here . In Fig. we show an example program graph in two different grain sizes. A program graph shows the structure of a program. It is very similar to the dependence graph . Each node in the program graph corresponds to a computational unit in the program. The grain size is measured by the number of basic machine cycle[including both processor and memory cycles] needed to execute all the operations within the node. We denote each node in Fig. by a pair (n,s) , where n is the node name {id} and s is the grain size of the node. Thus grain size reflects the number of computations involved in a program segment. Fine-grain nodes have a smaller grain size. and coarse-grain nodes have a larger grain size.
  • 21. The edge label (v,d) between two end nodes specifies the output variable v from the source node or the input variable to the destination node , and the communication delay d between them.  This delay includes all the path delays and memory latency involved. There are 17 nodes in the fine-grain program graph (Fig. a) and 5 in the coarse-grain program graph (Fig. b]. The coarse-grain node is obtained by combining {grouping} multiple fine-grain nodes.  The fine grain corresponds to the following program:
  • 23. Nodes l, 2, 3, 4, 5, and 6 are memory reference (data fetch} operations. Each takes one cycle to address and six cycles to fetch from memory. All remaining nodes (7 to 17) are CPU operations, each requiring two cycles to complete. After packing, the coarse-grain nodes have larger grain sizes ranging from 4 to 8 as shown. The node (A, 8) in Fig. (b) is obtained by combining the nodes (1, 1), (2, 1), (3, 1), (4, I), (5, 1), (6, 1) and (11, 2) in Fig.(a). The grain size, 8. of node A is the summation of all grain sizes (1 + 1 + 1 + 1 + 1 +1 + 2 = 8) being combined.
  • 24. The idea of grain packing is to apply fine grain first in order to achieve a higher degree of parallelism. Then one combines (packs) multiple fine-grain nodes into a coarse grain node if it can eliminate unnecessary communications delays or reduce the overall scheduling overhead. Usually, all fine-grain operations within a single coarse-grain node are assigned to the same processor for execution. Fine-grain partition of a program often demands more inter processor communication than that required in a coarse-grain partition. Thus grain packing offers a trade off between parallelism and scheduling/communication overhead. Internal delays among fine-grain operations within the same coarse- grain node are negligible because the communication delay is contributed mainly by inter processor delays rather than by delays within the same processor. The choice of the optimal grain size is meant to achieve the shortest schedule for the nodes on a parallel computer system.