0% found this document useful (0 votes)
12 views

CSA HW 4

The document discusses the performance comparison between unscheduled and scheduled code in computer system architecture, highlighting that the scheduled code is 1.6 times faster, resulting in a 60% speedup. It also details the execution time per element for both codes, with unscheduled code taking 16 clock cycles and scheduled code taking 10 clock cycles. Additionally, it explains the benefits of loop unrolling, which reduces execution time to approximately 6.67 cycles per element by minimizing loop overhead and allowing parallel execution of operations.

Uploaded by

vinod.kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

CSA HW 4

The document discusses the performance comparison between unscheduled and scheduled code in computer system architecture, highlighting that the scheduled code is 1.6 times faster, resulting in a 60% speedup. It also details the execution time per element for both codes, with unscheduled code taking 16 clock cycles and scheduled code taking 10 clock cycles. Additionally, it explains the benefits of loop unrolling, which reduces execution time to approximately 6.67 cycles per element by minimizing loop overhead and allowing parallel execution of operations.

Uploaded by

vinod.kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

CSCI 6461 Vinod Kumar (G39671299) FALL 2024 Semester

Computer System Architecture, HW 4


Dr. Lei He
Q. 3.14 Answer_a)
1. Unscheduled Code:
Clock cycle Instruction
1 addi x4, x1, #800 // Setup loop upper bound
2 fld F2, 0(x1) // Load X(i) → F2
3 stall
4 fmul.d F4, F2, F0 // F4 = a * X(i)
5 fld F6, 0(x2) // Load Y(i) → F6
6 stall
7 fadd.d F6, F4, F6 // F6 = a * X(i) + Y(i)
8 stall
9 stall
10 stall
11 fsd F6, 0(x2) // Store Y(i) = a * X(i) + Y(i)
12 addi x1, x1, #8 // Increment X index
13 addi x2, x2, #8 // Increment Y index
14 sltu x3, x1, x4 // Test if X(i) < X upper bound
15 stall
16 bnez x3, foo // Branch if needed
17 stall
Total = 16 clock cycles per element.
2. Unscheduled Code:
Clock cycle Instruction
1 addi x4, x1, #800 // Setup loop upper bound
2 fld F2, 0(x1) // Load X(i) → F2
3 fld F6, 0(x2) // Load Y(i) → F6 (parallel load)
4 fmul.d F4, F2, F0 // F4 = a * X(i)
5 addi x1, x1, #8 // Increment X index
6 addi x2, x2, #8 // Increment Y index
7 sltu x3, x1, x4 // Test if X(i) < X upper bound
8 fadd.d F6, F4, F6 // F6 = a * X(i) + Y(i)
9 Stall // Floating-point addition delay
10 bnez x3, foo // Branch if needed
11 fsd F6, -8(x2) // Store Y(i) = a * X(i) + Y(i)
Total = 10 clock cycles per element.
Execution Time per Element:
Unscheduled Code = 16 clock cycles per element.
Scheduled Code = 10 clock cycles per element.
The speedup from the scheduled code can be calculated as:
𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 (𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈ℎ𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒)
Speedup =
𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 (𝑆𝑆𝑆𝑆ℎ𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒)
16
Speedup = = 1.6
10
This means the scheduled code is 1.6 times faster than the unscheduled code,
Or in percentage terms:
Percentage Speedup = (1.6 − 1) × 100 = 60%
So, the scheduled code is 60% faster than the unscheduled code.
Thus, the clock speed must be 60% faster for the unscheduled code to match the performance
of the scheduled code on the original hardware.

Answer_b)
Clock Cycle Instruction
1 addi x4, x1, #800 // Set up loop upper bound
2 fld F2, 0(x1) // Load X(i) → F2
3 fld F6, 0(x2) // Load Y(i) → F6 (parallel load)
4 fmul.d F4, F2, F0 // F4 = a * X(i)
5 fld F2, 8(x1) // Load X(i+1) → F2 for next iteration
6 fld F10, 8(x2) // Load Y(i+1) → F10 (parallel load)
7 fmul.d F8, F2, F0 // F8 = a * X(i+1)
8 fld F2, 8(x1) // Load X(i+2) → F2 for next iteration
9 fld F14, 8(x2) // Load Y(i+2) → F14 (parallel load)
10 fmul.d F12, F2, F0 // F12 = a * X(i+2)
11 fadd.d F6, F4, F6 // F6 = a * X(i) + Y(i) (complete the first iteration)
12 addi x1, x1, #24 // Increment X index for the next 3 iterations
13 fadd.d F10, F8, F10 // F10 = a * X(i+1) + Y(i+1)
14 addi x2, x2, #24 // Increment Y index for the next 3 iterations
15 sltu x3, x1, x4 // Test if X(i+3) < X upper bound for the loop
16 fadd.d F14, F12, F14 // F14 = a * X(i+2) + Y(i+2)
17 fsd F6, -24(x2) // Store the result Y(i) = a * X(i) + Y(i)
18 fsd F10, -16(x2) // Store the result Y(i+1) = a * X(i+1) + Y(i+1)
19 bnez x3, foo // Branch if the loop continues
20 fsd F14, -8(x2) // Store the result Y(i+2) = a * X(i+2) + Y(i+2)
The unrolled loop completes 3 iterations in 20 clock cycles.
20 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
Execution Time Per Element = ≈6.67 cycles per element.
3 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒
• The loop must be unrolled three times to eliminate stalls and minimize loop overhead.
• The execution time per element after unrolling and scheduling is approximately 6.67
cycles per element.

Therefore, by unrolling the loop three times, we significantly reduce the loop overhead and
allow multiple independent operations to be executed in parallel, leading to a substantial
improvement in execution time.

You might also like