CSA HW 4
CSA HW 4
Answer_b)
Clock Cycle Instruction
1 addi x4, x1, #800 // Set up loop upper bound
2 fld F2, 0(x1) // Load X(i) → F2
3 fld F6, 0(x2) // Load Y(i) → F6 (parallel load)
4 fmul.d F4, F2, F0 // F4 = a * X(i)
5 fld F2, 8(x1) // Load X(i+1) → F2 for next iteration
6 fld F10, 8(x2) // Load Y(i+1) → F10 (parallel load)
7 fmul.d F8, F2, F0 // F8 = a * X(i+1)
8 fld F2, 8(x1) // Load X(i+2) → F2 for next iteration
9 fld F14, 8(x2) // Load Y(i+2) → F14 (parallel load)
10 fmul.d F12, F2, F0 // F12 = a * X(i+2)
11 fadd.d F6, F4, F6 // F6 = a * X(i) + Y(i) (complete the first iteration)
12 addi x1, x1, #24 // Increment X index for the next 3 iterations
13 fadd.d F10, F8, F10 // F10 = a * X(i+1) + Y(i+1)
14 addi x2, x2, #24 // Increment Y index for the next 3 iterations
15 sltu x3, x1, x4 // Test if X(i+3) < X upper bound for the loop
16 fadd.d F14, F12, F14 // F14 = a * X(i+2) + Y(i+2)
17 fsd F6, -24(x2) // Store the result Y(i) = a * X(i) + Y(i)
18 fsd F10, -16(x2) // Store the result Y(i+1) = a * X(i+1) + Y(i+1)
19 bnez x3, foo // Branch if the loop continues
20 fsd F14, -8(x2) // Store the result Y(i+2) = a * X(i+2) + Y(i+2)
The unrolled loop completes 3 iterations in 20 clock cycles.
20 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
Execution Time Per Element = ≈6.67 cycles per element.
3 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒
• The loop must be unrolled three times to eliminate stalls and minimize loop overhead.
• The execution time per element after unrolling and scheduling is approximately 6.67
cycles per element.
Therefore, by unrolling the loop three times, we significantly reduce the loop overhead and
allow multiple independent operations to be executed in parallel, leading to a substantial
improvement in execution time.