EENG/CSCI 641 Computer Architecture 1
Vector Code Example
Name:
Grade:
Example:
Consider this piece of C code:
for (i=0; i< 128; i++)
{
z[i] = a*x[i] + y[i];
}
a) Develop the MIPS scalar assembly code for this C code.
L.D
L.D.
L.D.
L.D.
L.D.
F0,a
R10, 128
R1, 1000
R2, 2000
R3, 3000
;
;
;
;
;
load scalar a
128 elements to
load address of
load address of
load address of
Loop:
L.D
MUL.D
L.D
ADD.D
S.D
DADDUI
DADDUI
DADDUI
DADDUI
BNEZ
F1,[R1]
F2,F1,F0
F3,[R2]
F4,F2,F3
[R3],F4
R1,R1,8
R2,R2,8
R3,R3,8
R10,R10,-1
R10,Loop
;
;
;
;
;
;
;
;
;
;
load vector X
scalar-scalar multiply
load vector Y
add
store the result
increment array pointer for x[]
increment array pointer for y[]
increment array pointer for z[]
decrement loop counter
branch R10 != zero
process
array x
array y
array z
b) Develop the VMIPS assembly code for this C code.
L.D
L.D.
L.D.
L.D.
L.D.
F0,a
R10, 128
R1, 1000
R2, 2000
R3, 3000
;
;
;
;
;
load scalar a
128 elements to
load address of
load address of
load address of
LOOP:
LV
MULVS.D
LV
ADDVV.D
SV
DADDUI
V1,R1
V2,V1,F0
V3,R2
V4,V2,V3
R3,V4
R1,R1,16*8
;
;
;
;
;
;
load vector X
vector-scalar multiply
load vector Y
add
store the result
increment array pointer for x[]
process
array x
array y
array z
Page 1 of 6
DADDUI
DADDUI
DADDUI
BNEZ
c)
R2,R2,16*8 ; increment array pointer for y[]
R3,R3,16*8 ; increment array pointer for z[]
R10,R10,-16 ; decrement loop counter
R10,Loop
; branch R10 != zero
How many cycles it takes to execute the scalar code, assuming no memory latencies?
Instruction
L.D
L.D.
L.D.
L.D.
L.D.
Loop:
L.D
MUL.D
L.D
ADD.D
S.D
DADDUI
for x[]
DADDUI
for y[]
DADDUI
for z[]
DADDUI
BNEZ
Number of times
executed
1
1
1
1
1
F0,a
R10, 128
R1, 1000
R2, 2000
R3, 3000
;
;
;
;
;
load scalar a
128 elements to
load address of
load address of
load address of
F1,[R1]
F2,F1,F0
F3,[R2]
F4,F2,F3
[R3],F4
R1,R1,8
;
;
;
;
;
;
load vector X
scalar-scalar multiply
load vector Y
add
store the result
increment array pointer
128
128
128
128
128
128
R2,R2,8
; increment array pointer
128
R3,R3,8
; increment array pointer
128
process
array x
array y
array z
R10,R10,-1 ; decrement loop counter
R10,Loop
; branch R10 != zero
128
127*2 + 1*1
Total number of instruction cycles = 1(1+1+1+1+1) + 128 (9) + 127*2 + 1 = 1412
d) How many cycles it takes to execute the scalar code, assuming no memory latencies?
Instruction
L.D
L.D.
L.D.
L.D.
L.D.
F0,a
R10, 128
R1, 1000
R2, 2000
R3, 3000
;
;
;
;
;
load scalar a
128 elements to
load address of
load address of
load address of
LOOP:
L.D
MULVS.D
LV
ADDVV.D
SV
DADDUI
V1,R1
; load vector X
V2,V1,F0
; vector-scalar multiply
V3,R2
; load vector Y
V4,V2,V3
; add
R3,V4
; store the result
R1,R1,16*8 ; increment array pointer
process
array x
array y
array z
Number of times
executed
1
1
1
1
1
8
8
8
8
8
8
Page 2 of 6
for x[]
DADDUI
for y[]
DADDUI
for z[]
DADDUI
counter
BNEZ
R2,R2,16*8 ; increment array pointer
R3,R3,16*8 ; increment array pointer
R10,R10,-16
R10,Loop
; decrement loop
; branch R10 != zero
7*2 + 1*1
Total number of instruction cycles = 1(1+1+1+1+1) + 8 (9*8) + 7*2 +1 = 92
e)
What is the speed up?
1412/92 = 15.34
f)
What would be the speed up if the vector length is 32?
I leave it for you to figure this out. Remember now per loop iteration, you calculate 32 elements, as opposed to 16.
Page 3 of 6
Exercise:
Repeat the previous example for this piece of C code:
for (i=0; i< 128; i++)
{
z[i] = (x[i] + y[i]) * w[i];
}
a) Develop the MIPS scalar assembly code for this C code.
L.D.
L.D.
L.D.
L.D.
L.D.
R10, 128
R1, 1000
R2, 2000
R3, 3000
R4, 4000
;
;
;
;
;
128 elements
load address
load address
load address
load address
to
of
of
of
of
process
array x
array y
array z
array w
Loop:
L.D
L.D
L.D
ADD.D
MUL.D
S.D
DADDUI
DADDUI
DADDUI
DADDUI
DADDUI
BNEZ
F1,[R1]
F2,[R2]
F4,[R4]
F3,F1,F2
F5, F3, F4
[R3],F5
R1,R1,8
R2,R2,8
R3,R3,8
R4,R4,8
R10,R10,-1
R10,Loop
;
;
;
;
;
;
;
;
;
;
;
;
load vector X
load vector Y
load vector W
add X & Y
Z = (X+Y) * W
store the result
increment array pointer
increment array pointer
increment array pointer
increment array pointer
decrement loop counter
branch R10 != zero
for
for
for
for
x[]
y[]
z[]
w[]
b) Develop the VMIPS assembly code for this C code.
L.D.
L.D.
L.D.
L.D.
L.D.
R10, 128
R1, 1000
R2, 2000
R3, 3000
R4, 4000
;
;
;
;
;
128 elements
load address
load address
load address
load address
to
of
of
of
of
process
array x
array y
array z
array w
LOOP:
LV
LV
LV
ADDVV.D
MULVV.D
SV
DADDUI
DADDUI
DADDUI
DADDUI
DADDUI
BNEZ
V1,R1
; load vector X
V2,R2
; load vector Y
V4,R4
; load vector W
V3,V1,V2
; vector-vector add
V5,V3,V4
; add
R3,V5
; store the result
R1,R1,16*8 ; increment array pointer for x[]
R2,R2,16*8 ; increment array pointer for y[]
R3,R3,16*8 ; increment array pointer for z[]
R4,R3,16*8 ; increment array pointer for w[]
R10,R10,-16 ; decrement loop counter
R10,Loop
; branch R10 != zero
Page 4 of 6
c)
How many cycles it takes to execute the scalar code, assuming no memory latencies?
Instruction
Number of times
executed
Total number of instruction cycles =
d) How many cycles it takes to execute the scalar code, assuming no memory latencies?
Instruction
Number of times
executed
Total number of instruction cycles =
Page 5 of 6
e)
What is the speed up?
f)
What would be the speed up if the vector length is 32?
Page 6 of 6