Chapter 03
Chapter 03
Chapter 03
5 D
Edition
th
Chapter 3
Arithmetic for
Computers
Operations on integers
3.1 Introduction
Integer Addition
Example: 7 + 6
Integer Subtraction
Saturating operations
3.3 Multiplication
Multiplication
multiplicand
multiplier
product
1000
1001
1000
0000
0000
1000
1001000
Length of product is
the sum of operand
lengths
Multiplication Hardware
Initially 0
Optimized Multiplier
Faster Multiplier
Cost/performance tradeoff
Can be pipelined
MIPS Multiplication
Instructions
mult rs, rt
multu rs, rt
mfhi rd
mflo rd
quotient
dividend
divisor
1001
1000 1001010
-1000
10
101
1010
-1000
10
remainder
Restoring division
Otherwise
3.4 Division
Division
Signed division
Division Hardware
Initially divisor
in left half
Initially dividend
Optimized Divider
Faster Division
MIPS Division
Instructions
2.34 1056
+0.002 104
+987.02 109
normalized
not normalized
In binary
Floating Point
1.xxxxxxx2 2yyyy
S Exponent
single: 23 bits
double: 52 bits
Fraction
Single-Precision Range
Exponent: 00000001
actual exponent = 1 127 = 126
Fraction: 00000 significand = 1.0
1.0 2126 1.2 1038
Largest value
exponent: 11111110
actual exponent = 254 127 = +127
Fraction: 11111 significand 2.0
2.0 2+127 3.4 10+38
Chapter 3 Arithmetic for Computers 20
Double-Precision Range
Exponent: 00000000001
actual exponent = 1 1023 = 1022
Fraction: 00000 significand = 1.0
1.0 21022 2.2 10308
Largest value
Exponent: 11111111110
actual exponent = 2046 1023 = +1023
Fraction: 11111 significand 2.0
2.0 2+1023 1.8 10+308
Chapter 3 Arithmetic for Computers 21
Floating-Point Precision
Relative precision
Floating-Point Example
Represent 0.75
S=1
Fraction = 1000002
Exponent = 1 + Bias
Single: 101111110100000
Double: 101111111110100000
Chapter 3 Arithmetic for Computers 23
Floating-Point Example
S=1
Fraction = 01000002
Fxponent = 100000012 = 129
Floating-Point Addition
2. Add significands
1.0015 102
1.002 102
Chapter 3 Arithmetic for Computers 27
Floating-Point Addition
2. Add significands
FP Adder Hardware
Can be pipelined
FP Adder Hardware
Step 1
Step 2
Step 3
Step 4
FP Arithmetic Hardware
Can be pipelined
Chapter 3 Arithmetic for Computers 33
FP Instructions in MIPS
FP hardware is coprocessor 1
Separate FP registers
FP Instructions in MIPS
Single-precision arithmetic
Double-precision arithmetic
bc1t, bc1f
FP Example: F to C
C code:
float f2c (float fahr) {
return ((5.0/9.0)*(fahr - 32.0));
}
fahr in $f12, result in $f0, literals in global memory
space
$f16,
$f18,
$f16,
$f18,
$f18,
$f0,
$ra
const5($gp)
const9($gp)
$f16, $f18
const32($gp)
$f12, $f18
$f16, $f18
Chapter 3 Arithmetic for Computers 36
X=X+YZ
C code:
void mm (double x[][],
double y[][], double z[][]) {
int i, j, k;
for (i = 0; i! = 32; i = i + 1)
for (j = 0; j! = 32; j = j + 1)
for (k = 0; k! = 32; k = k + 1)
x[i][j] = x[i][j]
+ y[i][k] * z[k][j];
}
Addresses of x, y, z in $a0, $a1, $a2, and
i, j, k in $s0, $s1, $s2
Chapter 3 Arithmetic for Computers 37
MIPS code:
li
li
L1: li
L2: li
sll
addu
sll
addu
l.d
L3: sll
addu
sll
addu
l.d
$t1, 32
$s0, 0
$s1, 0
$s2, 0
$t2, $s0, 5
$t2, $t2, $s1
$t2, $t2, 3
$t2, $a0, $t2
$f4, 0($t2)
$t0, $s2, 5
$t0, $t0, $s1
$t0, $t0, 3
$t0, $a2, $t0
$f16, 0($t0)
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
Accurate Arithmetic
Subword Parallellism
x86 FP Architecture
x86 FP Instructions
Data transfer
Arithmetic
Compare
Transcendental
FILD mem/ST(i)
FISTP mem/ST(i)
FLDPI
FLD1
FLDZ
FIADDP
FISUBRP
FIMULP
FIDIVRP
FSQRT
FABS
FRNDINT
FICOMP
FIUCOMP
FSTSW AX/mem
FPATAN
F2XMI
FCOS
FPTAN
FPREM
FPSIN
FYL2X
mem/ST(i)
mem/ST(i)
mem/ST(i)
mem/ST(i)
Optional variations
I: integer operand
P: pop operand from stack
R: reverse operand order
But not all combinations allowed
Chapter 3 Arithmetic for Computers 43
Single-Instruction Multiple-Data
Unoptimized code:
Matrix Multiply
1.
2.
3.
4.
5.
6.
Matrix Multiply
Optimized C code:
1. #include <x86intrin.h>
2. void dgemm (int n, double* A, double* B, double* C)
3. {
4. for ( int i = 0; i < n; i+=4 )
5.
for ( int j = 0; j < n; j++ ) {
6.
__m256d c0 = _mm256_load_pd(C+i+j*n); /* c0 = C[i]
[j] */
7.
for( int k = 0; k < n; k++ )
8.
c0 = _mm256_add_pd(c0, /* c0 += A[i][k]*B[k][j] */
9.
_mm256_mul_pd(_mm256_load_pd(A+i+k*n),
10.
_mm256_broadcast_sd(B+k+j*n)));
11.
_mm256_store_pd(C+i+j*n, c0); /* C[i][j] = c0 */
12. }
13. }
Matrix Multiply
1. vmovapd (%r11),%ymm0
# Load 4 elements of C into %ymm0
2. mov %rbx,%rcx
# register %rcx = %rbx
3. xor %eax,%eax
# register %eax = 0
4. vbroadcastsd (%rax,%r8,1),%ymm1 # Make 4 copies of B element
5. add $0x8,%rax
# register %rax = %rax + 8
6. vmulpd (%rcx),%ymm1,%ymm1 # Parallel mul %ymm1,4 A elements
7. add %r9,%rcx
# register %rcx = %rcx + %r9
8. cmp %r10,%rax
# compare %r10 to %rax
9. vaddpd %ymm1,%ymm0,%ymm0 # Parallel add %ymm1, %ymm0
10. jne 50 <dgemm+0x50>
# jump if not %r10 != %rax
11. add $0x1,%esi
# register % esi = % esi + 1
12. vmovapd %ymm0,(%r11)
# Store %ymm0 into 4 C elements
Matrix Multiply
Associativity
Concluding Remarks
Concluding Remarks
MIPS ISA