Ass 1
Ass 1
Ass 1
l of Mechanical and Aerospace Engineering Nanyang Technological University, Singapore 6 February 2012
Q1a. Floating Point Number Floating point is a method of representing real numbers on computer in a way that can support a wider range of values compared to that of fixed point and integer representation. In floating point, numbers are approximately represented to a fixed number of significant digits and scaled using an exponent. In computer, the base for the scaling is normally inherent to be 2. Although in other applications, it can be modified (e.g. to 10 or 16). It is called floating point due to the fact that the decimal point, or more commonly known in computers as binary point can "float". Float means it can be placed anywhere relative to the significant digits of the number. This position is indicated in the internal representation. Therefore, floating point representation (x) can be thought of as a computer realization of scientific notation (s). floating point representation : x = m bn scientific notation : s = m 10n, 1 m <10 Since the 1990s, the most commonly encountered floating point representation (x) is that defined by the IEEE 754 Standard, which uses base 2 (b=2): x = m 2n, 0.5 m < 1 Q1b. Mantissa and Exponent An integer mantissa (m) is used to store the represented significant digits. Length of the mantissa determines the precision, i.e. the maximum number of significant digit to which numbers can be represented. The decimal point position is assumed to always be somewhere within the mantissa. An integer exponent, also referred to as the characteristic or scale, is to modify the magnitude of the number by adjusting position of the decimal point: to the right if the exponent is positive or to the left if the exponent is negative.
Page 1 of 8
Both of them are necessary as to store, and later to derive, value of the floating point number. One must multiply the mantissa by the base raised to the power of the exponent. The base, however, needs not to be stored, since it will be the same for the entire range of supported numbers, and can thus be inferred. Q1c. Normalized Representation Following IEEE 754 (IEEE Standard for Floating Point Arithmetic) the absolute value of mantissa, m is to be in fact greater than or equal to 0.12 (=0.5) and less than 12 (=1). This standard is given to have unique mantissa value for each floating point number representation. Unique mantissa value leads to unique floating point representation. For example, without restriction in value of the mantissa, 0.25 can be represented in several ways as follows: 0.25 = 0.5 2-1 (a) -4.3219 0.25 = 5 2 (b) -7.6439 0.25 = 502 (c)
Obj100
Therefore, by imposing limitations within one order magnitude of the base (between 2-1 and 20) only representation (a) can be implemented. This guarantees unique and uniform implementation among different computing machines. Another of its beauty is that it has no non-zero digits to the left of the decimal point and a non-zero digit just to the right of the decimal point, i.e. everything is inherently implemented in binary form as: x = 2n 0 . XX2 only XX2 is to be stored in mantissa Using this background, the answer for the given question is straight forward. The smallest number of mantissa for N-bit representation is 1000000, which essentially is 0.12=0.5. However, it is essential to take note that, despite violating the upper and lower limits of the mantissa, 0 is represented uniquely in IEEE 754 using m = 0 and n = 0.
Q2a. Sign Bit Representation Sign bit allocates one sign bit to represent the sign. The allocated bit, which is often the most significant bit, is normally assigned to 0 for a positive number, and to 1 for a negative number. The remaining bits indicate the magnitude, or the absolute value. Therefore, in a byte (8-bit) representation, apart from the sign bit, only 7 bits determines the precision, from 0000000 (0) to 1111111 (127). In other words, together with the sign bit, it can only represent numbers from 127
Page 2 of 8
to +127. A consequence of this representation is that there are two ways to represent zero, 00000000 (0) and 10000000 (0). Using 8-bit representation, 78 is represented as 11001110. Q2b. Ones Complement Representation Ones complement representation is obtained by reversing all the bits in the binary representation of the number, changing all the bits that are 1 to 0 and all the bits that are 0 to 1. Similar to sign bit representation, the range of signed numbers using ones complement is represented by (2N11) to (2N11) and 0. It can also be inferred that the most significant bit indirectly indicates sign of the represented number. For example, using 8-bit representation, 78 is represented as 01001110 and its ones complement 10110001 represents 78. Q2c. Twos Complement Representation In twos complement representation, a number is converted from positive to negative or vice versa by computing its ones complement and adding one. In other words, negative numbers are represented by the bit pattern which is one greater (in an unsigned sense) than the ones complement of the positive value. An N-bit twos complement numeral system can represent every integer in the range 2N1 to 2N11. Clearly, there will be only single representation of zer, obviating the negative zero. Ones complement of 78 is 10110001. Adding one to obtain its twos complement, 78 can be represented as 10110010. An easier alternative to get the negation of a number in twos complement is as follows: 1. Starting from the right of the positive number, find the first 1: 01001110 2. Invert all of the bits to the left of that one: 10110010
Q3a. Range of Representation i. 16-bit integer Excluding the 0, minimum positive number that can be represented using 16-bit integer is 1, while the maximum one is 2161=65,536. ii. 16-bit floating point with 12-bit mantissa and 4-bit exponent Assuming the 4-bit exponent is signed-bit exponent (using representation described in Q2a), the approximate range that can be expressed by using this exponent is 27 to 27. Using representation described in answer for Q1c for the mantissa (0.5 m < 1), the minimum positive number is therefore 0.12 27 = 0.00390625. The maximum one is 0.1111111111112 27.
Page 3 of 8
Assuming the 4-bit exponent is unsigned-bit exponent, but using representation described in Q2c to be able to represent negative exponent, the approximate range that can be expressed by using this exponent is 28 to 27. Using representation described in answer for Q1c for the mantissa, the minimum positive number is therefore 0.12 28 = 0.001953125. The maximum one is still 0.1111111111112 27. Assuming the 4-bit exponent is unsigned-bit exponent, but using IEEE Short Real Exponents to be able to represent negative exponent, the approximate range that can be expressed by using this exponent is 27 to 28. Using representation described in answer for Q1c for the mantissa, the minimum positive number is therefore 0.12 27 = 0.00390625. The maximum one is now 0.1111111111112 28. Assuming the 4-bit exponent is unsigned-bit exponent, which does not represent any negative exponent, the approximate range that can be expressed by using this exponent is 20 to 215. Using representation described in answer for Q1c for the mantissa, the minimum positive number is therefore 0.12 20 = 0.5. The maximum one is 0.1111111111112 215. Q3b. Minimum difference The smallest possible difference using two such float point numbers is again dependent upon the type of 4-bit exponent. The minimum differences are summarized in the table below: Range of 4-bit exponent 27 to 27 28 to 27 27 to 28 20 to 215 Minimum Difference 0.0000000000012 27 1.9073486328125 106 0.0000000000012 28 9.5367431640625 107 0.0000000000012 27 1.9073486328125 106 0.0000000000012 2.44140625 104
Q3c. Reliable Significant Digit The number of reliable significant can be assessed based on its precision, i.e. length of its mantissa. As this number representation has 12-bit mantissa, 0.XXXXXXXXXXXX, X can be 0 or 1, the reliable significant digit in decimal is 12 log10(2) 3.6124. Therefore, only 3 significant digits are reliable. Q4. Epsilon Test and Its Alternatives
Page 4 of 8
Floating point is very often not an exact representation of one value. Simple value such as 0.1 cannot be precisely represented using binary floating point number because of the limited precision. If one does some calculation and then compares the results against some expected value, it is highly unlikely that the intended result will be obtained. In other words, if a calculation is performed where a is the result from calculation and b is the expected result, then this comparison:
if (a==b), then return a and b are equal is unlikely to be true. It is important to highlight that either one of a or b comes from calculation process, while another is defined as the expected variable. If both a and b come from variable definitions, e.g. it is defined that a=0.1 and b=0.1, than the above statement must always be true.
Since floating point calculations involve certain degree of uncertainty, one can try to give some consideration in evaluating if two numbers are so close to each other that they can be considered equal. This consideration can be reflected in a maximum difference/error between the two variables, defined as epsilon, so that:
if fabs(a-b)<epsilon, then return a and b are equal epsilon is normally defined to be a very small positive number, e.g. 0.001.
There are several issues with performing epsilon test using absolute error value, specifically pertaining to how large or small the epsilon value should be, relative to the range of the two compared values: The epsilon value may be relatively not small enough compared to the a and b values. For example, if epsilon is defined to be 0.001, then the epsilon test is reasonable if a and b are about between 1 to 10. However, it a and b are about 0.01 or less, the epsilon test can be misleading. The epsilon value may be relatively very small compared to a and b and it can be so small that the finite precision of the floats may not be able to represent such small value. For example, one does a calculation that has an expected answer of 10,000. Because of limitation in the floating number representation, one may not get an answer of exactly 10,000. In a considerably good approximation, the calculation may be off by just one or two in the least significant bits. Assuming 4-byte float is used and it gets off by only one in the least significant bit. Instead of 10,000 one gets 10,000.000977. Absolute difference between the two numbers is 0.000977, which is 97.7 times larger than epsilon. The epsilon test will conclude that the numbers are not nearly equal, even though they are in fact adjacent floats. Thus, using very small epsilon, the test can be meaningless, as no pair of values will be considered equal. Alternatively, a more generic way of comparing two numbers that works regardless of their range, can be performed by checking the relative error. Relative error is measured by comparing the difference between the compared numbers to, let say, the expected number b. The relative error can then be compared with maximum relative error, epsilon_relative to evaluate the equality:
relative_error=fabs((a-b)/a) if relative_error<epsilon_relative, then return a and b are equal
Page 5 of 8
However, there are several issues with this method. Firstly, if a and b are happen to be both zero then the relative_error calculation will calculate 0.0/0.0. Zero divided by zero is undefined. Secondly, the relative error can be measured by comparing the error to the expected number (i.e. b) or to the calculated number (i.e. a). In other word, there is ambiguity in choosing the divisor/denominator in the relative_error calculation. This can be improved by having the calculation always divided by the smaller number so that the largest possible relative error is considered. Thirdly, due to the usage of absolute difference, the relative_error calculation will behave poorly for numbers around zero. The positive number closest to zero (e.g. 0.001) and the negative number closest to zero are very close to each other (e.g. -0.001), yet the calculation will calculate that both of them have a huge relative error of 200%. Therefore, especially for numbers that are near zero but of opposite sign, calculation of absolute difference is still necessary. Based on these three issues, the following hybrid algorithm, combining absolute and relative difference/error, is proposed:
if fabs(a-b)<epsilon, then return a and b are equal if fabs(b)> fabs(a), then relative_error=fabs((a-b)/a) else relative_error=fabs((a-b)/b) if relative_error<epsilon_relative, then return a and b are equal
Q5. Catastrophic Cancellation Catastrophic cancellation can be understood as devastating loss of precision when small numbers are resulted from operation involving large number, while the large numbers themselves are subject to round-off error. For example, if a and b agree in all but the last few digits, then if one computes e = a-b, then e may only have a few significant digits, much fewer than that of a and b, i.e. accuracy of the result is catastrophically reduced. Assuming a=123.40 (5 significant digits) and b=123.35 (also 5 significant digits), the difference e = a-b=0.05 has only 1 significant digit. If e is subsequently used in calculations, the result may only have a very few digits of accuracy. Moreover, due to limited precision in floating number representation, the real difference between a and b may not be correctly reflected in e. Assuming a is represented in floating number by af with associated round-off error a due to limitation in precision so that af =a (1+a). Similarly, for b, bf =b(1+b). Error in subtracting a and b can be calculated as:
Obj101
Obj102
Page 6 of 8
Therefore, as a and b get very close to each other, the subtraction error b increases. This becomes prominent as the subtraction result e is of the same order of the round-off error a and/or b. Error in multiplication of a with itself (quadratic function) can be represented as:
Obj103
Obj104
Obj106
it indicates that the error increases as the order increases. Therefore, if one would like to calculate en = an-bn, for a and b are very close, using limited (e.g. 5) significant digits, the calculation error c can grow significantly as n increases. Q6. Evaluation of Series in Computer Sine function as sum of power series, where x is in radian:
Obj107
For small values of x, only the first few terms are needed. However, as x approaches /2, more terms are needed. While it is possible to use the series for the entire range from - to , it is practical to limit the range to -/2 < x < /2. For angles outside this range, one can use the periodicity and symmetry of sine function to perform extrapolation. The problem with direct evaluation of this series on a computer is the possibility of catastrophic cancellation as the series numerically involve subtractions among the increasing order terms. The higher order terms are getting closer to each other and the subtractions will lead to increase in error as discussed in Q5. Similarly, that behavior is observed in the binomial expansion below:
Page 7 of 8
Obj108
Obj110
In this form, only x (for binomial expansion) or x2 (for sine function) needs to be calculated, so that effect of the error associated to subtraction among high order multiplication can be significantly reduced.
Page 8 of 8