Difference between revisions of "Numeric Limits/en"

From expecco Wiki (Version 2.x)
Jump to navigation Jump to search
 
(11 intermediate revisions by the same user not shown)
Line 17: Line 17:
 
and the truncating division "//" will deliver an integer, truncated towards negative infinity (i.e. the next smaller integer):
 
and the truncating division "//" will deliver an integer, truncated towards negative infinity (i.e. the next smaller integer):
   
5 // 3 -> 2
+
5 // 3 -> 1
 
-5 // 3 -> -3
 
-5 // 3 -> -3
  +
  +
The modulo operator "\\" provides the remainder, such that:
  +
(a // b) + (a \\ b) = a
  +
  +
There is also a truncating division operator which truncates towards zero, and a corresponding remainder operator, for which:
  +
(a quo: b) + (a rem: b) = a
  +
  +
For positive a and b, the two operator pairs deliver the same result.
  +
  +
  +
In addition, the usual ceiling, floor and rounding operations are available (both on fractions and on limited precision reals):
  +
(5 / 3) ceiling -> 2
  +
(5 / 3) floor -> 1
  +
(5 / 3) rounded -> 2
  +
(5 / 3) roundTo: 0.1 -> 1.7
  +
(5 / 3) roundTo: 0.01 -> 1.67
  +
  +
(-5 / 3) ceiling -> -1
  +
(-5 / 3) floor -> -2
  +
(-5 / 3) rounded -> -2
  +
(-5 / 3) roundTo: 0.1 -> -1.7
   
 
== Inexact Float and Double Numbers ==
 
== Inexact Float and Double Numbers ==
Line 24: Line 45:
 
Floating point numbers are inherently inexact.
 
Floating point numbers are inherently inexact.
 
This is not a problem specific to expecco,
 
This is not a problem specific to expecco,
but inherent to the way, floating point numbers are represented
+
but inherent to the way, [https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html floating point numbers are represented]
(many numbers cannot be represented exactly as a sum of powers of 2,
+
<br>(many numbers cannot be represented exactly as a sum of powers of 2,
 
and the floating point number has an error in the last significant bit).
 
and the floating point number has an error in the last significant bit).
   
Line 65: Line 86:
 
different results (in the least significant bit) can be returned from math operations on different CPUs or even different versions (steppings) of the same CPU architecture. This applies especially to trigonometric and other math functions.
 
different results (in the least significant bit) can be returned from math operations on different CPUs or even different versions (steppings) of the same CPU architecture. This applies especially to trigonometric and other math functions.
 
Be prepared for this, and use the "''almost-equal''" comparison functions when results are to be verified.
 
Be prepared for this, and use the "''almost-equal''" comparison functions when results are to be verified.
  +
  +
== Higher Precision Numbers ==
  +
  +
Expecco supports various inexact real formats, with different precision (i.e. number of mantissa bits):
  +
Name overall exponent mantissa decimal Smalltalk
  +
size size size precision class name
  +
  +
IEEE single precision floats 32 bit 8 24 6 ShortFloat
  +
IEEE double precision floats 64 bit 11 53 15 Float
  +
IEEE extended prec. floats. 80/128 bit 15 64/112 19/34 LongFloat (1)
  +
quad double 4*64 bit 11 200 60 QDouble (2)
  +
arbitrary any any any any LargeFloat (2)
  +
  +
(1) on x86/x64 machines, LongFloats are represented as 80bit extended floats with 64bit mantissa; on other CPUs, these might be represented as 128bit quadFloats with 112 bit mantissa.
  +
<br>(2) these are currently provided as a preview feature without warranty.
  +
  +
Notice that the use of any but double precision floats (which are directly supported by the machine) comes at a price.
  +
The speed of operations degrades from double -> single -> extended -> quad double -> largeFloat.

Latest revision as of 20:12, 29 November 2019

Integer Arithmetic[edit]

Expecco supports arbitrary precision integer arithmetic, arbitrary precision fractions and limited precision floating point numbers.

For integer operations, there is no overflow or error in the result for any legal operation. I.e. operations on two big numbers deliver a correct result:

4294967295 (0xFFFFFFFF) + 1 -> 4294967296 (0x100000000)
18446744073709551615 (0xFFFFFFFFFFFFFFFF) + 1 -> 18446744073709551616 (0x10000000000000000)


when dividing integers, the "/" operator will deliver an exact result, possibly as a fraction:

5 / 3 -> 5/3

and the truncating division "//" will deliver an integer, truncated towards negative infinity (i.e. the next smaller integer):

5 // 3 -> 1
-5 // 3 -> -3

The modulo operator "\\" provides the remainder, such that:

(a // b) + (a \\ b) = a

There is also a truncating division operator which truncates towards zero, and a corresponding remainder operator, for which:

(a quo: b) + (a rem: b) = a

For positive a and b, the two operator pairs deliver the same result.


In addition, the usual ceiling, floor and rounding operations are available (both on fractions and on limited precision reals):

(5 / 3) ceiling -> 2
(5 / 3) floor -> 1
(5 / 3) rounded -> 2
(5 / 3) roundTo: 0.1  -> 1.7
(5 / 3) roundTo: 0.01  -> 1.67
(-5 / 3) ceiling -> -1
(-5 / 3) floor -> -2
(-5 / 3) rounded -> -2
(-5 / 3) roundTo: 0.1 -> -1.7

Inexact Float and Double Numbers[edit]

Floating point numbers are inherently inexact. This is not a problem specific to expecco, but inherent to the way, floating point numbers are represented
(many numbers cannot be represented exactly as a sum of powers of 2, and the floating point number has an error in the last significant bit).

Such errors will even accumulate, with every math operation performed on it. For example, the decimal 0.1 cannot be exactly represented as a floating point number, and is actually 0.099999... with an error in the last but. Adding this multiple times will result in a larger error in the final result:

1.0 - (0.1 + 0.1 + 0.1 ...(10 times)... + 0.1) -> 1.11022302462516E-16

The print functions will usually try to compensate for an error in the last bit, showing "0.1" although in reality, it is "0.9999...".

No such rounding error are encountered, if fractions are used:

1 - ((1/10) + (1/10) + (1/10) ...(10 times)... + (1/10)) -> 0

Be aware of such errors, and do not compare floating point numbers for equality/inequality. Instead either use range-compares and/or use the special "compare-almost-equal" functions, where the number of bits of acceptable error can be specified. Expecco provides such functions both for elementary code and in the standard block library.

Also, for this reason, do not compute money values using floats or doubles. Instead, use instances of FixedPoint, which are fractions with a mutiple-of-10 denominator, and which compute exact results. You will loose a cent/penny here and there, if you use floats/doubles on big budgets.

Trigonometric and other Math Functions[edit]

Trigonometric and other math functions (sqrt, log, exp) will first convert the number to a limited precision real number (a C-double), and therefore may have a limited input value range and also generate inexact results.

For example, you will not get a valid result for:

10000 factorial sin

because it is not possible to represent that large number as real number. (expecco will signal a domain error, as the input to sin will be +INF)

Also, the result from:

(9 / 64) sqrt

will be the (inexact) 0.75 (a double), instead of the exact 3/4 (a fraction). (the error is (0.75 - x), which is roughly 1.11022302462516E-16)

Different Results on Different CPUs[edit]

Since floating point arithmetic is performed by the underlying CPU hardware, different results (in the least significant bit) can be returned from math operations on different CPUs or even different versions (steppings) of the same CPU architecture. This applies especially to trigonometric and other math functions. Be prepared for this, and use the "almost-equal" comparison functions when results are to be verified.

Higher Precision Numbers[edit]

Expecco supports various inexact real formats, with different precision (i.e. number of mantissa bits):

Name                          overall    exponent    mantissa   decimal       Smalltalk
                              size       size        size       precision     class name

IEEE single precision floats  32 bit     8           24         6             ShortFloat 
IEEE double precision floats  64 bit     11          53         15            Float
IEEE extended prec. floats.   80/128 bit 15          64/112     19/34         LongFloat      (1)
quad double                   4*64 bit   11          200        60            QDouble        (2)
arbitrary                     any        any         any        any           LargeFloat     (2)

(1) on x86/x64 machines, LongFloats are represented as 80bit extended floats with 64bit mantissa; on other CPUs, these might be represented as 128bit quadFloats with 112 bit mantissa.
(2) these are currently provided as a preview feature without warranty.

Notice that the use of any but double precision floats (which are directly supported by the machine) comes at a price. The speed of operations degrades from double -> single -> extended -> quad double -> largeFloat.



Copyright © 2014-2018 eXept Software AG