Numeric Limits/en: Unterschied zwischen den Versionen

Version vom 1. April 2025, 20:05 Uhr

This page provides some computer science basics, which are not specific to expecco. However, in the past some users encountered problems and it is useful to provide some insight on number representations.

Expecco supports arbitrary precision integer arithmetic, arbitrary precision fractions and limited precision floating point numbers in various precisions.

Exact Integer Numbers[Bearbeiten]

For integer operations, there is no overflow or error in the result for any legal operation. I.e. operations on two big numbers deliver a correct result.

This is a feature of the underlying Smalltalk runtime environment and in contrast to many other programming languages (especially: Java and C) which provide int (usually 32bit) and long (usually 64bit) integer types.
In expecco, you can write both in Smalltalk and in the builtin JavaScript syntax¹:

2147483647 "(0x7FFFFFFF)" + 1
    -> 2147483648 "(0x80000000)"

4294967295 "(0xFFFFFFFF)" + 1
    -> 4294967296 "(0x100000000)"

18446744073709551615 "(0xFFFFFFFFFFFFFFFF)" + 1
    -> 18446744073709551616 "(0x10000000000000000)"

Very large values can be computed:

10000 factorial
    -> a huge number beginning with: 284625968091705451890641321211986889014....

Smalltalk will automatically convert any result which is too large to fit into a machine-integer into a LargeInteger (with an arbitrary number of bits) and also automatically convert back to a small representation, if possible.
Thus, although the two operands to the division in the following example are large integers,

rslt := (1000 factorial) / (999 factorial)

the result will be a small integer (since the value 1000 fits easily into a machine word).
As a user, you do not have to care about these internals.

Hint: Therefore, you can use a Workspace (Notepad) window as a calculator with arbitrary precision.

¹) be aware that this only computes correct results if the elementary action is written in either Smalltalk or in the builtin JavaScript syntax. Depending on the version, it may or may not be correct, if using Java/Groovy, Python, C/C++, Node.js etc.

Exact Fractions, ScaledDecimals and FixedDecimals[Bearbeiten]

Fractions[Bearbeiten]

When dividing integers, the "/" operator will deliver an exact result, possibly as a fraction:

5 / 3 -> 5/3

and reduce the result (possibly returning an Integer):

(5/3)*(3/2) -> 5/2
(5/3)/(3/2) -> 10/9
(5/3)*(9/3) -> 5
1000 factorial / 999 factorial -> 1000

There is also a truncating division operator "//", which will deliver an integer, truncated towards negative infinity (i.e. the next smaller integer), which is what you'd get in Java or C:

5 // 3 -> 1
-5 // 3 -> -3

The corresponding modulo operator "\\" provides the remainder, such that:

(a // b) + (a \\ b) = a

The "\\" is the standard Smalltalk modulo operator; Smalltalk/X also provides "%" as an alias (for users with a C/Java background).
Thus you can also write:

(a // b) + (a % b) = a

There is also a division operator ("quo:") which truncates towards zero, and a corresponding remainder operator ("rem:") , for which:

(a quo: b) + (a rem: b) = a

For positive a and b, the two operator pairs deliver the same result. For negative arguments, these are different. Be aware and think about the domain of your arguments.

In addition, the usual ceiling, floor and rounding operations are available (both on fractions and on limited precision reals):

(5 / 3) ceiling -> 2      "the next larger integer"
(5 / 3) floor -> 1        "the next smaller integer"
(5 / 3) truncated -> 1    "truncate towards zero"
(5 / 3) rounded -> 2           "round wrt. fraction >= 0.5)
(5 / 3) roundTo: 0.1  -> 1.7.
(5 / 3) roundTo: 0.01  -> 1.67

(-5 / 3) ceiling -> -1.   "the next larger integer"
(-5 / 3) floor -> -2      "the next smaller integer"
(-5 / 3) truncated -> -1  "truncate towards zero"
(-5 / 3) rounded -> -2.
(-5 / 3) roundTo: 0.1 -> -1.7

Fractions print themself as "(numerator / denominator)".

ScaledDecimal[Bearbeiten]

If you prefer a decimal representation with a defined number of fractional digits, use ScaledDecimals (which for backward compatibility are also called "FixedPoint" (¹⁾).

These are also exact fractions, but print differently: you can specify the number of digits to be printed and it will print itself rounded on the last digit. In other words: the computation and internal value will be exact (as with Fractions), and therefore, no rounding errors will accumulate. Only when printed, will the external represenatation be rounded to the specified number of decimal places.

(5 / 3) asScaledDecimal:2 -> 1.67
(5 / 3) asScaledDecimal:4 -> 1.6667

((5 / 3) asScaledDecimal:2) * 3 -> 5.00

1.2 asScaledDecimal:3 -> 1.200

Float pi asScaledDecimal:5 -> 3.14159

1) the class was previously called "FixedPoint" and the converters were called "asFixedPoint:". For compatibility with other Smalltalk dialects, these have aliases "ScaledDecimal" and "asScaledDecimal:".
Both the old class name and the old operators are and will be supported in the future for backward compatibility (as aliases), but you should use the new name, both for compatibility with other Smalltalk dialects, and to avoid confusion with FixedDecimal numbers.

FixedDecimal[Bearbeiten]

As mentioned above, a ScaledDecimal keeps the exact value internally, but prints themself rounded to a given number of decimal digits. Smalltalk/X provides an alternative class called "FixedDecimal" (1), which always keeps a rounded value internally. These may be better suited for monetary values, especially in computed additive sums which are printed in a table, as the sum of two FixedDecimals will always be the presented (printed) sum of two FixedDecimals.

For example:

v := 50.004 asScaledDecimal:2.
    v printString -> '50.00'.   "is actually 50.004"
v2 := v * 2.
    v2 printString -> '100.01'. "is actually 100.008"

this leads to confusion, iff such numbers represent monetary values and are printed eg. in a summed-up table.

With FixedDecimals, you'll get:

v := 50.004 asFixedDecimal:2.
    v printString -> '50.00'.   "is actually 50.00"
v2 := v * 2.
    v2 printString -> '100.00'. "is actually 100.00"

Be aware that mixed arithmetic operations will usually return an instance of the class with a higher generality, and that Floats do have a higher generality than FixedDecimals which have a higher generality than ScaledDecimals.

Thus, when multiplying a float and a fixed decimal, you'll get a float result, whereas if you multiply an integer and a fixed decimal, the result will be a fixed decimal. Finally, when multiplying a fixed decimal and a scaled decimal, the result will be a fixed decimal.

1) both names "ScaledDecimal" and "FixedDecimal" have been chosen a but unwise, and may easily be confused. However, these cannot easily be changed for backward compatibility reasons. We apologize.

Inexact Float and Double Numbers[Bearbeiten]

Floating point numbers are inherently inexact and almost always represent an approximated value. The error depends on the floating point number's precision, which is the number of bits with which the value is approximated.

This is not a problem specific to expecco, but inherent to the way floating point numbers are represented (in the machine).
See "What Every Computer Scientist Should Know About Floating-Point Arithmetic", "Mindless Assessments of Roundoff in Floating-Point Computation" and "Some disasters attributable to bad numerical computing". A very impressive example of how wrong double precision IEEE arithmetic can be is described in "Do_not_trust_Floating_Point" (1).

Floating point numbers are represented as a sum of powers of 2 (actually 1/2 + 1/4 + 1/8 +...) called the "mantissa" then multiplied by 2 raised to an exponent. I.e.

value = mantissa * (2 ** exponent)

with the mantissa being normalized to be a sum in the interval 0.5..1 (as listed above). And the exponent stored with an offset (called "ebias"). The minimum exponent (0) is reserved for the zero number and non-normalized tiny numbers (called "subnormals"); the maximum exponent is reserved for infinities and NaNs ("not a number"). These might be returned from some operations if invalid arguments are provided (for example, trying to take a logarithm of a negative number

The number of exponent bits determines the largest and smallest representable magnitudes, the number of mantissa bits determines the relative error. The error depends on the value of the last mantissa bit, which depends on the exponent. This value is called "Unit in the Last Place" or "ULP" (see Wikipedia).
For a large number like 1e100, one ULP is the very large 1.94266889222573e+84, whereas for a small number like 0.5, it is 1.11022302462516e-16.

Floating point formats differ in the number of bits (single/double/extended precision).
A double precision IEEE float has 11 bits for the exponent and 53 for the mantissa (see IEEE floating point formats).

As a rule of thumb, the error in the last bit of a double precision IEEE float is roughly 15 to 16 orders of magnitudes smaller than the magnitude of the (double precision) floating point number. The error is larger for single precision (32bit) floats and smaller for extended floats (80, 128 or more bits).

1): If you do not believe me, try the following example from one of the mentioned papers in (eg. excel) or your favourite programming language:

v := 4/3.    "/ or maybe (4.0/3.0)
w := v - 1.
x := w*3.   
y := x - 1.  
z := (y*2)**52.

Limited Precision[Bearbeiten]

Due to the limited number of bits in the mantissa, different values may end in the same floating point representation. For example, both 9223372036854776000 and 9223372036854775808 will end up being represented as the same float when converted from integer to float. The reason is that there are simply not enough bits in the mantissa.

For example:

9223372036854776000 asFloat = 9223372036854775808 asFloat

will return "true", and the difference will be zero in:

9223372036854776000 asFloat - 9223372036854775808 asFloat

in contrast to the correct result being returned when comparing/subtracting them as integers:

9223372036854776000 = 9223372036854775808.

Also, many numbers (actually: most numbers) cannot be exactly represented by a finite sum of powers of 2. Such numbers will have an error in the last significant bit (actually half the last bit). When floating point numbers are added or multiplied, the result is usually computed internally with a few more bits as mantissa, and then rounded on the last bit, to fit the mantissa's number of bits. Notice that this may not be immediately obvious, because the print functions (such as printf) cheat, and round again on the last bit. Thus, a result such as 0.9999999 would be printed as "1.0).

The situation may be relaxed slightly, by using more bits for the mantissa (and expecco gives you a choice of 32bit (called "ShortFloat"), 64bit ("Float") and 80bit ("LongFloat") and even more, which are mapped to corresponding IEEE floats (single, double and extended).

Repeating the above example with long floats, there are enough mantissa bits and the numbers are no longer represented or considered equal,
thus:

9223372036854776000 asLongFloat = 9223372036854775808 asLongFloat

yields "false" as answer, and the computation:

9223372036854776000 asLongFloat - 9223372036854775808 asLongFloat

will give 192.0 as answer.

However, even with more bits, the fundamental restriction remains, although appearing less frequently with higher precision. But be aware that many numbers (such as 1/10, 1/5, 1/3) can never be represented exactly, no matter how many bits are used. So even the innocent looking "0.1" is actually an approximation and wrong in the last bit.

The limited precision may lead to "strange" results, especially when operands are far apart; for example, when subtracting a very small value from a much larger one, as in:

2.15e12 - 1.25e-5

Here, the operands differ by 17 orders of magnitude, and there are not enough bits to represent the result, which will be rounded to give you 2.15e12 again.
Thus, the comparison "2.15e12 - 1.25e-5 = 2.15e12" returns true and "2.15e12 - (2.15e12 - 1.25e-5) = 0.0", which are both obviously wrong.
In this special case, a better result is obtained when operating with extended precision:

2.15e12 asLongFloat - 1.25e-5 asLongFloat

which gives 2149999999999.999987 as result (still incorrect due to its 64 bit mantissa, but much better).

If you are willing to trade speed for precision, you can use one of expecco's builtin higher precision representations or even the arbitrary precision representation, and compute with more bits of precision. The QDouble class provides a compromise between speed and precision, proving roughly 200 bits of precision or alternatively represents a compination of up to 4 arbitrarily valued doubles (i.e. it can represent te sum of a very large and a small number):

(2.15e12 asQDouble) - (1.25e-5 asQDouble)
(2.15e12 asQDouble) - ((2.15e12 asQDouble) - (1.25e-5 asQDouble))

Another representation supports an arbitrary number of precision bits (here 200):

(2.15e12 asLargeFloatPrecision:200) - (1.25e-5 asLargeFloatPrecision:200)
(2.15e12 asLargeFloatPrecision:200) - ((2.15e12 asLargeFloatPrecision:200) - (1.25e-5 asLargeFloatPrecision:200))

Both return the correct results 2149999999999.9999875 and 1.25e-5 respectively.

Be aware, that the higher precision and arbitrary precision operations are much slower than the ones which are directly supported by the processor (which has special hardware, usually for single, double and extended precision). Also they are currently being developed and not yet released for official use (meaning they may contain bugs, especially in their trigonometric and other math functions by the time of writing).

You may use fractions,

2150000000000 - (1 / 125000)

to compute the exact result: (268749999999999999/125000) (of course, these are less convenent to read, and should probably be presented to the end-user as ScaledDecimals.

Also, do not forget that a conversion of one number to a higher precision number cannot magically generate missing bits. For example, a 32 bit floating point number which is already an approximation (i.e. the real value cannot be represented as an exact sum of powers of 2), then the conversion will give you another such approximation, with the same error. Thus, "0.25 asLongFloat" will give you an exact 0.25 (because 0.25 is representable), whereas "0.1 asLongFloat" will not give an exact "0.1". Actually, the result of such a conversion will usually not give you the full possible precision. If you need a constant with the max. precision, either enter it as such (i.e. "0.1q" instead of "0.1 asLongFloat") or read it from a string (i.e. LongFloat fromString:'0.1').

Floating Point Errors Propagate[Bearbeiten]

The above rounding and last bit errors will accumulate, with every math operation performed on it (and may even do so wildly).

For example, the decimal 0.1 cannot be exactly represented as a floating point number, and is actually 0.099999... with an error in the last bit (half a ULP). Adding this multiple times will result in a larger error in the final result:

1.0 - (0.1 + 0.1 + 0.1 ...(10 times)... + 0.1) -> 1.11022302462516E-16

The print functions will usually try to compensate for an error in the last bit(s), showing "0.1" although in reality, it is "0.09999..." (it rounds before printing). Thus, even though the printed representation of such a number might look ok, it will inject more and more error when the value is used in further operations (up to the point when print will no longer be able to cheat, and the error becomes visible).

This is especially inconvenient, when monetary values or counts are represented as floats and a final sum is wrong in the penny value
(and therefore, a real programmer will never ever use floating point numbers to represent monetary values!).

As an example, try to sum an array consisting of 10 values:

(Array new:10 withAll:0.1) sum printString

which results in "1.0" due to print's cheating,
whereas:

(Array new:100 withAll:0.1) sum printString

will show '9.99999999999998' (i.e. the error accumulated to a value too big for print's cheating to compensate).

Expecco (actually the underlying Smalltalk) provides additional number representations which are better suited for such computations: Fraction, ScaledDecimal and FixedDecimal (in other systems, ScaledDecimals are also called "FixedPoint" numbers, and expecco knows that as an alias).

These are exact fractions internally, but use different print strategies: Fractions print as such (i.e. '1/3', '2/17' etc.) whereas ScaledDecimal and FixedDecimal numbers print themself as decimal expansion (i.e. '0.33' or '0.20'). ScaledDecimal constant numbers can be entered by using "s" instead of "e" (i.e. '1.23s' defines a scaled decimal with 2 and '1.23s4' which will print 4 valid digits after the decimal point).

No such rounding errors are encountered, if fractions are used:

1 - ((1/10) + (1/10) + (1/10) ...(10 times)... + (1/10)) -> 0

or if ScaledDecimal numbers are used:

(Array new:100 withAll:0.1s) sum printString -> '10.0'

Floating Point Number Comparison[Bearbeiten]

Be aware of such errors, and do not compare floating point numbers for equality/inequality.

As a concrete example, try:

(0.2 + 0.1 - 0.3) = 0

which will return false and if you print "0.2 + 0.1 - 0.3", you might get something like: "5.55111512312578e-17".
Even increasing the precision does not really help; if we went to 200bits precision, we'd still get a small error:

(0.2QL + 0.1QL - 0.3QL) printString

gives "-3.111507638930570853572...e-61"

The problem also occurs when comparing numbers with different precision. For example, consider that a float32 value is to be compared against a constant. The float32 might be read from a file or provided by a measurement device via any communication mechanism. If we compare it against a higher precision value, the missing bits in the shorter float are filled with zeros. Thus:

(Float32 readFrom:'0.125') = 0.125

leads to a true value, whereas:

(Float32 readFrom:'0.123') = 0.123

returns false. The reason for this is that 0.125 can be represented as exact float in both float32 and float64 formats, whereas 0.123 is non-exact and has repeated binary digits. Its representation as float64 is:

0 01111111011 1111011111001110110110010001011010000111001010110000

and:

0 01111011 11110111110011101101101

as float32. When comparing, the float32 is expanded to:

0 01111011 111101111100111011011010000000000000000000000000000000

which is obviously different.

Instead of comparing against a constant, either use range-compares and/or use the special "compare-almost-equal" functions, where the number of bits of acceptable error can be specified (so called: "ULPs"). Expecco provides such functions both for elementary code and in the standard action block library.

Also, for this reason, do not compute money values using floats or doubles. Instead, use instances of ScaledDecimal. Otherwise you might loose a cent/penny here and there, if you use floats/doubles on big budgets.

With scaled decimals, the result is correct:

0.1s + 0.2s - 0.3s = 0.  => true

and:

(0.1 asScaledDecimal + 0.2 asScaledDecimal - 0.3 asScaledDecimal) printString => 0.00

Find further insight here

Limited Range of Float and Double Numbers[Bearbeiten]

Floating point numbers also have a limited range. In expecco, the default float format is IEEE double precision format (called "Float" or "Float64" in expecco). Numbers with an absolute value greater than 1.79769313486232E+308 will lead to a +INF/-INF (infinite) result, and numbers with absolute value smaller than 2.2250738585072E-308 will be zero.

For IEEE single precision floats (called "ShortFloat" or "Float32" in expecco), the range is much smaller, and for IEEE extended precision (called "LongFloat" or "Float80" in expecco), the range is larger.

You can ask the classes for their limits, with:

fmin (smallest representable number larger than zero)
fmax (largest representable number),
emin (smallest exponent; binary)
emax (largest exponent binary),
precision (bits in mantissa, incl. any hidden bit),
decimalPrecision (digits when printed),

remember Float is the same as Float64
Float fmin -> 2.2250738585072E-308
Float fmax -> 1.79769313486232E+308
Float emin -> -1022
Float emax -> 1023
Float precision -> 53
Float decimalPrecision -> 15

remember Float is the same as Float32
ShortFloat fmin -> 1.175494e-38
ShortFloat fmax -> 3.402823e+38
ShortFloat emin -> -126
ShortFloat emax -> 127
ShortFloat precision -> 24
ShortFloat decimalPrecision -> 7
 
remember Float is the same as Float80
LongFloat fmin -> 3.362103143112093506E-4932
LongFloat fmax -> 1.189731495357231765E+4932
LongFloat emin -> -16382
LongFloat emax -> 16383
LongFloat precision -> 64
LongFloat decimalPrecision -> 19
 
Float128 fmin -> 3.36210314311209350626267781732e-4932
Float128 fmax -> 1.18973149535723176508575932662e+4932
Float128 emin -> -16382
Float128 emax -> 16383
Float128 precision -> 113
Float128 decimalPrecision -> 34
 
Float256 fmin -> 2.48242795146434978829932822291387172367768770607964686927095329791379e-78913
Float256 fmax -> 1.61132571748576047361957211845200501064402387454966951747637125049607e+78913
Float256 emin -> -262142
Float256 emax -> 262143
Float256 precision -> 237
Float256 decimalPrecision -> 71

QDouble fmin -> same as float64
QDouble fmax -> same as float64
QDouble emin -> same as float64
QDouble emax -> same as float64
QDouble precision -> 204
QDouble decimalPrecision -> 61

LargeFloat fmin -> 0.0 arbitrary small
LargeFloat fmax -> inf arbitrary large
LargeFloat emin -> -inf arbitrary small
LargeFloat emax -> inf arbitrary large
LargeFloat precision -> 200 default; configurable
LargeFloat decimalPrecision -> 60 default; configurable

Remember that the name "Float" refers to "Float64", which is called "double" in the C language. And also remember that the name "ShortFloat" refers to "Float32", which is called "float" in C. And finally, the name "LongFloat" refers to "Float80", which is called "long double" in C.

As a consequence, you cannot compute very large numbers using any of the CPU supported floats, and you will have to use one of the software computed float representations.
For example trying to compute the number of decimal digits of a huge number:

10000 factorial asFloat log10 -> INF

I.e. it returns infinity, because 10000 factorial asFloat already returns INF. (it could have been converted with "asFloatChecked" which raises an exception in that situation, which is probably a good idea to do. However, the regular asFloat conversion uses the underlying machine's CPU support, which returns INF, and is similar to the behavior in other programming languages).

In contrast, the integer computation does it:

10000 factorial log10 -> 35659.454274

Again: this is not a problem specific to expecco, but inherent to the way floating point numbers are represented in the CPU.

Speed of Operations[Bearbeiten]

Machines have builtin floating point math operations, which usually work fastest in single or double precision (actually, some modern machines work faster in double than in single precision).

Unless you have special precision needs, it is best to stick with double precision which is also portable across machines. Therefore, double precision floats (aka "double") is the default and simply named "Float" in Smalltalk/X (and therefore also in expecco).

Trigonometric and other Math Functions[Bearbeiten]

Some trigonometric and other math functions (sqrt, log, exp) will first convert the number to a limited precision real number (a C-double), and therefore may have a limited input value range and also generate inexact results.

For example, you will not get a valid result for:

10000 factorial sin

because it is not possible to represent that large number as real number. (expecco will signal a domain error, as the input to sin will be +INF)

Also, the result from:

(9 / 64) sqrt

will be the (inexact) 0.375 (a double), instead of the exact 3/4 (a fraction). (this might change in a future release and provide exact results if both numerator and denominator are perfect squares)

Complex Results[Bearbeiten]

By default, Smalltalk/X will raise an error if the result of a function with a real operand would return a complex result.
For example, computing the square root of a negative number as in:

-2 sqrt

will raise an "ImaginaryResultError".

However, this is a proceedable exception, which can be caught and if the handler proceeds, a complex result is returned:

rslt := ImaginaryResultError ignoreIn:[ -2 sqrt ].

for readability, there is also an alias called "trapImaginary:" in the number class:

rslt := Number trapImaginary:[ -2 sqrt ]

Both of the above would return the complex result:

(0+1.4142135623731 i)

Thus,

rslt := -2 sqrt squared

will result in an exception, whereas:

rslt := Number trapImaginary:[ -2 sqrt squared]

will generate a result of "-2.0".

All operations within "[" .. "]" which would produce an ImaginaryResultError will return a complex. Thus you can write:

Number trapImaginary:[ 
   |num1 num2|

   num1 := -2 sqrt.
   num2 := -3 sqrt.
   Transcript showCR: (num1 + num2).
]

and "(0+3.14626436994197i)" will be shown.

Undefined Results[Bearbeiten]

Similar to the way imaginary results are handled, some operations are not defined for certain values (values outside the function's domain).
For example, the receiver of the arcSin operation must be in [-1 .. 1].

By default, these situations are also reported by raising an error, and therefore:

-2 arcSin

will raise such a "DomainError".

Similar to the above, this can be handled, although no useful value will be provided (in the above case, NaN (Not a Number) will be returned:

rslt := DomainError ignoreIn:[ -2 arcSin ]

or:

rslt := Number trapDomainError:[ -2 arcSin ].

Both of the above would generate a NaN as result.

Notice that if such a NaN is used in other arithmetic operations, either more exceptions or other NaNs will usually be generated (depending on the exception being handled or not).
Thus:

rslt := Number trapDomainError:[ -2 arcSin sin ].

will also generate a NaN as result.

You can check for valid results with:

aNumber isNaN               - true for NaN
aNumber isInfinite          - true for infinities
aNumber isPositiveInfinity  - true for +inf
aNumber isBegativeInfinity  - true for -inf
aNumber isFinite            - false for NaN or infinities (i.e. true for valid numbers)

Overflow[Bearbeiten]

When an operation's arguments are OK, but the result falls outside the range of representable numbers [fmin..fmax], you will get an infinite result (1). Further operations on these might produce more infinities or a NaN ("Not a Number"). This may be especially troublesome, if a final result gets corrupted due to an intermediate computation, as in:

a := 1e+10.
b := 1e+300.
c := 1e+20.
d := 1e+300.

rslt := (a*b) / (c*d)
  -> NaN

Here, the final result is certainly representable, but the intermediate values (1e10 * 1e300) are out of the Float range [2.225E-308 .. 1.796e+308]. Thus, the computation will be:

a := 1e+10.
b := 1e+300.
c := 1e+20.
d := 1e+300.

(a*b) -> INF
(c*d) -> INF
INF / INF -> NaN

If the above intermediates are computed with a higher precision, the final result will be correct:

a := 1e+10.
b := 1e+300.
c := 1e+20.
d := 1e+300.

t := (a asLongFloat * b asLongFloat) / (c asLongFloat * d asLongFloat).
rslt := t asFloat
  -> 1e-10

Of course, with LongFloats, the problem is only shifted towards larger numbers; as soon as the temporary result cannot be represented by a LongFloat:

a := 1q+100.
b := 1q+4900.
c := 1q+200.
d := 1q+4900.
rslt := (a * b) / (c * d).
  -> NaN

then, you may use arbitrary precision LargeFloat numbers:

t := (a asLargeFloat * b asLargeFloat) / (c asLargeFloat * d asLargeFloat).
rslt := t asFloat.
  -> 1r-100

1) that is the current default behavior. Future versions may allow enabling exceptions in this situation, if there are customer requests. However, as most other programming languages behave similar in these situations, most programmers are aware of these pitfalls and avoid such problems.

Different Results on Different CPUs[Bearbeiten]

Since FLOAT32 and FLOAT64 arithmetic is performed by the underlying CPU hardware, different results (in the least significant bit) may be returned from math operations on different CPUs or even different versions (steppings) of the same CPU architecture.

This applies especially to trigonometric and other math functions. Therse are computed by Power- or Taylor-series or Newton approximations with different algorithms on different systems, leading to different results.
Be prepared for this, and use the "almost-equal" comparison functions when results are to be verified.

Higher Precision Numbers[Bearbeiten]

Expecco supports various inexact real formats, with different precision (i.e. number of mantissa bits). Some of those classes have alias names; these are provided to make Smalltalk/X code portable to other Smalltalk dialects (however, within expecco, you will probably not care, as it is not planned to port it to another dialect).

Name              Overall   Exponent   Mantissa    Decimal    fmin                Smalltalk/X  ANSI-Smalltalk
                  Size      Size       Size (1)    Precision  fmax                Name         Name (4)
                  Bit       Bit        Bit         Digits

IEEE single          32        8        24          6          1.175494e-038      ShortFloat   FloatE
                                                               3.402823e+038      FloatE
                                                                                  Float32

IEEE double          64       11        53         15          2.225074e-308      Float        FloatD
                                                               1.797693e+308      FloatD
                                                                                  Float64
                                                                                  Double

IEEE extended    80/128       15       64/112      19/34       3.362103e-4932     LongFloat    FloatQ  (2)
                                                               1.189731e+4932     FloatQ  
                                                                                  Float80
                                                                                  or Float128

quad double        4*64       11       200         60          1.175494e-038      QDouble       -      (3)
                                                               3.402823e+038 

IEEE quadruple      128       15       112         34          3.362103e-4932     QuadFloat     -      (3)
                                                               1.189731e+4932     Float128

IEEE octuple        256       19       236         71          2.482427e-78913    OctaFloat     -      (3)
                                                               1.611325e+78913    Float256

IEEE arbitrary      any       any      any         any         any                IEEEFloat     -      (3)
                                                               any

large float         any       any      any         any         any                LargeFloat    -      (3)
                                                               any

(1) mantissa incl. any hidden bit (normalized floats)

(2) LongFloats use the underlying CPU's long double format.
On x86/x64 machines, LongFloats are represented as 80bit extended floats with 64bit mantissa; on other CPUs, these might be represented as 128bit quadFloats with 112 bit mantissa.

(3) these are currently been developed and provided as a preview feature without warranty (meaning: they may be buggy at the moment; let us know if you need them).

(4) different Smalltalk dialects use different precisions for their floating point numbers: ST80/VW Floats are IEEE singles, V'Age Floats are IEEE doubles and ST/X Floats are IEEE doubles. VW refers to Float64 as Double.
Later, the ANSI standard defined FloatE, FloatD and FloatQ as aliases. You can use either interchangable in expecco.

Notice that the use of any but double precision floats (which are directly supported by the machine) may come at a performance price. The speed of operations degrades from double -> single -> extended -> quad double -> ieee arbitrary - largeFloat.

Constants[Bearbeiten]

Some well known and common constants can be acquired by asking a number class such as the Float class:

Float pi     -> pi (3.14159...)
Float halfPi -> pi / 2
Float twoPi  -> pi * 2 
Float phi    -> phi (golden ratio 1.6180...)
Float sqrt2  -> square root of 2 (1.4142...)
Float sqrt5  -> square root of 5 (2.2360...)
Float ln2    -> natural log of 2 (0.69314...)
Float ln10   -> natural log of 10 (2.30258...)
Float e      -> e (2.718281...)

Each number class will return a representation of that constant with its precision.
I.e. if you ask the Float class for the constant "pi", you'll get a pi with roughly 15 digits precision, whereas if you ask QDoubles, a more accurate representation will be returned.

Float pi        -> 3.14159265358979
ShortFloat pi   -> 3.141593
LongFloat pi    -> 3.141592653589793238
QDouble pi      -> 3.1415926535897932384626433832795028841971693993751058209749446
QuadFloat pi    -> 3.1415926535897932384626433832795027
OctaFloat pi    -> 3.14159265358979323846264338327950288419716939937510582097494459230781639
LargeFloat pi   -> 3.1415926535897932384626433832795028841971693993751058209749445923078164 [many more digits...]

Examples[Bearbeiten]

Code examples in Smalltalk syntax. Notice the float precision qualifiers:

'q'  -> longFloat (= IEEE extended precision = Float80)
'Q'  -> quadFloat (= IEEE quadruple precision = Float128)
'QO' -> octaFloat (= IEEE octuple precision = Float256)
'QD' -> qDouble
'QL' -> longFloat (arbitrary precision)

Square Root:

2.0 sqrt asShortFloat           -> 1.414214
2.0 sqrt                        -> 1.4142135623731
2.0q sqrt                       -> 1.414213562373095049
2.0Q sqrt                       -> 1.4142135623730950488016887242097
2.0QD sqrt                      -> 1.4142135623730950488016887242096980785696718753769
2.0QO sqrt                      -> 1.41421356237309504880168872420969807
                                     8569671875376948073176679737990732
(2.0QL precision:200) sqrt      -> 1.41421356237309504880168872420969807
                                     8569671875376948073176679
(2.0QL precision:400) sqrt      -> 1.41421356237309504880168872420969807
                                     85696718753769480731766797379907324
                                     78462107038850387534327641572735013
                                     846230912297025
(2.0QL precision:800) sqrt      -> 1.41421356237309504880168872420969807
                                     85696718753769480731766797379907324
                                     78462107038850387534327641572735013
                                     84623091229702492483605585073721264
                                     41214970999358314132226659275055927
                                     55799950501152782060571470109559971
                                     605970274...

Precision value from Wolfram:      1.41421356237309504880168872420969807
                                     85696718753769480731766797379907324
                                     78462107038850387534327641572735013
                                     8462309122970249248360...

Cubic Root:

2.0 cbrt asShortFloat           -> 1.259921
2.0 cbrt                        -> 1.25992104989487
2.0q cbrt                       -> 1.259921049894873165
2.0Q cbrt                       -> 1.25992104989487316476721060727823
2.0QD cbrt                      -> 1.25992104989487316476721060727822835
                                     05702514647015
2.0QO cbrt                      -> 1.25992104989487316476721060727822835
                                     05702514647015079800819751121553
(2.0QL precision:200) cbrt      -> 1.25992104989487316476721060727822835
                                     0570251464701507980081975
(2.0QL precision:400) cbrt      -> 1.25992104989487316476721060727822835
                                     05702514647015079800819751121552996
                                     76513959483729396562436255094154310
                                     256035615665259
(2.0QL precision:800) cbrt      -> 1.25992104989487316476721060727822835
                                     05702514647015079800819751121552996
                                     76513959483729396562436255094154310
                                     25603561566525939902404061373722845
                                     91103042693552469606426166250009774
                                     74526565480306867185405518689245872
                                     516764199373709695098382...

Wolfram:                           1.25992104989487316476721060727822835
                                     05702514647015079800819751121552996
                                     76513959483729396562436255094154310
                                     2560356156652593990240...

Exponentiation:

2.0 exp asShortFloat            -> 7.389056
2.0 exp                         -> 7.38905609893065
2.0q exp                        -> 7.389056098930650227
2.0Q exp                        -> 7.38905609893065022723042746057499
2.0QD exp                       -> 7.38905609893065022723042746057500781
                                     31803155705518
2.0QO exp                       -> 7.38905609893065022723042746057500781
                                     318031557055184732408712782252257266
(2.0QL precision:200) exp       -> 7.38905609893065022723042746057500781
                                     3180315570551847324087123
(2.0QL precision:400) exp       -> 7.38905609893065022723042746057500781
                                     31803155705518473240871278225225737
                                     96079057763384312485079121794773753
                                     161265478866123
(2.0QL precision:800) exp       -> 7.38905609893065022723042746057500781
                                     31803155705518473240871278225225737
                                     96079057763384312485079121794773753
                                     16126547886612388460369278127337447
                                     83922133980777749001228956074107537
                                     02391330947550682086581820269647868
                                     208404220982255234875742...

Wolfram:                           7.38905609893065022723042746057500781
                                     31803155705518473240871278225225737
                                     96079057763384312485079121794773753
                                     16126547886612388460369278...

Numeric Limits/en: Unterschied zwischen den Versionen

Version vom 1. April 2025, 20:05 Uhr

Inhaltsverzeichnis

Exact Integer Numbers[Bearbeiten]

Exact Fractions, ScaledDecimals and FixedDecimals[Bearbeiten]

Fractions[Bearbeiten]

ScaledDecimal[Bearbeiten]

FixedDecimal[Bearbeiten]

Inexact Float and Double Numbers[Bearbeiten]

Limited Precision[Bearbeiten]

Floating Point Errors Propagate[Bearbeiten]

Floating Point Number Comparison[Bearbeiten]

Limited Range of Float and Double Numbers[Bearbeiten]

Speed of Operations[Bearbeiten]

Trigonometric and other Math Functions[Bearbeiten]

Complex Results[Bearbeiten]

Undefined Results[Bearbeiten]

Overflow[Bearbeiten]

Different Results on Different CPUs[Bearbeiten]

Higher Precision Numbers[Bearbeiten]

Constants[Bearbeiten]

Examples[Bearbeiten]

See Also[Bearbeiten]

Navigationsmenü

Meine Werkzeuge

Namensräume

Varianten

Ansichten

Mehr

Suche

Navigation

Werkzeuge

Drucken/exportieren