Numeric Limits/en: Unterschied zwischen den Versionen
Cg (Diskussion | Beiträge) |
Cg (Diskussion | Beiträge) |
||
Zeile 284: | Zeile 284: | ||
* decimalPrecision (digits when printed), |
* decimalPrecision (digits when printed), |
||
<small>remember Float is the same as Float64</small> |
|||
Float fmin -> 2.2250738585072E-308 |
Float fmin -> 2.2250738585072E-308 |
||
Float fmax -> 1.79769313486232E+308 |
Float fmax -> 1.79769313486232E+308 |
||
Zeile 291: | Zeile 292: | ||
Float decimalPrecision -> 15 |
Float decimalPrecision -> 15 |
||
<small>remember Float is the same as Float32</small> |
|||
ShortFloat fmin -> 1.175494e-38 |
ShortFloat fmin -> 1.175494e-38 |
||
ShortFloat fmax -> 3.402823e+38 |
ShortFloat fmax -> 3.402823e+38 |
||
Zeile 298: | Zeile 300: | ||
ShortFloat decimalPrecision -> 7 |
ShortFloat decimalPrecision -> 7 |
||
<small>remember Float is the same as Float80</small> |
|||
LongFloat fmin -> 3.362103143112093506E-4932 |
LongFloat fmin -> 3.362103143112093506E-4932 |
||
LongFloat fmax -> 1.189731495357231765E+4932 |
LongFloat fmax -> 1.189731495357231765E+4932 |
Version vom 1. April 2025, 20:05 Uhr
This page provides some computer science basics, which are not specific to expecco. However, in the past some users encountered problems and it is useful to provide some insight on number representations.
Expecco supports arbitrary precision integer arithmetic, arbitrary precision fractions and limited precision floating point numbers in various precisions.
Inhaltsverzeichnis
- 1 Exact Integer Numbers
- 2 Exact Fractions, ScaledDecimals and FixedDecimals
- 3 Inexact Float and Double Numbers
- 4 Trigonometric and other Math Functions
- 5 Complex Results
- 6 Undefined Results
- 7 Overflow
- 8 Different Results on Different CPUs
- 9 Higher Precision Numbers
- 10 Constants
- 11 Examples
- 12 See Also
Exact Integer Numbers[Bearbeiten]
For integer operations, there is no overflow or error in the result for any legal operation. I.e. operations on two big numbers deliver a correct result.
This is a feature of the underlying Smalltalk runtime environment and in contrast to many other programming languages (especially: Java and C) which provide int (usually 32bit) and long (usually 64bit) integer types.
In expecco, you can write both in Smalltalk and in the builtin JavaScript syntax1:
2147483647 "(0x7FFFFFFF)" + 1 -> 2147483648 "(0x80000000)"
4294967295 "(0xFFFFFFFF)" + 1 -> 4294967296 "(0x100000000)"
18446744073709551615 "(0xFFFFFFFFFFFFFFFF)" + 1 -> 18446744073709551616 "(0x10000000000000000)"
Very large values can be computed:
10000 factorial -> a huge number beginning with: 284625968091705451890641321211986889014....
Smalltalk will automatically convert any result which is too large to fit into a machine-integer into a LargeInteger (with an arbitrary number of bits) and also automatically convert back to a small representation, if possible.
Thus, although the two operands to the division in the following example are large integers,
rslt := (1000 factorial) / (999 factorial)
the result will be a small integer (since the value 1000 fits easily into a machine word).
As a user, you do not have to care about these internals.
Hint: Therefore, you can use a Workspace (Notepad) window as a calculator with arbitrary precision.
1) be aware that this only computes correct results if the elementary action is written in either Smalltalk or in the builtin JavaScript syntax. Depending on the version, it may or may not be correct, if using Java/Groovy, Python, C/C++, Node.js etc.
Exact Fractions, ScaledDecimals and FixedDecimals[Bearbeiten]
Fractions[Bearbeiten]
When dividing integers, the "/" operator will deliver an exact result, possibly as a fraction:
5 / 3 -> 5/3
and reduce the result (possibly returning an Integer):
(5/3)*(3/2) -> 5/2 (5/3)/(3/2) -> 10/9 (5/3)*(9/3) -> 5 1000 factorial / 999 factorial -> 1000
There is also a truncating division operator "//", which will deliver an integer, truncated towards negative infinity (i.e. the next smaller integer), which is what you'd get in Java or C:
5 // 3 -> 1 -5 // 3 -> -3
The corresponding modulo operator "\\" provides the remainder, such that:
(a // b) + (a \\ b) = a
The "\\" is the standard Smalltalk modulo operator; Smalltalk/X also provides "%" as an alias (for users with a C/Java background).
Thus you can also write:
(a // b) + (a % b) = a
There is also a division operator ("quo:") which truncates towards zero, and a corresponding remainder operator ("rem:") , for which:
(a quo: b) + (a rem: b) = a
For positive a and b, the two operator pairs deliver the same result. For negative arguments, these are different. Be aware and think about the domain of your arguments.
In addition, the usual ceiling, floor and rounding operations are available (both on fractions and on limited precision reals):
(5 / 3) ceiling -> 2 "the next larger integer" (5 / 3) floor -> 1 "the next smaller integer" (5 / 3) truncated -> 1 "truncate towards zero" (5 / 3) rounded -> 2 "round wrt. fraction >= 0.5) (5 / 3) roundTo: 0.1 -> 1.7. (5 / 3) roundTo: 0.01 -> 1.67
(-5 / 3) ceiling -> -1. "the next larger integer" (-5 / 3) floor -> -2 "the next smaller integer" (-5 / 3) truncated -> -1 "truncate towards zero" (-5 / 3) rounded -> -2. (-5 / 3) roundTo: 0.1 -> -1.7
Fractions print themself as "(numerator / denominator)".
ScaledDecimal[Bearbeiten]
If you prefer a decimal representation with a defined number of fractional digits, use ScaledDecimals (which for backward compatibility are also called "FixedPoint" (1)).
These are also exact fractions, but print differently: you can specify the number of digits to be printed and it will print itself rounded on the last digit. In other words: the computation and internal value will be exact (as with Fractions), and therefore, no rounding errors will accumulate. Only when printed, will the external represenatation be rounded to the specified number of decimal places.
(5 / 3) asScaledDecimal:2 -> 1.67 (5 / 3) asScaledDecimal:4 -> 1.6667 ((5 / 3) asScaledDecimal:2) * 3 -> 5.00 1.2 asScaledDecimal:3 -> 1.200 Float pi asScaledDecimal:5 -> 3.14159
1) the class was previously called "FixedPoint" and the converters were called "asFixedPoint:". For compatibility with other Smalltalk dialects, these have aliases "ScaledDecimal" and "asScaledDecimal:".
Both the old class name and the old operators are and will be supported in the future for backward compatibility (as aliases),
but you should use the new name, both for compatibility with other Smalltalk dialects, and to avoid confusion with FixedDecimal numbers.
FixedDecimal[Bearbeiten]
As mentioned above, a ScaledDecimal keeps the exact value internally, but prints themself rounded to a given number of decimal digits. Smalltalk/X provides an alternative class called "FixedDecimal" (1), which always keeps a rounded value internally. These may be better suited for monetary values, especially in computed additive sums which are printed in a table, as the sum of two FixedDecimals will always be the presented (printed) sum of two FixedDecimals.
For example:
v := 50.004 asScaledDecimal:2. v printString -> '50.00'. "is actually 50.004" v2 := v * 2. v2 printString -> '100.01'. "is actually 100.008"
this leads to confusion, iff such numbers represent monetary values and are printed eg. in a summed-up table.
With FixedDecimals, you'll get:
v := 50.004 asFixedDecimal:2. v printString -> '50.00'. "is actually 50.00" v2 := v * 2. v2 printString -> '100.00'. "is actually 100.00"
Be aware that mixed arithmetic operations will usually return an instance of the class with a higher generality, and that Floats do have a higher generality than FixedDecimals which have a higher generality than ScaledDecimals.
Thus, when multiplying a float and a fixed decimal, you'll get a float result, whereas if you multiply an integer and a fixed decimal, the result will be a fixed decimal. Finally, when multiplying a fixed decimal and a scaled decimal, the result will be a fixed decimal.
1) both names "ScaledDecimal" and "FixedDecimal" have been chosen a but unwise, and may easily be confused. However, these cannot easily be changed for backward compatibility reasons. We apologize.
Inexact Float and Double Numbers[Bearbeiten]
Floating point numbers are inherently inexact and almost always represent an approximated value. The error depends on the floating point number's precision, which is the number of bits with which the value is approximated.
This is not a problem specific to expecco,
but inherent to the way floating point numbers are represented (in the machine).
See "What Every Computer Scientist Should Know About Floating-Point Arithmetic",
"Mindless Assessments of Roundoff in Floating-Point Computation" and "Some disasters attributable to bad numerical computing". A very impressive example of how wrong double precision IEEE arithmetic can be is described in "Do_not_trust_Floating_Point" (1).
Floating point numbers are represented as a sum of powers of 2 (actually 1/2 + 1/4 + 1/8 +...) called the "mantissa" then multiplied by 2 raised to an exponent. I.e.
value = mantissa * (2 ** exponent)
with the mantissa being normalized to be a sum in the interval 0.5..1 (as listed above). And the exponent stored with an offset (called "ebias"). The minimum exponent (0) is reserved for the zero number and non-normalized tiny numbers (called "subnormals"); the maximum exponent is reserved for infinities and NaNs ("not a number"). These might be returned from some operations if invalid arguments are provided (for example, trying to take a logarithm of a negative number
The number of exponent bits determines the largest and smallest representable magnitudes, the number of mantissa bits determines the relative error. The error depends on the value of the last mantissa bit, which depends on the exponent. This value is called "Unit in the Last Place" or "ULP" (see Wikipedia).
For a large number like 1e100, one ULP is the very large 1.94266889222573e+84, whereas for a small number like 0.5, it is 1.11022302462516e-16.
Floating point formats differ in the number of bits (single/double/extended precision).
A double precision IEEE float has 11 bits for the exponent and 53 for the mantissa (see IEEE floating point formats).
As a rule of thumb, the error in the last bit of a double precision IEEE float is roughly 15 to 16 orders of magnitudes smaller than the magnitude of the (double precision) floating point number. The error is larger for single precision (32bit) floats and smaller for extended floats (80, 128 or more bits).
1): If you do not believe me, try the following example from one of the mentioned papers in (eg. excel) or your favourite programming language:
v := 4/3. "/ or maybe (4.0/3.0) w := v - 1. x := w*3. y := x - 1. z := (y*2)**52.
Limited Precision[Bearbeiten]
Due to the limited number of bits in the mantissa, different values may end in the same floating point representation. For example, both 9223372036854776000 and 9223372036854775808 will end up being represented as the same float when converted from integer to float. The reason is that there are simply not enough bits in the mantissa.
For example:
9223372036854776000 asFloat = 9223372036854775808 asFloat
will return "true", and the difference will be zero in:
9223372036854776000 asFloat - 9223372036854775808 asFloat
in contrast to the correct result being returned when comparing/subtracting them as integers:
9223372036854776000 = 9223372036854775808.
Also, many numbers (actually: most numbers) cannot be exactly represented by a finite sum of powers of 2. Such numbers will have an error in the last significant bit (actually half the last bit). When floating point numbers are added or multiplied, the result is usually computed internally with a few more bits as mantissa, and then rounded on the last bit, to fit the mantissa's number of bits. Notice that this may not be immediately obvious, because the print functions (such as printf) cheat, and round again on the last bit. Thus, a result such as 0.9999999 would be printed as "1.0).
The situation may be relaxed slightly, by using more bits for the mantissa (and expecco gives you a choice of 32bit (called "ShortFloat"), 64bit ("Float") and 80bit ("LongFloat") and even more, which are mapped to corresponding IEEE floats (single, double and extended).
Repeating the above example with long floats, there are enough mantissa bits and the numbers are no longer represented or considered equal,
thus:
9223372036854776000 asLongFloat = 9223372036854775808 asLongFloat
yields "false" as answer, and the computation:
9223372036854776000 asLongFloat - 9223372036854775808 asLongFloat
will give 192.0 as answer.
However, even with more bits, the fundamental restriction remains, although appearing less frequently with higher precision. But be aware that many numbers (such as 1/10, 1/5, 1/3) can never be represented exactly, no matter how many bits are used. So even the innocent looking "0.1" is actually an approximation and wrong in the last bit.
The limited precision may lead to "strange" results, especially when operands are far apart; for example, when subtracting a very small value from a much larger one, as in:
2.15e12 - 1.25e-5
Here, the operands differ by 17 orders of magnitude, and there are not enough bits to represent the result, which will be rounded to give you 2.15e12 again.
Thus, the comparison "2.15e12 - 1.25e-5 = 2.15e12" returns true and "2.15e12 - (2.15e12 - 1.25e-5) = 0.0", which are both obviously wrong.
In this special case, a better result is obtained when operating with extended precision:
2.15e12 asLongFloat - 1.25e-5 asLongFloat
which gives 2149999999999.999987 as result (still incorrect due to its 64 bit mantissa, but much better).
If you are willing to trade speed for precision, you can use one of expecco's builtin higher precision representations or even the arbitrary precision representation, and compute with more bits of precision. The QDouble class provides a compromise between speed and precision, proving roughly 200 bits of precision or alternatively represents a compination of up to 4 arbitrarily valued doubles (i.e. it can represent te sum of a very large and a small number):
(2.15e12 asQDouble) - (1.25e-5 asQDouble) (2.15e12 asQDouble) - ((2.15e12 asQDouble) - (1.25e-5 asQDouble))
Another representation supports an arbitrary number of precision bits (here 200):
(2.15e12 asLargeFloatPrecision:200) - (1.25e-5 asLargeFloatPrecision:200) (2.15e12 asLargeFloatPrecision:200) - ((2.15e12 asLargeFloatPrecision:200) - (1.25e-5 asLargeFloatPrecision:200))
Both return the correct results 2149999999999.9999875 and 1.25e-5 respectively.
Be aware, that the higher precision and arbitrary precision operations are much slower than the ones which are directly supported by the processor (which has special hardware, usually for single, double and extended precision). Also they are currently being developed and not yet released for official use (meaning they may contain bugs, especially in their trigonometric and other math functions by the time of writing).
You may use fractions,
2150000000000 - (1 / 125000)
to compute the exact result: (268749999999999999/125000) (of course, these are less convenent to read, and should probably be presented to the end-user as ScaledDecimals.
Also, do not forget that a conversion of one number to a higher precision number cannot magically generate missing bits. For example, a 32 bit floating point number which is already an approximation (i.e. the real value cannot be represented as an exact sum of powers of 2), then the conversion will give you another such approximation, with the same error. Thus, "0.25 asLongFloat" will give you an exact 0.25 (because 0.25 is representable), whereas "0.1 asLongFloat" will not give an exact "0.1". Actually, the result of such a conversion will usually not give you the full possible precision. If you need a constant with the max. precision, either enter it as such (i.e. "0.1q" instead of "0.1 asLongFloat") or read it from a string (i.e. LongFloat fromString:'0.1').
Floating Point Errors Propagate[Bearbeiten]
The above rounding and last bit errors will accumulate, with every math operation performed on it (and may even do so wildly).
For example, the decimal 0.1 cannot be exactly represented as a floating point number, and is actually 0.099999... with an error in the last bit (half a ULP). Adding this multiple times will result in a larger error in the final result:
1.0 - (0.1 + 0.1 + 0.1 ...(10 times)... + 0.1) -> 1.11022302462516E-16
The print functions will usually try to compensate for an error in the last bit(s), showing "0.1" although in reality, it is "0.09999..." (it rounds before printing). Thus, even though the printed representation of such a number might look ok, it will inject more and more error when the value is used in further operations (up to the point when print will no longer be able to cheat, and the error becomes visible).
This is especially inconvenient, when monetary values or counts are represented as floats and a final sum is wrong in the penny value
(and therefore, a real programmer will never ever use floating point numbers to represent monetary values!).
As an example, try to sum an array consisting of 10 values:
(Array new:10 withAll:0.1) sum printString
which results in "1.0" due to print's cheating,
whereas:
(Array new:100 withAll:0.1) sum printString
will show '9.99999999999998' (i.e. the error accumulated to a value too big for print's cheating to compensate).
Expecco (actually the underlying Smalltalk) provides additional number representations which are better suited for such computations: Fraction, ScaledDecimal and FixedDecimal (in other systems, ScaledDecimals are also called "FixedPoint" numbers, and expecco knows that as an alias).
These are exact fractions internally, but use different print strategies: Fractions print as such (i.e. '1/3', '2/17' etc.) whereas ScaledDecimal and FixedDecimal numbers print themself as decimal expansion (i.e. '0.33' or '0.20'). ScaledDecimal constant numbers can be entered by using "s" instead of "e" (i.e. '1.23s' defines a scaled decimal with 2 and '1.23s4' which will print 4 valid digits after the decimal point).
No such rounding errors are encountered, if fractions are used:
1 - ((1/10) + (1/10) + (1/10) ...(10 times)... + (1/10)) -> 0
or if ScaledDecimal numbers are used:
(Array new:100 withAll:0.1s) sum printString -> '10.0'
Floating Point Number Comparison[Bearbeiten]
Be aware of such errors, and do not compare floating point numbers for equality/inequality.
As a concrete example, try:
(0.2 + 0.1 - 0.3) = 0
which will return false
and if you print "0.2 + 0.1 - 0.3", you might get something like: "5.55111512312578e-17".
Even increasing the precision does not really help; if we went to 200bits precision, we'd still get a small error:
(0.2QL + 0.1QL - 0.3QL) printString
gives "-3.111507638930570853572...e-61"
The problem also occurs when comparing numbers with different precision. For example, consider that a float32 value is to be compared against a constant. The float32 might be read from a file or provided by a measurement device via any communication mechanism. If we compare it against a higher precision value, the missing bits in the shorter float are filled with zeros. Thus:
(Float32 readFrom:'0.125') = 0.125
leads to a true
value, whereas:
(Float32 readFrom:'0.123') = 0.123
returns false
.
The reason for this is that 0.125 can be represented as exact float in both float32 and float64 formats, whereas 0.123 is non-exact and has repeated binary digits.
Its representation as float64 is:
0 01111111011 1111011111001110110110010001011010000111001010110000
and:
0 01111011 11110111110011101101101
as float32. When comparing, the float32 is expanded to:
0 01111011 111101111100111011011010000000000000000000000000000000
which is obviously different.
Instead of comparing against a constant, either use range-compares and/or use the special "compare-almost-equal" functions, where the number of bits of acceptable error can be specified (so called: "ULPs"). Expecco provides such functions both for elementary code and in the standard action block library.
Also, for this reason, do not compute money values using floats or doubles. Instead, use instances of ScaledDecimal. Otherwise you might loose a cent/penny here and there, if you use floats/doubles on big budgets.
With scaled decimals, the result is correct:
0.1s + 0.2s - 0.3s = 0. => true
and:
(0.1 asScaledDecimal + 0.2 asScaledDecimal - 0.3 asScaledDecimal) printString => 0.00
Find further insight here
Limited Range of Float and Double Numbers[Bearbeiten]
Floating point numbers also have a limited range. In expecco, the default float format is IEEE double precision format (called "Float" or "Float64" in expecco). Numbers with an absolute value greater than 1.79769313486232E+308 will lead to a +INF/-INF (infinite) result, and numbers with absolute value smaller than 2.2250738585072E-308 will be zero.
For IEEE single precision floats (called "ShortFloat" or "Float32" in expecco), the range is much smaller, and for IEEE extended precision (called "LongFloat" or "Float80" in expecco), the range is larger.
You can ask the classes for their limits, with:
- fmin (smallest representable number larger than zero)
- fmax (largest representable number),
- emin (smallest exponent; binary)
- emax (largest exponent binary),
- precision (bits in mantissa, incl. any hidden bit),
- decimalPrecision (digits when printed),
remember Float is the same as Float64 Float fmin -> 2.2250738585072E-308 Float fmax -> 1.79769313486232E+308 Float emin -> -1022 Float emax -> 1023 Float precision -> 53 Float decimalPrecision -> 15 remember Float is the same as Float32 ShortFloat fmin -> 1.175494e-38 ShortFloat fmax -> 3.402823e+38 ShortFloat emin -> -126 ShortFloat emax -> 127 ShortFloat precision -> 24 ShortFloat decimalPrecision -> 7 remember Float is the same as Float80 LongFloat fmin -> 3.362103143112093506E-4932 LongFloat fmax -> 1.189731495357231765E+4932 LongFloat emin -> -16382 LongFloat emax -> 16383 LongFloat precision -> 64 LongFloat decimalPrecision -> 19 Float128 fmin -> 3.36210314311209350626267781732e-4932 Float128 fmax -> 1.18973149535723176508575932662e+4932 Float128 emin -> -16382 Float128 emax -> 16383 Float128 precision -> 113 Float128 decimalPrecision -> 34 Float256 fmin -> 2.48242795146434978829932822291387172367768770607964686927095329791379e-78913 Float256 fmax -> 1.61132571748576047361957211845200501064402387454966951747637125049607e+78913 Float256 emin -> -262142 Float256 emax -> 262143 Float256 precision -> 237 Float256 decimalPrecision -> 71 QDouble fmin -> same as float64 QDouble fmax -> same as float64 QDouble emin -> same as float64 QDouble emax -> same as float64 QDouble precision -> 204 QDouble decimalPrecision -> 61 LargeFloat fmin -> 0.0 arbitrary small LargeFloat fmax -> inf arbitrary large LargeFloat emin -> -inf arbitrary small LargeFloat emax -> inf arbitrary large LargeFloat precision -> 200 default; configurable LargeFloat decimalPrecision -> 60 default; configurable
Remember that the name "Float" refers to "Float64", which is called "double" in the C language. And also remember that the name "ShortFloat" refers to "Float32", which is called "float" in C. And finally, the name "LongFloat" refers to "Float80", which is called "long double" in C.
As a consequence, you cannot compute very large numbers using any of the CPU supported floats, and you will have to use one of the software computed float representations.
For example trying to compute the number of decimal digits of a huge number:
10000 factorial asFloat log10 -> INF
I.e. it returns infinity, because 10000 factorial asFloat already returns INF. (it could have been converted with "asFloatChecked" which raises an exception in that situation, which is probably a good idea to do. However, the regular asFloat conversion uses the underlying machine's CPU support, which returns INF, and is similar to the behavior in other programming languages).
In contrast, the integer computation does it:
10000 factorial log10 -> 35659.454274
Again: this is not a problem specific to expecco, but inherent to the way floating point numbers are represented in the CPU.
Speed of Operations[Bearbeiten]
Machines have builtin floating point math operations, which usually work fastest in single or double precision (actually, some modern machines work faster in double than in single precision).
Unless you have special precision needs, it is best to stick with double precision which is also portable across machines. Therefore, double precision floats (aka "double") is the default and simply named "Float" in Smalltalk/X (and therefore also in expecco).
Trigonometric and other Math Functions[Bearbeiten]
Some trigonometric and other math functions (sqrt, log, exp) will first convert the number to a limited precision real number (a C-double), and therefore may have a limited input value range and also generate inexact results.
For example, you will not get a valid result for:
10000 factorial sin
because it is not possible to represent that large number as real number. (expecco will signal a domain error, as the input to sin will be +INF)
Also, the result from:
(9 / 64) sqrt
will be the (inexact) 0.375 (a double), instead of the exact 3/4 (a fraction). (this might change in a future release and provide exact results if both numerator and denominator are perfect squares)
Complex Results[Bearbeiten]
By default, Smalltalk/X will raise an error if the result of a function with a real operand would return a complex result.
For example, computing the square root of a negative number as in:
-2 sqrt
will raise an "ImaginaryResultError".
However, this is a proceedable exception, which can be caught and if the handler proceeds, a complex result is returned:
rslt := ImaginaryResultError ignoreIn:[ -2 sqrt ].
for readability, there is also an alias called "trapImaginary:" in the number class:
rslt := Number trapImaginary:[ -2 sqrt ]
Both of the above would return the complex result:
(0+1.4142135623731 i)
Thus,
rslt := -2 sqrt squared
will result in an exception, whereas:
rslt := Number trapImaginary:[ -2 sqrt squared]
will generate a result of "-2.0".
All operations within "[" .. "]" which would produce an ImaginaryResultError will return a complex. Thus you can write:
Number trapImaginary:[ |num1 num2| num1 := -2 sqrt. num2 := -3 sqrt. Transcript showCR: (num1 + num2). ]
and "(0+3.14626436994197i)" will be shown.
Undefined Results[Bearbeiten]
Similar to the way imaginary results are handled, some operations are not defined for certain values (values outside the function's domain).
For example, the receiver of the arcSin
operation must be in [-1 .. 1].
By default, these situations are also reported by raising an error, and therefore:
-2 arcSin
will raise such a "DomainError".
Similar to the above, this can be handled, although no useful value will be provided (in the above case, NaN (Not a Number) will be returned:
rslt := DomainError ignoreIn:[ -2 arcSin ]
or:
rslt := Number trapDomainError:[ -2 arcSin ].
Both of the above would generate a NaN as result.
Notice that if such a NaN is used in other arithmetic operations, either more exceptions or other NaNs will usually be generated (depending on the exception being handled or not).
Thus:
rslt := Number trapDomainError:[ -2 arcSin sin ].
will also generate a NaN as result.
You can check for valid results with:
aNumber isNaN - true for NaN aNumber isInfinite - true for infinities aNumber isPositiveInfinity - true for +inf aNumber isBegativeInfinity - true for -inf aNumber isFinite - false for NaN or infinities (i.e. true for valid numbers)
Overflow[Bearbeiten]
When an operation's arguments are OK, but the result falls outside the range of representable numbers [fmin..fmax], you will get an infinite result (1). Further operations on these might produce more infinities or a NaN ("Not a Number"). This may be especially troublesome, if a final result gets corrupted due to an intermediate computation, as in:
a := 1e+10. b := 1e+300. c := 1e+20. d := 1e+300. rslt := (a*b) / (c*d) -> NaN
Here, the final result is certainly representable, but the intermediate values (1e10 * 1e300) are out of the Float range [2.225E-308 .. 1.796e+308]. Thus, the computation will be:
a := 1e+10. b := 1e+300. c := 1e+20. d := 1e+300. (a*b) -> INF (c*d) -> INF INF / INF -> NaN
If the above intermediates are computed with a higher precision, the final result will be correct:
a := 1e+10. b := 1e+300. c := 1e+20. d := 1e+300. t := (a asLongFloat * b asLongFloat) / (c asLongFloat * d asLongFloat). rslt := t asFloat -> 1e-10
Of course, with LongFloats, the problem is only shifted towards larger numbers; as soon as the temporary result cannot be represented by a LongFloat:
a := 1q+100. b := 1q+4900. c := 1q+200. d := 1q+4900. rslt := (a * b) / (c * d). -> NaN
then, you may use arbitrary precision LargeFloat numbers:
t := (a asLargeFloat * b asLargeFloat) / (c asLargeFloat * d asLargeFloat). rslt := t asFloat. -> 1r-100
1) that is the current default behavior. Future versions may allow enabling exceptions in this situation, if there are customer requests. However, as most other programming languages behave similar in these situations, most programmers are aware of these pitfalls and avoid such problems.
Different Results on Different CPUs[Bearbeiten]
Since FLOAT32 and FLOAT64 arithmetic is performed by the underlying CPU hardware, different results (in the least significant bit) may be returned from math operations on different CPUs or even different versions (steppings) of the same CPU architecture.
This applies especially to trigonometric and other math functions. Therse are computed by Power- or Taylor-series or Newton approximations with different algorithms on different systems, leading to different results.
Be prepared for this, and use the "almost-equal" comparison functions when results are to be verified.
Higher Precision Numbers[Bearbeiten]
Expecco supports various inexact real formats, with different precision (i.e. number of mantissa bits). Some of those classes have alias names; these are provided to make Smalltalk/X code portable to other Smalltalk dialects (however, within expecco, you will probably not care, as it is not planned to port it to another dialect).
Name Overall Exponent Mantissa Decimal fmin Smalltalk/X ANSI-Smalltalk Size Size Size (1) Precision fmax Name Name (4) Bit Bit Bit Digits IEEE single 32 8 24 6 1.175494e-038 ShortFloat FloatE 3.402823e+038 FloatE Float32 IEEE double 64 11 53 15 2.225074e-308 Float FloatD 1.797693e+308 FloatD Float64 Double IEEE extended 80/128 15 64/112 19/34 3.362103e-4932 LongFloat FloatQ (2) 1.189731e+4932 FloatQ Float80 or Float128 quad double 4*64 11 200 60 1.175494e-038 QDouble - (3) 3.402823e+038 IEEE quadruple 128 15 112 34 3.362103e-4932 QuadFloat - (3) 1.189731e+4932 Float128 IEEE octuple 256 19 236 71 2.482427e-78913 OctaFloat - (3) 1.611325e+78913 Float256 IEEE arbitrary any any any any any IEEEFloat - (3) any large float any any any any any LargeFloat - (3) any
(1) mantissa incl. any hidden bit (normalized floats)
(2) LongFloats use the underlying CPU's long double format.
On x86/x64 machines, LongFloats are represented as 80bit extended floats with 64bit mantissa; on other CPUs, these might be represented as 128bit quadFloats with 112 bit mantissa.
(3) these are currently been developed and provided as a preview feature without warranty (meaning: they may be buggy at the moment; let us know if you need them).
(4) different Smalltalk dialects use different precisions for their floating point numbers: ST80/VW Floats are IEEE singles, V'Age Floats are IEEE doubles and ST/X Floats are IEEE doubles. VW refers to Float64 as Double.
Later, the ANSI standard defined FloatE, FloatD and FloatQ as aliases. You can use either interchangable in expecco.
Notice that the use of any but double precision floats (which are directly supported by the machine) may come at a performance price. The speed of operations degrades from double -> single -> extended -> quad double -> ieee arbitrary - largeFloat.
Constants[Bearbeiten]
Some well known and common constants can be acquired by asking a number class such as the Float class:
Float pi -> pi (3.14159...) Float halfPi -> pi / 2 Float twoPi -> pi * 2 Float phi -> phi (golden ratio 1.6180...) Float sqrt2 -> square root of 2 (1.4142...) Float sqrt5 -> square root of 5 (2.2360...) Float ln2 -> natural log of 2 (0.69314...) Float ln10 -> natural log of 10 (2.30258...) Float e -> e (2.718281...)
Each number class will return a representation of that constant with its precision.
I.e. if you ask the Float class for the constant "pi", you'll get a pi with roughly 15 digits precision, whereas if you ask QDoubles, a more accurate representation will be returned.
Float pi -> 3.14159265358979 ShortFloat pi -> 3.141593 LongFloat pi -> 3.141592653589793238 QDouble pi -> 3.1415926535897932384626433832795028841971693993751058209749446 QuadFloat pi -> 3.1415926535897932384626433832795027 OctaFloat pi -> 3.14159265358979323846264338327950288419716939937510582097494459230781639 LargeFloat pi -> 3.1415926535897932384626433832795028841971693993751058209749445923078164 [many more digits...]
Examples[Bearbeiten]
Code examples in Smalltalk syntax. Notice the float precision qualifiers:
'q' -> longFloat (= IEEE extended precision = Float80) 'Q' -> quadFloat (= IEEE quadruple precision = Float128) 'QO' -> octaFloat (= IEEE octuple precision = Float256) 'QD' -> qDouble 'QL' -> longFloat (arbitrary precision)
Square Root:
2.0 sqrt asShortFloat -> 1.414214 2.0 sqrt -> 1.4142135623731 2.0q sqrt -> 1.414213562373095049 2.0Q sqrt -> 1.4142135623730950488016887242097 2.0QD sqrt -> 1.4142135623730950488016887242096980785696718753769 2.0QO sqrt -> 1.41421356237309504880168872420969807 8569671875376948073176679737990732 (2.0QL precision:200) sqrt -> 1.41421356237309504880168872420969807 8569671875376948073176679 (2.0QL precision:400) sqrt -> 1.41421356237309504880168872420969807 85696718753769480731766797379907324 78462107038850387534327641572735013 846230912297025 (2.0QL precision:800) sqrt -> 1.41421356237309504880168872420969807 85696718753769480731766797379907324 78462107038850387534327641572735013 84623091229702492483605585073721264 41214970999358314132226659275055927 55799950501152782060571470109559971 605970274... Precision value from Wolfram: 1.41421356237309504880168872420969807 85696718753769480731766797379907324 78462107038850387534327641572735013 8462309122970249248360...
Cubic Root:
2.0 cbrt asShortFloat -> 1.259921 2.0 cbrt -> 1.25992104989487 2.0q cbrt -> 1.259921049894873165 2.0Q cbrt -> 1.25992104989487316476721060727823 2.0QD cbrt -> 1.25992104989487316476721060727822835 05702514647015 2.0QO cbrt -> 1.25992104989487316476721060727822835 05702514647015079800819751121553 (2.0QL precision:200) cbrt -> 1.25992104989487316476721060727822835 0570251464701507980081975 (2.0QL precision:400) cbrt -> 1.25992104989487316476721060727822835 05702514647015079800819751121552996 76513959483729396562436255094154310 256035615665259 (2.0QL precision:800) cbrt -> 1.25992104989487316476721060727822835 05702514647015079800819751121552996 76513959483729396562436255094154310 25603561566525939902404061373722845 91103042693552469606426166250009774 74526565480306867185405518689245872 516764199373709695098382... Wolfram: 1.25992104989487316476721060727822835 05702514647015079800819751121552996 76513959483729396562436255094154310 2560356156652593990240...
Exponentiation:
2.0 exp asShortFloat -> 7.389056 2.0 exp -> 7.38905609893065 2.0q exp -> 7.389056098930650227 2.0Q exp -> 7.38905609893065022723042746057499 2.0QD exp -> 7.38905609893065022723042746057500781 31803155705518 2.0QO exp -> 7.38905609893065022723042746057500781 318031557055184732408712782252257266 (2.0QL precision:200) exp -> 7.38905609893065022723042746057500781 3180315570551847324087123 (2.0QL precision:400) exp -> 7.38905609893065022723042746057500781 31803155705518473240871278225225737 96079057763384312485079121794773753 161265478866123 (2.0QL precision:800) exp -> 7.38905609893065022723042746057500781 31803155705518473240871278225225737 96079057763384312485079121794773753 16126547886612388460369278127337447 83922133980777749001228956074107537 02391330947550682086581820269647868 208404220982255234875742... Wolfram: 7.38905609893065022723042746057500781 31803155705518473240871278225225737 96079057763384312485079121794773753 16126547886612388460369278...