Numeric Limits/en: Unterschied zwischen den Versionen
Cg (Diskussion | Beiträge) |
Sv (Diskussion | Beiträge) Änderung übernommen aus altem Wiki von cg, 15.2.26 |
||
| (56 dazwischenliegende Versionen von 2 Benutzern werden nicht angezeigt) | |||
| Zeile 3: | Zeile 3: | ||
Expecco supports arbitrary precision integer arithmetic, |
Expecco supports arbitrary precision integer arithmetic, |
||
arbitrary precision fractions and limited precision floating point numbers. |
arbitrary precision fractions, arbitrary and limited precision floating point numbers in various precisions and complex numbers. |
||
In addition, special purpose numbers for monetary and decimal presentations. |
|||
Being based on Smalltalk/X, expecco provides a complete set of number classes. |
|||
== Syntax (In Smalltalk and FreezeValues) == |
|||
See also: "[[Smalltalk_Syntax_Cheat_Sheet]]". |
|||
====Integers (arbitrary size)==== |
|||
1234567 |
|||
0xFF00AA (base 16) |
|||
0b01010101 (base 2) |
|||
-0xAFFE |
|||
3r121212 (base 3) |
|||
4r123123 (base 4) |
|||
====Fractions (arbitrary integral numerator and denominator)==== |
|||
(1/2) |
|||
(1/101) |
|||
(-1/3) |
|||
====ScaledDecimals ==== |
|||
123s2 |
|||
123.456s2 |
|||
====Floats / Float64 (actually 64bit IEEE doubles)==== |
|||
12.456 |
|||
1e17 |
|||
1.0e23 |
|||
12e |
|||
.5 -- illegal |
|||
.5e -- illegal |
|||
====LongFloats / Float80 (80bit IEEE long doubles)==== |
|||
12.456q |
|||
1e17q |
|||
1.0e23q |
|||
23q |
|||
====QuadFloats / Float128 (128bit IEEE quadruple floats)==== |
|||
12.456Q |
|||
1e17Q |
|||
1.0e23Q |
|||
====OctupleFloats / Float256 (256bit IEEE octuple floats)==== |
|||
12.456QO |
|||
1e17QO |
|||
1.0e23QO |
|||
====LargeFloats (arbitrary precision (defaults to 200bit) software floats)==== |
|||
12.456QL |
|||
1e17QL |
|||
1.0e23QL |
|||
====QDoubles (4 IEEE doubles combined)==== |
|||
12.456QD |
|||
1e17QD |
|||
1.0e23QD |
|||
====Complex==== |
|||
1+5i |
|||
4i |
|||
1.23+5.67i |
|||
== Exact Integer Numbers == |
== Exact Integer Numbers == |
||
For integer operations, there is no overflow or error in the result for any legal operation. |
For integer operations, there is no overflow or error in the result for any legal operation. |
||
I.e. operations on two big |
I.e. operations on two big integers delivers a correct and exact result. |
||
This is a feature of the underlying Smalltalk runtime environment and in contrast to many other programming languages (especially: Java and C) which provide int (usually 32bit) and long (usually 64bit) integer types.<br>In expecco, you can write both in Smalltalk and in the builtin JavaScript syntax<sup>1</sup> |
This is a feature of the underlying Smalltalk runtime environment and in contrast to many other programming languages (especially: Java and C) which provide int (usually 32bit) and long (usually 64bit) integer types.<br>In expecco, you can write both in Smalltalk and in the builtin JavaScript syntax:<sup>1</sup> |
||
2147483647 "(0x7FFFFFFF)" + 1 |
2147483647 "(0x7FFFFFFF)" + 1 |
||
| Zeile 25: | Zeile 89: | ||
-> a huge number beginning with: 284625968091705451890641321211986889014.... |
-> a huge number beginning with: 284625968091705451890641321211986889014.... |
||
Smalltalk will automatically convert any result which is too large to fit into a machine-integer into a LargeInteger (with an arbitrary number of bits) and also automatically |
Smalltalk will automatically convert any result which is too large to fit into a machine-integer into a LargeInteger (with an arbitrary number of bits) and also automatically return results converted back to a small representation, if possible.<sup>2</sup> |
||
<br>Thus, although the two operands to the division in the following example are large integers, |
<br>Thus, although the two operands to the division in the following example are large integers, |
||
rslt := (1000 factorial) / (999 factorial) |
rslt := (1000 factorial) / (999 factorial) |
||
| Zeile 33: | Zeile 97: | ||
Hint: Therefore, you can use a [[Tools_Notepad/en |Workspace (Notepad) window]] as a calculator with arbitrary precision. |
Hint: Therefore, you can use a [[Tools_Notepad/en |Workspace (Notepad) window]] as a calculator with arbitrary precision. |
||
<sup>1</sup>) be aware that this only computes correct results if the elementary action is written in either Smalltalk or in the builtin JavaScript syntax. |
<sup>1</sup>) be aware that this only computes correct results if the elementary action is written in either Smalltalk or in the builtin JavaScript syntax.<br>Depending on the version of the external language interpreter, it may or may not be correct, if using Java/Groovy, Python, C/C++, Node.js etc. |
||
<br> |
|||
<sup>2</sup>) the integer class provides functions which operate in the limited 32 or 64 bit range. These might be useful if you have to verify results or repeat computations as returned by corresponding C or Java operations. |
|||
== Exact Fractions, ScaledDecimals and FixedDecimals == |
== Exact Fractions, ScaledDecimals and FixedDecimals == |
||
| Zeile 45: | Zeile 111: | ||
1000 factorial / 999 factorial -> 1000 |
1000 factorial / 999 factorial -> 1000 |
||
There |
There are also ''truncating division'' operators: "//", which returns an integer truncated towards negative infinity (i.e. the next smaller integer), |
||
and "quo:". which truncates towards zero. "quo:" is what you'd get in Java or C. |
|||
5 // 3 -> 1 |
5 // 3 -> 1 |
||
-5 // 3 -> -3 |
-5 // 3 -> -3 |
||
The corresponding modulo |
The corresponding modulo operators "\\" and "rem:" provide the remainder, such that: |
||
(a // b) + (a \\ b) = a |
(a // b) + (a \\ b) = a |
||
The "\\" is the standard Smalltalk |
The "\\" is the standard Smalltalk remainder operator; Smalltalk/X also provides "%" as an alias.<br>Thus you can also write: |
||
(a // b) + (a % b) = a |
(a // b) + (a % b) = a |
||
| Zeile 61: | Zeile 128: | ||
For positive a and b, the two operator pairs deliver the same result. |
For positive a and b, the two operator pairs deliver the same result. |
||
For negative arguments, these are different. Be aware and think about the domain of your arguments. |
For negative arguments, these are different. Be aware and think about the domain of your arguments. |
||
Be reminded that the Smalltalk remainder "%" returns different results as C or Java, when operands are negative. |
|||
In addition, the usual ceiling, floor and rounding operations are available (both on fractions and on limited precision reals): |
In addition, the usual ceiling, floor and rounding operations are available (both on fractions and on limited precision reals): |
||
| Zeile 82: | Zeile 150: | ||
use ScaledDecimals (which for backward compatibility are also called "FixedPoint" (<sup>1)</sup>). |
use ScaledDecimals (which for backward compatibility are also called "FixedPoint" (<sup>1)</sup>). |
||
These are also exact fractions, but print differently: you can specify the number of digits to be printed and it will print itself rounded on the last digit |
These are also exact fractions, but print differently: you can specify the number of digits to be printed and it will print itself rounded on the last digit. |
||
In other words: the computation and internal value will be exact (as with Fractions), and therefore, no rounding errors will accumulate. |
|||
Only when printed, will the external represenatation be rounded to the specified number of decimal places. |
|||
(5 / 3) asScaledDecimal:2 -> 1.67 |
(5 / 3) asScaledDecimal:2 -> 1.67 |
||
| Zeile 91: | Zeile 161: | ||
1.2 asScaledDecimal:3 -> 1.200 |
1.2 asScaledDecimal:3 -> 1.200 |
||
Float pi asScaledDecimal:5 |
Float pi asScaledDecimal:5 -> 3.14159 |
||
1) the class was previously called "FixedPoint" and the converters were called "asFixedPoint:". For compatibility with other Smalltalk dialects, these have aliases "ScaledDecimal" and "asScaledDecimal:".<br>Both the old class name and the old operators are and will be supported in the future for backward compatibility (as aliases), |
1) the class was previously called "FixedPoint" and the converters were called "asFixedPoint:". For compatibility with other Smalltalk dialects, these have aliases "ScaledDecimal" and "asScaledDecimal:".<br>Both the old class name and the old operators are and will be supported in the future for backward compatibility (as aliases), |
||
| Zeile 97: | Zeile 167: | ||
=== FixedDecimal === |
=== FixedDecimal === |
||
As mentioned above, ScaledDecimal |
As mentioned above, a ScaledDecimal keeps the exact value internally, but prints itself rounded to a given number of decimal digits. |
||
Smalltalk/X provides an alternative class called "''FixedDecimal''", which always keeps a rounded value internally. These may be better suited for monetary values, especially in computed additive sums which are printed in a table, as the sum of two FixedDecimals will always be the presented (printed) sum of two FixedDecimals. |
Smalltalk/X provides an alternative class called "''FixedDecimal''" (1), which always keeps a rounded value internally. These may be better suited for monetary values, especially in computed additive sums which are printed in a table, as the sum of two FixedDecimals will always be the presented (printed) sum of two FixedDecimals. In contrast, with ScaledDecimals, you may see a sum which differs from what presented table values suggest. |
||
For example: |
For example: |
||
| Zeile 117: | Zeile 187: | ||
Thus, when multiplying a float and a fixed decimal, you'll get a float result, whereas if you multiply an integer and a fixed decimal, the result will be a fixed decimal. Finally, when multiplying a fixed decimal and a scaled decimal, the result will be a fixed decimal. |
Thus, when multiplying a float and a fixed decimal, you'll get a float result, whereas if you multiply an integer and a fixed decimal, the result will be a fixed decimal. Finally, when multiplying a fixed decimal and a scaled decimal, the result will be a fixed decimal. |
||
1) both names "ScaledDecimal" and "FixedDecimal" have been chosen a bit unwise, and may be confusing. However, these cannot easily be changed for backward and cross Smalltalk dialect compatibility reasons. We apologize. |
|||
== Inexact Float and Double Numbers == |
== Inexact Float and Double Numbers == |
||
Floating point numbers are inherently inexact and almost always represent an approximated value. The error depends on the floating point number's precision, which is the number of bits with which the value is approximated. |
Floating point numbers are inherently inexact and almost always represent an approximated value. The error depends on the floating point number's precision, which is the number of bits with which the value is approximated. there are numbers which cannot ever be repsented as a floating point number, whatever precision is used. Even innocent looking numbers (eg. "0.1") are of this kind. |
||
This is not a problem specific to expecco, |
This is not a problem specific to expecco, |
||
but inherent to the way floating point numbers are represented (in the machine). <br>See [https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html "What Every Computer Scientist Should Know About Floating-Point Arithmetic"], |
but inherent to the way floating point numbers are represented (in the machine). <br>See [https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html "What Every Computer Scientist Should Know About Floating-Point Arithmetic"], |
||
[https://people.eecs.berkeley.edu/~wkahan/Mindless.pdf "Mindless Assessments of Roundoff in Floating-Point Computation"] and [ |
[https://people.eecs.berkeley.edu/~wkahan/Mindless.pdf "Mindless Assessments of Roundoff in Floating-Point Computation"] and [https://www-users.math.umn.edu/~arnold/disasters "Some disasters attributable to bad numerical computing"].<br>A very impressive example of how wrong double precision IEEE arithmetic can be is described in "[[Do_not_trust_Floating_Point]]" <sup>(1)</sup>. |
||
Floating point numbers are represented as a sum of powers of 2 (actually 1/2 + 1/4 + 1/8 +...) multiplied by 2 raised to an exponent. I.e. |
Floating point numbers are represented as a sum of powers of 2 (actually 1/2 + 1/4 + 1/8 +...) called the "''mantissa''" then multiplied by 2 raised to an exponent. I.e. |
||
value = mantissa * 2 exponent |
value = mantissa * (2 ** exponent) |
||
with the mantissa being normalized to be a sum in the interval 0.5..1 (as listed above). And the exponent stored with an offset (called "ebias"). The minimum exponent (0) is reserved for the zero number and non-normalized tiny numbers (called "''subnormals''"); the maximum exponent is reserved for infinities and NaNs ("''not a number''"). These might be returned from some operations if invalid arguments are provided (for example, trying to take a logarithm of a negative number). |
|||
with the mantissa being normalized to be a sum in the interval 0.5..1 (as listed above). |
|||
The number of exponent bits determines the largest and smallest representable magnitudes, the number of mantissa bits determines the relative error. The error depends on the value of the last mantissa bit, which depends on the exponent. This value is called "''Unit in the Last Place''" or "ULP" (see [https://en.wikipedia.org/wiki/Unit_in_the_last_place Wikipedia]).<br>For a large number like 1e100, one ULP is the very large 1.94266889222573e+84, whereas for a small number like 0.5, it is 1.11022302462516e-16. |
The number of exponent bits determines the largest and smallest representable magnitudes, the number of mantissa bits determines the relative error. The error depends on the value of the last mantissa bit, which depends on the exponent. This value is called "''Unit in the Last Place''" or "'ULP'" (see [https://en.wikipedia.org/wiki/Unit_in_the_last_place Wikipedia]).<br>For a large number like 1e100, one ULP is the very large 1.94266889222573e+84, whereas for a small number like 0.5, it is 1.11022302462516e-16. |
||
Floating point formats differ in the number of bits (single/double/extended precision |
Floating point formats differ in the number of bits (single/double/extended precision etc.).<br> |
||
A double precision IEEE float has 11 bits for the exponent and 53 for the mantissa (see [https://en.wikipedia.org/wiki/IEEE_754 IEEE floating point formats]). |
|||
As a rule of thumb, the error in the last bit of a double precision IEEE float is roughly 15 to 16 orders of magnitudes smaller than the magnitude of the (double precision) floating point number. The error is larger for single precision (32bit) floats and smaller for extended floats (80, 128 or more bits). |
|||
<sup>1)</sup>: If you do not believe me, try the following example from one of the mentioned papers in (eg. excel) or your favourite programming language: |
|||
v := 4/3. "/ or maybe (4.0/3.0) |
|||
w := v - 1. |
|||
x := w*3. |
|||
y := x - 1. |
|||
z := (y*2)**52. |
|||
=== Limited Precision === |
=== Limited Precision === |
||
Due to the limited number of bits in the mantissa, different values may end in the same floating point representation. For example, both 9223372036854776000 and 9223372036854775808 will end up being represented as the same float when converted from integer to float. |
Due to the limited number of bits in the mantissa, different values may end in the same floating point representation. For example, both 9223372036854776000 and 9223372036854775808 will end up being represented as the same float when converted from integer to float. The reason is that there are simply not enough bits in the mantissa. |
||
For example: |
|||
9223372036854776000 asFloat = 9223372036854775808 asFloat |
9223372036854776000 asFloat = 9223372036854775808 asFloat |
||
will return true, and the difference will be zero in: |
will return "true", and the difference will be zero in: |
||
9223372036854776000 asFloat - 9223372036854775808 asFloat |
9223372036854776000 asFloat - 9223372036854775808 asFloat |
||
in contrast to the correct result being returned when comparing/subtracting them as integers: |
in contrast to the correct result being returned when comparing/subtracting them as integers: |
||
9223372036854776000 = 9223372036854775808. |
9223372036854776000 = 9223372036854775808. |
||
<small>=> false</small> |
|||
9223372036854776000 - 9223372036854775808. |
|||
<small>=> 192</small> |
|||
Try it in a workspace window. |
|||
Also, many numbers (actually: most numbers) cannot be exactly represented by a finite sum of powers of 2. Such numbers will have an error in the last significant bit (actually half the last bit). When floating point numbers are added or multiplied, the result is usually computed internally with a few more bits as mantissa, and then rounded on the last bit, to fit the mantissa's number of bits. |
Also, many numbers (actually: most numbers) cannot be exactly represented by a finite sum of powers of 2. Such numbers will have an error in the last significant bit (actually half the last bit). When floating point numbers are added or multiplied, the result is usually computed internally with a few more bits as mantissa, and then rounded on the last bit, to fit the mantissa's number of bits. Notice that this may not be immediately obvious, because the print functions (such as printf) cheat, and round again on the last bit. Thus, a result such as 0.9999999 would still be printed as "1.0). |
||
The situation may be relaxed slightly, by using more bits for the mantissa (and expecco gives you a choice of 32bit (called "''ShortFloat''"), 64bit ("''Float''") and 80bit ("''LongFloat''") and even more, which are mapped to corresponding IEEE floats (single, double and extended). |
The situation may be relaxed slightly, by using more bits for the mantissa (and expecco gives you a choice of 32bit (called "''ShortFloat''"), 64bit ("''Float''") and 80bit ("''LongFloat''") and even more, which are mapped to corresponding IEEE floats (single, double and extended). |
||
| Zeile 148: | Zeile 236: | ||
Repeating the above example with long floats, there are enough mantissa bits and the numbers are no longer represented or considered equal,<br>thus: |
Repeating the above example with long floats, there are enough mantissa bits and the numbers are no longer represented or considered equal,<br>thus: |
||
9223372036854776000 asLongFloat = 9223372036854775808 asLongFloat |
9223372036854776000 asLongFloat = 9223372036854775808 asLongFloat |
||
yields |
yields "false" as answer, and the computation: |
||
9223372036854776000 asLongFloat - 9223372036854775808 asLongFloat |
9223372036854776000 asLongFloat - 9223372036854775808 asLongFloat |
||
will give 192.0 as answer. |
will give 192.0 as answer. |
||
However, even with more bits, the fundamental restriction remains, although appearing less frequently with higher precision. But be aware that many numbers (such as 1/3) can '''never''' be represented exactly, no matter how many bits are used. |
However, even with more bits, the fundamental restriction remains, although appearing less frequently with higher precision. But be aware that many numbers (such as 1/10, 1/5, 1/3) can '''never''' be represented exactly, no matter how many bits are used. So even the above mentioned innocent looking "0.1" is actually an approximation and wrong in the last bit. |
||
The limited precision may lead to "strange" results, especially when operands are far apart; for example, when subtracting a very small value from a much larger one, as in: |
The limited precision may lead to "strange" results, especially when operands are far apart; for example, when subtracting a very small value from a much larger one, as in: |
||
| Zeile 162: | Zeile 250: | ||
which gives 2149999999999.999987 as result (still incorrect due to its 64 bit mantissa, but much better). |
which gives 2149999999999.999987 as result (still incorrect due to its 64 bit mantissa, but much better). |
||
If you are willing to trade speed for precision, you can use one of expecco's builtin higher precision representations or even the arbitrary precision representation, and compute with more bits of precision. |
If you are willing to trade speed for precision, you can use one of expecco's builtin higher precision representations or even the arbitrary precision representation, and compute with more bits of precision. The QDouble class provides a compromise between speed and precision, proving roughly 200 bits of precision or alternatively represents a combination of up to 4 arbitrarily valued doubles (i.e. it can represent the sum of a very large and a small number): |
||
(2.15e12 asQDouble) - (1.25e-5 asQDouble) |
(2.15e12 asQDouble) - (1.25e-5 asQDouble) |
||
(2.15e12 asQDouble) - ((2.15e12 asQDouble) - (1.25e-5 asQDouble)) |
(2.15e12 asQDouble) - ((2.15e12 asQDouble) - (1.25e-5 asQDouble)) |
||
Another representation supports an arbitrary number of precision bits (here 200): |
|||
(2.15e12 asLargeFloatPrecision:200) - (1.25e-5 asLargeFloatPrecision:200) |
(2.15e12 asLargeFloatPrecision:200) - (1.25e-5 asLargeFloatPrecision:200) |
||
(2.15e12 asLargeFloatPrecision:200) - ((2.15e12 asLargeFloatPrecision:200) - (1.25e-5 asLargeFloatPrecision:200)) |
(2.15e12 asLargeFloatPrecision:200) - ((2.15e12 asLargeFloatPrecision:200) - (1.25e-5 asLargeFloatPrecision:200)) |
||
Both return the correct results 2149999999999.9999875 and 1.25e-5 respectively. |
|||
Be aware, that the higher precision and arbitrary precision operations are much slower than the ones which are directly supported by the processor (which has special hardware, usually for single, double and extended precision). Also these extended classes are still being developed and not yet released for official use (meaning they may contain bugs, especially in their trigonometric and other math functions by the time of writing). |
|||
You may use fractions, |
|||
2150000000000 - (1 / 125000) |
|||
to compute the exact result: (268749999999999999/125000) |
|||
(of course, these are less convenent to read, and should probably be presented to the end-user as ScaledDecimals. |
|||
Also, do not forget that a conversion of one number to a higher precision number cannot ''magically'' generate missing bits. |
|||
<br>For example, given a 32 bit floating point number which is already an approximation (i.e. the real value cannot be represented as an exact sum of powers of 2), |
|||
then the conversion will give you another such approximation, with the same error. |
|||
Thus, "0.25 asLongFloat" will give you an exact 0.25 (because 0.25 is representable), whereas "0.1 asLongFloat" will not give an exact "0.1". |
|||
Actually, the result of such a conversion will usually not give you the full possible precision. |
|||
If you need a constant with the max. precision, either enter it as such (i.e. "0.1q" instead of "0.1 asLongFloat") or read it from a string (i.e. LongFloat fromString:'0.1'). |
|||
Be aware, that the higher precision and arbitrary precision operations are much slower than the ones which are directly supported by the processor which has special hardware, usually for single, double and extended precision. Also they are currently being developed and not yet released for official use (meaning they may contain bugs, especially in their trigonometric and other math functions by the time of writing). |
|||
=== Floating Point Errors Propagate === |
=== Floating Point Errors Propagate === |
||
The above rounding and last bit errors will accumulate, with every math operation performed on it (and may even do so wildly). |
The above rounding and last bit errors will accumulate, with every math operation performed on it (and may even do so wildly). |
||
For example, the |
For example, the already mentioned 0.1 cannot be exactly represented as a floating point number, and is actually 0.099999..X with an error in the last bit (half a ULP). |
||
Adding this multiple times will result in a larger error in the final result: |
<br>Adding this multiple times will result in a larger and larger error in the final result: |
||
1.0 - (0.1 + 0.1 + 0.1 ...(10 times)... + 0.1) -> 1.11022302462516E-16 |
1.0 - (0.1 + 0.1 + 0.1 ...(10 times)... + 0.1) -> 1.11022302462516E-16 |
||
The print functions will |
The print functions will try to compensate for an error in the last bit(s), showing "0.1" although in reality, it is "0.09999..." (it rounds before printing). |
||
Thus, even though the printed representation of such a number might look ok, it will inject more and more error when the value is used in further operations (up to the point when print will no longer be able to cheat, and the error becomes visible). |
Thus, even though the printed representation of such a number might look ok, it will inject more and more error when the value is used in further operations (up to the point when the error accumulates out of the last bit and print will no longer be able to cheat, and the error becomes visible). |
||
This is especially inconvenient, when monetary values or counts are represented as floats and a final sum is wrong in the penny value<br>(and therefore, a ''real programmer'' will never ever use floating point numbers to represent monetary values!). |
This is especially inconvenient, when monetary values or counts are represented as floats and a final sum is wrong in the penny value<br>(and therefore, a ''real programmer'' will '''never ever use floating point numbers to represent monetary values'''!). |
||
<br>As an example, try to sum an array consisting of 10 values: |
<br>As an example, try to sum an array consisting of 10 values: |
||
| Zeile 205: | Zeile 306: | ||
As a concrete example, try: |
As a concrete example, try: |
||
(0.2 + 0.1 - 0.3) = 0 |
(0.2 + 0.1 - 0.3) = 0 |
||
which will return |
which will return <code>false</code> and if you print "0.2 + 0.1 - 0.3", you might get something like: "5.55111512312578e-17". |
||
<br>Even increasing the precision does not really help; if we went to 200bits precision, we'd still get a small error: |
<br>Even increasing the precision does not really help; if we went to 200bits precision, we'd still get a small error: |
||
(0.2QL + 0.1QL - 0.3QL) printString |
(0.2QL + 0.1QL - 0.3QL) printString |
||
gives "-3.111507638930570853572...e-61" |
gives "-3.111507638930570853572...e-61" |
||
The problem also occurs when comparing numbers with different precision. For example, consider that a float32 value is to be compared against a constant. The float32 might be read from a file or provided by a measurement device via any communication mechanism. |
|||
If we compare it against a higher precision value, the missing bits in the shorter float are filled with zeros. Thus: |
|||
(Float32 readFrom:'0.125') = 0.125 |
|||
leads to a <code>true</code> value, whereas: |
|||
(Float32 readFrom:'0.123') = 0.123 |
|||
returns <code>false</code>. |
|||
<br>The reason for this is that 0.125 can be represented as exact float in both float32 and float64 formats, whereas 0.123 is non-exact and has repeated binary digits. |
|||
Its representation as float64 is: |
|||
0 01111111011 1111011111001110110110010001011010000111001010110000 |
|||
and: |
|||
0 01111011 11110111110011101101101 |
|||
as float32. |
|||
When comparing, the float32 is expanded to: |
|||
0 01111011 111101111100111011011010000000000000000000000000000000 |
|||
which is obviously different. |
|||
Instead of comparing against a constant, either use range-compares and/or use the special "''compare-almost-equal''" functions, where the number of bits of acceptable error can be specified (so called: "''ULPs''"). Expecco provides such functions both for elementary code and in the standard action block library. |
Instead of comparing against a constant, either use range-compares and/or use the special "''compare-almost-equal''" functions, where the number of bits of acceptable error can be specified (so called: "''ULPs''"). Expecco provides such functions both for elementary code and in the standard action block library. |
||
Again, for this reason, do not compute money values using floats or doubles. |
|||
Instead, use instances of ScaledDecimal. |
Instead, use instances of ScaledDecimal. |
||
Otherwise you might loose a cent/penny here and there, if you use floats/doubles on big budgets. |
Otherwise you might loose a cent/penny here and there, if you use floats/doubles on big budgets. |
||
| Zeile 220: | Zeile 337: | ||
and: |
and: |
||
(0.1 asScaledDecimal + 0.2 asScaledDecimal - 0.3 asScaledDecimal) printString => 0.00 |
(0.1 asScaledDecimal + 0.2 asScaledDecimal - 0.3 asScaledDecimal) printString => 0.00 |
||
Find further insight [https://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/ here] |
|||
=== Limited Range of Float and Double Numbers === |
=== Limited Range of Float and Double Numbers === |
||
Floating point numbers also have a limited range. |
Floating point numbers also have a limited range; there are smallest and largest representable values. |
||
<br> |
|||
In expecco, the default float format is IEEE double precision format (called "''Float''" or "''Float64''" in expecco). |
In expecco, the default float format is IEEE double precision format (called "''Float''" or "''Float64''" in expecco). |
||
Numbers with an absolute value greater than 1.79769313486232E+308 will lead to a +INF/-INF |
Numbers with an absolute value greater than 1.79769313486232E+308 will lead to a +INF/-INF |
||
(infinite) result, and numbers with absolute value smaller than 2.2250738585072E-308 will be zero. |
(infinite) result, and numbers with absolute value smaller than 2.2250738585072E-308 will be zero. |
||
For IEEE single precision floats (called "''ShortFloat''" or "''Float32''" in expecco), the range is much smaller, and for IEEE extended precision (called "''LongFloat''" or "''Float80''" in expecco), the range is larger. |
|||
You can ask the classes for their limits, with: |
|||
You can ask the classes (or its instances <sup>1</sup>) for their limits, with: |
|||
* fmin (smallest representable number larger than zero) |
* fmin (smallest representable number larger than zero) |
||
* fmax (largest representable number), |
* fmax (largest representable number), |
||
| Zeile 236: | Zeile 358: | ||
* decimalPrecision (digits when printed), |
* decimalPrecision (digits when printed), |
||
<small>remember Float is the same as Float64</small> |
|||
Float fmin -> 2.2250738585072E-308 |
Float fmin -> 2.2250738585072E-308 |
||
Float fmax -> 1.79769313486232E+308 |
Float fmax -> 1.79769313486232E+308 |
||
| Zeile 244: | Zeile 366: | ||
Float decimalPrecision -> 15 |
Float decimalPrecision -> 15 |
||
<small>remember ShortFloat is an alias for Float32</small> |
|||
ShortFloat fmin -> 1.175494e-38 |
ShortFloat fmin -> 1.175494e-38 |
||
ShortFloat fmax -> 3.402823e+38 |
ShortFloat fmax -> 3.402823e+38 |
||
| Zeile 250: | Zeile 373: | ||
ShortFloat precision -> 24 |
ShortFloat precision -> 24 |
||
ShortFloat decimalPrecision -> 7 |
ShortFloat decimalPrecision -> 7 |
||
<small>remember LongFloat is an alias for Float80</small> |
|||
LongFloat fmin -> 3.362103143112093506E-4932 |
LongFloat fmin -> 3.362103143112093506E-4932 |
||
LongFloat fmax -> 1.189731495357231765E+4932 |
LongFloat fmax -> 1.189731495357231765E+4932 |
||
| Zeile 257: | Zeile 381: | ||
LongFloat precision -> 64 |
LongFloat precision -> 64 |
||
LongFloat decimalPrecision -> 19 |
LongFloat decimalPrecision -> 19 |
||
<small>QuadFloat is an alias for Float128</small> |
|||
Float128 fmin -> 3.36210314311209350626267781732e-4932 |
Float128 fmin -> 3.36210314311209350626267781732e-4932 |
||
Float128 fmax -> 1.18973149535723176508575932662e+4932 |
Float128 fmax -> 1.18973149535723176508575932662e+4932 |
||
| Zeile 264: | Zeile 389: | ||
Float128 precision -> 113 |
Float128 precision -> 113 |
||
Float128 decimalPrecision -> 34 |
Float128 decimalPrecision -> 34 |
||
<small>OctaFloat is an alias for Float256</small> |
|||
Float256 fmin -> 2.48242795146434978829932822291387172367768770607964686927095329791379e-78913 |
|||
Float256 |
Float256 fmin -> 2.48242795146434978829932822291387172...5329791379e-78913 |
||
Float256 fmax -> 1.61132571748576047361957211845200501...7125049607e+78913 |
|||
Float256 emin -> -262142 |
Float256 emin -> -262142 |
||
Float256 emax -> 262143 |
Float256 emax -> 262143 |
||
Float256 precision -> 237 |
Float256 precision -> 237 |
||
Float256 decimalPrecision -> 71 |
Float256 decimalPrecision -> 71 |
||
QDouble fmin -> <small>same as float64</small> |
|||
QDouble fmax -> <small>same as float64</small> |
|||
QDouble emin -> <small>same as float64</small> |
|||
QDouble emax -> <small>same as float64</small> |
|||
QDouble precision -> 204 |
|||
QDouble decimalPrecision -> 61 |
|||
aLargeFloat fmin -> 0.0 <small>arbitrary small</small> <sup>1</sup> |
|||
aLargeFloat fmax -> inf <small>arbitrary large</small> |
|||
aLargeFloat emin -> -inf <small>arbitrary small</small> |
|||
aLargeFloat emax -> inf <small>arbitrary large</small> |
|||
aLargeFloat precision -> 200 <small>default; configurable</small> |
|||
aLargeFloat decimalPrecision -> 60 <small>default; configurable</small> |
|||
Remember that the name "Float" refers to "Float64", which is called "double" in the C language. |
Remember that the name "Float" refers to "Float64", which is called "double" in the C language. |
||
| Zeile 276: | Zeile 416: | ||
And finally, the name "LongFloat" refers to "Float80", which is called "long double" in C. |
And finally, the name "LongFloat" refers to "Float80", which is called "long double" in C. |
||
As a consequence, you cannot compute very large numbers using floats. |
As a consequence of the limits, you cannot compute very large numbers using any of the CPU supported floats, and you will have to use one of the software computed float representations. |
||
<br>For example trying to compute the number of decimal digits of a huge number: |
<br>For example trying to compute the number of decimal digits of a huge number: |
||
10000 factorial asFloat log10 -> INF |
10000 factorial asFloat log10 -> INF |
||
| Zeile 288: | Zeile 428: | ||
Again: this is not a problem specific to expecco, |
Again: this is not a problem specific to expecco, |
||
but inherent to the way floating point numbers are represented in the CPU. |
but inherent to the way floating point numbers are represented in the CPU. |
||
<sup>1)</sup>because LargeFloats have an individual number of precision (per instance), you should ask the instance, not the class. The class will return the default values (which are valid for 200 bits of resolution). |
|||
=== Speed of Operations === |
=== Speed of Operations === |
||
Machines have builtin floating point math operations, which usually work fastest in single or double precision (actually, some modern machines work faster in double than in single precision). |
Machines have builtin floating point math operations, which usually work fastest in single or double precision (actually, some modern machines work faster in double than in single precision). |
||
Unless you have special precision needs, it is best to stick with double precision |
Unless you have special precision needs, it is best to stick with double precision which is also portable across machines. |
||
Therefore, double precision floats (aka "''double''") is the default and simply named "''Float''" in Smalltalk/X (and therefore also in expecco) <sup>1</sup>. |
|||
<sup>1)</sup>The reason for calling them "Float" is historic. There exist smalltalk dialects where Floats are 32bit IEEE floats, and others where they are 64bit. To be able to import code from either dialect, Smalltalk/X uses double precision for "Float".<br>To make you intention clear, it is recommended to use the explicit names "Float32", "Float64" etc. |
|||
== Trigonometric and other Math Functions == |
== Trigonometric and other Math Functions == |
||
| Zeile 309: | Zeile 454: | ||
will be the (inexact) 0.375 (a double), instead of the exact 3/4 (a fraction). |
will be the (inexact) 0.375 (a double), instead of the exact 3/4 (a fraction). |
||
(this might change in a future release and provide exact results if both numerator and denominator are perfect squares) |
(this might change in a future release and provide exact results if both numerator and denominator are perfect squares) |
||
You can however first convert to a higher precision float and then apply the function. These will compute using a Taylor series or Newton approximation taking the number's precision into account: |
|||
2 sqrt |
|||
<small>=> 1.4142135623731 (computed with float64 precision)</small> |
|||
2 asFloat128 sqrt |
|||
<small>=> 1.41421356237309504880168872421</small> |
|||
2 asFloat256 sqrt |
|||
<small>=> 1.41421356237309504880168872420969807856967187537694807317667973799073</small> |
|||
(2 asLargeFloatPrecision:500) sqrt |
|||
<small>=> 1.4142135623730950488016887242096980785696718753769480731766797379907324784621070388503875343276415727350138462309122970249248360558507372126441214971</small> |
|||
== Complex Results== |
== Complex Results== |
||
| Zeile 316: | Zeile 475: | ||
will raise an "ImaginaryResultError". |
will raise an "ImaginaryResultError". |
||
However, this is a ''proceedable exception'', which can be caught and if the handler proceeds, a complex result is returned: |
However, this is a ''proceedable exception'' <sup>1</sup>, which can be caught and if the handler proceeds, a complex result is returned: |
||
rslt := ImaginaryResultError ignoreIn:[ -2 sqrt ]. |
rslt := ImaginaryResultError ignoreIn:[ -2 sqrt ]. |
||
for readability, there is also an alias called "trapImaginary:" in the number class: |
for readability, there is also an alias called "trapImaginary:" in the number class: |
||
| Zeile 330: | Zeile 489: | ||
will generate a result of "-2.0". |
will generate a result of "-2.0". |
||
All operations within "[" .. "]" which would produce an ImaginaryResultError will return a complex. Thus you can write: |
|||
== Undefined Results== |
|||
Number trapImaginary:[ |
|||
|num1 num2| |
|||
num1 := -2 sqrt. |
|||
num2 := -3 sqrt. |
|||
Transcript showCR: (num1 + num2). |
|||
] |
|||
and "(0+3.14626436994197i)" will be shown. |
|||
<sup>1)</sup>Proceedable exceptions are among the unique features of the Smalltalk programming language; exceptions may be raised as being proceedable, and an exception handler may then "proceed" and provide an alternative return value from the failed operation. This mechanism is used here, where the exception handler - if pesent - can decide to return an imaginary result. |
|||
== Undefined Results, NaN and Domain Errors == |
|||
Similar to the way imaginary results are handled, some operations are not defined for certain values (values outside the function's domain). |
Similar to the way imaginary results are handled, some operations are not defined for certain values (values outside the function's domain). |
||
<br>For example, the receiver of the <code>arcSin</code> operation must be in [-1 .. 1]. |
<br>For example, the receiver of the <code>arcSin</code> operation must be in [-1 .. 1]. |
||
| Zeile 354: | Zeile 525: | ||
''aNumber'' isInfinite - true for infinities |
''aNumber'' isInfinite - true for infinities |
||
''aNumber'' isPositiveInfinity - true for +inf |
''aNumber'' isPositiveInfinity - true for +inf |
||
''aNumber'' |
''aNumber'' isNegativeInfinity - true for -inf |
||
''aNumber'' isFinite - false for NaN or infinities (i.e. true for valid numbers) |
''aNumber'' isFinite - false for NaN or infinities (i.e. true for valid numbers) |
||
== Overflow == |
== Overflow == |
||
When an operation's result falls outside the range of representable numbers [fmin..fmax], you will get an infinite result |
When an operation's arguments are OK, but the result falls outside the range of representable numbers [fmin..fmax], you will get an infinite result <sup>1)</sup>.<br>Further operations on these might produce more infinities or a NaN ("''Not a Number''"). |
||
This may be especially troublesome, if a final result gets corrupted due to an intermediate computation, as in: |
This may be especially troublesome, if a final result gets corrupted due to an intermediate computation, as in: |
||
a := 1e+10. |
a := 1e+10. |
||
| Zeile 397: | Zeile 568: | ||
t := (a asLargeFloat * b asLargeFloat) / (c asLargeFloat * d asLargeFloat). |
t := (a asLargeFloat * b asLargeFloat) / (c asLargeFloat * d asLargeFloat). |
||
rslt := t asFloat. |
rslt := t asFloat. |
||
-> |
-> 1e-100 |
||
1) that is the current default behavior. Future versions may allow enabling exceptions in this situation, if there are customer requests. However, as most other programming languages behave similar in these situations, most programmers are aware of these pitfalls and avoid such problems. |
<sup>1)</sup> that is the current default behavior. Future versions may allow enabling exceptions in this situation, if there are customer requests. However, as most other programming languages behave similar in these situations, most programmers are aware of these pitfalls and avoid such problems. |
||
== Different Results on Different CPUs == |
== Different Results on Different CPUs == |
||
| Zeile 410: | Zeile 581: | ||
<br>Be prepared for this, and use the "''almost-equal''" comparison functions when results are to be verified. |
<br>Be prepared for this, and use the "''almost-equal''" comparison functions when results are to be verified. |
||
== Higher Precision Numbers == |
== Summary of Higher Precision Numbers == |
||
Expecco supports various inexact real formats, with different precision (i.e. number of mantissa bits). |
Expecco supports various inexact real formats, with different precision (i.e. number of mantissa bits). |
||
| Zeile 416: | Zeile 587: | ||
(however, within expecco, you will probably not care, as it is not planned to port it to another dialect). |
(however, within expecco, you will probably not care, as it is not planned to port it to another dialect). |
||
Name Overall Exponent Mantissa Decimal fmin |
Name Overall Exponent Mantissa Decimal fmin ST/X ANSI-ST ST80/VW IBM VA-ST |
||
Size Size Size |
Size Size Size <sup>1)</sup> Precision fmax Name Name <sup>4)</sup> Name <sup>4)</sup> Name <sup>4)</sup> |
||
Bit Bit Bit Digits |
Bit Bit Bit Digits |
||
IEEE single 32 8 24 6 1.175494e-038 ShortFloat FloatE |
IEEE single 32 8 24 6 1.175494e-038 ShortFloat FloatE Float - |
||
3.402823e+038 FloatE |
3.402823e+038 FloatE |
||
Float32 |
Float32 |
||
IEEE double 64 11 53 15 2.225074e-308 Float FloatD |
IEEE double 64 11 53 15 2.225074e-308 Float FloatD Double Float |
||
1.797693e+308 FloatD |
1.797693e+308 FloatD |
||
Float64 |
Float64 |
||
Double |
Double |
||
IEEE extended 80/128 15 64/112 19/34 3.362103e-4932 LongFloat FloatQ |
IEEE extended 80/128 15 64/112 19/34 3.362103e-4932 LongFloat FloatQ <sup>2)</sup> - - |
||
1.189731e+4932 FloatQ |
1.189731e+4932 FloatQ |
||
Float80 |
Float80 |
||
or Float128 |
or Float128 |
||
quad double 4*64 11 |
quad double 4*64 11 204 60 1.175494e-038 QDouble - <sup>3)</sup> - - |
||
3.402823e+038 |
3.402823e+038 |
||
IEEE quadruple 128 15 112 34 3.362103e-4932 QuadFloat - |
IEEE quadruple 128 15 112 34 3.362103e-4932 QuadFloat - <sup>3)</sup> - - |
||
1.189731e+4932 Float128 |
1.189731e+4932 Float128 |
||
IEEE octuple 256 19 236 71 2.482427e-78913 OctaFloat - |
IEEE octuple 256 19 236 71 2.482427e-78913 OctaFloat - <sup>3)</sup> - - |
||
1.611325e+78913 Float256 |
1.611325e+78913 Float256 |
||
IEEE arbitrary any any any any any IEEEFloat - |
IEEE arbitrary any any any any any IEEEFloat - <sup>3)</sup> - - |
||
any |
any |
||
large float any any any any any LargeFloat - |
large float any any any any any LargeFloat - <sup>3)</sup> - - |
||
any |
any |
||
(1) mantissa incl. any hidden bit (normalized floats) |
(1) mantissa incl. any hidden bit (normalized floats) |
||
(2) LongFloats use the underlying CPU's long double format.<br>On x86/x64 machines, LongFloats are represented as 80bit extended floats with 64bit mantissa; on other CPUs |
(2) LongFloats use the underlying CPU's long double format.<br>On x86/x64 machines, LongFloats are represented as 80bit extended floats with 64bit mantissa; on other CPUs these might be represented as 128bit quadFloats with 112 bit mantissa (eg. a SPARC CPU does this). |
||
(3) these are |
(3) these are still been developed and provided as a preview feature without warranty (meaning: they may be buggy at the moment; let us know if you need them). |
||
(4) different Smalltalk dialects use different precisions for their floating point numbers: ST80/VW Floats are IEEE singles, V'Age Floats are IEEE doubles and ST/X Floats are IEEE doubles. VW refers to Float64 as Double.<br>Later, the ANSI standard defined FloatE, FloatD and FloatQ as aliases. You can use either interchangable in expecco. |
(4) different Smalltalk dialects use different precisions for their floating point numbers: ST80/VW Floats are IEEE singles, V'Age Floats are IEEE doubles and ST/X Floats are IEEE doubles. VW refers to Float64 as Double.<br>Later, the ANSI standard defined FloatE, FloatD and FloatQ as aliases. You can use either interchangable in expecco. |
||
Notice that the use of any but double precision floats (which are directly supported by the machine) may come at a performance price. |
Notice that the use of any but double precision floats (which are directly supported by the machine) may come at a performance price. |
||
The speed of operations degrades from double -> single -> extended -> quad double -> ieee arbitrary - largeFloat. |
<br>The speed of operations degrades from double -> single -> extended -> ieee128 -> quad double -> ieee256 -> ieee arbitrary - largeFloat. |
||
<br>This is especially true for the trigonometric and math functions, where both more iterations are needed to get to the desired precision in the series computations and the individual operations are also much slower. |
|||
Be aware that LargeFloats are super precise, but also super slow. |
|||
== Constants == |
== Constants == |
||
| Zeile 474: | Zeile 648: | ||
Each number class will return a representation of that constant with its precision.<br>I.e. if you ask the Float class for the constant "pi", you'll get a pi with roughly 15 digits precision, whereas if you ask QDoubles, a more accurate representation will be returned. |
Each number class will return a representation of that constant with its precision.<br>I.e. if you ask the Float class for the constant "pi", you'll get a pi with roughly 15 digits precision, whereas if you ask QDoubles, a more accurate representation will be returned. |
||
ShortFloat pi (= Float32 pi) -> 3.141593 |
|||
Float pi (= Float64 pi) -> 3.14159265358979 |
|||
ShortFloat pi -> 3.141593 |
|||
LongFloat pi -> 3.141592653589793238 |
LongFloat pi (= Float80 pi) -> 3.141592653589793238 |
||
QuadFloat pi (= Float128 pi) -> 3.1415926535897932384626433832795027 |
|||
QDouble pi -> 3.1415926535897932384626433832795028841971693993751058209749446 |
|||
QDouble pi (= QDouble pi) -> 3.1415926535897932384626433832795028841971693993751058209749446 |
|||
QuadFloat pi -> 3.1415926535897932384626433832795027 |
|||
OctaFloat pi -> 3.14159265358979323846264338327950288419716939937510582097494459230781639 |
OctaFloat pi (= Float256 pi) -> 3.14159265358979323846264338327950288419716939937510582097494459230781639 |
||
LargeFloat pi -> 3.1415926535897932384626433832795028841971693993751058209749445923078164 [many more digits...] |
LargeFloat pi (= LargeFloat64 pi) -> 3.1415926535897932384626433832795028841971693993751058209749445923078164 [many more digits...] |
||
== Examples == |
== Examples == |
||
Aktuelle Version vom 23. Februar 2026, 10:49 Uhr
This page provides some computer science basics, which are not specific to expecco. However, in the past some users encountered problems and it is useful to provide some insight on number representations.
Expecco supports arbitrary precision integer arithmetic, arbitrary precision fractions, arbitrary and limited precision floating point numbers in various precisions and complex numbers. In addition, special purpose numbers for monetary and decimal presentations.
Being based on Smalltalk/X, expecco provides a complete set of number classes.
Syntax (In Smalltalk and FreezeValues)
See also: "Smalltalk_Syntax_Cheat_Sheet".
Integers (arbitrary size)
1234567 0xFF00AA (base 16) 0b01010101 (base 2) -0xAFFE 3r121212 (base 3) 4r123123 (base 4)
Fractions (arbitrary integral numerator and denominator)
(1/2) (1/101) (-1/3)
ScaledDecimals
123s2 123.456s2
Floats / Float64 (actually 64bit IEEE doubles)
12.456 1e17 1.0e23 12e .5 -- illegal .5e -- illegal
LongFloats / Float80 (80bit IEEE long doubles)
12.456q 1e17q 1.0e23q 23q
QuadFloats / Float128 (128bit IEEE quadruple floats)
12.456Q 1e17Q 1.0e23Q
OctupleFloats / Float256 (256bit IEEE octuple floats)
12.456QO 1e17QO 1.0e23QO
LargeFloats (arbitrary precision (defaults to 200bit) software floats)
12.456QL 1e17QL 1.0e23QL
QDoubles (4 IEEE doubles combined)
12.456QD 1e17QD 1.0e23QD
Complex
1+5i 4i 1.23+5.67i
Exact Integer Numbers
For integer operations, there is no overflow or error in the result for any legal operation. I.e. operations on two big integers delivers a correct and exact result.
This is a feature of the underlying Smalltalk runtime environment and in contrast to many other programming languages (especially: Java and C) which provide int (usually 32bit) and long (usually 64bit) integer types.
In expecco, you can write both in Smalltalk and in the builtin JavaScript syntax:1
2147483647 "(0x7FFFFFFF)" + 1
-> 2147483648 "(0x80000000)"
4294967295 "(0xFFFFFFFF)" + 1
-> 4294967296 "(0x100000000)"
18446744073709551615 "(0xFFFFFFFFFFFFFFFF)" + 1
-> 18446744073709551616 "(0x10000000000000000)"
Very large values can be computed:
10000 factorial
-> a huge number beginning with: 284625968091705451890641321211986889014....
Smalltalk will automatically convert any result which is too large to fit into a machine-integer into a LargeInteger (with an arbitrary number of bits) and also automatically return results converted back to a small representation, if possible.2
Thus, although the two operands to the division in the following example are large integers,
rslt := (1000 factorial) / (999 factorial)
the result will be a small integer (since the value 1000 fits easily into a machine word).
As a user, you do not have to care about these internals.
Hint: Therefore, you can use a Workspace (Notepad) window as a calculator with arbitrary precision.
1) be aware that this only computes correct results if the elementary action is written in either Smalltalk or in the builtin JavaScript syntax.
Depending on the version of the external language interpreter, it may or may not be correct, if using Java/Groovy, Python, C/C++, Node.js etc.
2) the integer class provides functions which operate in the limited 32 or 64 bit range. These might be useful if you have to verify results or repeat computations as returned by corresponding C or Java operations.
Exact Fractions, ScaledDecimals and FixedDecimals
Fractions
When dividing integers, the "/" operator will deliver an exact result, possibly as a fraction:
5 / 3 -> 5/3
and reduce the result (possibly returning an Integer):
(5/3)*(3/2) -> 5/2 (5/3)/(3/2) -> 10/9 (5/3)*(9/3) -> 5 1000 factorial / 999 factorial -> 1000
There are also truncating division operators: "//", which returns an integer truncated towards negative infinity (i.e. the next smaller integer), and "quo:". which truncates towards zero. "quo:" is what you'd get in Java or C.
5 // 3 -> 1 -5 // 3 -> -3
The corresponding modulo operators "\\" and "rem:" provide the remainder, such that:
(a // b) + (a \\ b) = a
The "\\" is the standard Smalltalk remainder operator; Smalltalk/X also provides "%" as an alias.
Thus you can also write:
(a // b) + (a % b) = a
There is also a division operator ("quo:") which truncates towards zero, and a corresponding remainder operator ("rem:") , for which:
(a quo: b) + (a rem: b) = a
For positive a and b, the two operator pairs deliver the same result. For negative arguments, these are different. Be aware and think about the domain of your arguments. Be reminded that the Smalltalk remainder "%" returns different results as C or Java, when operands are negative.
In addition, the usual ceiling, floor and rounding operations are available (both on fractions and on limited precision reals):
(5 / 3) ceiling -> 2 "the next larger integer" (5 / 3) floor -> 1 "the next smaller integer" (5 / 3) truncated -> 1 "truncate towards zero" (5 / 3) rounded -> 2 "round wrt. fraction >= 0.5) (5 / 3) roundTo: 0.1 -> 1.7. (5 / 3) roundTo: 0.01 -> 1.67
(-5 / 3) ceiling -> -1. "the next larger integer" (-5 / 3) floor -> -2 "the next smaller integer" (-5 / 3) truncated -> -1 "truncate towards zero" (-5 / 3) rounded -> -2. (-5 / 3) roundTo: 0.1 -> -1.7
Fractions print themself as "(numerator / denominator)".
ScaledDecimal
If you prefer a decimal representation with a defined number of fractional digits, use ScaledDecimals (which for backward compatibility are also called "FixedPoint" (1)).
These are also exact fractions, but print differently: you can specify the number of digits to be printed and it will print itself rounded on the last digit. In other words: the computation and internal value will be exact (as with Fractions), and therefore, no rounding errors will accumulate. Only when printed, will the external represenatation be rounded to the specified number of decimal places.
(5 / 3) asScaledDecimal:2 -> 1.67 (5 / 3) asScaledDecimal:4 -> 1.6667 ((5 / 3) asScaledDecimal:2) * 3 -> 5.00 1.2 asScaledDecimal:3 -> 1.200 Float pi asScaledDecimal:5 -> 3.14159
1) the class was previously called "FixedPoint" and the converters were called "asFixedPoint:". For compatibility with other Smalltalk dialects, these have aliases "ScaledDecimal" and "asScaledDecimal:".
Both the old class name and the old operators are and will be supported in the future for backward compatibility (as aliases),
but you should use the new name, both for compatibility with other Smalltalk dialects, and to avoid confusion with FixedDecimal numbers.
FixedDecimal
As mentioned above, a ScaledDecimal keeps the exact value internally, but prints itself rounded to a given number of decimal digits. Smalltalk/X provides an alternative class called "FixedDecimal" (1), which always keeps a rounded value internally. These may be better suited for monetary values, especially in computed additive sums which are printed in a table, as the sum of two FixedDecimals will always be the presented (printed) sum of two FixedDecimals. In contrast, with ScaledDecimals, you may see a sum which differs from what presented table values suggest.
For example:
v := 50.004 asScaledDecimal:2.
v printString -> '50.00'. "is actually 50.004"
v2 := v * 2.
v2 printString -> '100.01'. "is actually 100.008"
this leads to confusion, iff such numbers represent monetary values and are printed eg. in a summed-up table.
With FixedDecimals, you'll get:
v := 50.004 asFixedDecimal:2.
v printString -> '50.00'. "is actually 50.00"
v2 := v * 2.
v2 printString -> '100.00'. "is actually 100.00"
Be aware that mixed arithmetic operations will usually return an instance of the class with a higher generality, and that Floats do have a higher generality than FixedDecimals which have a higher generality than ScaledDecimals.
Thus, when multiplying a float and a fixed decimal, you'll get a float result, whereas if you multiply an integer and a fixed decimal, the result will be a fixed decimal. Finally, when multiplying a fixed decimal and a scaled decimal, the result will be a fixed decimal.
1) both names "ScaledDecimal" and "FixedDecimal" have been chosen a bit unwise, and may be confusing. However, these cannot easily be changed for backward and cross Smalltalk dialect compatibility reasons. We apologize.
Inexact Float and Double Numbers
Floating point numbers are inherently inexact and almost always represent an approximated value. The error depends on the floating point number's precision, which is the number of bits with which the value is approximated. there are numbers which cannot ever be repsented as a floating point number, whatever precision is used. Even innocent looking numbers (eg. "0.1") are of this kind.
This is not a problem specific to expecco,
but inherent to the way floating point numbers are represented (in the machine).
See "What Every Computer Scientist Should Know About Floating-Point Arithmetic",
"Mindless Assessments of Roundoff in Floating-Point Computation" and "Some disasters attributable to bad numerical computing".
A very impressive example of how wrong double precision IEEE arithmetic can be is described in "Do_not_trust_Floating_Point" (1).
Floating point numbers are represented as a sum of powers of 2 (actually 1/2 + 1/4 + 1/8 +...) called the "mantissa" then multiplied by 2 raised to an exponent. I.e.
value = mantissa * (2 ** exponent)
with the mantissa being normalized to be a sum in the interval 0.5..1 (as listed above). And the exponent stored with an offset (called "ebias"). The minimum exponent (0) is reserved for the zero number and non-normalized tiny numbers (called "subnormals"); the maximum exponent is reserved for infinities and NaNs ("not a number"). These might be returned from some operations if invalid arguments are provided (for example, trying to take a logarithm of a negative number).
The number of exponent bits determines the largest and smallest representable magnitudes, the number of mantissa bits determines the relative error. The error depends on the value of the last mantissa bit, which depends on the exponent. This value is called "Unit in the Last Place" or "'ULP'" (see Wikipedia).
For a large number like 1e100, one ULP is the very large 1.94266889222573e+84, whereas for a small number like 0.5, it is 1.11022302462516e-16.
Floating point formats differ in the number of bits (single/double/extended precision etc.).
A double precision IEEE float has 11 bits for the exponent and 53 for the mantissa (see IEEE floating point formats).
As a rule of thumb, the error in the last bit of a double precision IEEE float is roughly 15 to 16 orders of magnitudes smaller than the magnitude of the (double precision) floating point number. The error is larger for single precision (32bit) floats and smaller for extended floats (80, 128 or more bits).
1): If you do not believe me, try the following example from one of the mentioned papers in (eg. excel) or your favourite programming language:
v := 4/3. "/ or maybe (4.0/3.0) w := v - 1. x := w*3. y := x - 1. z := (y*2)**52.
Limited Precision
Due to the limited number of bits in the mantissa, different values may end in the same floating point representation. For example, both 9223372036854776000 and 9223372036854775808 will end up being represented as the same float when converted from integer to float. The reason is that there are simply not enough bits in the mantissa.
For example:
9223372036854776000 asFloat = 9223372036854775808 asFloat
will return "true", and the difference will be zero in:
9223372036854776000 asFloat - 9223372036854775808 asFloat
in contrast to the correct result being returned when comparing/subtracting them as integers:
9223372036854776000 = 9223372036854775808. => false 9223372036854776000 - 9223372036854775808. => 192
Try it in a workspace window.
Also, many numbers (actually: most numbers) cannot be exactly represented by a finite sum of powers of 2. Such numbers will have an error in the last significant bit (actually half the last bit). When floating point numbers are added or multiplied, the result is usually computed internally with a few more bits as mantissa, and then rounded on the last bit, to fit the mantissa's number of bits. Notice that this may not be immediately obvious, because the print functions (such as printf) cheat, and round again on the last bit. Thus, a result such as 0.9999999 would still be printed as "1.0).
The situation may be relaxed slightly, by using more bits for the mantissa (and expecco gives you a choice of 32bit (called "ShortFloat"), 64bit ("Float") and 80bit ("LongFloat") and even more, which are mapped to corresponding IEEE floats (single, double and extended).
Repeating the above example with long floats, there are enough mantissa bits and the numbers are no longer represented or considered equal,
thus:
9223372036854776000 asLongFloat = 9223372036854775808 asLongFloat
yields "false" as answer, and the computation:
9223372036854776000 asLongFloat - 9223372036854775808 asLongFloat
will give 192.0 as answer.
However, even with more bits, the fundamental restriction remains, although appearing less frequently with higher precision. But be aware that many numbers (such as 1/10, 1/5, 1/3) can never be represented exactly, no matter how many bits are used. So even the above mentioned innocent looking "0.1" is actually an approximation and wrong in the last bit.
The limited precision may lead to "strange" results, especially when operands are far apart; for example, when subtracting a very small value from a much larger one, as in:
2.15e12 - 1.25e-5
Here, the operands differ by 17 orders of magnitude, and there are not enough bits to represent the result, which will be rounded to give you 2.15e12 again.
Thus, the comparison "2.15e12 - 1.25e-5 = 2.15e12" returns true and "2.15e12 - (2.15e12 - 1.25e-5) = 0.0", which are both obviously wrong.
In this special case, a better result is obtained when operating with extended precision:
2.15e12 asLongFloat - 1.25e-5 asLongFloat
which gives 2149999999999.999987 as result (still incorrect due to its 64 bit mantissa, but much better).
If you are willing to trade speed for precision, you can use one of expecco's builtin higher precision representations or even the arbitrary precision representation, and compute with more bits of precision. The QDouble class provides a compromise between speed and precision, proving roughly 200 bits of precision or alternatively represents a combination of up to 4 arbitrarily valued doubles (i.e. it can represent the sum of a very large and a small number):
(2.15e12 asQDouble) - (1.25e-5 asQDouble) (2.15e12 asQDouble) - ((2.15e12 asQDouble) - (1.25e-5 asQDouble))
Another representation supports an arbitrary number of precision bits (here 200):
(2.15e12 asLargeFloatPrecision:200) - (1.25e-5 asLargeFloatPrecision:200) (2.15e12 asLargeFloatPrecision:200) - ((2.15e12 asLargeFloatPrecision:200) - (1.25e-5 asLargeFloatPrecision:200))
Both return the correct results 2149999999999.9999875 and 1.25e-5 respectively.
Be aware, that the higher precision and arbitrary precision operations are much slower than the ones which are directly supported by the processor (which has special hardware, usually for single, double and extended precision). Also these extended classes are still being developed and not yet released for official use (meaning they may contain bugs, especially in their trigonometric and other math functions by the time of writing).
You may use fractions,
2150000000000 - (1 / 125000)
to compute the exact result: (268749999999999999/125000) (of course, these are less convenent to read, and should probably be presented to the end-user as ScaledDecimals.
Also, do not forget that a conversion of one number to a higher precision number cannot magically generate missing bits.
For example, given a 32 bit floating point number which is already an approximation (i.e. the real value cannot be represented as an exact sum of powers of 2),
then the conversion will give you another such approximation, with the same error.
Thus, "0.25 asLongFloat" will give you an exact 0.25 (because 0.25 is representable), whereas "0.1 asLongFloat" will not give an exact "0.1".
Actually, the result of such a conversion will usually not give you the full possible precision.
If you need a constant with the max. precision, either enter it as such (i.e. "0.1q" instead of "0.1 asLongFloat") or read it from a string (i.e. LongFloat fromString:'0.1').
Floating Point Errors Propagate
The above rounding and last bit errors will accumulate, with every math operation performed on it (and may even do so wildly).
For example, the already mentioned 0.1 cannot be exactly represented as a floating point number, and is actually 0.099999..X with an error in the last bit (half a ULP).
Adding this multiple times will result in a larger and larger error in the final result:
1.0 - (0.1 + 0.1 + 0.1 ...(10 times)... + 0.1) -> 1.11022302462516E-16
The print functions will try to compensate for an error in the last bit(s), showing "0.1" although in reality, it is "0.09999..." (it rounds before printing). Thus, even though the printed representation of such a number might look ok, it will inject more and more error when the value is used in further operations (up to the point when the error accumulates out of the last bit and print will no longer be able to cheat, and the error becomes visible).
This is especially inconvenient, when monetary values or counts are represented as floats and a final sum is wrong in the penny value
(and therefore, a real programmer will never ever use floating point numbers to represent monetary values!).
As an example, try to sum an array consisting of 10 values:
(Array new:10 withAll:0.1) sum printString
which results in "1.0" due to print's cheating,
whereas:
(Array new:100 withAll:0.1) sum printString
will show '9.99999999999998' (i.e. the error accumulated to a value too big for print's cheating to compensate).
Expecco (actually the underlying Smalltalk) provides additional number representations which are better suited for such computations: Fraction, ScaledDecimal and FixedDecimal (in other systems, ScaledDecimals are also called "FixedPoint" numbers, and expecco knows that as an alias).
These are exact fractions internally, but use different print strategies: Fractions print as such (i.e. '1/3', '2/17' etc.) whereas ScaledDecimal and FixedDecimal numbers print themself as decimal expansion (i.e. '0.33' or '0.20'). ScaledDecimal constant numbers can be entered by using "s" instead of "e" (i.e. '1.23s' defines a scaled decimal with 2 and '1.23s4' which will print 4 valid digits after the decimal point).
No such rounding errors are encountered, if fractions are used:
1 - ((1/10) + (1/10) + (1/10) ...(10 times)... + (1/10)) -> 0
or if ScaledDecimal numbers are used:
(Array new:100 withAll:0.1s) sum printString -> '10.0'
Floating Point Number Comparison
Be aware of such errors, and do not compare floating point numbers for equality/inequality.
As a concrete example, try:
(0.2 + 0.1 - 0.3) = 0
which will return false and if you print "0.2 + 0.1 - 0.3", you might get something like: "5.55111512312578e-17".
Even increasing the precision does not really help; if we went to 200bits precision, we'd still get a small error:
(0.2QL + 0.1QL - 0.3QL) printString
gives "-3.111507638930570853572...e-61"
The problem also occurs when comparing numbers with different precision. For example, consider that a float32 value is to be compared against a constant. The float32 might be read from a file or provided by a measurement device via any communication mechanism. If we compare it against a higher precision value, the missing bits in the shorter float are filled with zeros. Thus:
(Float32 readFrom:'0.125') = 0.125
leads to a true value, whereas:
(Float32 readFrom:'0.123') = 0.123
returns false.
The reason for this is that 0.125 can be represented as exact float in both float32 and float64 formats, whereas 0.123 is non-exact and has repeated binary digits.
Its representation as float64 is:
0 01111111011 1111011111001110110110010001011010000111001010110000
and:
0 01111011 11110111110011101101101
as float32. When comparing, the float32 is expanded to:
0 01111011 111101111100111011011010000000000000000000000000000000
which is obviously different.
Instead of comparing against a constant, either use range-compares and/or use the special "compare-almost-equal" functions, where the number of bits of acceptable error can be specified (so called: "ULPs"). Expecco provides such functions both for elementary code and in the standard action block library.
Again, for this reason, do not compute money values using floats or doubles. Instead, use instances of ScaledDecimal. Otherwise you might loose a cent/penny here and there, if you use floats/doubles on big budgets.
With scaled decimals, the result is correct:
0.1s + 0.2s - 0.3s = 0. => true
and:
(0.1 asScaledDecimal + 0.2 asScaledDecimal - 0.3 asScaledDecimal) printString => 0.00
Find further insight here
Limited Range of Float and Double Numbers
Floating point numbers also have a limited range; there are smallest and largest representable values.
In expecco, the default float format is IEEE double precision format (called "Float" or "Float64" in expecco).
Numbers with an absolute value greater than 1.79769313486232E+308 will lead to a +INF/-INF
(infinite) result, and numbers with absolute value smaller than 2.2250738585072E-308 will be zero.
For IEEE single precision floats (called "ShortFloat" or "Float32" in expecco), the range is much smaller, and for IEEE extended precision (called "LongFloat" or "Float80" in expecco), the range is larger.
You can ask the classes (or its instances 1) for their limits, with:
- fmin (smallest representable number larger than zero)
- fmax (largest representable number),
- emin (smallest exponent; binary)
- emax (largest exponent binary),
- precision (bits in mantissa, incl. any hidden bit),
- decimalPrecision (digits when printed),
remember Float is the same as Float64 Float fmin -> 2.2250738585072E-308 Float fmax -> 1.79769313486232E+308 Float emin -> -1022 Float emax -> 1023 Float precision -> 53 Float decimalPrecision -> 15 remember ShortFloat is an alias for Float32 ShortFloat fmin -> 1.175494e-38 ShortFloat fmax -> 3.402823e+38 ShortFloat emin -> -126 ShortFloat emax -> 127 ShortFloat precision -> 24 ShortFloat decimalPrecision -> 7 remember LongFloat is an alias for Float80 LongFloat fmin -> 3.362103143112093506E-4932 LongFloat fmax -> 1.189731495357231765E+4932 LongFloat emin -> -16382 LongFloat emax -> 16383 LongFloat precision -> 64 LongFloat decimalPrecision -> 19 QuadFloat is an alias for Float128 Float128 fmin -> 3.36210314311209350626267781732e-4932 Float128 fmax -> 1.18973149535723176508575932662e+4932 Float128 emin -> -16382 Float128 emax -> 16383 Float128 precision -> 113 Float128 decimalPrecision -> 34 OctaFloat is an alias for Float256 Float256 fmin -> 2.48242795146434978829932822291387172...5329791379e-78913 Float256 fmax -> 1.61132571748576047361957211845200501...7125049607e+78913 Float256 emin -> -262142 Float256 emax -> 262143 Float256 precision -> 237 Float256 decimalPrecision -> 71 QDouble fmin -> same as float64 QDouble fmax -> same as float64 QDouble emin -> same as float64 QDouble emax -> same as float64 QDouble precision -> 204 QDouble decimalPrecision -> 61 aLargeFloat fmin -> 0.0 arbitrary small 1 aLargeFloat fmax -> inf arbitrary large aLargeFloat emin -> -inf arbitrary small aLargeFloat emax -> inf arbitrary large aLargeFloat precision -> 200 default; configurable aLargeFloat decimalPrecision -> 60 default; configurable
Remember that the name "Float" refers to "Float64", which is called "double" in the C language. And also remember that the name "ShortFloat" refers to "Float32", which is called "float" in C. And finally, the name "LongFloat" refers to "Float80", which is called "long double" in C.
As a consequence of the limits, you cannot compute very large numbers using any of the CPU supported floats, and you will have to use one of the software computed float representations.
For example trying to compute the number of decimal digits of a huge number:
10000 factorial asFloat log10 -> INF
I.e. it returns infinity, because 10000 factorial asFloat already returns INF. (it could have been converted with "asFloatChecked" which raises an exception in that situation, which is probably a good idea to do. However, the regular asFloat conversion uses the underlying machine's CPU support, which returns INF, and is similar to the behavior in other programming languages).
In contrast, the integer computation does it:
10000 factorial log10 -> 35659.454274
Again: this is not a problem specific to expecco, but inherent to the way floating point numbers are represented in the CPU.
1)because LargeFloats have an individual number of precision (per instance), you should ask the instance, not the class. The class will return the default values (which are valid for 200 bits of resolution).
Speed of Operations
Machines have builtin floating point math operations, which usually work fastest in single or double precision (actually, some modern machines work faster in double than in single precision).
Unless you have special precision needs, it is best to stick with double precision which is also portable across machines. Therefore, double precision floats (aka "double") is the default and simply named "Float" in Smalltalk/X (and therefore also in expecco) 1.
1)The reason for calling them "Float" is historic. There exist smalltalk dialects where Floats are 32bit IEEE floats, and others where they are 64bit. To be able to import code from either dialect, Smalltalk/X uses double precision for "Float".
To make you intention clear, it is recommended to use the explicit names "Float32", "Float64" etc.
Trigonometric and other Math Functions
Some trigonometric and other math functions (sqrt, log, exp) will first convert the number to a limited precision real number (a C-double), and therefore may have a limited input value range and also generate inexact results.
For example, you will not get a valid result for:
10000 factorial sin
because it is not possible to represent that large number as real number. (expecco will signal a domain error, as the input to sin will be +INF)
Also, the result from:
(9 / 64) sqrt
will be the (inexact) 0.375 (a double), instead of the exact 3/4 (a fraction). (this might change in a future release and provide exact results if both numerator and denominator are perfect squares)
You can however first convert to a higher precision float and then apply the function. These will compute using a Taylor series or Newton approximation taking the number's precision into account:
2 sqrt => 1.4142135623731 (computed with float64 precision) 2 asFloat128 sqrt => 1.41421356237309504880168872421 2 asFloat256 sqrt => 1.41421356237309504880168872420969807856967187537694807317667973799073 (2 asLargeFloatPrecision:500) sqrt => 1.4142135623730950488016887242096980785696718753769480731766797379907324784621070388503875343276415727350138462309122970249248360558507372126441214971
Complex Results
By default, Smalltalk/X will raise an error if the result of a function with a real operand would return a complex result.
For example, computing the square root of a negative number as in:
-2 sqrt
will raise an "ImaginaryResultError".
However, this is a proceedable exception 1, which can be caught and if the handler proceeds, a complex result is returned:
rslt := ImaginaryResultError ignoreIn:[ -2 sqrt ].
for readability, there is also an alias called "trapImaginary:" in the number class:
rslt := Number trapImaginary:[ -2 sqrt ]
Both of the above would return the complex result:
(0+1.4142135623731 i)
Thus,
rslt := -2 sqrt squared
will result in an exception, whereas:
rslt := Number trapImaginary:[ -2 sqrt squared]
will generate a result of "-2.0".
All operations within "[" .. "]" which would produce an ImaginaryResultError will return a complex. Thus you can write:
Number trapImaginary:[ |num1 num2| num1 := -2 sqrt. num2 := -3 sqrt. Transcript showCR: (num1 + num2). ]
and "(0+3.14626436994197i)" will be shown.
1)Proceedable exceptions are among the unique features of the Smalltalk programming language; exceptions may be raised as being proceedable, and an exception handler may then "proceed" and provide an alternative return value from the failed operation. This mechanism is used here, where the exception handler - if pesent - can decide to return an imaginary result.
Undefined Results, NaN and Domain Errors
Similar to the way imaginary results are handled, some operations are not defined for certain values (values outside the function's domain).
For example, the receiver of the arcSin operation must be in [-1 .. 1].
By default, these situations are also reported by raising an error, and therefore:
-2 arcSin
will raise such a "DomainError".
Similar to the above, this can be handled, although no useful value will be provided (in the above case, NaN (Not a Number) will be returned:
rslt := DomainError ignoreIn:[ -2 arcSin ]
or:
rslt := Number trapDomainError:[ -2 arcSin ].
Both of the above would generate a NaN as result.
Notice that if such a NaN is used in other arithmetic operations, either more exceptions or other NaNs will usually be generated (depending on the exception being handled or not).
Thus:
rslt := Number trapDomainError:[ -2 arcSin sin ].
will also generate a NaN as result.
You can check for valid results with:
aNumber isNaN - true for NaN aNumber isInfinite - true for infinities aNumber isPositiveInfinity - true for +inf aNumber isNegativeInfinity - true for -inf aNumber isFinite - false for NaN or infinities (i.e. true for valid numbers)
Overflow
When an operation's arguments are OK, but the result falls outside the range of representable numbers [fmin..fmax], you will get an infinite result 1).
Further operations on these might produce more infinities or a NaN ("Not a Number").
This may be especially troublesome, if a final result gets corrupted due to an intermediate computation, as in:
a := 1e+10. b := 1e+300. c := 1e+20. d := 1e+300. rslt := (a*b) / (c*d) -> NaN
Here, the final result is certainly representable, but the intermediate values (1e10 * 1e300) are out of the Float range [2.225E-308 .. 1.796e+308]. Thus, the computation will be:
a := 1e+10. b := 1e+300. c := 1e+20. d := 1e+300. (a*b) -> INF (c*d) -> INF INF / INF -> NaN
If the above intermediates are computed with a higher precision, the final result will be correct:
a := 1e+10. b := 1e+300. c := 1e+20. d := 1e+300. t := (a asLongFloat * b asLongFloat) / (c asLongFloat * d asLongFloat). rslt := t asFloat -> 1e-10
Of course, with LongFloats, the problem is only shifted towards larger numbers; as soon as the temporary result cannot be represented by a LongFloat:
a := 1q+100. b := 1q+4900. c := 1q+200. d := 1q+4900. rslt := (a * b) / (c * d). -> NaN
then, you may use arbitrary precision LargeFloat numbers:
t := (a asLargeFloat * b asLargeFloat) / (c asLargeFloat * d asLargeFloat). rslt := t asFloat. -> 1e-100
1) that is the current default behavior. Future versions may allow enabling exceptions in this situation, if there are customer requests. However, as most other programming languages behave similar in these situations, most programmers are aware of these pitfalls and avoid such problems.
Different Results on Different CPUs
Since FLOAT32 and FLOAT64 arithmetic is performed by the underlying CPU hardware, different results (in the least significant bit) may be returned from math operations on different CPUs or even different versions (steppings) of the same CPU architecture.
This applies especially to trigonometric and other math functions. Therse are computed by Power- or Taylor-series or Newton approximations with different algorithms on different systems, leading to different results.
Be prepared for this, and use the "almost-equal" comparison functions when results are to be verified.
Summary of Higher Precision Numbers
Expecco supports various inexact real formats, with different precision (i.e. number of mantissa bits). Some of those classes have alias names; these are provided to make Smalltalk/X code portable to other Smalltalk dialects (however, within expecco, you will probably not care, as it is not planned to port it to another dialect).
Name Overall Exponent Mantissa Decimal fmin ST/X ANSI-ST ST80/VW IBM VA-ST
Size Size Size 1) Precision fmax Name Name 4) Name 4) Name 4)
Bit Bit Bit Digits
IEEE single 32 8 24 6 1.175494e-038 ShortFloat FloatE Float -
3.402823e+038 FloatE
Float32
IEEE double 64 11 53 15 2.225074e-308 Float FloatD Double Float
1.797693e+308 FloatD
Float64
Double
IEEE extended 80/128 15 64/112 19/34 3.362103e-4932 LongFloat FloatQ 2) - -
1.189731e+4932 FloatQ
Float80
or Float128
quad double 4*64 11 204 60 1.175494e-038 QDouble - 3) - -
3.402823e+038
IEEE quadruple 128 15 112 34 3.362103e-4932 QuadFloat - 3) - -
1.189731e+4932 Float128
IEEE octuple 256 19 236 71 2.482427e-78913 OctaFloat - 3) - -
1.611325e+78913 Float256
IEEE arbitrary any any any any any IEEEFloat - 3) - -
any
large float any any any any any LargeFloat - 3) - -
any
(1) mantissa incl. any hidden bit (normalized floats)
(2) LongFloats use the underlying CPU's long double format.
On x86/x64 machines, LongFloats are represented as 80bit extended floats with 64bit mantissa; on other CPUs these might be represented as 128bit quadFloats with 112 bit mantissa (eg. a SPARC CPU does this).
(3) these are still been developed and provided as a preview feature without warranty (meaning: they may be buggy at the moment; let us know if you need them).
(4) different Smalltalk dialects use different precisions for their floating point numbers: ST80/VW Floats are IEEE singles, V'Age Floats are IEEE doubles and ST/X Floats are IEEE doubles. VW refers to Float64 as Double.
Later, the ANSI standard defined FloatE, FloatD and FloatQ as aliases. You can use either interchangable in expecco.
Notice that the use of any but double precision floats (which are directly supported by the machine) may come at a performance price.
The speed of operations degrades from double -> single -> extended -> ieee128 -> quad double -> ieee256 -> ieee arbitrary - largeFloat.
This is especially true for the trigonometric and math functions, where both more iterations are needed to get to the desired precision in the series computations and the individual operations are also much slower.
Be aware that LargeFloats are super precise, but also super slow.
Constants
Some well known and common constants can be acquired by asking a number class such as the Float class:
Float pi -> pi (3.14159...) Float halfPi -> pi / 2 Float twoPi -> pi * 2 Float phi -> phi (golden ratio 1.6180...) Float sqrt2 -> square root of 2 (1.4142...) Float sqrt5 -> square root of 5 (2.2360...) Float ln2 -> natural log of 2 (0.69314...) Float ln10 -> natural log of 10 (2.30258...) Float e -> e (2.718281...)
Each number class will return a representation of that constant with its precision.
I.e. if you ask the Float class for the constant "pi", you'll get a pi with roughly 15 digits precision, whereas if you ask QDoubles, a more accurate representation will be returned.
ShortFloat pi (= Float32 pi) -> 3.141593 Float pi (= Float64 pi) -> 3.14159265358979 LongFloat pi (= Float80 pi) -> 3.141592653589793238 QuadFloat pi (= Float128 pi) -> 3.1415926535897932384626433832795027 QDouble pi (= QDouble pi) -> 3.1415926535897932384626433832795028841971693993751058209749446 OctaFloat pi (= Float256 pi) -> 3.14159265358979323846264338327950288419716939937510582097494459230781639 LargeFloat pi (= LargeFloat64 pi) -> 3.1415926535897932384626433832795028841971693993751058209749445923078164 [many more digits...]
Examples
Code examples in Smalltalk syntax. Notice the float precision qualifiers:
'q' -> longFloat (= IEEE extended precision = Float80) 'Q' -> quadFloat (= IEEE quadruple precision = Float128) 'QO' -> octaFloat (= IEEE octuple precision = Float256) 'QD' -> qDouble 'QL' -> longFloat (arbitrary precision)
Square Root:
2.0 sqrt asShortFloat -> 1.414214
2.0 sqrt -> 1.4142135623731
2.0q sqrt -> 1.414213562373095049
2.0Q sqrt -> 1.4142135623730950488016887242097
2.0QD sqrt -> 1.4142135623730950488016887242096980785696718753769
2.0QO sqrt -> 1.41421356237309504880168872420969807
8569671875376948073176679737990732
(2.0QL precision:200) sqrt -> 1.41421356237309504880168872420969807
8569671875376948073176679
(2.0QL precision:400) sqrt -> 1.41421356237309504880168872420969807
85696718753769480731766797379907324
78462107038850387534327641572735013
846230912297025
(2.0QL precision:800) sqrt -> 1.41421356237309504880168872420969807
85696718753769480731766797379907324
78462107038850387534327641572735013
84623091229702492483605585073721264
41214970999358314132226659275055927
55799950501152782060571470109559971
605970274...
Precision value from Wolfram: 1.41421356237309504880168872420969807
85696718753769480731766797379907324
78462107038850387534327641572735013
8462309122970249248360...
Cubic Root:
2.0 cbrt asShortFloat -> 1.259921
2.0 cbrt -> 1.25992104989487
2.0q cbrt -> 1.259921049894873165
2.0Q cbrt -> 1.25992104989487316476721060727823
2.0QD cbrt -> 1.25992104989487316476721060727822835
05702514647015
2.0QO cbrt -> 1.25992104989487316476721060727822835
05702514647015079800819751121553
(2.0QL precision:200) cbrt -> 1.25992104989487316476721060727822835
0570251464701507980081975
(2.0QL precision:400) cbrt -> 1.25992104989487316476721060727822835
05702514647015079800819751121552996
76513959483729396562436255094154310
256035615665259
(2.0QL precision:800) cbrt -> 1.25992104989487316476721060727822835
05702514647015079800819751121552996
76513959483729396562436255094154310
25603561566525939902404061373722845
91103042693552469606426166250009774
74526565480306867185405518689245872
516764199373709695098382...
Wolfram: 1.25992104989487316476721060727822835
05702514647015079800819751121552996
76513959483729396562436255094154310
2560356156652593990240...
Exponentiation:
2.0 exp asShortFloat -> 7.389056
2.0 exp -> 7.38905609893065
2.0q exp -> 7.389056098930650227
2.0Q exp -> 7.38905609893065022723042746057499
2.0QD exp -> 7.38905609893065022723042746057500781
31803155705518
2.0QO exp -> 7.38905609893065022723042746057500781
318031557055184732408712782252257266
(2.0QL precision:200) exp -> 7.38905609893065022723042746057500781
3180315570551847324087123
(2.0QL precision:400) exp -> 7.38905609893065022723042746057500781
31803155705518473240871278225225737
96079057763384312485079121794773753
161265478866123
(2.0QL precision:800) exp -> 7.38905609893065022723042746057500781
31803155705518473240871278225225737
96079057763384312485079121794773753
16126547886612388460369278127337447
83922133980777749001228956074107537
02391330947550682086581820269647868
208404220982255234875742...
Wolfram: 7.38905609893065022723042746057500781
31803155705518473240871278225225737
96079057763384312485079121794773753
16126547886612388460369278...