Do not trust Floating Point: Unterschied zwischen den Versionen

Aus expecco Wiki (Version 2.x)
Zur Navigation springen Zur Suche springen
 
(20 dazwischenliegende Versionen desselben Benutzers werden nicht angezeigt)
Zeile 1: Zeile 1:
== Introduction ==
== Introduction ==
The essence of this article in one sentence is "Do not blindly trust any result when computing with floating point numbers".
The essence of this article in one sentence is "''Do not blindly trust any result when computing with floating point numbers''".


This may sound strange, but you have to be aware all the time, that computations involving floating point numbers are always in danger of being completely wrong.
This may sound strange, but you have to be aware all the time, that computations involving floating point numbers are always in danger of being inexact and sometimes completely (absurdly) wrong. It is a good idea (i.e. highly recommended) that you check the expected error, precision and value ranges when doing floating point math.


== An Example ==
== Example 1 ==


=== Rumps Royal Pain ===
=== Rump's Royal Pain<sup>1)</sup> ===


This example shows absurdly wrong results from a seemingly innocent floating point computation. In addition to generating a completely wrong result in double precision IEEE floats, it also does so when using longer precision (80bit or 128 bit).
Try to compute the following expression (in any programming language):

Try to compute the following expression (in your favorite programming language):


333.75y<sup>6</sup> + x<sup>2</sup>(11x<sup>2</sup>y<sup>2</sup> - y<sup>6</sup> - 121y<sup>4</sup> - 2) + 5.5y<sup>8</sup> + x/(2y)
333.75y<sup>6</sup> + x<sup>2</sup>(11x<sup>2</sup>y<sup>2</sup> - y<sup>6</sup> - 121y<sup>4</sup> - 2) + 5.5y<sup>8</sup> + x/(2y)
with:
x = 77617 and y = 33096.


in Smalltalk, this could be written as:
in Smalltalk, this could be written as:
Zeile 41: Zeile 45:


when evaluated, the result will be 1.18059162071741e+21.
when evaluated, the result will be 1.18059162071741e+21.
Even when using higher precision IEEE floats (by using "x := 77617q" and "y := 33096q"), we get wrong results.
<br>
<br>
- which is completely incorrect. The correct (approximated) result is: -0.8273960599..
This is completely incorrect. The correct (approximated) result is: -0.8273960599..


So not even the sign is correct, but the IEEE value is off by 21 orders of magnitudes!
So not even the sign is correct - and the IEEE value is off by '''21 orders of magnitudes'''!


In Smalltalk/X, you can compute this by using large precision floats; simply write:
In Smalltalk/X, you can compute this by using large precision floats; simply write:
Zeile 52: Zeile 57:
to get:
to get:
-0.827396059946821368141165095479816291999033115784384819
-0.827396059946821368141165095479816291999033115784384819

It could also be computed exactly, by using fractions instead:
|x y|
x := 77617.
y := 33096.
((333 + (3/4)) * (y ** 6))
+ ((x ** 2)
* ((11 * (x ** 2) * (y ** 2))
- (y ** 6)
- (121 * (y ** 4))
- 2))
+ ((11/2) * (y ** 8))
+ (x / (2 * y))

By the way, in expecco (and in many other programming languages), the above expression delivers the following results as per precision used:
{| class="wikitable"
|Precision
|
|Result
|-
|IEEE Single
|x := 77617f.<br>y := 33096f.
| -1.18059162071741e+21
|WRONG (ABSURDLY)
|-
|IEEE Double
|x := 77617.<br>y := 33096.
| 1.18059162071741e+21
|WRONG (ABSURDLY)
|-
|IEEE Extended
|x := 77617q.<br>y := 33096q.
| 576460752303423489.2
|WRONG (still way off)
|-
|IEEE Quadruple
|x := 77617Q.<br>y := 33096Q.
| 1.17260394005317863185883490452
|WRONG (coming closer)
|-
|IEEE Octuple
|x := 77617QO.<br>y := 33096QO.
| -0.8273960599468213681411650...
|OK
|-
|QDouble
|x := 77617QD.<br>y := 33096QD.
| -0.8273960599468213681327577...
|ALMOST
|-
|LargeFloat
|x := 77617QL.<br>y := 33096QL.
| -0.8273960599468213681411650...
|OK
|-
|Fractions
|x := 77617.<br>y := 33096.
| (-54767/66192)<br>which asLargeFloat is -0.8273960599468213681411650954...
|EXACT
|}

1) see [https://books.google.de/books?id=ZqnuDwAAQBAJ&pg=PT244&lpg=PT244&dq=rumps+royal+pain&source=bl&ots=mQeqSQWHtM&sig=ACfU3U3DLU3vpu4yMlx99xFDeorzlXLGEA&hl=de&sa=X&ved=2ahUKEwiEhIjmlPL0AhVdS_EDHSw8A10Q6AF6BAgPEAM#v=onepage&q=rumps%20royal%20pain&f=false Rump's Pain in books.google]

== Example 2 ==

Trigonometric functions are computed by Taylor- or Powerseries,
which also involve errors (sometimes more than one ULP).
If combined with divisions, the error usually gets bigger.

As an example, take
( tan(sin(x)) - sin(tan(x)) ) / x<sup>7</sup>
with:
x = 0.02

(example is taken from 2)

In Smalltalk you'd write:
x := 0.02.
rslt := (x sin tan - x tan sin ) / (x ** 7).
or in Javascript:
var x = 0.02;
var rslt = ( tan(sin(x)) - sin(tan(x)) ) / (x ** 7);

to get "0.0333473483202229" as result.
If we ask Wolfram, we get: "0.0333486812981771543916836365420930826209673623104313000386284441..."

As you can see, the result computed with IEEE double precision is wrong in the 6th post decimal digit.
Using different representations for x (i.e. "0.02q" for float80, "0.02Q" for float128, "0.02QD" for qDoubles,
"0.02QO" for float256 and "0.02QL" for LargeFloat (which defaults to 200 bits),
we get:
x := 0.02. "/ Float: 64 bits
=> 0.03334|73483202229
0.03334|86812981771543916836365420930826209673623104313000386284441
x := 0.02q. "/ LongFloat: 80 bits
=> 0.033348681|07362584807
0.033348681|2981771543916836365420930826209673623104313000386284441
x := QuadFloat fromString:'0.02'. "/ QuadFloat: 128 bits
=> 0.03334868129817715439168|58584082
0.03334868129817715439168|36365420930826209673623104313000386284441
x := LargeFloat fromString:'0.02'. "/ LargeFloat: defaults to 200 bits
=> 0.0333486812981771543916836365420930826209673623104|45685286582
0.0333486812981771543916836365420930826209673623104|313000386284441
x := 0.02QD. "/ QDouble: 204 bits
=> 0.03334868129817715439168363654209308262096|68260470426318400894
0.03334868129817715439168363654209308262096|73623104313000386284441
x := OctaFloat fromString:'0.02'. "/ OctaFloat: 256 bits
=> 0.033348681298177154391683636542093082620967362310431300038628|8528889811
0.033348681298177154391683636542093082620967362310431300038628|4441...
x := LargeFloat readFrom:'0.02' precision:400. "/ LargeFloat-400
=> 0.0333486812981771543916836365420930826209673623104313000386284441111102503364815348...
0.0333486812981771543916836365420930826209673623104313000386284441...

Notice, that (currently?) QDouble is less accurate than LargeFloat-200, despite its slightly higher number of bits.
This seems to result from internal rounding errors which accumulate, whereas LargeFloat first computes the trigonometric function with a higher precision, and rounds at the end. This will likely remain as-is, as the open-source code for QDouble is used as-is.

(However, be reminded that LargeFloat and OctaFloat arithmetic is *much* slower)
2) see [https://people.eecs.berkeley.edu/~wkahan/JAVAhurt.pdf "How Java’s Floating-Point Hurts Everyone Everywhere pg35"]

Aktuelle Version vom 17. Juni 2024, 14:42 Uhr

Introduction[Bearbeiten]

The essence of this article in one sentence is "Do not blindly trust any result when computing with floating point numbers".

This may sound strange, but you have to be aware all the time, that computations involving floating point numbers are always in danger of being inexact and sometimes completely (absurdly) wrong. It is a good idea (i.e. highly recommended) that you check the expected error, precision and value ranges when doing floating point math.

Example 1[Bearbeiten]

Rump's Royal Pain1)[Bearbeiten]

This example shows absurdly wrong results from a seemingly innocent floating point computation. In addition to generating a completely wrong result in double precision IEEE floats, it also does so when using longer precision (80bit or 128 bit).

Try to compute the following expression (in your favorite programming language):

333.75y6 + x2(11x2y2 - y6 - 121y4 - 2) + 5.5y8 + x/(2y)

with:

 x = 77617 and y = 33096.

in Smalltalk, this could be written as:

x := 77617.
y := 33096.

(333.75 * (y ** 6)) 
+ ((x ** 2)
   * ((11 * (x ** 2) * (y ** 2))
      - (y ** 6)
      - (121 * (y ** 4))
      - 2))
+ (5.5 * (y ** 8))
+ (x / (2 * y))

or in JavaScript as:

x = 77617;
y = 33096;

(333.75 * (y ** 6)) 
+ ((x ** 2)
   * ((11 * (x ** 2) * (y ** 2))
      - (y ** 6)
      - (121 * (y ** 4))
      - 2))
+ (5.5 * (y ** 8))
+ (x / (2 * y))

when evaluated, the result will be 1.18059162071741e+21. Even when using higher precision IEEE floats (by using "x := 77617q" and "y := 33096q"), we get wrong results.
This is completely incorrect. The correct (approximated) result is: -0.8273960599..

So not even the sign is correct - and the IEEE value is off by 21 orders of magnitudes!

In Smalltalk/X, you can compute this by using large precision floats; simply write:

x := 77617QL.
y := 33096QL.
...

to get:

-0.827396059946821368141165095479816291999033115784384819

It could also be computed exactly, by using fractions instead:

|x y|

x := 77617.
y := 33096.

((333 + (3/4)) * (y ** 6)) 
+ ((x ** 2)
   * ((11 * (x ** 2) * (y ** 2))
      - (y ** 6)
      - (121 * (y ** 4))
      - 2))
+ ((11/2) * (y ** 8))
+ (x / (2 * y))

By the way, in expecco (and in many other programming languages), the above expression delivers the following results as per precision used:

Precision Result
IEEE Single x := 77617f.
y := 33096f.
-1.18059162071741e+21 WRONG (ABSURDLY)
IEEE Double x := 77617.
y := 33096.
1.18059162071741e+21 WRONG (ABSURDLY)
IEEE Extended x := 77617q.
y := 33096q.
576460752303423489.2 WRONG (still way off)
IEEE Quadruple x := 77617Q.
y := 33096Q.
1.17260394005317863185883490452 WRONG (coming closer)
IEEE Octuple x := 77617QO.
y := 33096QO.
-0.8273960599468213681411650... OK
QDouble x := 77617QD.
y := 33096QD.
-0.8273960599468213681327577... ALMOST
LargeFloat x := 77617QL.
y := 33096QL.
-0.8273960599468213681411650... OK
Fractions x := 77617.
y := 33096.
(-54767/66192)
which asLargeFloat is -0.8273960599468213681411650954...
EXACT

1) see Rump's Pain in books.google

Example 2[Bearbeiten]

Trigonometric functions are computed by Taylor- or Powerseries, which also involve errors (sometimes more than one ULP). If combined with divisions, the error usually gets bigger.

As an example, take

( tan(sin(x)) - sin(tan(x)) ) / x7
with:
 x = 0.02

(example is taken from 2)

In Smalltalk you'd write:

x := 0.02.
rslt := (x sin tan - x tan sin ) / (x ** 7).

or in Javascript:

var x = 0.02;
var rslt = ( tan(sin(x)) - sin(tan(x)) ) / (x ** 7);

to get "0.0333473483202229" as result. If we ask Wolfram, we get: "0.0333486812981771543916836365420930826209673623104313000386284441..."

As you can see, the result computed with IEEE double precision is wrong in the 6th post decimal digit. Using different representations for x (i.e. "0.02q" for float80, "0.02Q" for float128, "0.02QD" for qDoubles, "0.02QO" for float256 and "0.02QL" for LargeFloat (which defaults to 200 bits), we get:

x := 0.02.                         "/ Float: 64 bits
    => 0.03334|73483202229
       0.03334|86812981771543916836365420930826209673623104313000386284441
       
x := 0.02q.                        "/ LongFloat: 80 bits
    => 0.033348681|07362584807
       0.033348681|2981771543916836365420930826209673623104313000386284441

x := QuadFloat fromString:'0.02'.  "/ QuadFloat: 128 bits
    => 0.03334868129817715439168|58584082
       0.03334868129817715439168|36365420930826209673623104313000386284441

x := LargeFloat fromString:'0.02'. "/ LargeFloat: defaults to 200 bits  
    => 0.0333486812981771543916836365420930826209673623104|45685286582
       0.0333486812981771543916836365420930826209673623104|313000386284441

x := 0.02QD.                       "/ QDouble: 204 bits 
    => 0.03334868129817715439168363654209308262096|68260470426318400894 
       0.03334868129817715439168363654209308262096|73623104313000386284441

x := OctaFloat fromString:'0.02'.  "/ OctaFloat: 256 bits   
    => 0.033348681298177154391683636542093082620967362310431300038628|8528889811
       0.033348681298177154391683636542093082620967362310431300038628|4441...

x := LargeFloat readFrom:'0.02' precision:400.  "/ LargeFloat-400 
    => 0.0333486812981771543916836365420930826209673623104313000386284441111102503364815348...
       0.0333486812981771543916836365420930826209673623104313000386284441...

Notice, that (currently?) QDouble is less accurate than LargeFloat-200, despite its slightly higher number of bits. This seems to result from internal rounding errors which accumulate, whereas LargeFloat first computes the trigonometric function with a higher precision, and rounds at the end. This will likely remain as-is, as the open-source code for QDouble is used as-is.

(However, be reminded that LargeFloat and OctaFloat arithmetic is *much* slower)

2) see "How Java’s Floating-Point Hurts Everyone Everywhere pg35"



Copyright © 2014-2024 eXept Software AG