~/ntl-11.4.2/doc/quad_float.cpp.html


/**************************************************************************\

MODULE: quad_float

SUMMARY:

The class quad_float is used to represent quadruple precision numbers.
Thus, with standard IEEE floating point, you should get the equivalent
of about 106 bits of precision (but actually just a bit less).

The interface allows you to treat quad_floats more or less as if they were
"ordinary" floating point types.

See below for more implementation details.


\**************************************************************************/

#include <NTL/ZZ.h>


class quad_float {
public:

quad_float(); // = 0

quad_float(const quad_float& a);  // copy constructor

explicit quad_float(double a);  // promotion constructor


quad_float& operator=(const quad_float& a);  // assignment operator
quad_float& operator=(double a);

~quad_float();


static void SetOutputPrecision(long p);
// This sets the number of decimal digits to be output.  Default is
// 10.


static long OutputPrecision();
// returns current output precision.


};


/**************************************************************************\

                             Arithmetic Operations

\**************************************************************************/




quad_float operator +(const quad_float& x, const quad_float& y);
quad_float operator -(const quad_float& x, const quad_float& y);
quad_float operator *(const quad_float& x, const quad_float& y);
quad_float operator /(const quad_float& x, const quad_float& y);


// PROMOTIONS: operators +, -, *, / promote double to quad_float
// on (x, y).

quad_float operator -(const quad_float& x);

quad_float& operator += (quad_float& x, const quad_float& y);
quad_float& operator += (quad_float& x, double y);

quad_float& operator -= (quad_float& x, const quad_float& y);
quad_float& operator -= (quad_float& x, double y);

quad_float& operator *= (quad_float& x, const quad_float& y);
quad_float& operator *= (quad_float& x, double y);

quad_float& operator /= (quad_float& x, const quad_float& y);
quad_float& operator /= (quad_float& x, double y);

quad_float& operator++(quad_float& a); // prefix
void operator++(quad_float& a, int); // postfix

quad_float& operator--(quad_float& a); // prefix
void operator--(quad_float& a, int); // postfix



/**************************************************************************\

                                  Comparison

\**************************************************************************/


long operator> (const quad_float& x, const quad_float& y);
long operator>=(const quad_float& x, const quad_float& y);
long operator< (const quad_float& x, const quad_float& y);
long operator<=(const quad_float& x, const quad_float& y);
long operator==(const quad_float& x, const quad_float& y);
long operator!=(const quad_float& x, const quad_float& y);

long sign(const quad_float& x);  // sign of x, -1, 0, +1
long compare(const quad_float& x, const quad_float& y); // sign of x - y

// PROMOTIONS: operators >, ..., != and function compare
// promote double to quad_float on (x, y).


/**************************************************************************\

                               Input/Output
Input Syntax:

<number>: [ "-" ] <unsigned-number>
<unsigned-number>: <dotted-number> [ <e-part> ] | <e-part>
<dotted-number>: <digits> | <digits> "." <digits> | "." <digits> | <digits> "."
<digits>: <digit> <digits> | <digit>
<digit>: "0" | ... | "9"
<e-part>: ( "E" | "e" ) [ "+" | "-" ] <digits>

Examples of valid input:

17 1.5 0.5 .5  5.  -.5 e10 e-10 e+10 1.5e10 .5e10 .5E10

Note that the number of decimal digits of precision that are used
for output can be set to any number p >= 1 by calling
the routine quad_float::SetOutputPrecision(p).  
The default value of p is 10.
The current value of p is returned by a call to quad_float::OutputPrecision().



\**************************************************************************/


istream& operator >> (istream& s, quad_float& x);
ostream& operator << (ostream& s, const quad_float& x);


/**************************************************************************\

                                  Miscellaneous

\**************************************************************************/



quad_float sqrt(const quad_float& x);
quad_float floor(const quad_float& x);
quad_float ceil(const quad_float& x);
quad_float trunc(const quad_float& x);
quad_float fabs(const quad_float& x);
quad_float exp(const quad_float& x);
quad_float log(const quad_float& x);


void power(quad_float& x, const quad_float& a, long e); // x = a^e
quad_float power(const quad_float& a, long e);

void power2(quad_float& x, long e); // x = 2^e
quad_float power2_quad_float(long e);

quad_float ldexp(const quad_float& x, long e);  // return x*2^e

long IsFinite(quad_float *x); // checks if x is "finite"   
                              // pointer is used for compatability with
                              // IsFinite(double*)


void random(quad_float& x);
quad_float random_quad_float();
// generate a random quad_float x with 0 <= x <= 1





/***********************************************************************\

IMPLEMENTATION DETAILS

A quad_float x is represented as a pair of doubles, x.hi and x.lo,
such that the number represented by x is x.hi + x.lo, where

   |x.lo| <= 0.5*ulp(x.hi),  (*)

and ulp(y) means "unit in the last place of y".  

For the software to work correctly, IEEE Standard Arithmetic is sufficient.  
However, there are a couple subtle points (see below).

This is a rather weird representation;  although it gives one
essentially twice the precision of an ordinary double, it is
not really the equivalent of quadratic precision (despite the name).
For example, the number 1 + 2^{-200} can be represented exactly as
a quad_float.  Also, there is no real notion of "machine precision".

Note that overflow/underflow for quad_floats does not follow any particularly
useful rules, even if the underlying floating point arithmetic is IEEE
compliant.  Generally, when an overflow/underflow occurs, the resulting value
is unpredicatble, although typically when overflow occurs in computing a value
x, the result is non-finite (i.e., IsFinite(&x) == 0).  Note, however, that
some care is taken to ensure that the ZZ to quad_float conversion routine
produces a non-finite value upon overflow.



THE EXTENDED PRECISION PROBLEM

Some machines may compute intermediate floating point results in an extended
double precision format.  The only machines I know that do this are old
(pre-SSE) x86 machines that rely on the x87 coprocessor instruction set, so
this is mainly of historical interest these days.  On these old machines,
intermediate results are stored in the 80-bit x87 floating point registers.
These registers have 53 or 64 bits of precision---this can be set at run-time
by modifying the cpu "control word" (either in assembly or through special
compiler intrinsics).  If the preicsion is set to 64 bits, the quad_float
routines will produce incorrect results.


The best way to fix this problem is to set the precsion of the x87 registers to
53 at the beginning of program execution and to leave it.  This is what Windows
with MSVC++ does.  On Linux with gcc (and other compilers), however, this is
not the case: the precision of the registers is set to 64 bits.

To fix this problem, the build process for NTL automatically checks for
extended double preceision, and if detected, arranges for the quad_float
routines to locally set the precision to 53 bits on entry and back to 64 bits
on exit.  This only works with GCC and GCC-compatible compilers (including
CLANG and ICC) that support GCC's inline assembly syntax.  The configuration
flags  NTL_FIX_X86 or NTL_NO_FIX_X86 can be set to override NTL's default
behavior (but I've never really seen a need to use them).

If all else fails, NTL will revert to a strategy of  forcing all intermediate
floating point results into memory.  This is not an ideal fix, since it is not
fully equivalent to 53-bit precision (because of double rounding), but it works
(although to be honest, I've never seen a full proof of correctness in this
case).  NTL's quad_float code does this by storing intermediate results in
local variables declared to be 'volatile'.  This is the solution to the problem
that NTL uses if it detects the problem and can't fix it using the control word
hack mentioned above.  In fact, I've never seen a platform where this strategy
is required.

To reiterate: this is all mainly of historical interest, as the x87 coprocessor
is pretty much not used any more.


THE FMA PROBLEM

Some CPUs come equipped with a fused-multiply-add (FMA) instruction, which
computes x + a*b with just a single rounding.  While this generally is faster
and more precise than performing this using two instructions and two roundings,
FMA instructions can break the logic of quad_float.  

To mitigate this problem, NTL's configuration script will attempt to detect the
existence of FMA's, and if detected, to supress them using a compiler flag.
Right now, NTL's configuration script only attempts to fix this problem for the
GCC, CLANG, and ICC compilers.  GCC,  this is done with the flag
-ffp-contract=off (or the flag -mno-fused-madd, which is obsolete in new
versions of GCC).  On CLANG, this is also done with the flag -ffp-contract=off,
although by default, CLANG will not emit FMA's, so this should not be
necessary.  On ICC, this is done by the fp_contract(off) pragma.  This is
accomplished by passing information through the make variable NOCONTRACT, which
is only used in the compilation of quad_float.cpp.



THE FLOATING POINT REASSOCIATION PROBLEM

The C++ standard says that compilers must issue instructions that respect the
grouping of floating point operations.  So the compiler is not allowed to
compile (a+b)+c as a+(b+c).  Most compilers repect this rule by default, but
some do not (such as Intel's ICC compiler).  The logic of quad_float relies
crucially on this rule.

In fact, NTL attempts to ensure that all of its source files get compiled with
this rule in force.  To this end, if the configuration script detects an ICC
compiler, it will pass the "-fp-model precise" flag through to the compiler
(via the CXXAUTOFLAGS make variable) when compiling *all* source files (not
just quad_float.cpp).


BACKGROUND INFO

The code NTL uses algorithms designed by Knuth, Kahan, Dekker, and
Linnainmaa.  The original transcription to C++ was done by Douglas
Priest.  Enhancements and bug fixes were done by Keith Briggs ---
see http://keithbriggs.info/doubledouble.html.  The NTL version is a
stripped down version of Briggs' code, with some bug fixes and
portability improvements.  

Here is a brief annotated bibliography (compiled by Priest) of papers 
dealing with DP and similar techniques, arranged chronologically.


Kahan, W., Further Remarks on Reducing Truncation Errors,
  {\it Comm.\ ACM\/} {\bf 8} (1965), 40.

M{\o}ller, O., Quasi Double Precision in Floating-Point Addition,
  {\it BIT\/} {\bf 5} (1965), 37--50.

  The two papers that first presented the idea of recovering the
  roundoff of a sum.

Dekker, T., A Floating-Point Technique for Extending the Available
  Precision, {\it Numer.\ Math.} {\bf 18} (1971), 224--242.

  The classic reference for DP algorithms for sum, product, quotient,
  and square root.

Pichat, M., Correction d'une Somme en Arithmetique \`a Virgule
  Flottante, {\it Numer.\ Math.} {\bf 19} (1972), 400--406.

  An iterative algorithm for computing a protracted sum to working
  precision by repeatedly applying the sum-and-roundoff method.

Linnainmaa, S., Analysis of Some Known Methods of Improving the Accuracy
  of Floating-Point Sums, {\it BIT\/} {\bf 14} (1974), 167--202.

  Comparison of Kahan and M{\o}ller algorithms with variations given
  by Knuth.

Bohlender, G., Floating-Point Computation of Functions with Maximum
  Accuracy, {\it IEEE Trans.\ Comput.} {\bf C-26} (1977), 621--632.

  Extended the analysis of Pichat's algorithm to compute a multi-word
  representation of the exact sum of n working precision numbers.
  This is the algorithm Kahan has called "distillation".

Linnainmaa, S., Software for Doubled-Precision Floating-Point Computations,
  {\it ACM Trans.\ Math.\ Soft.} {\bf 7} (1981), 272--283.

  Generalized the hypotheses of Dekker and showed how to take advantage
  of extended precision where available.

Leuprecht, H., and W.~Oberaigner, Parallel Algorithms for the Rounding-Exact
  Summation of Floating-Point Numbers, {\it Computing} {\bf 28} (1982), 89--104.

  Variations of distillation appropriate for parallel and vector
  architectures.

Kahan, W., Paradoxes in Concepts of Accuracy, lecture notes from Joint
  Seminar on Issues and Directions in Scientific Computation, Berkeley, 1989.

  Gives the more accurate DP sum I've shown above, discusses some
  examples.

Priest, D., Algorithms for Arbitrary Precision Floating Point Arithmetic,
  in P.~Kornerup and D.~Matula, Eds., {\it Proc.\ 10th Symposium on Com-
  puter Arithmetic}, IEEE Computer Society Press, Los Alamitos, Calif., 1991.

  Extends from DP to arbitrary precision; gives portable algorithms and
  general proofs.

Sorensen, D., and P.~Tang, On the Orthogonality of Eigenvectors Computed
  by Divide-and-Conquer Techniques, {\it SIAM J.\ Num.\ Anal.} {\bf 28}
  (1991), 1752--1775.

  Uses some DP arithmetic to retain orthogonality of eigenvectors
  computed by a parallel divide-and-conquer scheme.

Priest, D., On Properties of Floating Point Arithmetics: Numerical Stability
  and the Cost of Accurate Computations, Ph.D. dissertation, University
  of California at Berkeley, 1992.

  More examples, organizes proofs in terms of common properties of fp
  addition/subtraction, gives other summation algorithms.

Another relevant paper: 

X. S. Li, et al.
Design, implementation, and testing of extended and mixed 
precision BLAS.  ACM Trans. Math. Soft., 28:152-205, 2002.



\***********************************************************************/