|
A real number is said to be effective if there
exists an algorithm which, given a required tolerance
, returns a binary approximation
for
with
. Effective real numbers are interesting in areas
of numerical analysis where numerical instability is a major problem.
One key problem with effective real numbers is to perform intermediate computations at the smallest precision which is sufficient to guarantee an exact end-result. In this paper we first review two classical techniques to achieve this: a priori error estimates and interval analysis. We next present two new techniques: “relaxed evaluations” reduce the amount of re-evaluations at larger precisions and “balanced error estimates” automatically provide good tolerances for intermediate computations.
Keywords: effective real number, algorithm, interval analysis, error estimates.
A dyadic number is a number of the form
with
. We denote by
the set of dyadic numbers and
. Given
and
, an
-approximation
for
is a number
with
. An approximation
algorithm for
is an algorithm which takes a
tolerance
on input and which returns an
-approximation for
. A real number
is said
to be effective, if it admits an approximation algorithm. The
aim of this paper is to review several recent techniques for
computations with effective real numbers and to present a few new ones.
Effective real numbers can be useful in areas of numerical analysis where numerical instability is a major problem, like singularity theory. In such areas, multiple precision arithmetic is often necessary and the required precisions for auxiliary computations may heavily vary. We hope that the development of an adequate computational theory for effective real numbers will both allow us to automatically perform the error analysis and compute the required precisions at which computations have to take place.
We recall that there exists no general zero-test for effective real numbers. Nevertheless, exact or heuristically reliable zero-tests do exist for interesting subfields, which contain transcendental constants. However, this topic will not be studied in this paper and we refer to [1, ?; 2, ?] for some recent work on this matter.
In an object-oriented language like with precision
, then we determine tolerances
such that the multiple precision evaluation of
at any
-approximations for
the
always yields an
-approximation for
.
In some cases, a priori error estimates are quite pessimistic.
An alternative technique for computing with effective real numbers is
interval arithmetic. In this case, we approximate an effective real
number by an interval
which contains
. Then the
evaluation of
comes down to the determination of
an interval
with
For continuous functions ,
when starting with a very precise approximation of
(i.e.
is very small), the obtained
a posteriori error estimate for
will
also become as small as desired. In section 3, this
technique of a posteriori error estimates will be reviewed in
more detail, as well as an efficient representation for high-precision
intervals.
Unfortunately, in some cases, both a priori and a posteriori error estimates are quite pessimistic. In sections 4 and 5, we will therefore present two new techniques for the computation of “adaptive error estimates”. These techniques combine the advantages of a priori and a posteriori error estimates, while eliminating their major disadvantages. Moreover, this new technique remains reasonably easy to implement in a general purpose system for computations with effective real numbers. Under certain restrictions, the new techniques are also close to being optimal. This point will be discussed in section 6.
The approaches presented in sections 2 and 3
have been proposed, often independently, by several authors [3, ?; 4, ?;
5, ?] and we notice the existence of several similarities with the
theory of effective power series [6, ?; 7, ?]. In the past, we
experimentally implemented the approaches of sections 2 and
3 in the case of power series (not officially distributed)
resp. real numbers [8, ?]. We are currently working on a
more robust C++ library based on the ideas in this paper [9, ?].
Currently, this library contains the basic arithmetic operations. Some
other implementations with similar objectives are currently known to the
author [10, ?; 11, ?]. All these implementations are free software and
they are mainly based on the
A natural way to represent an effective real number
is by its approximation algorithm. Conceptually speaking, this means
that we view
as a black box which can be asked
for approximations up to any desired precision:
In an object oriented language like C++, this can be implemented using an abstract base class real_rep with a virtual method for the approximation algorithm. Effective real numbers will then be pointers to real_rep. For instance,
class real_rep {
public:
virtual dyadic approx (const dyadic& tol) = 0;
};
typedef real_rep* real;
Here dyadic stands for the class of dyadic numbers. In practice, these may be taken to be arbitrary precision floating point numbers. For simplicity, we also do not care about memory management. In a real implementation, one would need a mechanism for reference counting or conservative garbage collection.
Now assume that we want to evaluate for a given
tolerance
, where
are effective real numbers and
is
an operation like
,
or
. In order
to make
effective, we need to construct
an approximation algorithm for
as a function of
. This both involves the
dyadic approximation of the evaluation of
in
dyadic approximations of the
and the
determination of tolerances
for the dyadic
approximations of the
. More
precisely,
should be such that for any
with
, we have
where stands for a dyadic approximation
algorithm for
which depends on the tolerance
. For instance, in the case
when
is the addition, we may use exact
arithmetic on
(so that
for all
) and take
. This yields the following class
for representing sums of real numbers:
class add_real_rep: public real_rep {
real x, y;
public:
add_real_rep (const real& x2, const real& y2):
x (x2), y (y2) {}
dyadic approx (const dyadic& tol) {
return x->approx (tol >> 1) + y->approx (tol
>> 1); }
};
The addition can now be implemented as follows:
inline real operator + (const real& x, const real& y)
{
return new add_real_rep (x, y); }
Notice that, in a sense, we have really represented the sum of and
by the expression
(more generally, such expressions are dags (directed
acyclic graphs)). Nevertheless, the representation using an abstract
class real_rep provides additional flexibility. For
instance, we may attach additional information to the class real_rep,
like the best currently known approximation for the number (thereby
avoiding unnecessary recomputations). In practice, it is also good to
provide an additional abstract method for computing a rough upper bound
for the number. This gives a fine-grained control over potential
cancellations.
The above approach heavily relies on the computation of a
priori error estimates (i.e. the computation of the
). If no additional
techniques are used, then this leads to the following disadvantages:
We do not take advantage of the fact that the numeric evaluation
of may lead to an approximation of
which is far better than the required tolerance
. Indeed, multiple
precision computations are usually done with a precision which is
a multiple of the number of bits
in a
machine word. “On average”, we therefore gain
something like
bits of precision. In
section 4, we will show that it may actually be
profitable to systematically compute more than necessary.
The error estimates may be pessimistic, due to badly balanced
expressions. For instance consider the -approximation of a sum
, which corresponds to a tree
Then the above technique would lead to the computation of an -approximation of
for
and an
-approximation of
. If
is large, then
is unnecessarily small, since a mere
-approximation for each
would do. In section 5 we will
consider a general technique for computing “balanced error
estimates”. Badly balanced expressions naturally occur when
evaluating polynomials using Horner's rule.
An alternative technique for computing with effective real numbers is interval arithmetic. The idea is to systematically compute intervals approximations instead of floating point approximations. These intervals must be such that the real numbers we are interested in are certified to lie in their respective interval approximations.
More precisely, given and
, an
-interval
for
is a closed interval
with
and
.
Concretely speaking, we may represent such intervals by their endpoints
and
.
Alternatively, if the precisions of
and
are large, then it may be more efficient to represent
the interval by its center
and its radius
. Indeed, the exact endpoints of
the interval are not that important. Hence, modulo a slight increase of
the radius, we may always assume that the radius can be stored in a
“single precision dyadic number”
, where
is the number of
bits in a machine word. This trick allows to reduce the number of
multiple precision computations by a factor of two.
Now assume that we want to compute an -approximation
for
, where
are effective real numbers and where
is a
continuous function. Assume also that we have reasonable initial
-intervals for the
. Then, starting with a low precision
, we first compute
-intervals
for the
. We next evaluate
using interval arithmetic. This yields an interval
with
If , then
is an
-approximation for
. Otherwise, we increase the
precisions
and repeat the computations. Under
relatively mild assumptions on the way we evaluate
, this procedure will eventually stop, since
is continuous.
Although this technique of a posteriori error estimates does
solve the problems P1 and P2 raised in
the previous section, it also induces some new problems. Most
importantly, we have lost the fine-grained control over the precisions
in intermediate computations during the evaluation of . Indeed, we have only control over the overall
starting precision
. This
disadvantage is reflected in two ways:
It is not clear how to increase .
Ideally speaking,
should be increased in
such a way that the computation time of
is
doubled at each iteration. In that case (see section 4),
the overall computation time is bounded by a constant time the
computation time w.r.t. the least precision
which leads to an
-computation
for
. Now the problem
is that this “overall computation time” of
can be estimated well by hand for elementary
operations like
,
,
, etc., but not necessarily for
more complicated functions.
Consider the case when, somewhere during the evaluation of , we need to compute the sum
of a very large number
and a very small number
.
Assume moreover that
can be approximated
very fast, but that the approximation of
requires a lot of time. Since
is very
small w.r.t.
,
it will then be possible to skip the computation of
, by setting
, unless
is very
large. However, the above technique of a posteriori error
estimates does not allow for this optimization.
Relaxed evaluation can be used in order to “combine the good ideas” of sections 2 and 3. The idea is to endow real_rep with an extra field interval best which corresponds to the best currently known interval approximation of the real number. Moreover, when requesting for a better approximation, we will actually compute a much better approximation, so as to avoid expensive recomputations when we repeatedly ask for slightly better approximations. This anticipating strategy was first introduced in the case of power series, where it leads to important algorithmic gains [14, ?; 7, ?]. In our context, the proposed method is quite different though, because of the problem with carries.
More precisely, assume that we have an -ary
operation
, such that the
-digit approximation of
at dyadic numbers with
digits
has time complexity
. The
complexity
for an elementary operation is
usually a well-understood, regular function (in general,
is increasing, etc.). For instance, we have
for addition and
for
multiplication, where
and
decreases slowly from
to
.
Now assume that we have an -digit
approximation of
and that we request a slightly
better approximation. Then we let
be such that
and we replace our
-digit
approximation by an
-digit
approximation. This strategy has the property that the successive
evaluation of
at
digits
requires a time
where
are such that
and
for
. Consequently
More generally, the evaluation of at different
precisions
requires at most four times as much
time as the evaluation of
at precision
.
The combination of a priori and a posteriori error
estimates in the relaxed strategy clearly solves problems
P1 and P4. Furthermore, at the level
of the class real_rep, we have a fine-grained control
over how to increase the computational precision. Moreover, we have
shown how to perform this increase in an efficient way, as a function of
. This will avoid expensive
recomputations when
occurs many times in a dag.
In other words, the relaxed technique also solves problem
P3.
Let us illustrate the relaxed strategy on the concrete example of the computation of the sine function. We assume that we have implemented a suitable class interval for certified arbitrary precision computations with intervals. First of all, real_rep now becomes:
class real_rep {
protected:
interval best;
public:
inline real_rep (): best (interval::fuzzy) {}
virtual interval approx (const dyadic& tol) = 0;
};
Here interval::fuzzy stands for the interval . Sines of numbers are represented
by instances of the following class:
class sin_real_rep: public real_rep {
real x;
public:
inline sin_real_rep (const real& x2): x (x2) {
best= interval (-1, 1); }
interval approx (const dyadic& tol);
};
The approximation algorithm is given by
interval sin_real_rep::approx (const dyadic& tol) {
if (tol < radius (best)) {
interval xa= x->approx (tol * (1 - DELTA));
int required_prec= 2 - expo (tol);
int proposed_prec= next_prec (-expo (radius (best)));
xa= truncate (xa, max (required_prec, proposed_prec));
best= sin (xa);
}
return best;
}
This algorithm requires some explanations. First of all, DELTA
stands for a small number like ,
where
is the machine word size. We use DELTA for countering the effect of rounding errors in the
subsequent computations. Next, expo stands for the
function which gives the exponent of a dyadic number. We have
for all
. The
function next_prec (see below) applied to a precision
computes the precision
such that
, where
stands for the complexity of the evaluation of the sine
function. Now we recall that xa may contain an
approximation for x which is much better than the
approximation we need (this is typically the case if a more precise
value for the argument is needed in another part of the global
computation). We therefore truncate the precision of xa
at the precision we need, before computing the next approximation for
the sine of x. The function next_prec
may for instance be implemented as follows:
int next_prec (int prec) {
if (prec <= (16 * WORD_SIZE)) return 1.41 * prec +
WORD_SIZE;
else if (prec <= 256 * WORD_SIZE) return 1.55 * prec;
else if (prec <= 4096 * WORD_SIZE) return 1.71 * prec;
else return 2 * prec;
}
The different thresholds correspond to the precisions where we use naive multiplication, Karatsuba multiplication and two ranges of F.F.T. multiplication. We finally have the following algorithm for computing sines:
inline real sin (const real& x) {
return new sin_real_rep (x); }
Remark , it may happen that the precision
of xa is only slightly higher
than the precision
of best. In
that case, the next approximation of
is only
computed at precision
and not at a precision
(like
) which doubles the
computation time.
When the above problem occurs repeatedly, we thereby lose the principal
advantage of the relaxed approach. Nevertheless, it should be noticed
that the slight redundancy in the precision can
only occur if this higher precision was requested by another
computation than
(except when
has an exponential computational complexity). In fact, the author
currently does not know of any natural example of a computation where
the above problem occurs systematically (so as to make the computation
time increase by more than a constant factor).
One remedy is to delay the actual evaluation best=sin(xa) until the very end and first gradually increase precisions until we are sure that the total re-evaluation cost is about twice as large as the previous evaluation cost (considering only those nodes which are re-evaluated). However, this is only possible if we do not need the actual result of a re-evaluation during the “prospective phase” in which we increase the precisions.
Yet another approach would be to systematically use the real number
analogue of relaxed power series evaluations [7, ?]. However, this
requires the re-implementation of all basic operations on floating
pointers in a relaxed way. It also involves an additional overhead.
One of the main problems with a priori error estimates that we
have mentioned in section 2 are badly balanced expressions,
which lead to overly pessimistic bounds. In this section we introduce
the general technique of “balanced error estimates” which
solve this problem as well as possible in a general purpose system. The
idea behind balanced error estimates is to “distribute the
required tolerance over the subexpressions in a way which is
proportional to their weights”. Here leaves have weight and the
of an expression
is given by
.
Remark , where
is the size of a machine word.
Let us first illustrate the strategy of balanced error estimates in the
case of addition. So assume that we want to compute an -interval for a sum
, where
has weight
as an expression and
has weight
. Then we will compute a
-interval for
and a
-interval
for
. For example, in case of
the expression
, a tolerance
of
is distributed as follows:
Subtraction is treated in a similar way as addition.
Remark with one root and only additions
and subtractions. If a tolerance
is specified at
the root, then the above algorithm requires a tolerance
at each node of weight
.
Notice that any correct algorithm should require a tolerance of at most
instead of
.
In other words, the precision with respect to which we compute never
exceeds the absolutely necessary precision by more than
.
Remark can be linear in the size
of the dag, as we saw in the case of repeated squarings (and more
generally in the case of loops or recursive functions). Nevertheless, if
is exponential in the size, then the
precision loss generally only causes the overall
complexity to increase by a constant factor. Let us illustrate this
point by an example: consider the computation of a
-approximation of
,
where
Then our algorithm requires the computation of a -approximation of each
,
which takes a time
. The
optimal algorithm would only require a
-approximation
of each
, which again leads
to a global complexity of
.
This example should be compared to the computation of a
-approximation of
Our balanced algorithm only requires a time , which is optimal in this case, whereas a
non-balanced algorithm would yield a quadratic complexity
.
In the case of multiplication, the above additive distribution of
tolerances cannot longer be used and one has to resort to
“relative tolerances”. Here a relative tolerance for
corresponds to an absolute
tolerance of
. Now let
and
be effective real numbers
of sizes
and
.
Assume for simplicity that we are in the regular case when we
have good starting approximations
for
and
, say of
relative tolerances
, where
is the machine word size. Then, in order to
compute an
-interval for
, we first distribute
over
and
, and then compute a
-interval for
and a
-interval for
. The product of these approximations yields an
-interval for
.
More generally, assume that are effective real
numbers of weights
and that we want to compute
an
-interval for
, where
is a
differentiable
-ary operation
for which we are given an arbitrary precision evaluation algorithm at
dyadic intervals. Before balancing tolerances as above, we first have to
determine the numbers which have to be subdivided (i.e.
in the case of addition and
(really
and
as we will
see below) in the case of multiplication).
We define the optimal tolerances for to
be the minimal numbers
, such
that for any intervals
with
![]() |
(1) |
we have . Let us for
simplicity that we are in the regular case when there exist
(computable) bounds
which are valid for . Then
(1) implies in particular that
for
. Conversely,
implies (1). This proves that
. For instance, in the case of addition, we may
take
and we obtain
.
In the case of multiplication, we may take
and
and obtain
and
.
Now, assuming that we have sharp lower bounds
for the
(like
),
we compute a
-interval for
each
. Then we obtain the
requested
-interval for
by evaluating
at these
intervals at a sufficient precision.
Remark ,
the precision with respect to which we compute never exceeds the
absolutely necessary precision by more than
.
The approach of balanced error estimates is more delicate in the irregular case. It is instructive to well understand the case of multiplication first; the other operations can probably be treated in a similar way.
So assume that we want to compute a product . Let us first consider the semi-regular case when
we still have a good initial approximation for one of the arguments, say
, but not for the other. Then
we first compute a
-interval
for
. In case when this
yields an approximation
for
with a good relative tolerance, then we may proceed as in the regular
case. Otherwise, we obtain at least an upper bound for
and we let this upper bound play the role of
.
This leaves us with the purely irregular case when we neither have
strictly positive lower bounds for nor
. In this case, we have the choice
between computing a more precise approximation for
or for
, but it is not clear
a priori which choice is optimal from the complexity point of
view. One solution to this problem may be to continuously improve the
precisions of the approximations for both
and
, while distributing the
available computation time equally over
and
. It is not clear yet to us
how to estimate the complexity of such an algorithm. Another solution,
which is easier to implement, is to use rough upper bounds for
and
instead of
and
, while increasing the
precision continuously. However, this strategy is certainly not optimal
in some cases.
We have presented two new “adaptive error estimating techniques” for computing with effective real numbers. We believe these techniques to be optimal in virtually all situations which occur in practice. It remains interesting to study the theoretical optimality in more detail. This first requires the introduction of a suitable notion of optimality.
Assume that we have a library of effective real functions , some of which have arity
. Then any computation using this library
involves only a finite number
of expressions
constructed using the
. These
expressions actually correspond to a directed acyclic graph, since they
may share common subexpressions.
Assume that the aim of the computation is to obtain an -approximation for each
with
for some
.
A certified solution to this problem consists of assigning a
dyadic interval to each node of the dag, such that
The interval associated to each
with
satisfies
.
For each subexpression with
, the interval associated to
is obtained by evaluating
at
the intervals associated to
.
The certified solution is said to be optimal, if the time needed to compute all interval labels is minimal. Modulo rounding errors, finding such an optimal solution corresponds to determining optimal tolerances for the leaves of the dag (which is equivalent to determining the lengths of the intervals associated to the leaves).
Remark may themselves depend on the way we
compute things. Also, we may require
-approximations
for the
during the intermediate computations,
and in an arbitrary order. A more elaborate model would therefore
consider a sequence of problems of the above type, with growing dags,
growing sets
and decreasing
.
Clearly, it is difficult to efficiently find optimal certified
solutions. In many cases, it should nevertheless be possible to find
almost optimal solutions for which the cost of the -approximations exceeds the optimal cost by at
most a constant factor.
We have already pointed out that our strategies are non-optimal in general (see remarks 1 and 4, and the discussion on the irregular case in section 5). Nevertheless, in some special cases, we do approach optimality.
First of all, if we are computing with expression trees (and not dags), then the problem in remark 1 cannot occur and the precision loss in remark 4 is at most logarithmic in the size of the expression.
Secondly, assume that all approximation algorithms for the have a polynomial time complexity
, and consider a computation which essentially
spends its time in high precision arithmetic. Then the precision loss in
remark 4 only causes a slow-down by a constant factor,
since
.
D. Richardson, How to recognize zero, J.S.C. 24 (1997) 627–645.
J. van der Hoeven, Zero-testing, witness conjectures and differential diophantine approximation, Tech. Rep. 2001-62, Prépublications d'Orsay (2001).
J. Blanck, V. Brattka, P. Hertling (Eds.), Computability and complexity in analysis, Vol. 2064 of Lect. Notes in Comp. Sc., Springer, 2001.
J. Blanck, General purpose exact real arithmetic, Tech. Rep. CSR 21-200, Luleå University of Technology, Sweden, http://www.sm.luth.se/~jens/ (2002).
M. Grimmer, K. Petras, N. Revol, Multiple precision interval packages: Comparing different approaches, Tech. Rep. RR 2003-32, LIP, École Normale Supérieure de Lyon (2003).
J. van der Hoeven, Automatic asymptotics, Ph.D. thesis, École polytechnique, France (1997).
J. van der Hoeven, Relax, but don't be too lazy, JSC 34 (2002) 479–542.
J. van der Hoeven, GMPX, a C-extension library for gmp, http://www.math.u-psud.fr/~vdhoeven/, no longer maintained (1999).
J. van der Hoeven, Mmxlib, a C++ core library for Mathemagix, http://www.math.u-psud.fr/~vdhoeven/ (2003–2004).
N. Müller, iRRAM, exact arithmetic in C++, http://www.informatik.uni-trier.de/iRRAM/ (2000–2004).
N. Revol, MPFI, a multiple precision interval arithmetic library, http://perso.ens-lyon.fr/nathalie.revol/software.html (2001–2004).
T. Granlund, Others, GMP, the GNU multiple precision arithmetic library, http://www.swox.com/gmp (1991–2004).
G. Hanrot, V. Lefèvre, K. Ryde, P. Zimmermann, MPFR, a c library for multiple-precision floating-point computations with exact rounding, http://www.mpfr.org (2000–2004).
J. van der Hoeven, Lazy multiplication of formal power series, in: W. W. Küchlin (Ed.), Proc. ISSAC '97, Maui, Hawaii, 1997, pp. 17–20.