A Tour of NTL: NTL Implementation and Portability

NTL is designed to be portable, fast, and relatively easy to use and extend.

To make NTL portable, no assembly code is used (well, almost none, see below). This is highly desirable, as architectures are constantly changing and evolving, and maintaining assembly code is quite costly. By avoiding assembly code, NTL should remain usable, with virtually no maintenance, for many years.

Minimal platform requirements

When the configuration flag NTL_CLEAN_INT is on (this is not the default, see below), NTL makes two requirements of its platform, neither of which are guaranteed by the C++ language definition, but are essentially universal:

int and long quantities, respectively, are represented using a 2's complement representation whose width is equal to the width of unsigned int and unsigned long, respectively.
Double precision floating point conforms to the IEEE floating point standard.

NTL makes very conservative requirements of the C++ compiler:

it is assumed that the C++ compiler conforms to the 1998 standard, including the basic of templates;
it does not assume any features not in the 1998 standard, unless compiled with a flag that requires features from the 2011 standard (NTL_THREADS, NTL_EXCEPTIONS, NTL_SAFE_VECTORS) or a flag that explicitly requests a standard (NTL_STD_CXX11, NTL_STD_CXX14).

At some point in the future, it is expected that NTL will move to requiring the 2011 standard or later.

The `NTL_CLEAN_INT` flag

The configuration flag NTL_CLEAN_INT is currently off by default.

When this flag is off, NTL makes a couple of other requirements of its platform.

First, that conversions from unsigned long to long convert the bit pattern without change to the corresponding 2's complement signed integer. Note that the C++ standard defines the behavior of converting unsigned to signed values as implementation defined when the value cannot be represented in the range of nonnegative signed values. Nevertheless, this behavior is essentially universal, and more importantly, is is not undefined behavior: implementation-defined behavior must be documented and respected by the compiler, while undefined behavior can be exploited by the compiler in some surprising ways.

Second, right shifts of signed integers are consistent, in the sense that if it is sometimes an arithmetic shift, then it is always an arithmetic shift (the installation scripts check if right shift appears to be arithmetic, and if so, this assumption is made elsewhere). Arithmetic right shift is also implementation defined behavior that is essentially universal.

It seems fairly unlikely that one would ever have to turn the NTL_CLEAN_INT flag on, but it seems a good idea to make this possible, and at the very least to identify and isolate the code that relies on these asumptions. Actually, the most recent versions of NTL (especially since v10.0), there is very little such code remaining, and it is not really all that critical to performance any more. Eventually, all such code may disappear completely.

Some floating point issues

NTL uses double precision floating point arithmetic in a number of places, including a number of exact computations, where one might not expect to see floating point. Relying on floating point may seem prone to errors, but with the guarantees provided by the IEEE 754 standard, one can prove the correctness of the NTL code that uses floating point.

Generally, NTL assumes that the IEEE standard is correctly implemented. In particular, it assumes that the compiler issues instructions that respect the grouping of floating point operations. For example, the compiler is not allowed to compile (a+b)+c as a+(b+c), or a*b + a*c as a*(b+c).

By default, most compilers will satisfy this assumption at normal optimization levels (for example, at the -O2 optimization level of most compilers). One should avoid higher optimization levels like -O3 or special flags like -Ofast. In practice, these optimization flags do not help much, and may result in incorrect code. One important compiler which does not satisfy this assumption by default is Intel's icc compiler: one has to pass the -fp-model precise flag to icc to correct this problem. NTL's configuration script will take care of this automatically, by including -fp-model precise in the CXXAUTOFLAGS make variable, so you shouldn't have to worry about it.

It should be said that most floating point code in NTL is quite robust, and will still work correctly under somewhat weaker assumptions. Moreover, NTL's header files are structured so that programs that use NTL do not need to be compiled in a way that satisfies these assumptions. For example, NTL client code could be compiled using icc's defaults or using gcc with the -Ofast flag.

There are a few other issues to address.

Extended precision. Standard compliant compilers may compute intermediate results in extended precision. The only machine for which the extended precision is an issue are old (pre-SSE) x86 machines that rely exclusively on the x87 coprocessor instruction set.
Contractions and FMA. Standard compliant compilers may "contract" certain sequences of operations into a single operation with just one rounding. The main issue here is that many newer machines have a "fused multiply-add (FMA)" instruction that computes a*b+c with just a single rounding, and compilers may choose generate this instruction.
Except for one module, NTL works perfectly well even with extended precision and/or contractions. The one exception is the quad_float, which requires strict adherence to the above floating point assumptions, and in addition, requires that all computations are carried out without extended precision or contractions. NTL goes to great lengths to detect and work around these issues, so you should not have to worry about it. More details can be found in quad_float.txt.
Special values. The compiler should handle special values (infinities, NaN's, signed zeroes, denormalized numbers) correctly.
NTL does not rely very much on these for its own computations. For certain operations (like conversions) it assumes that and infinities and NaN's behave as expected so they can be detected. It in some bounds checking code, it assumes that infinities basically work the right way. NTL does not make any assumptions about signed zeros or denormalized numbers.
Conversions. The compiler should convert double's to long's by truncation. It should should convert long's to double's exactly if possible, and otherwise, to either the next lower or higher representable value. This behavior is required by the C++ standard.
Rounding modes. On some platforms, it is possible to dynamically change the rounding mode from the default round-to-nearest mode to other modes (e.g., round to zero).
The core algorithms in NTL that use floating point to assist in computing exact integer results are conservatively designed so that they will work in any rounding mode. However, it is not recommended to run NTL code in non-default rounding modes. First, NTL has not been thoroughly tested in these other modes. Second, some higher-level floating-point code (such as quad_float) may not give expected results in other modes.

Algorithms

NTL makes fairly consistent use of asymptotically fast algorithms.

Long integer multiplication is implemented using the classical algorithm, crossing over to Karatsuba for very big numbers. Long integer division is currently only implemented using the classical algorithm -- unless you use NTL with GMP (version 3 or later), which employs an algorithm that is about twice as slow as multiplication for very large numbers.

Polynomial multiplication and division is carried out using a combination of the classical algorithm, Karatsuba, the FFT using small primes, and the FFT using the Schoenhagge-Strassen approach. The choice of algorithm depends on the coefficient domain.

Many algorithms employed throughout NTL are inventions of the author (Victor Shoup) and his colleagues Joachim von zur Gathen and Erich Kaltofen, as well as John Abbott and Paul Zimmermann.

Thread safety

As of v7.0, NTL is thread safe. That said, there are several things to be aware of:

To use this feature, you have to enable NTL_THREADS in the configuration script. Also, you will need a compiler and runtime library that implements several key C++11 features, including thread_local storage.
- NOTE: as of v9.8, the requirements have been relaxed, so that for gcc and gcc-compatible compilers (such as clang and icc) only support of the gcc __thread storage specifier is required.
- With these relaxed requirements, it is possible to build a thread safe version of NTL on Linux using gcc 4.8 and above, or on Mac OSX 10.10 and above.
Prior to v10.0 of NTL, you had to use NTL with GMP to get thread safety. Since v10.0, this requirement has been dropped.
As of v11.0, NTL_THREADS is on by default.
If you use the external gf2x library, you must use version 1.2 of that library or later. Earlier versions are not thread safe.

To obtain thread safety, I used the following strategies:

In places where NTL's interface demands global variables, such as the "current modulus" for the ZZ_p class, these global variables have been made thread local. So, you can pass around various ZZ_pContext objects among threads, and individual threads can install these locally. Thus, different threads can concurrently use the same or different moduli, and it all just works, with no changes to NTL's interface.
In places where NTL used static variables to hold on to space for scratch variables, I make these variables thread local, and I also make sure the storage used by these variables get released when the thread terminates. In all NTL builds (thread safe or not), I try to make sure that fairly large chunks of memory get released immediately.
In places where NTL uses a lazy strategy to build various tables (such as FFT primes), I uses a "double checked locking" strategy to grow these tables in a way that (a) the tables can be shared among different threads, and (b) taking a lock on a mutex is very rare. The new C++11 concurrent memory model is essential here.
Smart pointers (for things like ZZ_pContext's) are designed to do the necesary reference counting in a thread-safe manner.
For psuedo-random number generation, the internal state of the PRG is thread local, and the default initial seed is guaranteed to be unique among all threads in a given process (and an attempt is made to make the seed globally unique among all processes and threads, but this is hard to do in a completely portable way).

The overall structure of the code has been modified so that the code base is nearly identical for regular and thread-safe builds: there are just a few ifdef's on the NTL_THREADS flag.

Thread Boosting

As of v9.5.0, NTL provides a thread boosting feature. With this feature, certain code within NTL will use available threads to speed up computations on a multicore machine. This feature is enabled by setting NTL_THREAD_BOOST=on during configuration. See BasicThreadPool.txt for more information.

As of v11.0, NTL_THREAD_BOOST is on by default.

This feature is a work in progress. As time goes on, more and more code within NTL is being thread boosted.

Error Handling and Exceptions

As of v8.0, NTL provides error handling through exceptions. To enable exptions, you have to configure NTL with NTL_EXCEPTIONS flag turned on. By default, exceptions are not enabled, and NTL reverts to its old error handling method: abort with an error message.

If exceptions are enabled, then instead of aborting your program, and appropriate exception is thrown. More details ion the programming interface of this feature are available here.

If you enable exceptions, you must use a C++11 compiler. Specifically, your compiler will need support for lambdas (which are used to conveniently implement the "scope guard" idiom), and your compiler should implement the new default exception specification semantics (namely, that destructors are "noexcept" by default).

Implementation of this required a top-to-bottom scrub of NTL's code, replacing a lot of old-fashioned code with more modern, RAII-oriented code (RAII = "resource acquisition is initialization").