### NTL 10.3.0

Posted:

**Mon Nov 21, 2016 4:04 pm**Go to http://shoup.net/ntl for full details.

Below is the latest entry in the change log.

I've also updated the NTL/FLINT comparison document:

http://shoup.net/ntl/benchmarks.pdf

The only significant change is to the ZZ_pX factoring benchmarks

(the new code in NTL speeds up mat_ZZ_p multiplication and

indirectly CompMod and CanZass).

======= Changes in 10.3.0 =======

* Implementation of a multi-modular strategy for matrix multiplication

over ZZ_p. Here are some benchmarks that compare the new strategy to

the old (naive) one. Here, we consider n-by-n matrices modulo an

n-bit prime.

o n=128, speedup=6.8x

o n=256, speedup=8.6x

o n=512, speedup=9.3x

o n=1024, speedup=18x

o n=2048, speedup=37x

I also compared NTL's new mat_ZZ_p multiplication to FLINT's

fmpz_mat multiplication. The latter also uses a multi-modular

strategy. For n=128,256,512,1024 as above, NTL's code is between

2.7 and 3.1 times faster than FLINT's (and we did not count the time

it would take to reduce the entries mod p for the FLINT code).

Part of this may be due to the AVX-enhanced small-prime matrix

multiplication code used by NTL, and part may be due to better CRT

code.

* I also instrumented both the plain and multi-modular matrix

multiplication for ZZ_p so that they are both "thread boosted"

(i.e., will automatcally exploit multiple cores when possible).

* As an initial application of this faster matrix multiplication, I

implemented a new version of Brent/Kung modular composition for

ZZ_pX, which is now between 2 and 5 times faster than the old one

(depending on parameters). This is done with a new class called

ZZ_pXNewArgument, which supersedes ZZ_pXArgument (retained for

compatibility). This also makes the CanZass factoring algorithm for

ZZ_pX faster (sometimes by a factor of almost 2, depending on

parameters). It also makes CanZass more memory intensive (but while

the overall memory usage increases, the memory access pattern is

fairly cache friendly).

* I would like to see if this faster matrix multiplication can be used

to get faster linear algebra (determinants, inverses, more general

Gaussian elimination) via reduction to matrix multiplication. If

anyone wants to volunteer to work on this, please let me know.

Presumably, the FLINT nmod_mat code could be repurposed for this. I

won't have time to work on this for a few months, but would be glad

to discuss ideas with anyone who wanted to do the work. Note that

the "plain" versions of these routines also need some work.

* I also added move methods to the Vec and Mat classes, and made a

slight tweak to the Lazy class.

Below is the latest entry in the change log.

I've also updated the NTL/FLINT comparison document:

http://shoup.net/ntl/benchmarks.pdf

The only significant change is to the ZZ_pX factoring benchmarks

(the new code in NTL speeds up mat_ZZ_p multiplication and

indirectly CompMod and CanZass).

======= Changes in 10.3.0 =======

* Implementation of a multi-modular strategy for matrix multiplication

over ZZ_p. Here are some benchmarks that compare the new strategy to

the old (naive) one. Here, we consider n-by-n matrices modulo an

n-bit prime.

o n=128, speedup=6.8x

o n=256, speedup=8.6x

o n=512, speedup=9.3x

o n=1024, speedup=18x

o n=2048, speedup=37x

I also compared NTL's new mat_ZZ_p multiplication to FLINT's

fmpz_mat multiplication. The latter also uses a multi-modular

strategy. For n=128,256,512,1024 as above, NTL's code is between

2.7 and 3.1 times faster than FLINT's (and we did not count the time

it would take to reduce the entries mod p for the FLINT code).

Part of this may be due to the AVX-enhanced small-prime matrix

multiplication code used by NTL, and part may be due to better CRT

code.

* I also instrumented both the plain and multi-modular matrix

multiplication for ZZ_p so that they are both "thread boosted"

(i.e., will automatcally exploit multiple cores when possible).

* As an initial application of this faster matrix multiplication, I

implemented a new version of Brent/Kung modular composition for

ZZ_pX, which is now between 2 and 5 times faster than the old one

(depending on parameters). This is done with a new class called

ZZ_pXNewArgument, which supersedes ZZ_pXArgument (retained for

compatibility). This also makes the CanZass factoring algorithm for

ZZ_pX faster (sometimes by a factor of almost 2, depending on

parameters). It also makes CanZass more memory intensive (but while

the overall memory usage increases, the memory access pattern is

fairly cache friendly).

* I would like to see if this faster matrix multiplication can be used

to get faster linear algebra (determinants, inverses, more general

Gaussian elimination) via reduction to matrix multiplication. If

anyone wants to volunteer to work on this, please let me know.

Presumably, the FLINT nmod_mat code could be repurposed for this. I

won't have time to work on this for a few months, but would be glad

to discuss ideas with anyone who wanted to do the work. Note that

the "plain" versions of these routines also need some work.

* I also added move methods to the Vec and Mat classes, and made a

slight tweak to the Lazy class.