1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
/*
For arithmetic mod n, the product a*b is represented exactly as the usual
double-double format h + l where h and l are prec-bit floating point
numbers. a*b will be roughly the size of n^2, but since one or both of a or
b may not be reduced modulo n, it is common for the product a*b to be a
reasonable multiple of n^2. Since this is calculated as
h = mul(a, b)
l = fmsub(a, b, h),
if we have |a*b| < 2^e <= 2^(2*prec), then
|l| <= 2^(e - prec - 1).
First, ninv is an floating approximation of 1/n with some error epsilon
ninv := 1/n + epsilon.
Next, we would like to approximate the true quotient a*b/n as
q = round_int(h*ninv),
so that h + l can be reduced mod n. Since
h*ninv = (a*b - l)*(1/n + epsilon)
= a*b/n + a*b*epsilon - l*ninv,
we have (with |h*ninv - q| <= q_error)
(*) |a*b/n - q| <= |a*b*epsilon| + 2^(e - prec - 1)*ninv + q_error.
When the quantity on the rhs is <= k, we can guarantee that
|h + l - q*n| <= k*n,
while h + l - q*n can be calculated exactly as add(l, fnmadd(q, n, h)).
*** q_error ***
when mul(h, ninv) does rounding B bits after the units place, the additional
rounding done by round(mul(h, ninv)) can introduce an error as large as
q_error = 1/2 + 1/2^(B+1)
For example, suppose
h*ninv = XXXXXX1.0100001....
mul(h, ninv) = XXXXXX1.1 rounded to 1 bit to the right of the dot
Now mul(h,ninv) rounds up although h*ninv rounds down. The difference between
the final rounded integer and the original h*ninv is at most 1/2 + 1/4 = 3/4
We can get the optimal q_error = 1/2 by calculating round_int(h*ninv) as
sub(fmadd(h, ninv, r), r) with r = 3*2^(prec - 2)
However, this requires |h*innv| < 2^(prec - 2), which is not always true.
Furthermore, round(mul(.,.)) is fast on AMD :)
Illustration of the truth of 1/2 + 1/2^(B+1): suppose mul(h, ninv) rounds
to B=2 bits after the dot (XXXX means some non-zero bits):
h*ninv mul(h*ninv) round(mul(h*ninv)) bound on |h*ninv - round(mul(h*ninv))|
0.000 0.00 0. 0
0.000XXXX 0.00 0. 1/8
0.001 0.00 0. 1/8
0.001XXXX 0.01 0. 2/8
0.010 0.01 0. 2/8
0.010XXXX 0.01 0. 3/8
0.011 0.10 0. or 1. 3/8 or 5/8
0.011XXXX 0.10 0. or 1. 4/8 or 5/8
0.100 0.10 0. or 1. 4/8
0.100XXXX 0.10 0. or 1. 5/8 or 4/8
0.101 0.10 0. or 1. 5/8 or 3/8
0.101XXXX 0.11 1. 3/8
0.110 0.11 1. 2/8
0.110XXXX 0.11 1. 2/8
0.111 1.00 1. 1/8
0.111XXXX 1.00 1. 1/8
We see q_error = 5/8 = 1/2 + 1/2^(B+1)
*/
/*
Analysis for 2^(prec - 4) < n < 2^(prec - 3) (aka 50 bit primes)
----------------------------------------------------------------
In this case we have |epsilon| < 2^(3 - 2*prec) and ninv < 2^(4 - prec).
For |a*b| < 2*n^2, we can take e = 2*prec - 5.
Then, |h*ninv| < 2^(prec-2) and mul(h, ninv) does rounding at least
two bits to the right of the dot. Then, we make take q_error = 1/2 + 2^-3:
|a*b*epsilon| + 2^(e - prec - 1)*ninv + 1/2+2^-3 < 2^-2 + 2^-2 + 1/2 + 2^-3 = 1+1/8
Similarly for |a*b| < 4*n^2, we can take e = 2*prec - 4 and q_error = 1/2 + 2^-2:
|a*b*epsilon| + 2^(e - prec - 1)*ninv + 1/2+2^-2 < 2^-1 + 2^-1 + 1/2 + 2^-2 = 3/2+1/4
Therefore,
Products in the range (-2*n^2, 2*n^2) are reduced to the range (-9/8*n, 9/8*n).
Products in the range (-4*n^2, 4*n^2) are reduced to the range (-7/4*n, 7/4*n).
These pessimistic bounds are not good enough. For specific values of n, epsilon can be
much smaller, and it not very restrictive to assume what mulmod needs:
(1) Products in the range (-2*n^2, 2*n^2) are reduced to the range (-1*n, 1*n).
(2) Products in the range (-4*n^2, 4*n^2) are reduced to the range (-3/2*n, 3/2*n).
*/
/*
define e as |a*b| < 2^e
bound = |a*b*(ninv-1/n)| + 2^(e - prec - 1)*ninv + q_error
want bound < 1 for |a*b| < 2*n^2 where q_error = 1/2 + 2^-3
want bound < 1.5 for |a*b| < 4*n^2 where q_error = 1/2 + 2^-2
*/
int
/*
assuming satisfies_bounds(n):
(1) Products in the range (-2*n^2, 2*n^2) are reduced to the range (-1*n, 1*n).
(2) Products in the range (-4*n^2, 4*n^2) are reduced to the range (-3/2*n, 3/2*n).
Analysis of the forward butterfiles for 50 bit primes
-----------------------------------------------------
b0 = 1*a0 + w^2*a2 + w*(a1 + w^2*a3)
b1 = 1*a0 + w^2*a2 - w*(a1 + w^2*a3)
b2 = 1*a0 - w^2*a2 + i*w*(a1 - w^2*a3)
b3 = 1*a0 - w^2*a2 - i*w*(a1 - w^2*a3)
Each of w^2, w, and i*w is in (-n/2, n/2). As part of this operation, the
multiplication 1*a0 reduces a0 to the range (-n, n) and the other
multiplications use the above. In practice, a0 does not get large even
without this reduction, but this is needed to prove things stay bounded.
The following claim is trivial from (1):
(*) If the ai in (-3*n, 3*n), then the bi are also in (-3*n, 3*n).
Analysis of the truncated forward butterfiles for 50 bit primes
---------------------------------------------------------------
Trivial because some of the ai are just 0.
Analysis of the pointwise muls
------------------------------
The output of one fft in the range (-3*n, 3*n) is multiplied with a
multiplier (2^-depth) in the range (-n/2, n/2). This produces a product x
in the range (-n, n). Then, x is multiplied by the output of another fft
in the range (-3*n, 3*n). Thus,
(*) The outputs of the pointwise muls are in the range (-3/2*n, 3/2*n).
Analysis of the reverse butterflies for 50 bit primes
-----------------------------------------------------
4a0 = 1*( (b0 + b1) + (b2 + b3))
4a2 = w^-2*( (b0 + b1) - (b2 + b3))
4a1 = w^-1*(b0 - b1) - i*w^-1*(b2 - b3)
4a3 = w^-2*(w^-1*(b0 - b1) + i*w^-1*(b2 - b3))
As before, each of w^-2, w^-1, and i*w^-1 is in (-n/2, n/2). As part of
this operation, 1*(b0 + b1 + b2 + b3) reduces 4a0 to the range (-n, n),
and the other multiplications are as before.
The following claim is trivial from (1) and (2):
(*) If the bi are in (-2*n, 2*n), then the 4ai are also in (-2*n, 2*n).
The bound n < 2^50 guarantees that |(b0 + b1) +- (b2 + b3)| < 2^53 so that
no bits are lost.
Analysis of the truncated reverse butterflies for 50 bit primes
---------------------------------------------------------------
*** TODO ***
A bit tedious because there are so many formulas, but we just need make
sure that output is in the range (-2*n, 2*n) if input is.
*/