[math-fun] One more reciprocal
This is mostly for the old-timers. Once upon a time, roughly <1965, division was expensive. Some computers even came without a Divide instruction, and had to use a subroutine. With floating point numbers, the division X/Y was sometimes done as X times 1/Y. A programming trick from that era came to be known as "one more reciprocal". The idea of 1MR is to replace several reciprocal operations with a single one, minimizing the total execution time. The magic formula: Assume you need to compute 1/X and 1/Y. Compute R = 1/(XY), and then 1/X = RY and 1/Y = RX. The idea extends to more reciprocals; potentially an arbitrary number of reciprocals can be replaced with a single one. Each avoided reciprocal costs an extra three multiplications. Whether this is worthwhile depends on the relative cost of Divide versus Multiplication, and some side costs: code complexity, storage, concern for over/underflow, and a slight loss of accuracy. As division has cheapened, 1MR has mostly fallen out of use. 1MR is still used in modular arithmetic, where the relative costs favor it. Peter Montgomery introduced it for elliptic curve calculations c. 1985. It's also usable for matrices, if you are careful about non-commutativity. There was a brief vogue for possible generalizations, perhaps when computing several square roots, or several multiplications -- a multiply can be replaced with two squarings. AFAIK, these didn't go anywhere. I'm looking for some mention of 1MR, either in print or old memories. Google is unhelpful, and my Numerical Recipes is too recent. Rich
These might be helpful: http://ipa.ece.illinois.edu/mif/pubs/web-only/Frank-RawMemo12-1999.html https://homepage.divms.uiowa.edu/~jones/bcd/divide.html On Fri, Dec 22, 2017 at 11:15 AM, <rcs@xmission.com> wrote:
This is mostly for the old-timers.
Once upon a time, roughly <1965, division was expensive. Some computers even came without a Divide instruction, and had to use a subroutine. With floating point numbers, the division X/Y was sometimes done as X times 1/Y. A programming trick from that era came to be known as "one more reciprocal".
The idea of 1MR is to replace several reciprocal operations with a single one, minimizing the total execution time. The magic formula: Assume you need to compute 1/X and 1/Y. Compute R = 1/(XY), and then 1/X = RY and 1/Y = RX.
The idea extends to more reciprocals; potentially an arbitrary number of reciprocals can be replaced with a single one. Each avoided reciprocal costs an extra three multiplications. Whether this is worthwhile depends on the relative cost of Divide versus Multiplication, and some side costs: code complexity, storage, concern for over/underflow, and a slight loss of accuracy.
As division has cheapened, 1MR has mostly fallen out of use. 1MR is still used in modular arithmetic, where the relative costs favor it. Peter Montgomery introduced it for elliptic curve calculations c. 1985. It's also usable for matrices, if you are careful about non-commutativity.
There was a brief vogue for possible generalizations, perhaps when computing several square roots, or several multiplications -- a multiply can be replaced with two squarings. AFAIK, these didn't go anywhere.
I'm looking for some mention of 1MR, either in print or old memories. Google is unhelpful, and my Numerical Recipes is too recent.
Rich
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
-- Mike Stay - metaweta@gmail.com http://www.cs.auckland.ac.nz/~mike http://reperiendi.wordpress.com
On Fri, Dec 22, 2017 at 1:15 PM, <rcs@xmission.com> wrote:
This is mostly for the old-timers.
Once upon a time, roughly <1965, division was expensive. Some computers even came without a Divide instruction, and had to use a subroutine. With floating point numbers, the division X/Y was sometimes done as X times 1/Y. A programming trick from that era came to be known as "one more reciprocal".
The idea of 1MR is to replace several reciprocal operations with a single one, minimizing the total execution time. The magic formula: Assume you need to compute 1/X and 1/Y. Compute R = 1/(XY), and then 1/X = RY and 1/Y = RX.
The idea extends to more reciprocals; potentially an arbitrary number of reciprocals can be replaced with a single one. Each avoided reciprocal costs an extra three multiplications.
It's not obvious to me that it's exactly 3 additional multiplications per reciprocal. Can you show me how to work this out? To compute reciprocals of A, B, and C, R = 1/ABC Then 1/A = RBC, 1/B = RAC, and 1/C = RAB So this looks like 8 multiplications. But when we computed ABC we could store the AB partial result, and then computing RAB takes only one more. And if we save RC when calculating RBC, we can reuse it when calculating RAC, so down to 6. So multiplications total, so the second saved reciprocal cost us 3 additional multiplications. How about the next one? R = 1/ABCD, 1/A = RBCD,1/B = RACD, 1/C = RABD, 1/D = RABC Again, naively this is 15 multiplications, but by reusing partial results, we can compute AB, ABC, ABCD, BC, BCD, RBCD, AC, ACD, RACD, ABD, RABD, RABC, or 12 multiplications, so it took us an extra 6. Maybe I could be cleverer in the intermediate results I calculate: AB, ABC, ABCD, CD, BCD, RBCD, ACD, RACD, ABD, RABD, RABC, so down to 11 multiplications, so now it only took an extra 5. Let's try yet again: AB, CD, ABCD, BCD, RBCD, ACD, RACD, ABD, RABD, ABC, RABC. still 11 multiplications. One more try: AB, CD, ABCD, RCD, RBCD, RACD, RAB, RABD, RACD. OK, I got down to 9 multiplications. Another possible order: AB, ABC, ABCD, RABC, RAB, RABD, CD, RCD, RACD, RBCD takes 10. Is there some easy way to see what the order of multiplications is that guarantees only 3 extra for each new reciprocal? Andy
Whether this is worthwhile depends on the relative cost of Divide versus Multiplication, and some side costs: code complexity, storage, concern for over/underflow, and a slight loss of accuracy.
As division has cheapened, 1MR has mostly fallen out of use. 1MR is still used in modular arithmetic, where the relative costs favor it. Peter Montgomery introduced it for elliptic curve calculations c. 1985. It's also usable for matrices, if you are careful about non-commutativity.
There was a brief vogue for possible generalizations, perhaps when computing several square roots, or several multiplications -- a multiply can be replaced with two squarings. AFAIK, these didn't go anywhere.
I'm looking for some mention of 1MR, either in print or old memories. Google is unhelpful, and my Numerical Recipes is too recent.
Rich
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
-- Andy.Latto@pobox.com
One way to get exactly 3M per replaced R: Given A,B,C,D; Reqd 1/A, 1/B, 1/C, 1/D. compute AB, (AB)C, ((AB)C)D. cost 3M reciprocate: R = 1/(ABCD) cost R multiply: (ABC)R = 1/D; S = DR = 1/(ABC) cost 2M multiply: (AB)S = 1/C; T = CS = 1/(AB) cost 2M multiply: AT = 1/B; BT = 1/A cost 2M total cost 9M + R You can also organize it as a tree, to cut down on accumulated errors. The way to look at it is as a free-form binary tree, with the inputs at the leaves. Each node in the tree receives two inputs, either from leaves, deeper nodes, or one leaf and one deeper node. The node multiplies the two inputs, and (except for the root), forwards the product up the tree, and waits for the reciprocal to come back down. It multiplies the reciprocal by the two (swapped) inputs and sends the results down to the child nodes/leaves. The root node does the single reciprocal, but otherwise acts like the other parent nodes. For a tree with N leaves, there are N-1 nodes; each node does 3 multiplies. Rich -------- Quoting Andy Latto <andy.latto@pobox.com>:
On Fri, Dec 22, 2017 at 1:15 PM, <rcs@xmission.com> wrote:
This is mostly for the old-timers.
Once upon a time, roughly <1965, division was expensive. Some computers even came without a Divide instruction, and had to use a subroutine. With floating point numbers, the division X/Y was sometimes done as X times 1/Y. A programming trick from that era came to be known as "one more reciprocal".
The idea of 1MR is to replace several reciprocal operations with a single one, minimizing the total execution time. The magic formula: Assume you need to compute 1/X and 1/Y. Compute R = 1/(XY), and then 1/X = RY and 1/Y = RX.
The idea extends to more reciprocals; potentially an arbitrary number of reciprocals can be replaced with a single one. Each avoided reciprocal costs an extra three multiplications.
It's not obvious to me that it's exactly 3 additional multiplications per reciprocal. Can you show me how to work this out?
To compute reciprocals of A, B, and C, R = 1/ABC Then 1/A = RBC, 1/B = RAC, and 1/C = RAB So this looks like 8 multiplications. But when we computed ABC we could store the AB partial result, and then computing RAB takes only one more. And if we save RC when calculating RBC, we can reuse it when calculating RAC, so down to 6.
So multiplications total, so the second saved reciprocal cost us 3 additional multiplications. How about the next one?
R = 1/ABCD, 1/A = RBCD,1/B = RACD, 1/C = RABD, 1/D = RABC
Again, naively this is 15 multiplications, but by reusing partial results, we can compute
AB, ABC, ABCD, BC, BCD, RBCD, AC, ACD, RACD, ABD, RABD, RABC, or 12 multiplications, so it took us an extra 6. Maybe I could be cleverer in the intermediate results I calculate:
AB, ABC, ABCD, CD, BCD, RBCD, ACD, RACD, ABD, RABD, RABC, so down to 11 multiplications, so now it only took an extra 5. Let's try yet again:
AB, CD, ABCD, BCD, RBCD, ACD, RACD, ABD, RABD, ABC, RABC. still 11 multiplications. One more try:
AB, CD, ABCD, RCD, RBCD, RACD, RAB, RABD, RACD. OK, I got down to 9 multiplications.
Another possible order:
AB, ABC, ABCD, RABC, RAB, RABD, CD, RCD, RACD, RBCD takes 10.
Is there some easy way to see what the order of multiplications is that guarantees only 3 extra for each new reciprocal?
Andy
Whether this is worthwhile depends on the relative cost of Divide versus Multiplication, and some side costs: code complexity, storage, concern for over/underflow, and a slight loss of accuracy.
As division has cheapened, 1MR has mostly fallen out of use. 1MR is still used in modular arithmetic, where the relative costs favor it. Peter Montgomery introduced it for elliptic curve calculations c. 1985. It's also usable for matrices, if you are careful about non-commutativity.
There was a brief vogue for possible generalizations, perhaps when computing several square roots, or several multiplications -- a multiply can be replaced with two squarings. AFAIK, these didn't go anywhere.
I'm looking for some mention of 1MR, either in print or old memories. Google is unhelpful, and my Numerical Recipes is too recent.
Rich
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
-- Andy.Latto@pobox.com
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
That's an interesting trick. It reminds me of something I once did to reduce divides in a loop on a machine with multiple pipelined multipliers and adders, but only one high-latency, unpipelined divider. Eliminating divides, even at the cost of performing more multiplications, was generally a win. Suppose you have a loop that's accumulating some floating point value in a variable v. For example, each iteration might do: v = (v*r/s + t)/u where r, s, t, and u change on each loop iteration. This is one multiply, one add, and two divides per iteration. To transform this, I'd replace v with n and d, and initialize them outside the loop as: n = v d = 1.0 These treated as the numerator and denominator of a ratio, where v = n/d throughout the loop. Inside the loop, we replace addition, multiplication, and division of v with: v = v + x -> n = n + x*d v = v * x -> n = n * x v = v / x -> d = d * x So an addition becomes an addition and a multiplication, a multiplication remains a multiplication, and a division becomes a multiplication. (And subtraction is like addition.) Upon loop exit, you do: v = n / d The problem is that, after several iterations, n and d may become too large (or too small). This can be addressed by unrolling the loop, say 4 times, and renormalizing at the end of the unrolled loop body: n = n / d d = 1.0 Tom rcs@xmission.com writes:
One way to get exactly 3M per replaced R: Given A,B,C,D; Reqd 1/A, 1/B, 1/C, 1/D.
compute AB, (AB)C, ((AB)C)D. cost 3M reciprocate: R = 1/(ABCD) cost R multiply: (ABC)R = 1/D; S = DR = 1/(ABC) cost 2M multiply: (AB)S = 1/C; T = CS = 1/(AB) cost 2M multiply: AT = 1/B; BT = 1/A cost 2M total cost 9M + R
You can also organize it as a tree, to cut down on accumulated errors. The way to look at it is as a free-form binary tree, with the inputs at the leaves. Each node in the tree receives two inputs, either from leaves, deeper nodes, or one leaf and one deeper node. The node multiplies the two inputs, and (except for the root), forwards the product up the tree, and waits for the reciprocal to come back down. It multiplies the reciprocal by the two (swapped) inputs and sends the results down to the child nodes/leaves. The root node does the single reciprocal, but otherwise acts like the other parent nodes. For a tree with N leaves, there are N-1 nodes; each node does 3 multiplies.
Rich
-------- Quoting Andy Latto <andy.latto@pobox.com>:
On Fri, Dec 22, 2017 at 1:15 PM, <rcs@xmission.com> wrote:
This is mostly for the old-timers.
Once upon a time, roughly <1965, division was expensive. Some computers even came without a Divide instruction, and had to use a subroutine. With floating point numbers, the division X/Y was sometimes done as X times 1/Y. A programming trick from that era came to be known as "one more reciprocal".
The idea of 1MR is to replace several reciprocal operations with a single one, minimizing the total execution time. The magic formula: Assume you need to compute 1/X and 1/Y. Compute R = 1/(XY), and then 1/X = RY and 1/Y = RX.
The idea extends to more reciprocals; potentially an arbitrary number of reciprocals can be replaced with a single one. Each avoided reciprocal costs an extra three multiplications.
It's not obvious to me that it's exactly 3 additional multiplications per reciprocal. Can you show me how to work this out?
To compute reciprocals of A, B, and C, R = 1/ABC Then 1/A = RBC, 1/B = RAC, and 1/C = RAB So this looks like 8 multiplications. But when we computed ABC we could store the AB partial result, and then computing RAB takes only one more. And if we save RC when calculating RBC, we can reuse it when calculating RAC, so down to 6.
So multiplications total, so the second saved reciprocal cost us 3 additional multiplications. How about the next one?
R = 1/ABCD, 1/A = RBCD,1/B = RACD, 1/C = RABD, 1/D = RABC
Again, naively this is 15 multiplications, but by reusing partial results, we can compute
AB, ABC, ABCD, BC, BCD, RBCD, AC, ACD, RACD, ABD, RABD, RABC, or 12 multiplications, so it took us an extra 6. Maybe I could be cleverer in the intermediate results I calculate:
AB, ABC, ABCD, CD, BCD, RBCD, ACD, RACD, ABD, RABD, RABC, so down to 11 multiplications, so now it only took an extra 5. Let's try yet again:
AB, CD, ABCD, BCD, RBCD, ACD, RACD, ABD, RABD, ABC, RABC. still 11 multiplications. One more try:
AB, CD, ABCD, RCD, RBCD, RACD, RAB, RABD, RACD. OK, I got down to 9 multiplications.
Another possible order:
AB, ABC, ABCD, RABC, RAB, RABD, CD, RCD, RACD, RBCD takes 10.
Is there some easy way to see what the order of multiplications is that guarantees only 3 extra for each new reciprocal?
Andy
Whether this is worthwhile depends on the relative cost of Divide versus Multiplication, and some side costs: code complexity, storage, concern for over/underflow, and a slight loss of accuracy.
As division has cheapened, 1MR has mostly fallen out of use. 1MR is still used in modular arithmetic, where the relative costs favor it. Peter Montgomery introduced it for elliptic curve calculations c. 1985. It's also usable for matrices, if you are careful about non-commutativity.
There was a brief vogue for possible generalizations, perhaps when computing several square roots, or several multiplications -- a multiply can be replaced with two squarings. AFAIK, these didn't go anywhere.
I'm looking for some mention of 1MR, either in print or old memories. Google is unhelpful, and my Numerical Recipes is too recent.
Rich
This generalises to multiple variables: if you have (v_1, ..., v_n) in K^n for some ground field K, you can projectivise by adding an extra variable to obtain (d, x_1, x_2, ..., x_n) where v_i := x_i / d. And if K is a finite field (rather than the reals, for instances), then you have no numerical issues whatsoever. When K is, for instance, the finite field on 2^256 - 2^32 - 977 elements, then multiplication is much more expensive than addition, and division is costlier still. Hence, in elliptic curve cryptography, people often use projective coordinates: https://en.wikibooks.org/wiki/Cryptography/Prime_Curve/Standard_Projective_C... ...or, even better, Jacobian coordinates (which have the same property of deferring division until the very end of the calculation). Best wishes, Adam P. Goucher
Sent: Saturday, December 23, 2017 at 3:09 AM From: "Tom Karzes" <karzes@sonic.net> To: math-fun <math-fun@mailman.xmission.com> Subject: Re: [math-fun] One more reciprocal
That's an interesting trick. It reminds me of something I once did to reduce divides in a loop on a machine with multiple pipelined multipliers and adders, but only one high-latency, unpipelined divider. Eliminating divides, even at the cost of performing more multiplications, was generally a win.
Suppose you have a loop that's accumulating some floating point value in a variable v. For example, each iteration might do:
v = (v*r/s + t)/u
where r, s, t, and u change on each loop iteration. This is one multiply, one add, and two divides per iteration.
To transform this, I'd replace v with n and d, and initialize them outside the loop as:
n = v d = 1.0
These treated as the numerator and denominator of a ratio, where v = n/d throughout the loop. Inside the loop, we replace addition, multiplication, and division of v with:
v = v + x -> n = n + x*d v = v * x -> n = n * x v = v / x -> d = d * x
So an addition becomes an addition and a multiplication, a multiplication remains a multiplication, and a division becomes a multiplication. (And subtraction is like addition.)
Upon loop exit, you do:
v = n / d
The problem is that, after several iterations, n and d may become too large (or too small). This can be addressed by unrolling the loop, say 4 times, and renormalizing at the end of the unrolled loop body:
n = n / d d = 1.0
Tom
rcs@xmission.com writes:
One way to get exactly 3M per replaced R: Given A,B,C,D; Reqd 1/A, 1/B, 1/C, 1/D.
compute AB, (AB)C, ((AB)C)D. cost 3M reciprocate: R = 1/(ABCD) cost R multiply: (ABC)R = 1/D; S = DR = 1/(ABC) cost 2M multiply: (AB)S = 1/C; T = CS = 1/(AB) cost 2M multiply: AT = 1/B; BT = 1/A cost 2M total cost 9M + R
You can also organize it as a tree, to cut down on accumulated errors. The way to look at it is as a free-form binary tree, with the inputs at the leaves. Each node in the tree receives two inputs, either from leaves, deeper nodes, or one leaf and one deeper node. The node multiplies the two inputs, and (except for the root), forwards the product up the tree, and waits for the reciprocal to come back down. It multiplies the reciprocal by the two (swapped) inputs and sends the results down to the child nodes/leaves. The root node does the single reciprocal, but otherwise acts like the other parent nodes. For a tree with N leaves, there are N-1 nodes; each node does 3 multiplies.
Rich
-------- Quoting Andy Latto <andy.latto@pobox.com>:
On Fri, Dec 22, 2017 at 1:15 PM, <rcs@xmission.com> wrote:
This is mostly for the old-timers.
Once upon a time, roughly <1965, division was expensive. Some computers even came without a Divide instruction, and had to use a subroutine. With floating point numbers, the division X/Y was sometimes done as X times 1/Y. A programming trick from that era came to be known as "one more reciprocal".
The idea of 1MR is to replace several reciprocal operations with a single one, minimizing the total execution time. The magic formula: Assume you need to compute 1/X and 1/Y. Compute R = 1/(XY), and then 1/X = RY and 1/Y = RX.
The idea extends to more reciprocals; potentially an arbitrary number of reciprocals can be replaced with a single one. Each avoided reciprocal costs an extra three multiplications.
It's not obvious to me that it's exactly 3 additional multiplications per reciprocal. Can you show me how to work this out?
To compute reciprocals of A, B, and C, R = 1/ABC Then 1/A = RBC, 1/B = RAC, and 1/C = RAB So this looks like 8 multiplications. But when we computed ABC we could store the AB partial result, and then computing RAB takes only one more. And if we save RC when calculating RBC, we can reuse it when calculating RAC, so down to 6.
So multiplications total, so the second saved reciprocal cost us 3 additional multiplications. How about the next one?
R = 1/ABCD, 1/A = RBCD,1/B = RACD, 1/C = RABD, 1/D = RABC
Again, naively this is 15 multiplications, but by reusing partial results, we can compute
AB, ABC, ABCD, BC, BCD, RBCD, AC, ACD, RACD, ABD, RABD, RABC, or 12 multiplications, so it took us an extra 6. Maybe I could be cleverer in the intermediate results I calculate:
AB, ABC, ABCD, CD, BCD, RBCD, ACD, RACD, ABD, RABD, RABC, so down to 11 multiplications, so now it only took an extra 5. Let's try yet again:
AB, CD, ABCD, BCD, RBCD, ACD, RACD, ABD, RABD, ABC, RABC. still 11 multiplications. One more try:
AB, CD, ABCD, RCD, RBCD, RACD, RAB, RABD, RACD. OK, I got down to 9 multiplications.
Another possible order:
AB, ABC, ABCD, RABC, RAB, RABD, CD, RCD, RACD, RBCD takes 10.
Is there some easy way to see what the order of multiplications is that guarantees only 3 extra for each new reciprocal?
Andy
Whether this is worthwhile depends on the relative cost of Divide versus Multiplication, and some side costs: code complexity, storage, concern for over/underflow, and a slight loss of accuracy.
As division has cheapened, 1MR has mostly fallen out of use. 1MR is still used in modular arithmetic, where the relative costs favor it. Peter Montgomery introduced it for elliptic curve calculations c. 1985. It's also usable for matrices, if you are careful about non-commutativity.
There was a brief vogue for possible generalizations, perhaps when computing several square roots, or several multiplications -- a multiply can be replaced with two squarings. AFAIK, these didn't go anywhere.
I'm looking for some mention of 1MR, either in print or old memories. Google is unhelpful, and my Numerical Recipes is too recent.
Rich
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
* rcs@xmission.com <rcs@xmission.com> [Dec 23. 2017 09:21]:
This is mostly for the old-timers.
[...]
As division has cheapened, 1MR has mostly fallen out of use.
Writing L for latency (in cycles) and T for throughput (in instructions/cycle) we roughly have add/sub L=1 T=3(!) mul L=4 T=1/2 div L=60 T=1/60 for both floating point and integer operands. So saving divisions is very much worth it. Usual suspect is when vectors are normalized: double N = V.norm(); for (unsigned j=0; j<len; ++j) V[j] /= N; should better be double N1 = 1.0 / V.norm(); for (unsigned j=0; j<len; ++j) V[j] *= N1; The compiler is not allowed to do this (unless there is some --ariane5 optimization option). Best regards, jj
[...]
participants (6)
-
Adam P. Goucher -
Andy Latto -
Joerg Arndt -
Mike Stay -
rcs@xmission.com -
Tom Karzes