That's an interesting trick. It reminds me of something I once did to reduce divides in a loop on a machine with multiple pipelined multipliers and adders, but only one high-latency, unpipelined divider. Eliminating divides, even at the cost of performing more multiplications, was generally a win. Suppose you have a loop that's accumulating some floating point value in a variable v. For example, each iteration might do: v = (v*r/s + t)/u where r, s, t, and u change on each loop iteration. This is one multiply, one add, and two divides per iteration. To transform this, I'd replace v with n and d, and initialize them outside the loop as: n = v d = 1.0 These treated as the numerator and denominator of a ratio, where v = n/d throughout the loop. Inside the loop, we replace addition, multiplication, and division of v with: v = v + x -> n = n + x*d v = v * x -> n = n * x v = v / x -> d = d * x So an addition becomes an addition and a multiplication, a multiplication remains a multiplication, and a division becomes a multiplication. (And subtraction is like addition.) Upon loop exit, you do: v = n / d The problem is that, after several iterations, n and d may become too large (or too small). This can be addressed by unrolling the loop, say 4 times, and renormalizing at the end of the unrolled loop body: n = n / d d = 1.0 Tom rcs@xmission.com writes:
One way to get exactly 3M per replaced R: Given A,B,C,D; Reqd 1/A, 1/B, 1/C, 1/D.
compute AB, (AB)C, ((AB)C)D. cost 3M reciprocate: R = 1/(ABCD) cost R multiply: (ABC)R = 1/D; S = DR = 1/(ABC) cost 2M multiply: (AB)S = 1/C; T = CS = 1/(AB) cost 2M multiply: AT = 1/B; BT = 1/A cost 2M total cost 9M + R
You can also organize it as a tree, to cut down on accumulated errors. The way to look at it is as a free-form binary tree, with the inputs at the leaves. Each node in the tree receives two inputs, either from leaves, deeper nodes, or one leaf and one deeper node. The node multiplies the two inputs, and (except for the root), forwards the product up the tree, and waits for the reciprocal to come back down. It multiplies the reciprocal by the two (swapped) inputs and sends the results down to the child nodes/leaves. The root node does the single reciprocal, but otherwise acts like the other parent nodes. For a tree with N leaves, there are N-1 nodes; each node does 3 multiplies.
Rich
-------- Quoting Andy Latto <andy.latto@pobox.com>:
On Fri, Dec 22, 2017 at 1:15 PM, <rcs@xmission.com> wrote:
This is mostly for the old-timers.
Once upon a time, roughly <1965, division was expensive. Some computers even came without a Divide instruction, and had to use a subroutine. With floating point numbers, the division X/Y was sometimes done as X times 1/Y. A programming trick from that era came to be known as "one more reciprocal".
The idea of 1MR is to replace several reciprocal operations with a single one, minimizing the total execution time. The magic formula: Assume you need to compute 1/X and 1/Y. Compute R = 1/(XY), and then 1/X = RY and 1/Y = RX.
The idea extends to more reciprocals; potentially an arbitrary number of reciprocals can be replaced with a single one. Each avoided reciprocal costs an extra three multiplications.
It's not obvious to me that it's exactly 3 additional multiplications per reciprocal. Can you show me how to work this out?
To compute reciprocals of A, B, and C, R = 1/ABC Then 1/A = RBC, 1/B = RAC, and 1/C = RAB So this looks like 8 multiplications. But when we computed ABC we could store the AB partial result, and then computing RAB takes only one more. And if we save RC when calculating RBC, we can reuse it when calculating RAC, so down to 6.
So multiplications total, so the second saved reciprocal cost us 3 additional multiplications. How about the next one?
R = 1/ABCD, 1/A = RBCD,1/B = RACD, 1/C = RABD, 1/D = RABC
Again, naively this is 15 multiplications, but by reusing partial results, we can compute
AB, ABC, ABCD, BC, BCD, RBCD, AC, ACD, RACD, ABD, RABD, RABC, or 12 multiplications, so it took us an extra 6. Maybe I could be cleverer in the intermediate results I calculate:
AB, ABC, ABCD, CD, BCD, RBCD, ACD, RACD, ABD, RABD, RABC, so down to 11 multiplications, so now it only took an extra 5. Let's try yet again:
AB, CD, ABCD, BCD, RBCD, ACD, RACD, ABD, RABD, ABC, RABC. still 11 multiplications. One more try:
AB, CD, ABCD, RCD, RBCD, RACD, RAB, RABD, RACD. OK, I got down to 9 multiplications.
Another possible order:
AB, ABC, ABCD, RABC, RAB, RABD, CD, RCD, RACD, RBCD takes 10.
Is there some easy way to see what the order of multiplications is that guarantees only 3 extra for each new reciprocal?
Andy
Whether this is worthwhile depends on the relative cost of Divide versus Multiplication, and some side costs: code complexity, storage, concern for over/underflow, and a slight loss of accuracy.
As division has cheapened, 1MR has mostly fallen out of use. 1MR is still used in modular arithmetic, where the relative costs favor it. Peter Montgomery introduced it for elliptic curve calculations c. 1985. It's also usable for matrices, if you are careful about non-commutativity.
There was a brief vogue for possible generalizations, perhaps when computing several square roots, or several multiplications -- a multiply can be replaced with two squarings. AFAIK, these didn't go anywhere.
I'm looking for some mention of 1MR, either in print or old memories. Google is unhelpful, and my Numerical Recipes is too recent.
Rich