Re: [math-fun] One more reciprocal

22 Dec 2017

      That's an interesting trick.  It reminds me of something I once
did to reduce divides in a loop on a machine with multiple pipelined
multipliers and adders, but only one high-latency, unpipelined divider.
Eliminating divides, even at the cost of performing more multiplications,
was generally a win.

Suppose you have a loop that's accumulating some floating point value
in a variable v.  For example, each iteration might do:

    v = (v*r/s + t)/u

where r, s, t, and u change on each loop iteration.  This is one
multiply, one add, and two divides per iteration.

To transform this, I'd replace v with n and d, and initialize
them outside the loop as:

    n = v
    d = 1.0

These treated as the numerator and denominator of a ratio, where
v = n/d throughout the loop.  Inside the loop, we replace
addition, multiplication, and division of v with:

    v = v + x   ->  n = n + x*d
    v = v * x   ->  n = n * x
    v = v / x   ->  d = d * x

So an addition becomes an addition and a multiplication,
a multiplication remains a multiplication, and a division
becomes a multiplication.  (And subtraction is like addition.)

Upon loop exit, you do:

    v = n / d

The problem is that, after several iterations, n and d may
become too large (or too small).  This can be addressed
by unrolling the loop, say 4 times, and renormalizing at
the end of the unrolled loop body:

    n = n / d
    d = 1.0

Tom

rcs@xmission.com writes:
...
One way to get exactly 3M per replaced R:
Given A,B,C,D; Reqd 1/A, 1/B, 1/C, 1/D.
compute AB, (AB)C, ((AB)C)D.               cost 3M
reciprocate:  R = 1/(ABCD)                 cost R
multiply:  (ABC)R = 1/D; S = DR = 1/(ABC)  cost 2M
multiply:  (AB)S = 1/C;  T = CS = 1/(AB)   cost 2M
multiply:   AT = 1/B;    BT = 1/A          cost 2M
total cost 9M + R
You can also organize it as a tree, to cut down on accumulated errors.
The way to look at it is as a free-form binary tree, with the inputs
at the leaves.  Each node in the tree receives two inputs, either from
leaves, deeper nodes, or one leaf and one deeper node.
The node multiplies the two inputs, and (except for the root), forwards
the product up the tree, and waits for the reciprocal to come back down.
It multiplies the reciprocal by the two (swapped) inputs and sends the
results down to the child nodes/leaves.
The root node does the single reciprocal, but otherwise acts like the
other parent nodes.
For a tree with N leaves, there are N-1 nodes; each node does 3 multiplies.
Rich
--------
Quoting Andy Latto <andy.latto@pobox.com>:
...
On Fri, Dec 22, 2017 at 1:15 PM,  <rcs@xmission.com> wrote:
...
This is mostly for the old-timers.
Once upon a time, roughly <1965, division was expensive.
Some computers even came without a Divide instruction,
and had to use a subroutine.  With floating point numbers,
the division X/Y was sometimes done as X times 1/Y.
A programming trick from that era came to be known as
"one more reciprocal".
The idea of 1MR is to replace several reciprocal operations
with a single one, minimizing the total execution time.
The magic formula: Assume you need to compute 1/X and 1/Y.
Compute R = 1/(XY), and then 1/X = RY and 1/Y = RX.
The idea extends to more reciprocals; potentially an arbitrary
number of reciprocals can be replaced with a single one.
Each avoided reciprocal costs an extra three multiplications.
It's not obvious to me that it's exactly 3 additional multiplications
per reciprocal. Can you show me how to work this out?
To compute reciprocals of A, B, and C, R = 1/ABC
Then 1/A = RBC, 1/B = RAC, and 1/C = RAB
So this looks like 8 multiplications. But when we computed ABC we
could store the AB partial result, and then computing RAB takes only
one more. And if we save RC when calculating RBC, we can reuse it when
calculating RAC, so down to 6.
So  multiplications total, so the second saved reciprocal cost us 3
additional multiplications. How about the next one?
R = 1/ABCD, 1/A = RBCD,1/B = RACD, 1/C = RABD, 1/D = RABC
Again, naively this is 15 multiplications, but by reusing partial
results, we can compute
AB, ABC, ABCD, BC, BCD, RBCD, AC, ACD, RACD, ABD, RABD, RABC, or 12
multiplications, so it took us an extra 6. Maybe I could be cleverer
in the intermediate results I calculate:
AB, ABC, ABCD, CD, BCD, RBCD, ACD, RACD, ABD, RABD, RABC, so down to
11 multiplications, so now it only took an extra 5.  Let's try yet
again:
AB, CD, ABCD, BCD, RBCD, ACD, RACD, ABD, RABD, ABC, RABC. still 11
multiplications. One more try:
AB, CD, ABCD, RCD, RBCD, RACD, RAB, RABD, RACD. OK, I got down to 9
multiplications.
Another possible order:
AB, ABC, ABCD, RABC, RAB, RABD, CD, RCD, RACD, RBCD takes 10.
Is there some easy way to see what the order of multiplications is
that guarantees only 3 extra for each new reciprocal?
Andy
...
Whether this is worthwhile depends on the relative cost of Divide
versus Multiplication, and some side costs: code complexity,
storage, concern for over/underflow, and a slight loss of accuracy.
As division has cheapened, 1MR has mostly fallen out of use.
1MR is still used in modular arithmetic, where the relative costs
favor it.  Peter Montgomery introduced it for elliptic curve
calculations c. 1985.  It's also usable for matrices, if you
are careful about non-commutativity.
There was a brief vogue for possible generalizations, perhaps when
computing several square roots, or several multiplications -- a
multiply can be replaced with two squarings.  AFAIK, these didn't
go anywhere.
I'm looking for some mention of 1MR, either in print or old memories.
Google is unhelpful, and my Numerical Recipes is too recent.
Rich

Re: [math-fun] One more reciprocal

Tom Karzes