RE: [math-fun] arctan: update sieving and a fun relation

25 Apr 2006

      A potential problem with dropping the priority queues is the
amount of random memory access.  If your problem won't fit in
the memory cache, things slow down a lot.  There's a way of
organizing priority queues that drops most of the (heavy)
reorganization overhead.  The key idea is to use a larger base
like 256, instead of 2.  Bins come in a few sizes: the
stuff to be processed in the immediate future (about 256 cells),
the less-than immediate future (256 cells, each holding unsorted
values in a width-256 range), and the far future (256 cells,
each holding a width-65536 range).  You spend the bulk of your
time doing whatever processing is required for individual values.
When a low-level 256-range is completed, you open up the next
higher level cell and spread out its contents into the low-level
256 array.  There's never any sorting, and only one or two levels
of rebinning into smaller bins.  You can tinker with the bin-size
ratios at each level to optimize your problem; there's nothing
sacred about 256.  The advantage of this scheme is that it's
compatible with cache memory, and will even work tolerably if
there are disk files involved.

One interesting gimmick with divisor lists is that the insertion
of a next-number-with-this-divisor item, (item: N+D, divisor D)
into slot N+D can always be at the top of the list.  When the
time comes to process N+D, the list of divisors will be in sorted
(increasing) order already.

Rich

-----Original Message-----
From: math-fun-bounces+rschroe=sandia.gov@mailman.xmission.com
[mailto:math-fun-bounces+rschroe=sandia.gov@mailman.xmission.com] On
Behalf Of Joerg Arndt
Sent: Tuesday, April 25, 2006 2:00 AM
To: math-fun
Subject: [math-fun] arctan: update sieving and a fun relation

Sieving (finding all X<=M so that X^2+1 factors into primes <= 761):

First version: factor X^2+1 and test smoothness.
takes 11,000 cycles per test.  Improvements (gcd trick) accelerate
method but still not good enough.

Second version: adding logarithms (as described in earlier mail) and use
priority queues.  Takes 310 cycles.  Tedious optimizations bring it down
to 240 cycles.  Inspection of the machine code shows the reschedule of
the binary heaps eats most of the cycles.

Third version: same but _discard_ priority queues.
Careful but straightforward optimization bring method down to just over
11 cycles per test!  Analyzing memory throughput and intructions
involved: both reasonably close to peak performance.  Any further
speedup would need completely new ideas.

<clip>

RE: [math-fun] arctan: update sieving and a fun relation

Schroeppel, Richard