A potential problem with dropping the priority queues is the amount of random memory access. If your problem won't fit in the memory cache, things slow down a lot. There's a way of organizing priority queues that drops most of the (heavy) reorganization overhead. The key idea is to use a larger base like 256, instead of 2. Bins come in a few sizes: the stuff to be processed in the immediate future (about 256 cells), the less-than immediate future (256 cells, each holding unsorted values in a width-256 range), and the far future (256 cells, each holding a width-65536 range). You spend the bulk of your time doing whatever processing is required for individual values. When a low-level 256-range is completed, you open up the next higher level cell and spread out its contents into the low-level 256 array. There's never any sorting, and only one or two levels of rebinning into smaller bins. You can tinker with the bin-size ratios at each level to optimize your problem; there's nothing sacred about 256. The advantage of this scheme is that it's compatible with cache memory, and will even work tolerably if there are disk files involved. One interesting gimmick with divisor lists is that the insertion of a next-number-with-this-divisor item, (item: N+D, divisor D) into slot N+D can always be at the top of the list. When the time comes to process N+D, the list of divisors will be in sorted (increasing) order already. Rich -----Original Message----- From: math-fun-bounces+rschroe=sandia.gov@mailman.xmission.com [mailto:math-fun-bounces+rschroe=sandia.gov@mailman.xmission.com] On Behalf Of Joerg Arndt Sent: Tuesday, April 25, 2006 2:00 AM To: math-fun Subject: [math-fun] arctan: update sieving and a fun relation Sieving (finding all X<=M so that X^2+1 factors into primes <= 761): First version: factor X^2+1 and test smoothness. takes 11,000 cycles per test. Improvements (gcd trick) accelerate method but still not good enough. Second version: adding logarithms (as described in earlier mail) and use priority queues. Takes 310 cycles. Tedious optimizations bring it down to 240 cycles. Inspection of the machine code shows the reschedule of the binary heaps eats most of the cycles. Third version: same but _discard_ priority queues. Careful but straightforward optimization bring method down to just over 11 cycles per test! Analyzing memory throughput and intructions involved: both reasonably close to peak performance. Any further speedup would need completely new ideas. <clip>