I have been extremely pedantic about machine specific issues like (level one) cache size, alignment, and (CPU) internal parallelism. Plus I carefully inspected the machine code of every version of my code and also benchmarked each. I had up to 3 priority queues: 1) very frequent events 2) events that occur reasonably often 3) rare events I was very surprised that even dropping the last brought a speedup, even if it held only _very_ rare events. I have only vague ideas about why this is: as the code is so fast already one sees code and data dependencies that are not seen in less optimized code. For example, now I have two tables of events, rare and not-so-rare plus a unrolled prefill phase (length ten, taking care of log(2), log(5), and log(10) ). Initially I had the sequence prefill, not-so-rare, rare Changing that to prefill, rare, not-so-rare Gives a small but consistent speedup. The maximum gain possible with smarter event handling for the rare events can be tested by simply leaving them out: You don't get lower than 11 cycles (!) Btw. the "compute log only after sum greater than that of last found X" trick reduces the numer of actual log computations to exactly the number of smooth X found. Note what a devastating impact log computations (which take 150 cycles or so) would have for the performance. One might get down to 10 cycles using SIMD instructions. However, my experiences with (hand-written) SSE2 code has been quite discouraging. Plus SSE set is probably the worst anti-orthogonal attempt of an instruction set until today (i.e., painful to code, even more so producing fast code). So I do the lazy thing by stating: "it's fast enough!" Anyway, thanks for the multi-bin idea! For those still reading, I am using % gcc --version gcc (GCC) 3.3.5 20050117 (prerelease) (SUSE Linux) which produces scalar but breathtakingly well optimized (and often beautiful!) machine code. Producing hand-coded assembler that beats the gcc machine code is, if possible at all, extremely hard. Machine is a AMD64 @2.2GHz, 939 socket, dual channel DDR 200, memtransfer is 2GB/sec when far beyond cache. Important CPU characteristics are Name: AMD Athlon(tm) 64 Processor 3400+ Family: 15, Model: 15, Stepping: 0 Level 1 cache (data): 64 kB, 2-way associative. 64 bytes per line, lines per tag: 1. Level 1 cache (instr): 64 kB, 2-way associative. 64 bytes per line, lines per tag: 1. Level 2 cache: 512 kB, 16-way associative 64 bytes per line, lines per tag: 1. fpu: x87 FPU sse: Streaming SIMD Extensions sse2: Streaming SIMD Extensions-2 cmov: CMOV instruction (plus FPU FCMOVCC and FCOMI) If you consider a x86 system, then go for AMD64. The document http://swox.com/doc/x86-timing.pdf tells you why, in every little detail. End of advertisement :o) * Schroeppel, Richard <rschroe@sandia.gov> [Apr 25. 2006 18:17]:
A potential problem with dropping the priority queues is the amount of random memory access. If your problem won't fit in the memory cache, things slow down a lot. There's a way of organizing priority queues that drops most of the (heavy) reorganization overhead. The key idea is to use a larger base like 256, instead of 2. Bins come in a few sizes: the stuff to be processed in the immediate future (about 256 cells), the less-than immediate future (256 cells, each holding unsorted values in a width-256 range), and the far future (256 cells, each holding a width-65536 range). You spend the bulk of your time doing whatever processing is required for individual values. When a low-level 256-range is completed, you open up the next higher level cell and spread out its contents into the low-level 256 array. There's never any sorting, and only one or two levels of rebinning into smaller bins. You can tinker with the bin-size ratios at each level to optimize your problem; there's nothing sacred about 256. The advantage of this scheme is that it's compatible with cache memory, and will even work tolerably if there are disk files involved.
One interesting gimmick with divisor lists is that the insertion of a next-number-with-this-divisor item, (item: N+D, divisor D) into slot N+D can always be at the top of the list. When the time comes to process N+D, the list of divisors will be in sorted (increasing) order already.
Rich
[...]