Re: [math-fun] arctan: update sieving and a fun relation

25 Apr 2006

      I have been extremely pedantic about machine specific issues like
(level one) cache size, alignment, and (CPU) internal parallelism.
Plus I carefully inspected the machine code of every version
of my code and also benchmarked each.

I had up to 3 priority queues: 
1) very frequent events
2) events that occur reasonably often
3) rare events
I was very surprised that even dropping the last
brought a speedup, even if it held only _very_ rare
events.  I have only vague ideas about why this is:
as the code is so fast already one sees code and data
dependencies that are not seen in less optimized code.

For example, now I have two tables of events, rare
and not-so-rare plus a unrolled prefill phase (length
ten, taking care of log(2), log(5), and log(10) ).
Initially I had the sequence
  prefill, not-so-rare, rare
Changing that to
  prefill, rare, not-so-rare
Gives a small but consistent speedup.

The maximum gain possible with smarter event handling
for the rare events can be tested by simply leaving
them out:  You don't get lower than 11 cycles (!)

Btw. the "compute log only after sum greater than
that of last found X" trick reduces the numer of
actual log computations to exactly the number of
smooth X found.  Note what a devastating impact
log computations (which take 150 cycles or so)
would have for the performance.

One might get down to 10 cycles using SIMD instructions.
However, my experiences with (hand-written) SSE2 code
has been quite discouraging.  Plus SSE set is probably
the worst anti-orthogonal attempt of an instruction set
until today (i.e., painful to code, even more so producing
fast code).

So I do the lazy thing by stating: "it's fast enough!"

Anyway, thanks for the multi-bin idea!

For those still reading, I am using

% gcc --version
gcc (GCC) 3.3.5 20050117 (prerelease) (SUSE Linux)
which produces scalar but breathtakingly well
optimized (and often beautiful!) machine code.
Producing hand-coded assembler that beats the
gcc machine code is, if possible at all, extremely hard.

Machine is a AMD64 @2.2GHz, 939 socket, dual channel DDR 200,
memtransfer is 2GB/sec when far beyond cache.
Important CPU characteristics are

  Name: AMD Athlon(tm) 64 Processor 3400+
    Family: 15,  Model: 15,  Stepping: 0
  Level 1 cache (data):  64 kB,  2-way associative.
    64 bytes per line,  lines per tag: 1.
  Level 1 cache (instr):  64 kB,  2-way associative.
    64 bytes per line,  lines per tag: 1.
  Level 2 cache:  512 kB,  16-way associative
    64 bytes per line,  lines per tag: 1.
    fpu: x87 FPU
    sse: Streaming SIMD Extensions
    sse2: Streaming SIMD Extensions-2
    cmov: CMOV instruction (plus FPU FCMOVCC and FCOMI)

If you consider a x86 system, then go for AMD64.
The document http://swox.com/doc/x86-timing.pdf
tells you why, in every little detail.

End of advertisement  :o)

* Schroeppel, Richard <rschroe@sandia.gov> [Apr 25. 2006 18:17]:
...
A potential problem with dropping the priority queues is the
amount of random memory access.  If your problem won't fit in
the memory cache, things slow down a lot.  There's a way of
organizing priority queues that drops most of the (heavy)
reorganization overhead.  The key idea is to use a larger base
like 256, instead of 2.  Bins come in a few sizes: the
stuff to be processed in the immediate future (about 256 cells),
the less-than immediate future (256 cells, each holding unsorted
values in a width-256 range), and the far future (256 cells,
each holding a width-65536 range).  You spend the bulk of your
time doing whatever processing is required for individual values.
When a low-level 256-range is completed, you open up the next
higher level cell and spread out its contents into the low-level
256 array.  There's never any sorting, and only one or two levels
of rebinning into smaller bins.  You can tinker with the bin-size
ratios at each level to optimize your problem; there's nothing
sacred about 256.  The advantage of this scheme is that it's
compatible with cache memory, and will even work tolerably if
there are disk files involved.
One interesting gimmick with divisor lists is that the insertion
of a next-number-with-this-divisor item, (item: N+D, divisor D)
into slot N+D can always be at the top of the list.  When the
time comes to process N+D, the list of divisors will be in sorted
(increasing) order already.
Rich
[...]

Re: [math-fun] arctan: update sieving and a fun relation

Joerg Arndt