New subject: [math-fun] arctan: update sieving and a fun relation

25 Apr 2006

      You can use very rough approximate logarithms.
In my factor sieving code, I used base cbrt2, and my logs
all fit in a 9-bit field.  Computing these logs is pretty
fast: find the highest bit of N, left normalize, compare
to the normalized 2^(1/6) and 2^(3/6).

The final quotient in your factorization of X^2+1
is either close to 1, or very far away; so you can tolerate
a lot of rounding noise from subtracting approximate logs.
On today's processors, there's likely no benefit to
using bytes, but you still only need a couple of bits of
mantissa precision for your logs.
You can compute log(prime) with a table lookup on the
high order few bits.

But as you say, "fast enough".

Rich

-----Original Message-----
From: math-fun-bounces+rschroe=sandia.gov@mailman.xmission.com
[mailto:math-fun-bounces+rschroe=sandia.gov@mailman.xmission.com] On
Behalf Of Joerg Arndt
Sent: Tuesday, April 25, 2006 11:22 AM
To: math-fun
Subject: Re: [math-fun] arctan: update sieving and a fun relation

I have been extremely pedantic about machine specific issues like (level
one) cache size, alignment, and (CPU) internal parallelism.
Plus I carefully inspected the machine code of every version of my code
and also benchmarked each.

I had up to 3 priority queues: 
1) very frequent events
2) events that occur reasonably often
3) rare events
I was very surprised that even dropping the last brought a speedup, even
if it held only _very_ rare events.  I have only vague ideas about why
this is:
as the code is so fast already one sees code and data dependencies that
are not seen in less optimized code.

For example, now I have two tables of events, rare and not-so-rare plus
a unrolled prefill phase (length ten, taking care of log(2), log(5), and
log(10) ).
Initially I had the sequence
  prefill, not-so-rare, rare
Changing that to
  prefill, rare, not-so-rare
Gives a small but consistent speedup.

The maximum gain possible with smarter event handling for the rare
events can be tested by simply leaving them out:  You don't get lower
than 11 cycles (!)

Btw. the "compute log only after sum greater than that of last found X"
trick reduces the numer of actual log computations to exactly the number
of smooth X found.  Note what a devastating impact log computations
(which take 150 cycles or so) would have for the performance.

One might get down to 10 cycles using SIMD instructions.
However, my experiences with (hand-written) SSE2 code has been quite
discouraging.  Plus SSE set is probably the worst anti-orthogonal
attempt of an instruction set until today (i.e., painful to code, even
more so producing fast code).

So I do the lazy thing by stating: "it's fast enough!"

Anyway, thanks for the multi-bin idea!

For those still reading, I am using

% gcc --version
gcc (GCC) 3.3.5 20050117 (prerelease) (SUSE Linux) which produces scalar
but breathtakingly well optimized (and often beautiful!) machine code.
Producing hand-coded assembler that beats the gcc machine code is, if
possible at all, extremely hard.

Machine is a AMD64 @2.2GHz, 939 socket, dual channel DDR 200,
memtransfer is 2GB/sec when far beyond cache.
Important CPU characteristics are

  Name: AMD Athlon(tm) 64 Processor 3400+
    Family: 15,  Model: 15,  Stepping: 0
  Level 1 cache (data):  64 kB,  2-way associative.
    64 bytes per line,  lines per tag: 1.
  Level 1 cache (instr):  64 kB,  2-way associative.
    64 bytes per line,  lines per tag: 1.
  Level 2 cache:  512 kB,  16-way associative
    64 bytes per line,  lines per tag: 1.
    fpu: x87 FPU
    sse: Streaming SIMD Extensions
    sse2: Streaming SIMD Extensions-2
    cmov: CMOV instruction (plus FPU FCMOVCC and FCOMI)

If you consider a x86 system, then go for AMD64.
The document http://swox.com/doc/x86-timing.pdf
tells you why, in every little detail.

End of advertisement  :o)

* Schroeppel, Richard <rschroe@sandia.gov> [Apr 25. 2006 18:17]:
...
A potential problem with dropping the priority queues is the amount 
of random memory access.  If your problem won't fit in the memory 
cache, things slow down a lot.  There's a way of organizing priority 
queues that drops most of the (heavy) reorganization overhead.  The 
key idea is to use a larger base like 256, instead of 2.  Bins come in
...
a few sizes: the stuff to be processed in the immediate future (about 
256 cells), the less-than immediate future (256 cells, each holding 
unsorted values in a width-256 range), and the far future (256 cells, 
each holding a width-65536 range).  You spend the bulk of your time 
doing whatever processing is required for individual values.
When a low-level 256-range is completed, you open up the next higher 
level cell and spread out its contents into the low-level
256 array.  There's never any sorting, and only one or two levels of 
rebinning into smaller bins.  You can tinker with the bin-size ratios 
at each level to optimize your problem; there's nothing sacred about 
256.  The advantage of this scheme is that it's compatible with cache 
memory, and will even work tolerably if there are disk files involved.
One interesting gimmick with divisor lists is that the insertion of a 
next-number-with-this-divisor item, (item: N+D, divisor D) into slot 
N+D can always be at the top of the list.  When the time comes to 
process N+D, the list of divisors will be in sorted
(increasing) order already.
Rich
[...]
_______________________________________________
math-fun mailing list
math-fun@mailman.xmission.com
http://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun

RE: [math-fun] arctan: update sieving and a fun relation

Schroeppel, Richard

Joerg Arndt

tags

participants (2)