RE: [math-fun] arctan: update sieving and a fun relation
You can use very rough approximate logarithms. In my factor sieving code, I used base cbrt2, and my logs all fit in a 9-bit field. Computing these logs is pretty fast: find the highest bit of N, left normalize, compare to the normalized 2^(1/6) and 2^(3/6). The final quotient in your factorization of X^2+1 is either close to 1, or very far away; so you can tolerate a lot of rounding noise from subtracting approximate logs. On today's processors, there's likely no benefit to using bytes, but you still only need a couple of bits of mantissa precision for your logs. You can compute log(prime) with a table lookup on the high order few bits. But as you say, "fast enough". Rich -----Original Message----- From: math-fun-bounces+rschroe=sandia.gov@mailman.xmission.com [mailto:math-fun-bounces+rschroe=sandia.gov@mailman.xmission.com] On Behalf Of Joerg Arndt Sent: Tuesday, April 25, 2006 11:22 AM To: math-fun Subject: Re: [math-fun] arctan: update sieving and a fun relation I have been extremely pedantic about machine specific issues like (level one) cache size, alignment, and (CPU) internal parallelism. Plus I carefully inspected the machine code of every version of my code and also benchmarked each. I had up to 3 priority queues: 1) very frequent events 2) events that occur reasonably often 3) rare events I was very surprised that even dropping the last brought a speedup, even if it held only _very_ rare events. I have only vague ideas about why this is: as the code is so fast already one sees code and data dependencies that are not seen in less optimized code. For example, now I have two tables of events, rare and not-so-rare plus a unrolled prefill phase (length ten, taking care of log(2), log(5), and log(10) ). Initially I had the sequence prefill, not-so-rare, rare Changing that to prefill, rare, not-so-rare Gives a small but consistent speedup. The maximum gain possible with smarter event handling for the rare events can be tested by simply leaving them out: You don't get lower than 11 cycles (!) Btw. the "compute log only after sum greater than that of last found X" trick reduces the numer of actual log computations to exactly the number of smooth X found. Note what a devastating impact log computations (which take 150 cycles or so) would have for the performance. One might get down to 10 cycles using SIMD instructions. However, my experiences with (hand-written) SSE2 code has been quite discouraging. Plus SSE set is probably the worst anti-orthogonal attempt of an instruction set until today (i.e., painful to code, even more so producing fast code). So I do the lazy thing by stating: "it's fast enough!" Anyway, thanks for the multi-bin idea! For those still reading, I am using % gcc --version gcc (GCC) 3.3.5 20050117 (prerelease) (SUSE Linux) which produces scalar but breathtakingly well optimized (and often beautiful!) machine code. Producing hand-coded assembler that beats the gcc machine code is, if possible at all, extremely hard. Machine is a AMD64 @2.2GHz, 939 socket, dual channel DDR 200, memtransfer is 2GB/sec when far beyond cache. Important CPU characteristics are Name: AMD Athlon(tm) 64 Processor 3400+ Family: 15, Model: 15, Stepping: 0 Level 1 cache (data): 64 kB, 2-way associative. 64 bytes per line, lines per tag: 1. Level 1 cache (instr): 64 kB, 2-way associative. 64 bytes per line, lines per tag: 1. Level 2 cache: 512 kB, 16-way associative 64 bytes per line, lines per tag: 1. fpu: x87 FPU sse: Streaming SIMD Extensions sse2: Streaming SIMD Extensions-2 cmov: CMOV instruction (plus FPU FCMOVCC and FCOMI) If you consider a x86 system, then go for AMD64. The document http://swox.com/doc/x86-timing.pdf tells you why, in every little detail. End of advertisement :o) * Schroeppel, Richard <rschroe@sandia.gov> [Apr 25. 2006 18:17]:
A potential problem with dropping the priority queues is the amount of random memory access. If your problem won't fit in the memory cache, things slow down a lot. There's a way of organizing priority queues that drops most of the (heavy) reorganization overhead. The key idea is to use a larger base like 256, instead of 2. Bins come in
a few sizes: the stuff to be processed in the immediate future (about 256 cells), the less-than immediate future (256 cells, each holding unsorted values in a width-256 range), and the far future (256 cells, each holding a width-65536 range). You spend the bulk of your time doing whatever processing is required for individual values. When a low-level 256-range is completed, you open up the next higher level cell and spread out its contents into the low-level 256 array. There's never any sorting, and only one or two levels of rebinning into smaller bins. You can tinker with the bin-size ratios at each level to optimize your problem; there's nothing sacred about 256. The advantage of this scheme is that it's compatible with cache memory, and will even work tolerably if there are disk files involved.
One interesting gimmick with divisor lists is that the insertion of a next-number-with-this-divisor item, (item: N+D, divisor D) into slot N+D can always be at the top of the list. When the time comes to process N+D, the list of divisors will be in sorted (increasing) order already.
Rich
[...]
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com http://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
* Schroeppel, Richard <rschroe@sandia.gov> [Apr 26. 2006 10:01]:
You can use very rough approximate logarithms.
Yes the code is extremely robust, you can even (in the prefill phase) just write only zeros.
In my factor sieving code,
Guess why I am so eager to have fast sieving code! Which factoring algorithm do you use it in? I'll look into factoring in the near future.
I used base cbrt2, and my logs all fit in a 9-bit field. Computing these logs is pretty fast: find the highest bit of N, left normalize, compare to the normalized 2^(1/6) and 2^(3/6).
However fast your rough-log is, my trick gives you more: I search for all X<10^14. A logarithm costs, say, 250 cycles. A logarithm is computed _just_ with each finding. I find less than 50,000 X. So, per test, the cost is (in cycles, less than) 250*50000/10^14 == 1/8000000 Unless you rough-log computes more than 8 million rough-logs _per_cycle_ your code will be slower. But then, what is the set of primes you allow? Is it rather huge? Plus what is the density of findings? (mine: 50000/10^14)
The final quotient in your factorization of X^2+1 is either close to 1, or very far away; so you can tolerate a lot of rounding noise from subtracting approximate logs. On today's processors, there's likely no benefit to using bytes, but you still only need a couple of bits of mantissa precision for your logs. You can compute log(prime) with a table lookup on the high order few bits.
Note that the float instructions go completely in parallel with the integer instructions. Further, any conversion float to int (or back) is still quite expensive, such conversions should _never_ occur in loops. I tried different data types for the quantities I use, turns out unsigned long (machine word!) and double (machine word!) do best.
But as you say, "fast enough".
Rich
Thanks again for your very helpful advice!
participants (2)
-
Joerg Arndt -
Schroeppel, Richard