For those still reading, I am using
% gcc --version gcc (GCC) 3.3.5 20050117 (prerelease) (SUSE Linux) which produces scalar but breathtakingly well optimized (and often beautiful!) machine code. Producing hand-coded assembler that beats the gcc machine code is, if possible at all, extremely hard. Machine is a AMD64 @2.2GHz, 939 socket, dual channel DDR 200, memtransfer is 2GB/sec when far beyond cache. Important CPU characteristics are Name: AMD Athlon(tm) 64 Processor 3400+ Family: 15, Model: 15, Stepping: 0 Level 1 cache (data): 64 kB, 2-way associative. 64 bytes per line, lines per tag: 1. Level 1 cache (instr): 64 kB, 2-way associative. 64 bytes per line, lines per tag: 1. Level 2 cache: 512 kB, 16-way associative 64 bytes per line, lines per tag: 1. fpu: x87 FPU sse: Streaming SIMD Extensions sse2: Streaming SIMD Extensions-2 cmov: CMOV instruction (plus FPU FCMOVCC and FCOMI) If you consider a x86 system, then go for AMD64. The document http://swox.com/doc/x86-timing.pdf tells you why, in every little detail.
have you tried to use the PathScale compiler www.pathscale.com ? Often it gives an improvement on the order of -30% in execution time compared to gcc. Also the Intel compiler (icc) should be faster than gcc. If you want, I could compile for you on both compilers on an AMD64 (Opteron) machine and send you the binaries for benchmarking... Christoph