I changed main() to take a couple of arguments to use for the loop and made statevec[] volatile to (try to) ensure that the compiler would not optimize the loop out. With that, 9999 loops to warm up and 9999999 loops output, it takes 1.56 seconds on my 2.8 GHz phenomII, compiled with: :; gcc-4.6.2 -O3 -ggdb3 -march=amdfam10 -floop-interchange \ -floop-strip-mine -floop-block -flto -fwhole-program \ warren-rand.c -o warren-rand That makes ~ 6416665 loops/second (the warmup loops are obviously faster than the output loops, given their lack of printf(3) calls; I'm ignoring that for now) or 436 ticks per loop. Most of those are in printf(3). Going the other route, a warmup of 9999999 loops and just one output takes 0.05 seconds. That is 200M loops/s or 14 ticks per loop. 999999999 + 1 loops takes 5.72 s. I did not unroll the loop, and neither did the -ftree-vectorize implicit in -O3. In theory it should be possible to reduce the number of ticks per loop by using vector instructions. -JimC -- James Cloos <cloos@jhcloos.com> OpenPGP: 1024D/ED7DAEA6