[math-fun] OK, I programmed my proposed psu-random generator....
Will it pass the SuperCrush U01 tests? How fast will it be? http://www.iro.umontreal.ca/~simardr/testu01/tu01.html I have not run it thru randomness tests, but I crudely timed it for NumWordsInState=4 (period=2^256) on 64-bit iMac 2.0GHz, finding it produced random 64-bit words at a rate of 1 billion in 12 seconds, which is 3 cycles per random byte. Perhaps fiddling with the code and compiler would make it go faster? I manually unrolled loop and replaced the state 4-vector by 4 scalars; that sped it up to under 11 seconds. Another idea is to output more than one random word each time, which might mean you need a larger state vector but still in net could yield a speedup... --- #include <stdint.h> #include <stdio.h> //Warren D. Smith attempt at making psu-random generator. Compile with // gcc WarrenRand.c -Wall -O6 -o WarrenRand #define uint unsigned int #define uint32 uint32_t #define uint64 uint64_t #define uint128 __uint128_t #define RandStrideConst 14923729446516375051ull //must be odd //The below is written for a 64-bit machine. //If you have 32-bit machine then text-replace 64-->32 and then 128-->64 throughout the following: //period will be 2^(64 * NumWordsInState), increase for more randomness but slower runtime: #define NumWordsInState 4 inline uint64 UpdatorF(uint64 x){ uint128 y; y = x; x >>= (64-1); y *= y; x += y ^ (y>>64); return(x); } uint64 statevec[NumWordsInState]; uint64 Rand64(){ uint i; uint64 v; v = statevec[0]; for(i=NumWordsInState-1; i>0; i--){ //can be unrolled and all loop iterations run parallel v ^= statevec[i]; statevec[i] += UpdatorF(statevec[i-1]); } statevec[0] += RandStrideConst; return(v); //returns the XOR of all words in state vector } main(){ uint i; for(i=0; i<9999; i++){ Rand64(); } //warm it up for(i=0; i<99; i++){ printf("%llx\n", Rand64()); } //print some random machine words }
I changed main() to take a couple of arguments to use for the loop and made statevec[] volatile to (try to) ensure that the compiler would not optimize the loop out. With that, 9999 loops to warm up and 9999999 loops output, it takes 1.56 seconds on my 2.8 GHz phenomII, compiled with: :; gcc-4.6.2 -O3 -ggdb3 -march=amdfam10 -floop-interchange \ -floop-strip-mine -floop-block -flto -fwhole-program \ warren-rand.c -o warren-rand That makes ~ 6416665 loops/second (the warmup loops are obviously faster than the output loops, given their lack of printf(3) calls; I'm ignoring that for now) or 436 ticks per loop. Most of those are in printf(3). Going the other route, a warmup of 9999999 loops and just one output takes 0.05 seconds. That is 200M loops/s or 14 ticks per loop. 999999999 + 1 loops takes 5.72 s. I did not unroll the loop, and neither did the -ftree-vectorize implicit in -O3. In theory it should be possible to reduce the number of ticks per loop by using vector instructions. -JimC -- James Cloos <cloos@jhcloos.com> OpenPGP: 1024D/ED7DAEA6
* James Cloos <cloos@jhcloos.com> [Feb 07. 2012 11:40]:
[...]
For benchmarking any printf() will make results meaningless ( 2000 cycles or so per printf() ). You have to stop your compiler from optimizing things away. A good way usually is to XOR every output into an (unsigned long) integer, which you'll need to print at the very end (else the compiler kann kill that as well). 200M/sec already sounds quite good btw. At Warren: if you do not believe in publishing you still can go for arxive (which I recommend to do).
On Monday 06 February 2012 18:47:40 Warren Smith wrote:
Will it pass the SuperCrush U01 tests? How fast will it be?
I ran the 32-bit version (with 4 words of state) through TestU01's SmallCrush battery, which completes in seconds, and it passed (with one marginal-looking p-value which evaporated on repeated testing). I'm running a BigCrush (~5 hours) on it now. -- g
On Tuesday 07 February 2012 02:17:23 I wrote:
On Monday 06 February 2012 18:47:40 Warren Smith wrote:
Will it pass the SuperCrush U01 tests? How fast will it be?
I ran the 32-bit version (with 4 words of state) through TestU01's SmallCrush battery, which completes in seconds, and it passed (with one marginal-looking p-value which evaporated on repeated testing). I'm running a BigCrush (~5 hours) on it now.
========= Summary results of BigCrush ========= Version: TestU01 1.2.3 Generator: warren/32/4 Number of statistics: 160 Total CPU time: 06:19:48.03 The following tests gave p-values outside [0.001, 0.9990]: (eps means a value < 1.0e-300): (eps1 means a value < 1.0e-15): Test p-value ---------------------------------------------- 7 CollisionOver, t = 7 3.1e-4 ---------------------------------------------- All other tests were passed (This is with a revised version of Warren's code -- he found a bug -- and for a 32-bit machine, with 4 words of state.) -- g
With 160 statistics, we should expect the smallest p-value to be around 3 * 10^-3, I suppose? I think there's something called Bonferroni that we can use to figure out whether a p-value as small as this one, in one out of 160 tests, should actually make us suspicious. Or run the test again and see what we get. This one looks a bit unusually small but not really surprisingly small when there are so many tests being done. One crude calculation would be to take (1 - 3.1 * 10^-4)^160 to see how likely it is to get at least one p-value this small or smaller, and that calculation comes out right around 5%, which is slightly surprising but not very surprising if the numbers generated were truly random. --Joshua Zucker On Tue, Feb 7, 2012 at 12:07 PM, Gareth McCaughan <gareth.mccaughan@pobox.com> wrote:
Number of statistics: 160 Total CPU time: 06:19:48.03 The following tests gave p-values outside [0.001, 0.9990]: (eps means a value < 1.0e-300): (eps1 means a value < 1.0e-15):
Test p-value ---------------------------------------------- 7 CollisionOver, t = 7 3.1e-4 ----------------------------------------------
participants (5)
-
Gareth McCaughan -
James Cloos -
Joerg Arndt -
Joshua Zucker -
Warren Smith