Re: [math-fun] compressed bitmaps of the primes?
Very interesting! Thanks! I would imagine that the "zip"-type programs would do better with larger dictionaries -- especially on larger files -- so perhaps this would help to improve these numbers. Perhaps there is an additional argument to these programs to allow for larger dictionaries? (Many of these compression programs were developed in an earlier era of PC's having main memories in the megabyte range instead of the gigabyte range; so perhaps their idea of a "large" dictionary needs to be revised.)) I wonder if "zip"-type programs would do even as well as your 2-level scheme; i.e., could these programs notice the patterns of 0's when the number is divisible by 2, 3, 5 ? At 11:35 AM 5/22/2012, Tom Rokicki wrote:
Well, I do have a sieve program that outputs bitmaps, but these bitmaps represent a sequence of 30 integers with a single byte (the eight values mod 30 that are not divisible by 2, 3, or 5.)
Running this to a 1GB data file (representing all primes less than 32,212,254,720) and compressing the resulting file, I see the following sizes:
1073741824 primes.dat 798639547 primes.dat.bz2 761814389 primes.dat.gz 761814523 primes.zip
I'm a bit surprised the compressors don't do better; just using arithmetic coding and the probability of a zero or one bit should give about a 0.639 compression ratio, but the best we see above is 0.709.
On Tue, May 22, 2012 at 11:00 AM, Henry Baker <hbaker1@pipeline.com> wrote:
Has anyone analyzed the expected size of bitmaps of the primes?
Specifically, consider the set of primes less than some number N, and then represent those primes as "1" bits in a bit vector of length N.
Then compress this bit vector using common techniques such as zip, gzip, 7zip, etc.
What is the growth rate of such compressed bitmaps as a function of N? Are these schemes any better than a simple run-length encoding?
Has anyone even performed any experiments?
Thanks in advance for any info.
P.S. Yes, I know, you can encode the entire infinite bitmap in a fixed/finite number of bits, where these bits encode a Turing Machine that spits out all of the primes. That's not what I'm asking here.
-- -- http://cube20.org/ -- http://golly.sf.net/ --
Right. Anything based on the zlib package (e.g. gzip) has a control that you can set from 1 to 9 (where 1 = faster, 9= better). You can spend a _lot_ more time looking for the very best dictionary. - John On 05/22/2012 02:43 PM, Henry Baker wrote:
I would imagine that the "zip"-type programs would do better with larger dictionaries -- especially on larger files -- so perhaps this would help to improve these numbers. Perhaps there is an additional argument to these programs to allow for larger dictionaries?
As near as I can tell, bzip2 defaults to -9 (and there's no higher setting, even though it is fast). gzip -9 is very slow. I also added 7z. First, for the 30-integers-in-one-byte representation, they all fall far short of arithmetic coding, with not too much variation. The slightly improved compression of 7z is offset by its greatly increased CPU requirements in this case. 745501238 primes.7z 761333494 primes.dat.gz9 761814389 primes.dat.gz 761814523 primes.zip 798639547 primes.dat.bz2 1073741824 primes.dat When using a one-bit-per-integer representation, all files increase substantially in size. I was surprised at this; I had expected such a simple, almost symbol-for-symbol transformation to not affect the final sizes of the compressed files very much. But for this case, the greater redundancy appeared to permit the different compression programs to show there differences: 886586498 primesb.7z 975278536 primesb.bz2 1038286643 primesb.gz9 1130861176 primesb.gz 1130861312 primesb.zip 4026531840 primesb.dat If it were indeed important to store a compressed bitmap of the primes, a simple technique of omitting multiples of (2,3,5,7,11,13,17,19,23) (or some such) would do better than any of these programs, showing once again that domain-specific compression that takes advantage of domain knowledge usually wins. I often use this technique when looking for patterns in things like pruning tables for permutation puzzles and the like; try compressing different representations and see what happens. Almost always the results are obvious once you understand why, but gaining that understanding through empirical compression can help the process of enlightenment. -tom
participants (3)
-
Henry Baker -
John Aspinall -
Tom Rokicki