Re: [math-fun] bzip compression; Move2Front algorithm
Thanks, Victor, for the reference & link to the paper. I looked through it, and Elias also ignores the Belady OPT "paging" algorithm. But it is forgivable for Elias -- he comes from a probability & coding background where the forwards and backwards sequences from a source won't be distinguishable using their probability distributions. But for Bentley, et al, who are computer scientists, they know that running a program backwards is very different from running it forwards, and probably taught Belady's OPT algorithm in their classes! Basically, I'm trying to build a "unified theory" for a computer's instruction set, which would incorporate both the register set (and/or stack) and the LRU cache, which shows that a "stack-like" architecture is optimal, at least as far as instruction encoding compactness is concerned. To the extent that moving items out of (cache) memory to the top of the stack is a form of "recency rank" encoding, such a stack machine is already performing a "move2front" compression algorithm. See this paper for some of my earlier thoughts on this problem, which maps a version of Lisp onto a Forth/Postscript stack machine interpreter. http://home.pipeline.com/~hbaker1/LinearLisp.html At 06:17 AM 7/20/2012, Victor Miller wrote:
Peter Elias gives a nice analysis of these schemes in http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.13.9361 .
Victor
On Wed, Jul 18, 2012 at 12:15 AM, Henry Baker <hbaker1@pipeline.com> wrote:
I just saw a video lecture on data compression and it claimed that Bentley, et al's compression scheme in CACM 29,4 (April 1986),320-330, is used in current bzip (bzip2 ?).
Basically, Bentley's scheme assumes that both the sender & receiver keep an ordered list (hopefully the same list!) and when a symbol is found on the list, the _index_ into the list is sent, and both parties move that symbol to the front of the list. If the symbol is not on the list, then "n+1" is sent, indicating that a new element is coming just after the index, and both sides then increase the size of their lists to n+1.
If the file to be compressed has "locality of reference", then Bentley's scheme can do better than any static probability scheme -- e.g., Huffman coding.
It's bizarre, but Bentley doesn't mention the relationship between his algorithm and Belady's OPT optimal page replacement algorithm (used in Knuth's TeX, for example, for "paging" character font info). Traditionally, Belady's OPT algorithm isn't used, because it basically does LRU on the _reversed_ reference string, so as to kick out the page to be used the furthest into the future. Only in certain restricted cases (e.g., multi-pass analysis on media files) can OPT be used, because the future isn't normally known in advance.
The takeaway from Belady's algorithm & paper (as I recall) is that (forwards) LRU can do no worse than half as good as Belady OPT (backwards LRU) with twice the memory of Belady.
But file compression would seem to be oblivious to the direction of processing, so whether the file is processed forwards or backwards shouldn't matter very much; if it does, do both & output the better result.
I realize that it's been 25 years since Bentley published this paper, so _somebody_ must have noticed this similarity, especially because in the interim, the operating systems guys have stored _compressed_ page reference traces for later analysis.
If Bentley & Belady are right, then any bzip-compressed reference traces are _already_ compressed if they've been through an LRU (i.e., "stack") page-replacement algorithm!
participants (1)
-
Henry Baker