[math-fun] bzip compression; Move2Front algorithm
I just saw a video lecture on data compression and it claimed that Bentley, et al's compression scheme in CACM 29,4 (April 1986),320-330, is used in current bzip (bzip2 ?). Basically, Bentley's scheme assumes that both the sender & receiver keep an ordered list (hopefully the same list!) and when a symbol is found on the list, the _index_ into the list is sent, and both parties move that symbol to the front of the list. If the symbol is not on the list, then "n+1" is sent, indicating that a new element is coming just after the index, and both sides then increase the size of their lists to n+1. If the file to be compressed has "locality of reference", then Bentley's scheme can do better than any static probability scheme -- e.g., Huffman coding. It's bizarre, but Bentley doesn't mention the relationship between his algorithm and Belady's OPT optimal page replacement algorithm (used in Knuth's TeX, for example, for "paging" character font info). Traditionally, Belady's OPT algorithm isn't used, because it basically does LRU on the _reversed_ reference string, so as to kick out the page to be used the furthest into the future. Only in certain restricted cases (e.g., multi-pass analysis on media files) can OPT be used, because the future isn't normally known in advance. The takeaway from Belady's algorithm & paper (as I recall) is that (forwards) LRU can do no worse than half as good as Belady OPT (backwards LRU) with twice the memory of Belady. But file compression would seem to be oblivious to the direction of processing, so whether the file is processed forwards or backwards shouldn't matter very much; if it does, do both & output the better result. I realize that it's been 25 years since Bentley published this paper, so _somebody_ must have noticed this similarity, especially because in the interim, the operating systems guys have stored _compressed_ page reference traces for later analysis. If Bentley & Belady are right, then any bzip-compressed reference traces are _already_ compressed if they've been through an LRU (i.e., "stack") page-replacement algorithm!
Peter Elias gives a nice analysis of these schemes in http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.13.9361 . Victor On Wed, Jul 18, 2012 at 12:15 AM, Henry Baker <hbaker1@pipeline.com> wrote:
I just saw a video lecture on data compression and it claimed that Bentley, et al's compression scheme in CACM 29,4 (April 1986),320-330, is used in current bzip (bzip2 ?).
Basically, Bentley's scheme assumes that both the sender & receiver keep an ordered list (hopefully the same list!) and when a symbol is found on the list, the _index_ into the list is sent, and both parties move that symbol to the front of the list. If the symbol is not on the list, then "n+1" is sent, indicating that a new element is coming just after the index, and both sides then increase the size of their lists to n+1.
If the file to be compressed has "locality of reference", then Bentley's scheme can do better than any static probability scheme -- e.g., Huffman coding.
It's bizarre, but Bentley doesn't mention the relationship between his algorithm and Belady's OPT optimal page replacement algorithm (used in Knuth's TeX, for example, for "paging" character font info). Traditionally, Belady's OPT algorithm isn't used, because it basically does LRU on the _reversed_ reference string, so as to kick out the page to be used the furthest into the future. Only in certain restricted cases (e.g., multi-pass analysis on media files) can OPT be used, because the future isn't normally known in advance.
The takeaway from Belady's algorithm & paper (as I recall) is that (forwards) LRU can do no worse than half as good as Belady OPT (backwards LRU) with twice the memory of Belady.
But file compression would seem to be oblivious to the direction of processing, so whether the file is processed forwards or backwards shouldn't matter very much; if it does, do both & output the better result.
I realize that it's been 25 years since Bentley published this paper, so _somebody_ must have noticed this similarity, especially because in the interim, the operating systems guys have stored _compressed_ page reference traces for later analysis.
If Bentley & Belady are right, then any bzip-compressed reference traces are _already_ compressed if they've been through an LRU (i.e., "stack") page-replacement algorithm!
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com http://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
participants (2)
-
Henry Baker -
Victor Miller