[math-fun] bzip compression; Move2Front algorithm

17 Jul 2012

      I just saw a video lecture on data compression and it claimed that Bentley, et al's
compression scheme in CACM 29,4 (April 1986),320-330, is used in current bzip (bzip2 ?).

Basically, Bentley's scheme assumes that both the sender & receiver keep an
ordered list (hopefully the same list!) and when a symbol is found on the list,
the _index_ into the list is sent, and both parties move that symbol to the front
of the list.  If the symbol is not on the list, then "n+1" is sent, indicating
that a new element is coming just after the index, and both sides then increase
the size of their lists to n+1.

If the file to be compressed has "locality of reference", then Bentley's scheme
can do better than any static probability scheme -- e.g., Huffman coding.

It's bizarre, but Bentley doesn't mention the relationship between his algorithm
and Belady's OPT optimal page replacement algorithm (used in Knuth's TeX, for
example, for "paging" character font info).  Traditionally, Belady's OPT algorithm
isn't used, because it basically does LRU on the _reversed_ reference string, so
as to kick out the page to be used the furthest into the future.  Only in certain
restricted cases (e.g., multi-pass analysis on media files) can OPT be used, because
the future isn't normally known in advance.

The takeaway from Belady's algorithm & paper (as I recall) is that (forwards) LRU
can do no worse than half as good as Belady OPT (backwards LRU) with twice the memory
of Belady.

But file compression would seem to be oblivious to the direction of processing,
so whether the file is processed forwards or backwards shouldn't matter very much;
if it does, do both & output the better result.

I realize that it's been 25 years since Bentley published this paper, so _somebody_
must have noticed this similarity, especially because in the interim, the operating
systems guys have stored _compressed_ page reference traces for later analysis.

If Bentley & Belady are right, then any bzip-compressed reference traces are
_already_ compressed if they've been through an LRU (i.e., "stack") page-replacement
algorithm!

Henry Baker

Victor Miller

tags

participants (2)