Re: [math-fun] bzip compression; Move2Front algorithm

20 Jul 2012

      Thanks, Victor, for the reference & link to the paper.  I looked through it,
and Elias also ignores the Belady OPT "paging" algorithm.  But it is forgivable
for Elias -- he comes from a probability & coding background where the forwards
and backwards sequences from a source won't be distinguishable using their
probability distributions.  But for Bentley, et al, who are computer scientists,
they know that running a program backwards is very different from running it
forwards, and probably taught Belady's OPT algorithm in their classes!

Basically, I'm trying to build a "unified theory" for a computer's instruction
set, which would incorporate both the register set (and/or stack) and the LRU
cache, which shows that a "stack-like" architecture is optimal, at least as far
as instruction encoding compactness is concerned.

To the extent that moving items out of (cache) memory to the top of the stack
is a form of "recency rank" encoding, such a stack machine is already performing
a "move2front" compression algorithm.

See this paper for some of my earlier thoughts on this problem, which maps a
version of Lisp onto a Forth/Postscript stack machine interpreter.

http://home.pipeline.com/~hbaker1/LinearLisp.html

At 06:17 AM 7/20/2012, Victor Miller wrote:
...
Peter Elias gives a nice analysis of these schemes in
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.13.9361 .
Victor
On Wed, Jul 18, 2012 at 12:15 AM, Henry Baker <hbaker1@pipeline.com> wrote:
...
I just saw a video lecture on data compression and it claimed that Bentley, et al's
compression scheme in CACM 29,4 (April 1986),320-330, is used in current bzip (bzip2 ?).
Basically, Bentley's scheme assumes that both the sender & receiver keep an
ordered list (hopefully the same list!) and when a symbol is found on the list,
the _index_ into the list is sent, and both parties move that symbol to the front
of the list.  If the symbol is not on the list, then "n+1" is sent, indicating
that a new element is coming just after the index, and both sides then increase
the size of their lists to n+1.
If the file to be compressed has "locality of reference", then Bentley's scheme
can do better than any static probability scheme -- e.g., Huffman coding.
It's bizarre, but Bentley doesn't mention the relationship between his algorithm
and Belady's OPT optimal page replacement algorithm (used in Knuth's TeX, for
example, for "paging" character font info).  Traditionally, Belady's OPT algorithm
isn't used, because it basically does LRU on the _reversed_ reference string, so
as to kick out the page to be used the furthest into the future.  Only in certain
restricted cases (e.g., multi-pass analysis on media files) can OPT be used, because
the future isn't normally known in advance.
The takeaway from Belady's algorithm & paper (as I recall) is that (forwards) LRU
can do no worse than half as good as Belady OPT (backwards LRU) with twice the memory
of Belady.
But file compression would seem to be oblivious to the direction of processing,
so whether the file is processed forwards or backwards shouldn't matter very much;
if it does, do both & output the better result.
I realize that it's been 25 years since Bentley published this paper, so _somebody_
must have noticed this similarity, especially because in the interim, the operating
systems guys have stored _compressed_ page reference traces for later analysis.
If Bentley & Belady are right, then any bzip-compressed reference traces are
_already_ compressed if they've been through an LRU (i.e., "stack") page-replacement
algorithm!

Henry Baker

tags

participants (1)