Re: [math-fun] bzip compression; Move2Front algorithm

18 Jul 2012

      ...
From: Henry Baker <hbaker1@pipeline.com>
I just saw a video lecture on data compression and it claimed that 
Bentley, et al's
compression scheme in CACM 29,4 (April 1986),320-330, is used in 
current bzip (bzip2 ?).
Yes.  It is an LRU scheme.  It's used as *part of* the bzip2 scheme. 
The main part is called the Burrows-Wheeler Transform (BWT).
...
Basically, Bentley's scheme assumes that both the sender & receiver keep an
ordered list (hopefully the same list!) and when a symbol is found 
on the list,
the _index_ into the list is sent, and both parties move that symbol 
to the front
of the list.  If the symbol is not on the list, then "n+1" is sent, indicating
that a new element is coming just after the index, and both sides 
then increase
the size of their lists to n+1.
Or you start the list full of all the possibilities.  In bzip2's case, 257.
...
If the file to be compressed has "locality of reference", then 
Bentley's scheme
can do better than any static probability scheme -- e.g., Huffman coding.
...when combined with a combination of run-length and Huffman coding
of the list indexes.  (Otherwise you're just encoding bytes as bytes.)

So, the BWT does this (not quite true):
     Say you have a message of N bytes.  Append a unique EOF character.
     Treat the message as a loop of N+1 bytes.
     Take the length-N+1 string starting at each position of the loop:
     N+1 strings of N+1 bytes each.
     Sort the list lexically.
     Now, take the string formed by reading down the *last* column
     of the sorted list.  That is the BWT.
     The process is reversible.

     Encode that with the LRU trick.
     Encode that with run-length and Huffman codes.

Or as Burrows and Wheeler say, transform the input into a form that
has locality, then transform that into a skewed distribution,
then use a compression method that takes advantage
of the skewed distribution.
...
Belady's OPT optimal page replacement algorithm...
 basically does LRU on the _reversed_ reference string, so
as to kick out the page to be used the furthest into the future.
Not familiar with.
...
But file compression would seem to be oblivious to the direction of 
processing,
so whether the file is processed forwards or backwards shouldn't 
matter very much;
if it does, do both & output the better result.
(Or do both on the BWT of the file, which isn't really time-like.)
...
If Bentley & Belady are right, then any bzip-compressed reference traces are
_already_ compressed if they've been through an LRU (i.e., "stack") 
page-replacement
algorithm!
This may be foiled by having the BWT between the two LRUs.

  --Steve

Re: [math-fun] bzip compression; Move2Front algorithm

Steve Witham