[Back to FAQ SWAG index] [Back to Main SWAG index] [Original]
{
Robert Rothenburg
Ok, since a few people have requested info about compression routines I
thought some actual descriptions of the algorithm(s) involved would be a
good idea. Rather than dig out someone else's text file or mail (which
might be copyrighted or arcane) I'll try my hand at explaining a few
methods.
It seems better that programmers have at least a rudimentary understanding
of the algortihms if they plan on using (or not using :) them.
DISCLAIMER: Please pardon any innacuracies: What I know is based on other
people's mail, text files or from what I've gleaned from spending a few
hours at the library (adivce: your local college research-library is a
wonderful resource of algorithms and source-code. Old magazines as well
as academic papers like the IEEE Transactions are worth examining).
In this insanely long post:
I. "Garbage Collection"
II. "Keywords"
III. Run-Length Encoding (RLE)
IV. Static and Dynamic Huffmann Codes <--(BTW, is it one or two n's?)
V. Lampev-Ziv (LZ)
VI. Lampev-Ziv-Welch (LZW) et al.
I. "Garbage Collection"
The simplest methods of compression in weeding out unnecessary data,
especially from text files. Spaces at the end of a line, or extra
carriage returns at the end of a file. (Especially with MS-DOS text
files, which use a Carriage Return *and* Line Feed--many editors and
file-browsers do not need both. Eliminating one of them can cut a large
text file's size by a few percent!)
I've also seen some utilities that "clean up" the headers of EXE
files (such as UNP) by eliminating `unnecessary' info. Other
utilities will clean up graphics files so that they'll compress
better.
In fact, removing excess "garbage" from any file will probably
improve its compressability.
II. "Keywords"
Another way to compress text files is to use a "keyword" for each word.
I.E. we can assume most text files will have less than 65,536 different
words (16 bits=two bytes), and thus we can assign a different value for
each word: since most words are longer than three letters (more than two
bytes), we will save space in the file. However, we'll need some form of
look-up table for each keyword--one way around this might be to use a
reference to a standard dictionary file that is included with an operating
system or word processor.
(This has other advantages as well. Many BASIC interpreters will store
commands like PRINT or INPUT as a character code rather than the entire
name in ASCII text. Not only does this save memory but improves the
run-speed of the program.)
This method can be adapted to other (non-text) files, of course.
III. Run-Length Encoding (RLE)
With RLE, repeated characters (such as leading spaces in text, or large
areas of one color in graphics) are expressed with a code noting which
byte is repeated and how many times it's repeated.
IV. Static and Dynamic Huffmann Codes
The logic behind this method is that certain characters occur more often
than others, and thus the more common characters can be expressed with
fewer bits. For example, take a text file: 'e' might be the most common
character, then 'a', then 's', then 'i', etc....
We can express these characters in a bit-stream like so:
'e' = 1
'a' = 01 <--These are Static Huffmann Codes.
's' = 001
'i' = 0001
etc...
Since these characters normally take up 8-bits, the first (most common)
seven will save space when expressed this way, if they occur often enough
in relation to the other 249 characters.
This is often represented as a (Static) Huffmann Tree:
/\ Note that a compression routine
'e' = 1 0 will have to first scan the data
/ \ and count which characters are
'a' = 1 0 more common and assign the codes
/ \ appropriately.
etc... 1 0
/ \
etc...
Notice that if we use the full 256 ASCII characters we'll run into
a problem with long strings of 0's. We can get around that by using
RLE (Which is what the compressor SQUEEZE uses).
Again, since most files use a large range of the character set, and
their occurences are much more "even", we can use use Dynamic Huffmann
Codes as an alternative. They are like their static cousins, only after
the string of n zeros they are followed by n (binary) digits:
1 = 1 1 character
01x = 010, 011 2 chars
001xx = 00100, 00101, 00110, 0011 4 chars
0001xxx = 0001000, 0001001 ... 0001111 8 chars
etc...
As you can guess, the Huffmann Tree would look something like:
/\ It's a bit too complicated to
1 0 express with ASCII <g>
/ \
1 0
/ \ /\
1 0 etc...
Huffmann Coding is based on how often an item (such as an ASCII
character) occurs, and not where or in relation to other items.
It's not the msot efficient mothod, but it is one of the simplest.
One only needs to make an initial pass to count the characters and
then re-pass translating them into the appropriate codes. No
pattern-matching or search routines are required.
V. Lampev-Ziv (LZ) Encoding.
This method _does_ require some fast pattern-matching/search routines.
It takes advantage of the fact that in most files (especially text)
whole strings of characters repeat themselves.
Take this sample line of text:
"THAT WHICH IS, IS. THAT WHICH IS NOT, IS NOT. IS IT? IT IS!"
(Ok, so "Flowers for Algernon" was on TV the other night... :)
With LZ, we read in from the beginning of the file and keep adding
groups ("Windows") of characters that have not already occurred.
So we add the first sentence, then get to the second "IS"...instead
of repeating it we note that it's a duplicate, with a reference
to where the original occurred and how long the string is. (I.E. we note
that the "IS" occurred 4 characters previously, and is two characters
long.)
This method can be tweaked by using a "Sliding Window" approach where the
largest matches are found...we can compress the second "IS" with the first,
but when don't gain much. However the two "IS NOT"s, when matched against
each other, would compress better.
Our compressed file will have two types of data: the raw characters and
the LZ-references (containing the pointer and length) to the raw
characters, or even to other LZ-references.
One way to tweak this further is to compress the LZ-References using
Huffmann codes. (I think this is what's done with the ZIP algorithm.)
VI. Lampev-Ziv-Welch (LZW) -- Not to be confused with LZ compression.
This is the method used by Unix COMPRESS, as well as the GIF-graphics
file format.
Like LZ, LZW compression can compress a stream of data on one-pass
relatively quickly (assuming you've got the fast search routines!).
However, LZW uses a hash table (essentially it's an array of strings,
also known as a string table), and a "match" is expressed as its index
in the hash table rather than a reference to where and how long the
match is.
The input stream is read in, strings are assembled and sent out when
they are already in the hash table, or they are added to the hash table
and sent out in such a way that a decoder can still re-assemble the
file.
Needless to say, this method is a memory hog. Most compressors that
use this method are usually limited to a certain number of entries
in the hash-table (12- or 16-bits, or 4096 and 65536 respectively).
Here's an outline of the LZW Compression Algorithm:
0. a. Initialize the hash table. (That means for each possible
character there is one "root" entry in the table. Some
flavors of LZW may actually have a "null" non-character
at the base of the table.)
b. Initialize the "current" string (S) to null.
1. Read a character (K) from the input stream.
(If there are none left, then we're done, of course.)
2. Is the string S+K in the string-table?
3. If yes, then a. S := S+K
b. Go to Step 1.
If no, then a. add S+K to the string table.
b. output the code for S
c. S := K
d. Go to Step 1.
Decompression is very similar.
0. a. Initialize the string table (the same as with compression).
b. Read the first code from the compressed file into C
c. Then output the equivalent string for code C
d. Let code O := code C (The code, not the string!)
1. Read the next code from the input stream into C
2. Does code C exist in the string/hash table?
3. If yes, then a. output the string for code C
b. S := equivalent string for code O
c. K := first character of the string for code C
d. add S+K to the string table
If no, then a. S := equivalent string for code O
b. K := first character of S
c. output S+K
d. add S+K to the table
4. code O := code C
5. Go to Step 1.
It may seem psychotic at first, but it works. Just keep reading it and
thinking about it. The best way is with a pencil and pad experimenting
with a character set of four characters (A, B, C, and D) and a "word"
like "ABACADABA" and following through the steps.
(An actual walk-through is not included here, since it will take up
*way* too much space. I recommend getting further, more detailed
descriptions of LZW if you plan on writing an implementation.)
One important thing is to *not* confuse between the `code' for a string
(it's index in the hash table) and the string itself.
Two points about LZW implementations:
1. The code stream (i.e. the compressed output) is not usually
a set number of bits, but reflects how many entries are in
the string table. This is another way to further compress
the data.
Usually, LZW schemes will output the code in the number of bits
exactly reflecting the string table. Thus for the first 512
codes the output is 9-bits. Once the 513th string is added to
the table, it's 10-bits and so on.
Some implementations will use a constant output until a threshold
is reached. For example, the code output will always be 12-bits
until the 4097th string is added, then the code output is in
16-bit segments etc...
If these scenarios are used, the decompression routine *must*
"think ahead" so as to read in the proper number of bits from
the encoded data.
2. Because full string tables are incredible memory hogs, many
flavors of LZW will use a "hash" table (see, string and hash
table aren't quite the same).
Instead of a string of characters like "ABC", there'll be
a table containing the last character of the string, and a
reference to the previous character(s). Thus we'll have
"A", "[A]B" and "[[A]B]C" where [x] is the reference (or the
actual "code") for that character/string. Using this method
may be slower, but it saves memory and makes the implement-
ation a little easier.
There are many variations on LZW using other sets of strings to
compare with the string table--this is based on the assumption that
the more entries there are in the table, the more efficient the
compression will be.
One variation (Lampev-Ziv-Yakoo) would add the ButFirst(S+K) every
time S+K was added. "ButFirst" means all characters but the first
one. So ButFirst("ABCD") = "BCD".
Anyhow, those are the compression algorithms that I know about which I
can even remotely attempt to explain.
I apologize for any (glaring?) mistakes. It's late...what started as a
meaningless reply to somebody asking something about SQUEEZE exploded
into this post. I hope it's useful to anybody interested.
}
[Back to FAQ SWAG index] [Back to Main SWAG index] [Original]