		     Data Compression Techniques
			  Using	SQ and USQ

			   Mark	DiVecchio
		  San Diego IBM	PC Users Group

	 In any	computer system, efficient management of the file
	 storage space is important.  The two programs SQ and USQ
	 reduce	the size of data files by using	the Huffman Data
	 Compression Algorithm.

	 A file	is composed of a sequence of bytes.  A byte is
	 composed of 8 bits.  That byte	can have a decimal value
	 between 0 and 255.  A typical text file, like a C language
	 program, is composed of the ASCII character set (decimal 0 to
	 127).	This character set only	requires seven of the eight
	 bits in a byte.  Just think --	if there were a	way to use
	 that extra bit	for another character, you could reduce	the
	 size of the file by 12.5%.  For those of you who use upper
	 case characters only, you use only about 100 of the ASCII
	 codes from 0 to 255.  You could save 60% of your storage if
	 the bits not needed within each byte could be used for	other
	 characters.  What if you could	encode the most	frequently
	 used characters in the	file with only one bit?	 For example,
	 if you	could store the	letter "e" (the	most commonly used
	 letter	in the English language) using only one	bit :  "1",
	 you could save	87.5% of the space that	the "e"	would normally
	 take if it were stored	as the ASCII "0110 0101" binary.

	 The answer to these questions is the SQ and USQ programs.

	 The SQ	program	uses the Huffman coding	technique to search
	 for the frequency of use of each of the 256 possible byte
	 patterns, and it then assigns a translation for each
	 character to a	bit string.  All of these bit strings are
	 placed	end to end and written onto a disk file.  The encoding
	 information is	also put on the	file since the USQ program
	 needs to know the character distribution of the original
	 file.

	 The USQ program reads in the encoding information and then
	 reads in the encoded file.  It	is a simple matter to scan the
	 encoded file and produce an output file which is identical to
	 the file that SQ started with.

	 Huffman Coding	Technique

	 This is by far	the most popular encoding technique in use
	 today.	 The Huffman encoding replaces fixed bit characters
	 with variable length bit strings.  The	length of the bit
	 string	is roughly inversely proportional to the frequency of
	 occurrence of the character.  For those of you	inclined to
	 such symbolism:

	 Length	of bit	~= log	(character
	      string	      2	 probability)

	 The implementation of the algorithm which we will discuss
	 encodes fixed bit strings of length 8.

	 This algorithm	requires two passes through the	input file.
	 The first pass	builds a table of 256 entries showing the
	 frequency of each occurrence of each of the 256 possible
	 values	for a byte of information.

	 Once the counting is complete,	the algorithm must decide
	 which bit strings to associate	with each of the 256
	 characters that were found in the file.  Note that if a
	 particular byte value was never used, no string association
	 is needed.

	 The second pass through the input file	converts each byte
	 into its encoded string.  Remember that when the output file
	 is created, the information used for encoding must also be
	 written on the	file for use by	the decoding program.

	 The decoding program reads the	encoding information from the
	 file and then starts reading the bit strings.	As soon	as
	 enough	bits are read to interpret a character,	that character
	 is written onto the final output file.	 See the next two
	 sections on how SQ and	USQ actually implement this.

	 Even though this article primarily has	addresses ASCII	input
	 files,	there is nothing which restricts this algorithm	to
	 ASCII.	 It will work on binary	files (.COM or .EXE) as	well.
	 But since the length of the encoded bit string	is
	 approximately equal to	the inverse of the frequency of
	 occurrence of each 8 bit byte,	a binary file may not compress
	 very much.  This is because a binary file most	likely has a
	 uniform distribution over the 256 values in a byte.  A
	 machine language program is not like the English language
	 where the letter "e" is used far more than other letters.  If
	 the distribution is uniform, the encoded bit strings will all
	 be the	same length and	the encoded file could be longer than
	 the original (because of the encoding information on the
	 front of the file).  All of this has to be qualified, because
	 machine language programs tend	to use a lot of	"MOV"
	 instructions and have a lot of	bytes of zeros so that
	 encoding .COM and .EXE	files does save	some disk space.

		SQ

	 The SQ	program	is an example of the Huffman algorithm.

	 The first thing that SQ does is read through the input	file
	 and create a distribution array for the 256 possible
	 characters.  This array contains counts of the	number of
	 occurrences of	each of	the 256	characters.  The program
	 counts	these values in	a 16 bit number.  It makes sure	that,
	 if you	are encoding a big file, counts	do not exceed a	16 bit
	 value.	 This is highly	unlikely, but it must be accounted
	 for.

	 At the	same time, SQ removes strings of identical characters
	 and replaces them with	the ASCII character DLE	followed by a
	 character count of 2-255.  SQ replaces	the ASCII DLE with the
	 pair of characters:  DLE DLE.	This is	not related to the
	 Huffman algorithm but just serves to compress the file	a
	 little	more.

	 Once SQ has scanned the input file, it	creates	a binary tree
	 structure containing this frequency occurrence	information.
	 The most frequently occurring characters have the shortest
	 path from the root to the node, and the least frequently
	 occurring characters have the longest path.  For example, if
	 your file were:

	 ABRACADABRA   (a very simple and
			magical	example)

	 The table of frequency	of occurrences would be:

	 Letter		# of Occurrences
	 ------		---------------
	   A		     5
	   B		     2
	   C		     1
	   D		     1
	   R		     2
	   all the rest	     0

	 Since the letter "A" occurs most often, it should have	the
	 shortest encoded bit string.  The letters "C" and "D" should
	 have the longest.  The	other characters which don't appear in
	 the input file	don't need to be considered.

	 SQ would create a binary tree to represent this information.
	 The tree might	look something like this (for purposes of
	 discussion only):

	     root    <---Computer trees	are
	  /    \      always upside down!
	1	0   <--	This is	called a
      /	      /	  \	   node.
     A	    1	    0
	  /	  /   \	 <--This is called
	B	1	0    branch.
	      /	      /	  \
	    C	    1	    0
		  /	      \
		D		R  <-This
				     is	a
				     leaf


	 From this our encoded bit strings which are kept in a
	 translation table would be:

	   Table Entry	Character  Binary
	   -----------	---------  ------
		1	    A	   1
		2	    B	   01
		3	    C	   001
		4	    D	   0001
		5	    R	   0000


	 The output file would be:


	     A B  R    A C   A D    A B	 R    A
	     ----------------------------------
	     1 01 0000 1 001 1 0001 1 01 0000 1
	     (binary)

	     A1	31 A1
	     (hex)


	 We have reduced the size of your file from ten	bytes to thre
	 bytes for a 70% savings.  For this simple example, things
	 aren't that well off since we must put the binary tree
	 encoding information onto the file as well.  So the file size
	 grew a	lot.  But consider a file with the word	ABRACADABRA
	 repeated 100,000 times.  Now the encoding information is
	 going to be a very very small percentage of the output	file
	 and the file will shrink tremendously.

	 SQ opens the output file and writes out the binary tree
	 information.  Then SQ rewinds the input file and rereads it
	 from the beginning.  As it reads each character, it looks
	 into the translation table and	outputs	the corresponding bit
	 string.

	 SQ is a little	more complicated than what I have outlined
	 since it must operate in the real world of hardware, but this
	 is a fairly complete description of the algorithm.

	     USQ

	 The USQ program is very straightforward.  It reads in the
	 encoding information written out by SQ	and builds the
	 identical binary tree that SQ used to encode the file.

	 USQ then reads	the input file as if it	were a string of bits.
	 Starting at the root of the tree, it traverses	one branch of
	 the tree with each input bit.	If it has reached a leaf, it
	 has a character and that character is written to the output
	 file.	USQ then starts	at the root again with the next	bit
	 from the input	file.

	 What does it all mean?

	 Now that we understand	the algorithm and a little about how
	 the SQ	and USQ	programs work, we can use that knowledge to
	 help us run our systems a little more efficiently.  (So all
	 of this theory	is worth something after all!).

	 1.  Files must	be above a threshold size, or else the output
	 file will be longer than the input file because of the
	 decoding information put at the beginning of the compressed
	 data.	We don't know the exact size of the threshold because
	 the encoding binary tree information depends on the
	 distribution of the characters	in a file.  At least we	know
	 to check the size of the encoded file after we	run SQ to make
	 sure our file didn't grow.

	 2.  Some files	will not compress well if they have a uniform
	 distribution of byte values, for example, .COM	or .EXE	files.
	 This is because of the	way SQ builds the tree.	 Remember that
	 bytes with the	same frequency of occurrence are at the	same
	 depth (usually) in the	tree.  So if all of the	bytes have the
	 same depth, the output	strings	are all	the same length.

	 3.  SQ	reads the input	file twice.  If	you can, use RAM disk
	 at least for the input	file and for both files	if you have
	 the room.  The	next best case is to use two floppy drives,
	 one for input and one for output.  This will cause a lot of
	 disk starts and stops but not much head movement.  Worst case
	 is to use one floppy drive for	both input and output.	This
	 will cause a lot of head movement as the programs alternate
	 between the input and output files.

	 Other Compression Techniques

	 RUN-LENGTH ENCODING

	 Run-length encoding is	a technique whereby sequences of
	 identical bytes are replaced by the repeated byte and a byte
	 count.	 As you	might guess, this method is effective only on
	 very specialized files.  One good candidate is	a screen
	 display buffer.  A screen is made up mostly of	"spaces".  A
	 completely blank line could be	reduced	from 80	bytes of
	 spaces	to one space followed by a value of 80.	 To go from 80
	 bytes down to two bytes is a savings of almost	98%.  You
	 might guess that for text files or binary files, this
	 technique does	not work well at all.

	 ADAPTIVE COMPRESSION

	 This technique	replaces strings of characters of code.	 For
	 example, the string "ABRACADABRA" would be replaced by	a
	 code.	Typical	algorithms use a 12 bit	code.  The algorithm
	 is unique in that it only requires a single pass through the
	 input file as the encoding is taking place.  The current
	 incarnation of	this procedure is called the LZW method	(after
	 co-inventors; A.  Lempel, J.  Ziv and T.  Welch).  This
	 algorithm claims a savings of 66% on machine language files
	 and up	to 83% on COBOL	files.

	 Other Reading

	 If you	are interested in reading more about data compression
	 techniques, you may be	interested in these articles:


	 H.K.  Reghbati, "An Overview of Data Compression Techniques,"
	 Computer Magazine, Vol.  14, No.  4, April 1981, pp.  71-76.

	 T.A.  Welch, "A Technique for High-Performance Data
	 Compression", Computer Magazine, Vol 17, No.  6, June 1984,
	 pp.  8-19.

	 J.  Ziv and A.	 Lempel, "A Universal Algorithm for Sequential
	 Data Compression," IEEE Transactions on Information Theory,
	 Vol.  It-23, No.3, May	1977, pp. 337-343.


	 Data Compression," IEEE Transactions on Information Theory,
                                                                           