CSCI 241 - Homework 8:
Huffman's Algorithm

Due by 11:59.59pm, Sunday, May, 02

Introduction

For this assignment, you will be creating two programs (encode and decode) that will be performing the calculations needed for simple file compression. (For small files, it'll might make things a little bigger.)

Things to note

The repository URL for this assignment is https://classroom.github.com/a/4FAA5lBL
This project is trickier than most. Get started on it early!

Program behavior

Your encode program should read a text file specified on the command line and write a Huffman encoded version of that file to the specified output file. Similarly, the decode program will read a file generated by encode and write a decoded version of that file to a specified output file.

If no output file is specified, write to stdout.

% ./encode book.txt   book.huf      # encodes book.txt and writes it to
                                    # book.huf

% ./encode book.txt > book.huf      # encodes book.txt and writes it to
                                    # stdout (redirected to book.huf)

% ./decode book.huf   book.txt.2    # decodes the file, writing to book.txt.2

% diff -q book.txt book.txt.2       # check to see if files are the same
                                    # should print nothing if they are

Getting started

You will likely need to divide your code into three different parts, and therefore should be stored in 3 different files:

Functions needed by the encode program
Functions needed by the decode program
Functions used by both programs

Now, it is possible to have only a single program that can do both encoding and decoding based on the filename, but to handle that, you'd need to check the value of argv[0] and determine which function to perform. It's probably easier to just make 2 separate programs.

Recall that you can make an object file by using the "-c" flag when compiling. Then you can link the various object files together to make actual programs.

Program Design

Encoding

In order to encode a file, you will first need to construct a Huffman tree based on the frequency of letters in the file. Your first step should be to read the file from start to end and calculate either absolute or relative frequency of all the characters encountered. You should include the frequency of EOF (which should always be 1) and we will store that at the beginning of our list of nodes (logical index of -1).

You will then need to create a sorted list of nodes based on ascending frequency. To do this, I recommend that you use an insertion sort on a linked list. Insert new nodes starting with the index value of -1 and going up to index value of 255. Insert before items of equal value. Skip nodes with a frequency count of 0.

Then, you will need to convert this sorted list into a Huffman tree. While there are more than 2 nodes in your list, you should create a new node, attach the head item in the list as the left child, the second item in the list as the right child, update the frequency count for this new node, and insert it into your linked list. Be sure you've removed the two nodes that are now children before re-inserting.

Now traverse the tree keeping track of the string needed to reach that node based on using a character '0' for a left branch and '1' for a right branch. When you reach a leaf node, you will know what string is to be used to represent that character.

Now re-read the input file from the beginning and for each letter encountered, print the bit string that corresponds to that character. Be sure to output the string corresponding to the EOF notice too and stop after you do so.

File Format

The files to be encoded can be treated as simple 8-bit character files (but use CHAR_BIT instead of 8). What I mean by this is that if you call fgetc() you will get a character until you reach the end, at which point you will have EOF returned (which we'll treat as if it has a character value of -1 -- which it should). By treating these files as byte-oriented rather than printable ASCII, you should be able to encode both text and binary files.

The Huffman encoded output files will be a bit different. You need to include the binary trie representing the Huffman prefix codes. To do this, you will do a pre-order traversal of the trie using a 0-bit to indicate that it is an internal node and therefore has left and right children, or a 1-bit to indicate that it is a leaf node. Immediately following the 1-bit you will write the CHAR_BIT bits from most to least significant that make up the value of the character at that location in the tree.

Immediately following the pre-order traversal of the tree, you will write an initial bit-string representing EOF. (You will later use that to find the leaf that represents EOF instead of 255 and correct the value there.)

After the table, you should output the individual bits that are needed to represent the input file. You'll have to buffer the bits until you get CHAR_BIT of them and then output it. (The most significant bit is the first bit, and then they progress downward.) Hint, you might want to look at what you wrote for homework 4 encode_bits and decode_bits.

Pad out the last incomplete character in the file with 0 bits. If you write (CHAR_BIT-1) 0-bits out then it will flush any remaining bits without creating a new character.

Decoding

To decode the file, you should first open the file specified on the command line. You then can read in the pre-order traversal of the tree, assembling it as you go. A 0-bit indicates an internal node which has both a left and right child. A 1-bit indicates it is a leaf and the next CHAR_BIT bits represent the value at that node from MSB to LSB. (I found a recursive function to work nicely for this.)

Now you need to fix the value of EOF in the tree. Switch over to a bitwise read/tree traversal routing where 0 indicates to go left and 1 indicates to go right. Once you hit the first leaf, you now have the location for the actual EOF marker and you should update the value there accordingly.

Now you continue with a bitwise read/tree traversal routine and use those to determine if you should go left on 0 or right on 1 in the tree. Once you reach a leaf, you should be at a letter. Print it and move back to the root. When you reach the EOF marker you should stop reading/printing and close both files. Nothing is printed for the EOF marker.

NOTE: You should not print out anything when you reach the EOF marker, and you should never reach the actual end of the encoded file.

Sample run -- with internal state

INPUT:
cheese

Frequency Counts:
-1     EOF     1
10     \n      1
99     c       1
101    e       3
104    h       1
115    s       1

Linked List (initial):
s(1) -> h(1) -> c(1) -> \n(1) -> EOF(1) -> e(3)

First pass:
c(1) -> \n(1) -> EOF(1) -> (2) -> e(3)
                          /   \
                      s(1)     h(1)    

Second pass:
EOF(1) -> (2)   -->   (2) -> e(3)
         /   \       /   \
      c(1)  \n(1)  s(1)   h(1)    


Third pass:
   (2)  ------>  (3)  ---> e(3)
  /   \         /   \
s(1)   h(1) EOF(1)  (2)
                   /   \
                 c(1) \n(1)

Fourth pass:
e(3) -----------> (5)
               /       \
           (2)           (3)
          /   \         /   \
        s(1)   h(1) EOF(1)  (2)
                           /   \
                         c(1) \n(1)

Fifth (and final) pass:
         (8)
    /           \
e(3)              (5)
               /       \
           (2)           (3)
          /   \         /   \
        s(1)   h(1) EOF(1)  (2)
                           /   \
                         c(1) \n(1)

Internal data: (including padding)
char       count bitstring
----       ----- ---------
 -1  EOF       1 110
 10  \n        1 1111
 99  c         1 1110
101  e         3 0
104  h         1 101
115  s         1 100

Tree: (with added spaces for clarity)
0 1 01100101 0 0 1 01110011 1 01101000 0 1 11111111 0 1 01100011 1 00001010

EOF: 110

Remainder of file: (spaces added, includes EOF)
1110 101 0 0 100 0 1111 110

Remainder is padding to make it a full char:
000000

You can also use some Unix tools to examine your output files:

File passed through xxd:
0000000: 594b 9da1 ff58 e15b a91f 80              YK...X.[...

File passed through xxd -b: (bits)
0000000: 01011001 01001011 10011101 10100001 11111111 01011000  YK...X
0000006: 11100001 01011011 10101001 00011111 10000000           .[...

Design Ideas

You'll need to be dynamically creating nodes, so malloc() and free() are your friends. Be sure to free() all the allocated data once you are done with it, and fclose() all files you opened. Valgrind should report that there were no memory leaks.

You might want to create a node struct that can be used in both a linked list and a tree simultaneously. So, you'll want to have both "left" and "right" pointers as well as a "next" pointer.

You can create an array that has 256-buckets for your counts and hard-code the fact that your EOF node has a frequency of 1. This is the most straightforward approach.

If you want, you can create an array that has a valid position at index -1 by dynamically allocating an array and then setting a pointer to the address of the first item in the array. If you then use the pointer in an array context, you can go from index -1 to N-2. Just remember that you need to free() from the actual start of the array.

You should print a message and exit if you attempt to malloc something and it fails. Rather than just cutting and pasting this throughout your code, why not write a function that does the malloc(), the check for failure, and perhaps some initialization.

There is also sample binaries for you to play with in ~rhoyle/pub/cs241/hw06/

handin

README

Create a file called README that contains

Your name and partner's name (if any)
A description of the programs
A listing of the files with a short one line description of the contents
Any known bugs or incomplete functions
An estimate of the amount of time you spent completing this assignment
Any interesting design decisions you'd like to share
Describe any unresolved warnings that are generated by valgrind and what you believe them to be caused by.

man pages

Also, create encode(1) and decode(1) man pages. See HW 3 for a refresher. These should be named encode.1 and decode.1 since these are user commands.

Now you should make clean to get rid of your executables/object files and handin your folder containing your source files, Makefile, and README.

Extra Credit

In this compression algorithm, we are looking at single characters to create our encoding tree. For extra credit, experiment with 2-3 character sequences, and see if they create a better compression tree.

Grading

Here is what I am looking for in this assignment:

A working Makefile with your program, all, and clean as targets
A program that will encode files using Huffman's algorithm as described above.
A program that will decode files encoded with Huffman's algorithm as described above.
An internal linked-list representation using structs
Output matching the sample program
Appropriately modular code
Good comments
Runs under valgrind with no errors or warnings
Man pages for each program
A README with the information requested above. The listing of known bugs is important.

Last Modified: April 19, 2017 - Roberto Hoyle from material created by Ben Kuperman

CSCI 241 - Homework 8:Huffman's Algorithm