1. Data Compression: Huffman Codes. Data compression is fundamental to digital communication and storage. Consider a data file with 100,000 characters. What is the best way to store or transmit this file? Assume that the cost of storge or transmission is proportional to the number of bits required. In this example file, there are only 6 different characters, with their frequencies as shown below. Char a b c d e f Freq(K) 45 13 12 16 9 5 We want to design binary codes to achieve maximum compression. Suppose we use fixed length codes. Clearly, we need 3 bits to represent six characters. One possible such set of codes is: Char a b c d e f Code 000 001 010 011 100 101 Storing the 100K character requires 300K bits using this code. Is it possible to improve upon this? 2. Huffman Codes. We can improve on this using Variable Length Codes. Motivation: use shorter codes for more frequent letters, and longer codes for infrequent letters. One such set of codes shown below. Char a b c d e f VLC: 0 101 100 111 1101 1100 Note that some codes are smaller (1 bit), while others are longer (4 bits) than the fixed length code. Still, using code 2, the file requires 1*45 + 3*13 + 3 *12 + 3*16 + 4*9 + 4*5 Kbits, which is 224 Kbits. Improvement is 25% over fixed length codes. In general, variable length codes can give 20-90\% savings. 3. Variable Length Codes. However, we have a potential problem with variable length codes. While wiht fixed length coding, decoding is trivial. This is not the case with variable length codes. Suppose 0 and 000 are codes for letters x and y, what should decoder do upon receiving 00000? We could put special marker codes but that reduce efficiency. Instead we consider PREFIX CODES: no codeword is a prefix of another codeword. So, 0 and 000 will not be prefix codes, but (0, 101, 100, 111, 1101, 1100), the example shown earlier, do form a prefix code. To encode, just concatenate the codes for each letter of the file; to decode, extract the first valid codeword, and repeat. Example: Code for `abc' is 0101100. `001011101' uniquely decodes to 'aabe'. 4. Representing Codes by a Tree. Decoding best represented by a binary tree, with letters as leaves. Code for a letter is the sequence of bits between root and that leaf. o o o o |_| o o o o o o o |_||_||_||_| |_||_| |_| |_| o |_| |_| |_| Fixed length code Huffman Code 5. Optimality. An optimal tree must be full: each internal node has two children. Otherwise we can improve the code. Thus, by inspection, the fixed length code above is not optimal! 6. Measuring Optimality. Let C be the alphabet. Let f(x) be the frequency of a letter x in C. Let T be the tree for a prefix code; let d_T (x) be the depth of x in T. The number of bits needed to encode our file using this code is: B(T) = \sum_{x in C} f(x) d_T(x) We want a code for whose tree the quantity B(T) is minimum possible. 7. Greedy Strategies. Ideas for optimal coding? (David Huffman developed his coding procedure, in a term paper he wrote while a graduate student at MIT. Joined the faculty of MIT in 1953. In 1967, became the founding faculty member of the Computer Science Department at UCSC. Died in 1999. Excerpt from an Scientific American article about this: In 1951 David A. Huffman and his classmates in an electrical engineering graduate course on information theory were given the choice of a term paper or a final exam. For the term paper, Huffman's professor, Robert M. Fano, had assigned what at first appeared to be a simple problem. Students were asked to find the most efficient method of representing numbers, letters or other symbols using a binary code. Besides being a nimble intellectual exercise, finding such a code would enable information to be compressed for transmission over a computer network or for storage in a computer's memory. Huffman worked on the problem for months, developing a number of approaches, but none that he could prove to be the most efficient. Finally, he despaired of ever reaching a solution and decided to start studying for the final. Just as he was throwing his notes in the garbage, the solution came to him. "It was the most singular moment of my life," Huffman says. "There was the absolute lightning of sudden realization." Huffman says he might never have tried his hand at the problem--much less solved it at the age of 25--if he had known that Fano, his professor, and Claude E. Shannon, the creator of information theory, had struggled with it. "It was my luck to be there at the right time and also not have my professor discourage me by telling me that other good people had struggled with this problem," he says. "Huffman Codes" are used in nearly every application that involves the compression and transmission of digital data, such as fax machines, modems, computer networks, and high-definition television. ) HUFFMAN's ALGORITHM. Initially, each letter represented by a single-node tree. The weight of the tree is the letter's frequency. Huffman repeatedly chooses the two smallest trees (by weight), & merges them. The new tree's weight is the sum of the two children's weights. If there are n letters in the alphabet, there are n-1 merges. Pseudo-Code: Q <-- C for i = 1 to n-1 do z <- allocateNode() x <- left[z] <- DeleteMin(Q) y <- right[z] <- DeleteMin(Q) f[z] <- f[x] + f[y] Insert(Q, z) return FindMin(Q) 8. Illustration of Huffman Algorithm. Initial sort: f:5 e:9 c:12 b:13 d:16 a:45 Merge, reorder: c:12 b:13 f+e:14 d:16 a:45 next: f+e:14 d:16 c+b:25 a:45 next: c+b:25 (f+e)+d:30 a:45 next: a:45 (c+b)+((f+e)+d):55 Final tree: o |a| o o o |c| |b| o |d| |f| |e| 9. Analysis of Huffman. Time complexity is O(n log n). Initial sorting plus n heap operations. We now prove that the prefix code generated is optimal. It is a greedy algorithm, and we use the standard swapping argument.} Lemma: Suppose x and y are the two letters of lowest frequency. Then, there is optimal prefix code in which codewords for x and y have the same (and maximum) length and they differ only in the last bit. T T' T'' o o o o |x| o |b| o |b| |y| o |y| o |c| o |b| |c| |x| |c| |x| |y| Proof. The idea of the proof is to take the tree T representing an optimal prefix code, and modify it to mae a tree representing another optimal prefix code in which the characters x and y appear as sibling leaves of max depth. In that case, x and y will have the same codelength, with only the last bit different. Suppose b and c are the two characters that are sibling leaves of max depth in T. Without loss of generality, assume that f(b) <= f(c), and also that f(x) <= f(y). Because f(x) and f(y) are the two lowest frequencies, in order, they must satisfy f(x) <= f(b) and f(y) <= f(c). We first transform T into T' by swapping the positions of x and b. Since d_T(b) >= d_T(x) and f(b) >= f(x), the swap does not increase the frequency * depth cost. Specifically, B(T) - B(T') = \sum_{p} [f(p) d_T(p)] - \sum_{p} [f(p) d_T'(p)] = [f(x)d_T(x) + f(b)d_T(b)] - [f(x)d_T'(x) + f(b)d_T'(b)] = [f(x)d_T(x) + f(b)d_T(b)] - [f(x)d_T(b) + f(b)d_T(x)] = [f(b) - f(x)]*[d_T(b) - d_T(x)] >= 0 Thus, this transformation does not increase the total bit cost. Similarly, we then transform T' into T'' by exchangint y and c, which again does not increase the cost. So, we get that $B(T'') <= B(T') <= B(T). If T was optimal, so is T'', but in T'' x and y are sibling leaves and they are at the max depth. 10. Completing the proof. The rest of the argument follows from induction. When x and y are merged; we pretend a new character z arises, with f(z) = f(x) + f(y). We compute the optimal code/tree for these n-1 letters: C + {z} - {x,y}. Cal this tree T'. We then attach two new leaves to the node z, corresponding to x and y, obtaining the tree T. This is now the Huffman Code tree for character set C. Proof of optimality. The cost B(T) can be expressed in terms of cost B(T'), as follows. For each c not equal to x and y, its depth is the same in both trees, so no difference. Furthermore, d_T(x) = d_T(y) = d_T'(z) + 1, so we have f(x)d_T(x) + f(y)d_T(y) = [f(x) + f(y)]*[d_T'(z) + 1] = f(z)d_T'(z) + [f(x) + f(y)] So, B(T) = B(T') + f(x) + f(y). We now prove the optimality of Huffman algorithm by contradiction. Suppose T is not an optimal prefix code. Then there exists a tree T'' with B(T'') < B(T). By the earlier lemma, T'' has x and y as siblings. Let T'' be the tree T'' with the common parent of x and y replaced by a leave z, whose frequency is f(z) = f(x) + f(y). Then, B(T'') = B(T'') - f(x) - f(y) < B(T) - f(x) - f(y) < B(T') which contradicts the assumption that T's is an optimal prefix code for the character set C'! End of proof. 11. Some other examples of greedy algorithms. A. Knapsack Problem. A knapsack of size K. Set S = {1, 2, ..., n } items, not all can fit in knapsack. Item i has value v_i and size s_i. The KNAPSACK problem is to choose the subset of S of highest value (objective) that fits in the knapsack (constraint). Example. K = 110. Items: ($20, 100), ($15, 50), ($15, 50). Taking in the order of most to least valuable doesn't work. Example. K = 50. Items: ($60, 10), ($100, 20), ($120, 30)$.} Even in the order of descending value/size doesn't work. Fractional: However, if we are allowed to take *fractions* of the items, then the second greedy scheme does give the optimal.