Computer Science

Huffman Coding

Huffman coding is a method of lossless data compression that assigns variable-length codes to input characters based on their frequencies. It is an optimal prefix coding technique, meaning no other prefix code can have a smaller average length. Huffman coding is widely used in file compression algorithms, such as in the creation of ZIP files, to reduce the size of data for storage or transmission.

Written by Perlego with AI-assistance

6 Key excerpts on "Huffman Coding"

Index pages curate the most relevant extracts from our library of academic textbooks. They’ve been created using an in-house natural language model (NLM), each adding context and meaning to key research topics.
  • Industry 4.0 Interoperability, Analytics, Security, and Case Studies
    • G. Rajesh, X. Mercilin Raajini, Hien Dang, G. Rajesh, X. Mercilin Raajini, Hien Dang(Authors)
    • 2021(Publication Date)
    • CRC Press
      (Publisher)

    ...A binary Huffman tree is created from the secret data using the frequency of alphabets in it. The nodes of the tree are assigned a default code (0 to the left node and 1 to the right node) which could be used to represent the characters. This tree is used for encoding each character with the codes from root node to child node. The advantage of Huffman tree encoding is that it converts the data to binary format, which requires less storage space, so it works as a lossless data compression tool. Another benefit is that Huffman codes which represent the characters are such that there is no ambiguity in codes. As a default security mechanism, the decoding of messages cannot take place without using the exact same tree. 3.5.4 Creation of Huffman Tree Huffman tree creation starts by assigning a weight to each character used in the message. This weight can be assigned by identifying the frequency of the character in the text or by using some other logic. Initially, all the characters are assumed to be individual trees and are sorted in increasing order. The tree pair having the least frequency is combined and a new tree is created which has a weight equal to the sum of weights of combined trees. This process is repeated until all the individual trees are combined to form one big tree. After all the nodes are combined, default codes are assigned to each node connection starting from the root node all the way up to the leaf nodes (Figure 3.1). FIGURE 3.1 Creation of Huffman tree using weights of characters 3.5.4.1 Swapping Using Variable Length Key For swapping the codes a mutation key is required which is used to change the default codes on the tree branches. The key length should be a multiple of the height of the tree. If it is not, then it is made a multiple by adding an appropriate number of 0s in the end. Steps for performing swapping: i. Calculate the height of the tree, say n. ii. Compute the length of the key, make it a multiple of n by adding trailing 0s. iii...

  • Digital Image Processing with Application to Digital Cinema
    • KS Thyagarajan(Author)
    • 2005(Publication Date)
    • Routledge
      (Publisher)

    ...Huffman codes are also called prefix codes because no code can be a prefix to any other code in the code set. The Huffman Coding procedure can be described as follows. Assume a data source with a finite alphabet and corresponding symbol probabilities. 1.  Sort the symbol probabilities in descending order. Treat each symbol in the ordered list as a terminal node. 2.  Merge the two least probable nodes to form a new node whose probability is the sum of the two probabilities of the merged nodes. Assign 1 and 0 to the two branches that lead to a single node. This assignment of 1 and 0 is arbitrary. 3.  Repeat step 2 until left with a single node. 4.  The Huffman codes for the symbols are then obtained by reading the branch digits sequentially from the root node to the terminal node pointing to the symbols. Example 8.1 Find the Huffman codes for the source with an alphabet of five symbols with probabilities of 0.5, 0.25, 0.15, 0.07, and 0.03. Determine the source entropy and the average code length. Solution Let the source symbols be denoted by s i, 1 ≤ i ≤ 5 and the corresponding probabilities by p i. First, sort the symbols and probabilities according to a descending order of the probabilities, as shown in Figure 8-5. Consider the sorted symbols as the terminal nodes of the code tree. Next, combine the two nodes with symbols s 5 and s 3 with corresponding probabilities of 0.03 and 0.07 to form a single node s ′ 3 whose sum equals 0.1. Assign 1 to the top branch and 0 to the bottom branch, as shown in Figure 8-5. This leaves us with the three symbols s 1, s 4, and s 2 and the newly formed node for a total of four nodes. Now combine the symbol nodes s 2 and s ′ 3 to form a single node s ′ 2 whose probability is the sum of 0.1 and 0.15. Assign a 1 to the top branch and a 0 to the bottom branch. At this stage we have three symbols, namely s 1, s 4, and s ′ 2. Note that the number of nodes decreases by 1 at each stage of the procedure...

  • Introduction to Digital Audio
    • John Watkinson(Author)
    • 2013(Publication Date)
    • Routledge
      (Publisher)

    ...Variable-length coding is used in which frequently used letters are allocated short codes and letters which occur infrequently are allocated long codes. This results in a lossless code. The well-known Morse code used for telegraphy is an example of this approach. The letter e is the most frequent in English and is sent with a single dot. An infrequent letter such as z is allocated a long complex pattern. It should be clear that codes of this kind which rely on a prior knowledge of the statistics of the signal are only effective with signals actually having those statistics. If Morse code is used with another language, the transmission becomes significantly less efficient because the statistics are quite different; the letter z, for example, is quite common in Czech. The Huffman code3 is one which is designed for use with a data source having known statistics and shares the same principles with the Morse code. The probability of the different code values to be transmitted is studied, and the most frequent codes are arranged to be transmitted with short wordlength symbols. As the probability of a code value falls, it will be allocated longer wordlength. The Huffman code is used in conjunction with a number of compression techniques and is shown in Figure 5.4. The input or source codes are assembled in order of descending probability. The two lowest probabilities are distinguished by a single code bit and their probabilities are combined. The process of combining probabilities is continued until unity is reached and at each stage a bit is used to distinguish the path. The bit will be a zero for the most probable path and one for the least. The compressed output is obtained by reading the bits which describe which path to take going from right to left. Figure 5.4 The Huffman code achieves compression by allocating short codes to frequent values...

  • Compression for Great Video and Audio
    eBook - ePub

    Compression for Great Video and Audio

    Master Tips and Common Sense

    • Ben Waggoner(Author)
    • 2013(Publication Date)
    • Routledge
      (Publisher)

    ...The bit, a single 0 or 1, is the basis of all math on computers, and of information theory itself. Any number can be turned into a series of bits, be it an integer like “3,” a fraction like 7/9ths, or an irrational constant like pi (3.1415926535…). This means we can use bits as the basis for measuring everything that can be sampled and quantized. The More Redundancy in the Content, the More It Can Be Compressed English in plain text is rather predictable. For example, I’m much more likely to use the word “codec” at the end of this sentence than XYZZY. The more you understand the distribution of elements within the information you need to compress, the more efficient the compression can be. For example, the distribution of letters in English isn’t random: the letter “e” appears a lot more often than “q,” so it is more efficient to design a system where “e” can be coded in fewer bits than “q.” A table that translates from the original characters to shorter symbols is called a codebook. Shannon’s grad student David A. Huffman devised a method for generating an optimal codebook for a particular set of content (for a term paper, no less). Huffman Coding uses shorter codes for more common symbols and longer codes for less common ones. Different codebooks work better or worse for different kinds of content. A codebook for English won’t work well for Polish, which has a very different distribution of letters. For an English codebook, the Polish distribution of letters would require more bits to encode, possibly even more bits than the uncompressed original file. The more you know about what’s going to be compressed, the better a codebook can be generated. Taken to the next level, the order of letters is also not random...

  • Art of Digital Audio
    • John Watkinson(Author)
    • 2013(Publication Date)
    • Routledge
      (Publisher)

    ...The well-known Morse code used for telegraphy is an example of this approach. The letter e is the most frequent in English and is sent with a single dot. An infrequent letter such as z is allocated a long complex pattern. It should be clear that codes of this kind which rely on a prior knowledge of the statistics of the signal are only effective with signals actually having those statistics. If Morse code is used with another language, the transmission becomes significantly less efficient because the statistics are quite different; the letter z, for example, is quite common in Czech. The Huffman code 3 is one which is designed for use with a data source having known statistics and shares the same principles with the Morse code. The probability of the different code values to be transmitted is studied, and the most frequent codes are arranged to be transmitted with short wordlength symbols. As the probability of a code value falls, it will be allocated longer wordlength. The Huffman code is used in conjunction with a number of compression techniques and is shown in Figure 5.4. Figure 5.4 The Huffman code achieves compression by allocating short codes to frequent values. To aid deserializing the short codes are not prefixes of longer codes. The input or source codes are assembled in order of descending probability. The two lowest probabilities are distinguished by a single code bit and their probabilities are combined. The process of combining probabilities is continued until unity is reached and at each stage a bit is used to distinguish the path. The bit will be a zero for the most probable path and one for the least. The compressed output is obtained by reading the bits which describe which path to take going from right to left. In the case of computer data, there is no control over the data statistics. Data to be recorded could be instructions, images, tables, text files and so on; each having their own code value distributions...

  • Information Theory Meets Power Laws
    eBook - ePub

    Information Theory Meets Power Laws

    Stochastic Processes and Language Models

    • Lukasz Debowski(Author)
    • 2020(Publication Date)
    • Wiley
      (Publisher)

    ...7 Coding and Computation As we have mentioned in the introduction to Chapter 5, there are two standard approaches to the definition of the amount of information: one initiated by Shannon, which applies probability, and another proposed by Kolmogorov, which uses algorithms. It turns out that in both approaches, the amount of information contained in an object is roughly equal to the minimal number of binary digits that we need to describe the object. In fact, we have not touched this interpretation yet and we would like to discuss it in this chapter. In a single package, we will present the coding interpretation of the Shannon entropy and its close links with Kolmogorov's algorithmic complexity, which implements the idea of the minimal coding in a direct way. This chapter develops formally ideas sketched in the half‐formal Sections 1.7 and 1.10. The general idea of coding, an ingenious invention of the twentieth century, is to represent arbitrary objects from a countable set – such as letters, words, sentences, or whole finite texts – as unique finite sequences of binary digits. With the advent of personal computers and mobile devices, this idea seems as obvious as the idea of the alphabet, but it took some effort to discover profound implications of this idea for foundations of mathematics, computer science, and statistics. In particular, the idea of coding led to discovery of unprovable statements in mathematics by Gödel (1931), which was further strengthened by Chaitin (1975b) as unprovability of randomness. Of course, the abstract idea of coding without computation would not be so important for the general public. As computers have become ubiquitous and indispensable for human civilization, scientists have started applying the metaphor of computation everywhere: to living organisms, to laws of physics, and to human brain...