Unveiling Huffman Coding’s Data Compression Magic

In the world of information and data, efficiency is paramount. As we generate and consume vast amounts of digital content, the need to transmit and store this data in the most efficient manner has led to the development of various data compression techniques. One such remarkable technique is Huffman coding, which not only reduces data size but also plays a crucial role in modern data storage, transmission, and encryption. In this blog, we will delve into the mechanics of Huffman coding, its significance, and its applications across various domains.

Need for data compression

In a world inundated with data, the efficiency of data storage and transmission has become paramount. Whether it’s sending files over the internet, storing data on devices with limited space, or even optimizing database structures, the ability to compress data without significant loss of information has become essential. A naive way would be to assign the same number of bits to every letter, but Huffman coding provides an optimal solution.

Huffman Coding

Huffman coding, named after its inventor David A. Huffman, is a variable-length prefix coding technique used for data compression. Unlike fixed-length codes, where each symbol is represented by the same number of bits, Huffman coding assigns shorter codes to frequently occurring symbols and longer codes to less frequent symbols. This unique property reduces the overall length of the encoded data.

Algorithm

The Huffman coding algorithm follows a simple yet powerful process:

Frequency Calculation:

Determine the frequency of each symbol (character) in the data to be encoded.

Consider this long quote by Albert Einstein:

Life is like riding a bicycle. To keep your balance, you must keep moving. The important thing is not to stop questioning. Curiosity has its own reason for existing. One cannot help but be in awe when contemplating the mysteries of eternity, of life, of the marvelous structure of reality. It is enough if one tries merely to comprehend a little of this mystery each day. The important thing is not to stop questioning; never lose a holy curiosity.

s = "Life is like riding a bicycle. To keep your balance, you must keep moving. The important thing is not to stop questioning. Curiosity has its own reason for existing. One cannot help but be in awe when contemplating the mysteries of eternity, of life, of the marvelous structure of reality. It is enough if one tries merely to comprehend a little of this mystery each day. The important thing is not to stop questioning; never lose a holy curiosity."
s = s.lower()
s_len = 0
arr = [0]*26
d = {}
for i in range(len(s)):
  if(s[i]==' ' or s[i]==',' or s[i]=='.' or s[i]==';'):
    continue
  s_len+=1
  arr[ord(s[i])-ord('a')]+=1
for i in range(len(arr)):
  print(chr(i+ord('a')), arr[i])
  if arr[i]!=0:
    d[chr(i+ord('a'))] = arr[i]

a 16
b 4
c 10
d 3
e 43
f 9
g 9
h 14
i 36
j 0
k 3
l 14
m 10
n 28
o 34
p 9
q 2
r 19
s 23
t 41
u 12
v 3
w 3
x 1
y 13
z 0

Creating nodes for every letter:

There is a node corresponding to every symbol (letter), and it contains the frequency of the letter (or the number of times it appears in the data), the huffman coding for it and also pointers to left and right nodes.

class Node:
    def __init__(self, freq, symbol, left=None, right=None):
        self.freq = freq
        self.symbol = symbol
        self.left = left
        self.right = right
        self.huff = ''

    def __lt__(self, nxt):
        return self.freq < nxt.freq

Priority Queue / Min Heap:

Create a priority queue (min-heap) based on the symbol frequencies. In a min heap, the value at child nodes must be greater than that at the parent node.

import heapq

chars = list(d.keys())
freq = list(d.values())

nodes = []
for x in range(len(chars)):
    heapq.heappush(nodes, Node(freq[x], chars[x]))

Building the Huffman Tree:

Merge nodes:

Repeatedly remove the two nodes with the lowest frequencies from the priority queue and create a new node whose frequency is the sum of the frequencies of the two nodes. Insert this new node back into the priority queue. Continue this process until there is only one node left in the priority queue. This final node represents the root of the Huffman tree.

Construct the Huffman tree:

The last node remaining in the priority queue is the root of the Huffman tree. Trace back from this root node to the leaf nodes, creating the binary code for each character as you go. Assign ‘0’ for left branches and ‘1’ for right branches. When you reach a leaf node, you have the Huffman code for that character.

while len(nodes) > 1:
    left = heapq.heappop(nodes)
    right = heapq.heappop(nodes)
    left.huff = '0'
    right.huff = '1'
    new_symbol = left.symbol + right.symbol
    new_freq = left.freq + right.freq
    new_node = Node(new_freq, new_symbol, left, right)
    heapq.heappush(nodes, new_node)

Generating Codes:

Traverse the Huffman tree to assign codes to each symbol based on its position in the tree.

codes = {} 
huffman_tree_root = nodes[0]

def huffman_codes(node, val=''):
    newVal = val + str(node.huff)
    if not node.left and not node.right:
        codes[node.symbol] = newVal
    else:
        huffman_codes(node.left, newVal)
        huffman_codes(node.right, newVal)

huffman_codes(huffman_tree_root)

for i in range(len(codes)):
  print(list(codes.keys())[i],codes[list(codes.keys())[i]])

Visualizing the Huffman tree

The huffman tree can be visualized using the graphviz python library, the following code saves an image huffman_tree.png which contains the Huffman Tree.

import graphviz

def generate_graph(node, dot=None):
    if dot is None:
        dot = graphviz.Digraph(format='png')
        dot.attr(dpi='300', bgcolor='white')  # Set background color

    dot.node(str(id(node)), f"{node.symbol}:{node.freq}", style="filled", fillcolor="lightblue")  # Node color
    if node.left:
        dot.node(str(id(node.left)), f"{node.left.symbol}:{node.left.freq}", style="filled", fillcolor="lightgreen")  # Node color
        dot.edge(str(id(node)), str(id(node.left)), label="0", color="blue")  # Edge color
        generate_graph(node.left, dot)
    if node.right:
        dot.node(str(id(node.right)), f"{node.right.symbol}:{node.right.freq}", style="filled", fillcolor="lightgreen")  # Node color
        dot.edge(str(id(node)), str(id(node.right)), label="1", color="red")  # Edge color
        generate_graph(node.right, dot)

    return dot

# Uncomment the following 2 lines before running this cell
# graph = generate_graph(huffman_tree_root)
# graph.render("huffman_tree", cleanup=True)

The Huffman Tree obtained in our case is:

Why is Huffman coding better?

Let’s analyze the total number of bits that we need to transmit codes. Total number of bits = Letter frequency * Number of bits used for to represent that letter

print("Without Huffman coding, we need 5 bits to represent every letter.\nThis means, to transmit all data, we need to transmit a total of: ")
l1 = len(s)*5
print(l1,"bits")

l2 = 0
for i in range(len(codes)):
  l2 += len(codes[list(codes.keys())[i]]) * d[list(codes.keys())[i]]
print("However, using Huffman coding, we need only:")
print(l2,"bits")

Without Huffman coding, we need 5 bits to represent every letter.
This means, to transmit all data, we need to transmit a total of: 
2240 bits
However, using Huffman coding, we need only:
1485 bits

Compression ratio:

Compression ratio stands for the number of bits needed before compression to that after compression. The higher the compression ratio, the better. The compression ratio for the above quote after Huffman coding is:

print("Compression ratio: ",l1," : ",l2," = ",(l1/l2))

Compression ratio:  2240  :  1485  =  1.5084175084175084

Entropy

First, we calculate the probability of every letter in the string:

prob = {}
sum_p = 0

for i in range(len(codes)):
  prob[list(codes.keys())[i]] = d[list(codes.keys())[i]]/s_len
  sum_p += prob[list(codes.keys())[i]]

prob

{'o': 0.0947075208913649,
 'i': 0.10027855153203342,
 'r': 0.052924791086350974,
 'c': 0.027855153203342618,
 'm': 0.027855153203342618,
 't': 0.11420612813370473,
 'e': 0.11977715877437325,
 's': 0.06406685236768803,
 'u': 0.033426183844011144,
 'd': 0.008356545961002786,
 'v': 0.008356545961002786,
 'w': 0.008356545961002786,
 'x': 0.002785515320334262,
 'q': 0.005571030640668524,
 'y': 0.036211699164345405,
 'l': 0.03899721448467967,
 'n': 0.07799442896935933,
 'h': 0.03899721448467967,
 'k': 0.008356545961002786,
 'b': 0.011142061281337047,
 'f': 0.025069637883008356,
 'a': 0.04456824512534819,
 'g': 0.025069637883008356,
 'p': 0.025069637883008356}

Calculating the entropy: Entropy = -\(\sum_{i=1}^{n} prob[i]*log(prob[i])\) (i denotes every character)

import numpy as np 

sym = list(d.keys())
entropy = 0

for i in range(len(sym)):
  entropy += -1*prob[sym[i]]*np.log2(prob[sym[i]])
print("Entropy of Huffman coding is:",entropy)

Entropy of Huffman coding is: 4.102837721940702

Cross Entropy

Calculating the cross entropy: Cross entropy = -\(\sum_{i=1}^{n} prob[i]*log(q[i])\) (i denotes every character) q[i] = \(\frac{1}{log(2^{l})}\) where l is the length of its huffman code

Finding q values:

q = {} # Stores q values needed to find cross entropy

for i in range(len(codes)):
  q[list(codes.keys())[i]] = np.log2(1/(2**(len(codes[list(codes.keys())[i]]))))
q

{'o': -3.0,
 'i': -3.0,
 'r': -4.0,
 'c': -5.0,
 'm': -5.0,
 't': -3.0,
 'e': -3.0,
 's': -4.0,
 'u': -5.0,
 'd': -7.0,
 'v': -7.0,
 'w': -7.0,
 'x': -8.0,
 'q': -8.0,
 'y': -5.0,
 'l': -5.0,
 'n': -4.0,
 'h': -5.0,
 'k': -7.0,
 'b': -7.0,
 'f': -6.0,
 'a': -5.0,
 'g': -6.0,
 'p': -6.0}

Calculating cross entropy:

cross_entropy = 0
sym = list(q.keys())
lq = list(q.values())

for i in range(len(sym)):
  cross_entropy += -1*prob[sym[i]]*q[sym[i]]

print("Cross entropy is:",cross_entropy)

Cross entropy is: 4.13649025069638

KL Divergence

KL Divergence is the difference between cross entropy and entropy.

kl_divergence = cross_entropy - entropy
print("KL Divergence is:",kl_divergence)

KL Divergence is: 0.03365252875567837