Huffman Coding

Given character frequencies, build a prefix-free binary code that minimizes total encoded length. Huffman’s greedy algorithm repeatedly merges the two smallest frequencies into a binary tree. This lesson covers the algorithm, tree construction trace, code assignment, and implementations in C, Java, and Python.

Overview

Variable-length codes save space: frequent symbols get shorter bit strings, rare symbols get longer ones. Huffman coding guarantees an optimal prefix code — no codeword is a prefix of another, so decoding is unambiguous.

Frequencies:  f:5  e:9  c:12  b:13  d:16  a:45

Greedy merges (two smallest each step):
  f+e → 14,  c+b → 25,  14+d → 30,  25+30 → 55,  a+55 → 100 (root)

Sample codes (left=0, right=1):
  a:0   c:100   b:101   f:1100   e:1101   d:111
  Weighted bit cost = 224 (optimal for these frequencies)

Problem Statement

Input: A set of characters (or symbols) with non-negative frequencies freq[c].

Output: A binary prefix code for each symbol minimizing
Σ freq[c] × code_length[c] (total weighted path length in the Huffman tree).

Prefix property: For any two symbols, neither codeword is a prefix of the other.

Complexity target: O(n log n) with a min-heap (n = distinct symbols); O(n) space for the tree and code table.

Greedy Algorithm

  1. Create a leaf node for each symbol with its frequency.
  2. Insert all nodes into a min-priority queue keyed by frequency.
  3. While more than one node remains:
    • Extract the two nodes with smallest frequency.
    • Create a new internal node with those two as children; its frequency = sum of child frequencies.
    • Insert the new node back into the queue.
  4. The last remaining node is the root of the Huffman tree.
  5. Assign 0 to left edges and 1 to right edges; leaf paths form codewords.
Key insight: The two least frequent symbols should have the longest codes — merge them first at the deepest level. Greedy bottom-up merging achieves minimum total cost.

Why Greedy Works

Exchange argument (sketch): In an optimal tree, the two symbols with lowest frequency can be assumed to be sibling leaves at maximum depth (swap them with deepest leaves without increasing cost). Removing these two leaves and replacing them with one parent reduces the problem size by one with adjusted frequencies — the same greedy step. By induction, Huffman’s merge rule is optimal.

Tie-breaking: When frequencies tie, different merge orders may produce different codewords, but the total encoded length remains optimal.

Code (C, Java, Python)

Keywords Types Functions Variables Strings Numbers Comments

Python

import heapq

class Node:
    __slots__ = ("freq", "char", "left", "right")
    def __init__(self, freq, char=None, left=None, right=None):
        self.freq = freq
        self.char = char
        self.left = left
        self.right = right
    def __lt__(self, other):
        return self.freq < other.freq

def build_huffman_tree(freq_map):
    """Greedy: repeatedly merge two smallest frequencies. O(n log n)."""
    heap = [Node(f, c) for c, f in freq_map.items()]
    heapq.heapify(heap)

    while len(heap) > 1:
        left = heapq.heappop(heap)
        right = heapq.heappop(heap)
        parent = Node(left.freq + right.freq, None, left, right)
        heapq.heappush(heap, parent)

    return heap[0]  # root

def build_codes(root, prefix="", codes=None):
    if codes is None:
        codes = {}
    if root.char is not None:
        codes[root.char] = prefix or "0"
        return codes
    build_codes(root.left, prefix + "0", codes)
    build_codes(root.right, prefix + "1", codes)
    return codes


if __name__ == "__main__":
    freq = {"a": 45, "b": 13, "c": 12, "d": 16, "e": 9, "f": 5}
    root = build_huffman_tree(freq)
    print(build_codes(root))

Java

import java.util.HashMap;
import java.util.Map;
import java.util.PriorityQueue;

public class HuffmanCoding {

    static class Node implements Comparable<Node> {
        int freq;
        Character ch;
        Node left, right;
        Node(int f, Character c) { freq = f; ch = c; }
        Node(int f, Node l, Node r) { freq = f; left = l; right = r; }
        public int compareTo(Node o) { return Integer.compare(freq, o.freq); }
    }

    /** Build Huffman tree by merging two smallest nodes each step. */
    public static Node buildTree(Map<Character, Integer> freq) {
        PriorityQueue<Node> pq = new PriorityQueue<>();
        freq.forEach((c, f) -> pq.offer(new Node(f, c)));

        while (pq.size() > 1) {
            Node left = pq.poll();
            Node right = pq.poll();
            pq.offer(new Node(left.freq + right.freq, left, right));
        }
        return pq.peek();
    }

    public static void buildCodes(Node node, String path, Map<Character, String> codes) {
        if (node.ch != null) {
            codes.put(node.ch, path.isEmpty() ? "0" : path);
            return;
        }
        buildCodes(node.left, path + "0", codes);
        buildCodes(node.right, path + "1", codes);
    }

    public static void main(String[] args) {
        Map<Character, Integer> freq = Map.of(
            'a', 45, 'b', 13, 'c', 12, 'd', 16, 'e', 9, 'f', 5);
        Map<Character, String> codes = new HashMap<>();
        buildCodes(buildTree(freq), "", codes);
        System.out.println(codes);
    }
}

C

#include <stdio.h>
#include <stdlib.h>
#include <limits.h>

typedef struct Node {
    char ch;
    int freq;
    struct Node *left, *right;
} Node;

Node *new_node(char ch, int freq, Node *l, Node *r) {
    Node *n = (Node *)malloc(sizeof(Node));
    n->ch = ch; n->freq = freq; n->left = l; n->right = r;
    return n;
}

/* Find two smallest nodes in array list; merge and compact. */
Node *build_huffman(Node *nodes[], int *n) {
    while (*n > 1) {
        int i, min1 = 0, min2 = 1;
        if (nodes[min2]->freq < nodes[min1]->freq) { int t = min1; min1 = min2; min2 = t; }
        for (i = 2; i < *n; i++) {
            if (nodes[i]->freq < nodes[min1]->freq) { min2 = min1; min1 = i; }
            else if (nodes[i]->freq < nodes[min2]->freq) min2 = i;
        }
        Node *parent = new_node('\0', nodes[min1]->freq + nodes[min2]->freq,
                               nodes[min1], nodes[min2]);
        nodes[min1] = parent;
        nodes[min2] = nodes[*n - 1];
        (*n)--;
    }
    return nodes[0];
}

void print_codes(Node *root, char *path, int depth) {
    if (!root) return;
    if (!root->left && !root->right) {
        path[depth] = '\0';
        printf("%c:%s ", root->ch, depth ? path : "0");
        return;
    }
    path[depth] = '0'; print_codes(root->left, path, depth + 1);
    path[depth] = '1'; print_codes(root->right, path, depth + 1);
}

int main(void) {
    Node *nodes[] = {
        new_node('a', 45, NULL, NULL), new_node('b', 13, NULL, NULL),
        new_node('c', 12, NULL, NULL), new_node('d', 16, NULL, NULL),
        new_node('e', 9, NULL, NULL),  new_node('f', 5, NULL, NULL)
    };
    int n = 6;
    Node *root = build_huffman(nodes, &n);
    char path[64];
    print_codes(root, path, 0);
    printf("\n");
    return 0;
}

Output & Trace

Input frequencies

Symbol Frequency Fixed 3-bit code Huffman code (sample) Code length
a4500001
b130011013
c120101003
d160111113
e910011014
f510111004

Fixed 3-bit encoding cost: 100 × 3 = 300 bits. Huffman weighted cost: 45×1 + 13×3 + 12×3 + 16×3 + 9×4 + 5×4 = 224 bits.

Greedy merge trace

Step Merge (two smallest) New node freq Active multiset
0f5, e9, c12, b13, d16, a45
1f + e14c12, b13, d16, 14, a45
2c + b25d16, 14, 25, a45
314 + d3025, 30, a45
425 + 3055a45, 55
5a + 55100100 (root) ← done

Tree sketch (left = 0, right = 1)

                 (100)
                /     \
             a:45     (55)
                     /    \
                  (25)    (30)
                 /   \    /   \
              c:12 b:13 (14) d:16
                       /  \
                    f:5  e:9

Encoding & Decoding

  1. Encode: Replace each symbol with its Huffman codeword; concatenate bits.
  2. Decode: Walk from root — bit 0 goes left, 1 goes right; on reaching a leaf, emit symbol and restart at root.
  3. Store tree or code table with the compressed data (header overhead matters for small files).
Huffman is lossless — decoded text exactly matches the original. It works best when frequency skew is high (common in natural text and images).

Variants & Related Topics

Topic Description Notes
Canonical HuffmanStandardized code ordering for fast decodeUsed in DEFLATE/gzip
Adaptive HuffmanUpdate tree as symbols stream inNo upfront frequency table
Shannon-FanoSplit symbols by cumulative probabilityNear-optimal, not always minimal
Arithmetic codingEncode whole message as one fractionOften beats Huffman on small alphabets
Run-length + HuffmanTwo-stage compression pipelineJPEG, PNG use similar ideas

Approach Comparison

Approach Time Space Notes
Huffman greedy (min-heap)O(n log n)O(n)Optimal prefix code — standard
Fixed-length codesO(n)O(1)Simple but usually longer output
Brute force codeword assignmentExponentialImpractical
Shannon-FanoO(n log n)O(n)Good but not always optimal

Real-Life Applications

Domain Scenario Mapping
gzip / DEFLATECompress text and files for storageHuffman stage after LZ77
JPEGCompress DCT coefficientsHuffman tables in image headers
MP3 / AACEncode audio frequency dataEntropy coding layer
Fax machinesEncode run lengths of black/white pixelsModified Huffman codes
Network protocolsCompress HTTP headers (HPACK)Huffman-coded static/dynamic tables
DNA / bioinformaticsCompress symbol streams (A,C,G,T)Frequency-based prefix codes

Interview Tips

  1. State the greedy rule clearly: always merge the two smallest frequencies.
  2. Use a min-heap — mention O(n log n) time and why.
  3. Explain the prefix property and why it enables greedy decoding.
  4. Draw a small merge trace (4–6 symbols) if asked — interviewers love the step-by-step table.
  5. Distinguish from 0-1 knapsack and fractional knapsack — Huffman is a tree-building greedy, not ratio sorting.
  6. Mention real usage: gzip, JPEG — shows practical awareness.

Key Takeaway

Huffman coding = repeatedly merge two least frequent nodes into a binary tree, then assign 0/1 to edges. Produces an optimal prefix code minimizing total weighted bit length. Use a min-heap for efficiency.

Next up: Job sequencing · Fractional knapsack · Greedy introduction