Mathematical Preliminaries for Lossless Compression

In the modern digital era, the exponential growth of data has made efficient storage and transmission essential. Lossless compression, a critical technique in data management, achieves this efficiency by reducing file sizes without altering the original content. Unlike lossy compression, which trades accuracy for smaller sizes, lossless compression ensures the exact restoration of the original data, making it indispensable in fields such as scientific computing, medical imaging, and file archiving.

The foundation of lossless compression lies in mathematics, where concepts from information theory, probability, and algorithm design converge to optimize data representation. These mathematical principles enable the identification and elimination of redundancies, ensuring that data is stored and transmitted in a compact yet precise manner.

A Brief Introduction to Information Theory

Lossless compression, a crucial aspect of data management, aims to reduce file size while preserving the original data’s integrity. At the heart of this process lies the field of information theory, introduced by Claude Shannon. This field provides a mathematical foundation for understanding the limits and mechanisms of data compression. Key concepts such as self-information, entropy, and modeling are essential for designing and analyzing compression algorithms.

The concept of self-information quantifies the informativeness of an event. Mathematically, self-information $i(A)$ of an event $A$ with probability $P(A)$ is given by:

i(A) = -\log_b P(A),

where $b$ is the base of the logarithm. The choice of $b$ determines the unit of information: base 2 for bits, base $e$ for nats, and base 10 for hartleys. The formula reflects that less probable events are more informative. For example, in a fair coin toss where $P(\text{Heads}) = 0.5$ and $P(\text{Tails}) = 0.5$ , the self-information for either outcome is $i = -\log_2(0.5) = 1 \, \text{bit}$ . However, for a biased coin where $P(\text{Heads}) = \frac{1}{8}$ and $P(\text{Tails}) = \frac{7}{8}$ , the self-information for heads is $i(\text{Heads}) = -\log_2\left(\frac{1}{8}\right) = 3 \, \text{bits}$ , while for tails it is $i(\text{Tails}) \approx 0.193 \, \text{bits}$ . This demonstrates that less likely outcomes convey more information.

Entropy extends the idea of self-information to measure the average information content of a random variable. For a source $X$ with possible outcomes $x_i$ and probabilities $P(x_i)$ , entropy $H(X)$ is defined as:

H(X) = -\sum_{i} P(x_i) \log_b P(x_i).

Entropy represents the theoretical lower bound for the average number of bits needed to encode the source. For example, in a fair coin toss, the entropy is:

H = -\left( P(\text{Heads}) \log_2 P(\text{Heads}) + P(\text{Tails}) \log_2 P(\text{Tails}) \right) = 1 \, \text{bit}.

For the sequence $S = \{1, 2, 3, 2, 3, 4, 5, 4, 5, 6, 7, 8, 9, 8, 9, 10\}$ , probabilities of each symbol can be calculated based on their frequency. For instance, $P(1) = P(6) = P(7) = P(10) = \frac{1}{16}$ and $P(2) = P(3) = P(4) = P(5) = P(8) = P(9) = \frac{2}{16}$ . Using the entropy formula, the entropy for this sequence is approximately $H \approx 3.25 \, \text{bits/symbol}$ .

Modeling is another key concept in lossless compression, where patterns or structures in data are exploited to improve compression. For example, consider the sequence $S = \{12123333123333123312\}$ . If each symbol is treated independently, the entropy is $H = 1.5 \, \text{bits/symbol}$ , requiring 30 bits to represent the sequence. However, if symbols are grouped into pairs ( $\{12, 33\}$ ), the probabilities simplify to $P(12) = P(33) = 0.5$ , reducing the entropy to $H = 1 \, \text{bit/symbol}$ , and the sequence can be encoded in just 10 bits. This demonstrates how identifying structure in data reduces redundancy and improves compression.

Shannon also introduced the concept of source coding, showing that the best achievable compression encodes data with an average number of bits equal to the source’s entropy. While this limit cannot be perfectly achieved in practice, it serves as a benchmark for evaluating compression algorithms. For instance, in a sequence modeled as independent and identically distributed (i.i.d.), the first-order entropy calculation provides an estimate of the minimal bits required per symbol. For more complex sequences, higher-order models or adaptive techniques can further refine this estimate.

Derivation of Average Information

The derivation of the formula for average information, commonly referred to as entropy, follows directly from the properties we expect in a measure of information. This derivation, as first presented by Claude Shannon, is both elegant and rigorous.

Properties of Average Information

Given a set of independent events $A_1, A_2, \ldots, A_n$ with probabilities $p_i = P(A_i)$ , the average information $H$ should satisfy the following properties:

Continuity:
$H$ should be a continuous function of the probabilities $p_i$ . A small change in $p_i$ should lead to a small change in $H$ . This ensures stability and mathematical soundness.
Monotonicity:
If all events are equally likely ( $p_i = 1/n$ ), $H$ should be a monotonically increasing function of $n$ . This reflects the intuitive idea that more possible outcomes increase the uncertainty or information content.
Group Decomposition:
If outcomes are grouped, the total average information $H$ should be the same whether calculated directly or in stages. For example, consider grouping outcomes $A_1, A_2,$ and $A_3$ into two groups $B_1 = \{A_1\}$ and $B_2 = \{A_2, A_3\}$ . The total information should equal the information from identifying the group (e.g., $B_1$ or $B_2$ ) plus the information from identifying the specific event within the group.
Mathematically:
$H = H(q_1, q_2) + q_1 H\left(\frac{p_1}{q_1}\right) + q_2 H\left(\frac{p_2}{q_2}, \frac{p_3}{q_2}\right),$
where $q_1 = p_1$ and $q_2 = p_2 + p_3$ .

Step 1: Equally Likely Events

For $n$ equally likely outcomes ( $p_i = 1/n$ for all $i$ ), the average information $H$ depends only on $n$ . Denote this dependency as $A(n)$ , so:

H\left(\frac{1}{n}, \frac{1}{n}, \ldots, \frac{1}{n}\right) = A(n).

Using the group decomposition property, if we have $n = k^m$ outcomes, the occurrence of any event can be indicated by $m$ sequential choices, each with $k$ equally likely possibilities. Therefore:

A(k^m) = m A(k).

Let $n = k^m$ . Taking the logarithm of $n$ :

\log n = m \log k \quad \Rightarrow \quad m = \frac{\log n}{\log k}.

Substituting $m$ into the equation $A(k^m) = m A(k)$ :

A(n) = \frac{\log n}{\log k} A(k).

For simplicity, let $A(k) = K \log k$ , where $K$ is a positive constant. Then:

A(n) = K \log n.

Thus, for equally likely events:

H\left(\frac{1}{n}, \frac{1}{n}, \ldots, \frac{1}{n}\right) = K \log n.

Step 2: Unequally Likely Events

Now consider the general case where events have unequal probabilities $p_1, p_2, \ldots, p_n$ . Assume that these probabilities are rational and represent them as:

p_i = \frac{n_i}{\sum_{j=1}^n n_j}.

The events are grouped into $n$ groups of size $n_i$ , where the outcomes within each group are equally likely. Using the group decomposition property, the total average information is:

H(p_1, p_2, \ldots, p_n) = H(q_1, q_2, \ldots, q_n) + \sum_{i=1}^n q_i H\left(\frac{1}{n_i}, \frac{1}{n_i}, \ldots, \frac{1}{n_i}\right),

where $q_i = p_i$ . From the equally likely case:

H\left(\frac{1}{n_i}, \frac{1}{n_i}, \ldots, \frac{1}{n_i}\right) = K \log n_i.

Substituting $q_i = p_i$ and simplifying:

H(p_1, p_2, \ldots, p_n) = -K \sum_{i=1}^n p_i \log p_i.

Step 3: Choosing the Constant $K$

By convention, we choose $K = 1$ to simplify the formula. Thus, the final expression for entropy is:

H = -\sum_{i=1}^n p_i \log p_i.

Interpretation

The derived formula satisfies all three properties:

Continuity: The function $-p_i \log p_i$ is continuous for $0 < p_i \leq 1$ .
Monotonicity: For equally likely probabilities ( $p_i = 1/n$ ), $H$ increases with $n$ .
Group Decomposition: The grouping of events into subgroups does not affect the total entropy.

This derivation shows that the entropy formula is a natural consequence of the properties we require for a measure of average information. Like the laws of physics, these relationships emerge from the inherent structure of probabilities and information.

Advanced Concepts in Information Theory

1.1 Physical Models

Physical models in information theory are used to represent the real-world communication process. These models encapsulate both signal generation and the effects of noise on signal propagation, often relying on stochastic processes.

Example: Gaussian Channel Model
The additive white Gaussian noise (AWGN) channel is a common physical model:

Y = X + N,

where $Y$ is the received signal, $X$ is the transmitted signal, and $N \sim \mathcal{N}(0, \sigma^2)$ is Gaussian noise with zero mean and variance $\sigma^2$ . The capacity of this channel, derived using Shannon’s formula, is:

C = \frac{1}{2} \log_2\left(1 + \frac{P}{\sigma^2}\right) \, \text{bits per channel use},

where $P$ is the power of the signal.

1.2 Probability Models

Advanced probability models capture correlations and dependencies among events, essential for complex systems such as natural language processing or image analysis.

Example: Joint Entropy and Mutual Information
For two random variables $X$ and $Y$ , the joint entropy is:

H(X, Y) = -\sum_{x, y} P(x, y) \log P(x, y).

Mutual information, quantifying the information shared between $X$ and $Y$ , is:

I(X; Y) = H(X) + H(Y) – H(X, Y).

This measure is pivotal in evaluating the efficiency of coding schemes in noisy channels.

1.3 Markov Models

Markov models generalize into higher-order models and hidden Markov models (HMMs) to handle sequences with underlying latent processes.

Example: Hidden Markov Model (HMM)
An HMM assumes a sequence of observations $O = \{o_1, o_2, \ldots, o_T\}$ is generated by hidden states $S = \{s_1, s_2, \ldots, s_T\}$ , with:

Transition probabilities: $P(s_t | s_{t-1})$ ,
Emission probabilities: $P(o_t | s_t)$ .

Applications include speech recognition and bioinformatics.

1.4 Composite Source Models

Composite source models combine multiple sources to represent data heterogeneity. Advanced techniques involve mixing distributions for applications in data compression and machine learning.

Example: Mixture Models
A Gaussian Mixture Model (GMM) represents a composite source as:

P(x) = \sum_{k=1}^K \pi_k \mathcal{N}(x | \mu_k, \Sigma_k),

where $\pi_k$ are mixing coefficients, and $\mathcal{N}(x | \mu_k, \Sigma_k)$ are Gaussian components. The Expectation-Maximization (EM) algorithm is used for parameter estimation.

2. Advanced Coding Techniques

2.1 Uniquely Decodable Codes

Uniquely decodable codes now extend to variable-length codes optimized for specific distributions.

Example: Huffman Coding
For symbols $A, B, C, D$ with probabilities $0.4, 0.3, 0.2, 0.1$ , Huffman coding constructs a prefix tree:

$A = 0, B = 10, C = 110, D = 111$ . The expected code length is minimized:

L = \sum_{i} P(x_i) \cdot l(x_i) = (0.4)(1) + (0.3)(2) + (0.2)(3) + (0.1)(3) = 1.7 \, \text{bits}.

2.2 Prefix Codes

Advanced prefix codes consider redundancy and error-correcting properties. For instance, a Golomb code is a prefix code optimized for geometric distributions.

Example: Golomb Code
For integers with probabilities $P(k) = (1-p)p^{k-1}$ , a Golomb code splits the integers into groups of size $M$ , ensuring efficient representation.

2.3 The Kraft-McMillan Inequality

The Kraft-McMillan inequality not only checks the feasibility of a prefix code but also helps derive bounds on code length in advanced coding problems.

Example:
For an optimal code, equality holds:

\sum_{i=1}^n 2^{-l_i} = 1.

This result is foundational in proving the optimality of Huffman codes.

3. Algorithmic Information Theory

3.1 Minimum Description Length Principle (MDL)

The MDL principle balances the complexity of a model against its fit to data, directly linking to Kolmogorov complexity.

Example:
Given two models $M_1$ and $M_2$ for a dataset $D$ :

$M_1$ : Low complexity but poor fit ( $L(M_1) + L(D|M_1)$ ),
$M_2$ : Higher complexity but better fit.

The MDL selects the model minimizing total description length:

L(M) + L(D|M).

Advanced Applications:
MDL is used in selecting latent variables in Bayesian networks and determining optimal clusters in unsupervised learning.

4. Connecting the Pieces: Unified Insights

Advanced information theory combines models, coding, and algorithmic principles to optimize performance across diverse domains:

Communication: Efficient data encoding (Huffman, Arithmetic codes) and reliable transmission (error-correcting codes).
Machine Learning: Model selection (MDL) and representation learning (Markov and mixture models).
Complex Systems: Quantifying interactions (mutual information) and uncovering patterns in biological, social, and physical systems.

References

Books:
- Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory. Wiley-Interscience.
- MacKay, D. J. C. (2003). Information Theory, Inference, and Learning Algorithms. Cambridge University Press.
- Shannon, C. E., & Weaver, W. (1949). The Mathematical Theory of Communication. University of Illinois Press.
Research Papers:
- Shannon, C. E. (1948). “A Mathematical Theory of Communication.” Bell System Technical Journal, 27(3): 379–423, 623–656.
- Rissanen, J. (1978). “Modeling by Shortest Data Description.” Automatica, 14(5): 465–471.
Online Courses and Lecture Notes:
- MIT OpenCourseWare: Information Theory
- Stanford Online: Probabilistic Models
Research Software and Tools:
- Scikit-learn documentation for practical applications of Gaussian Mixture Models and Markov models: https://scikit-learn.org
- Python libraries like pymc3 for probabilistic programming and Bayesian inference.
Additional Readings:
- Berger, T. (1971). Rate Distortion Theory: A Mathematical Basis for Data Compression. Prentice-Hall.
- Kullback, S., & Leibler, R. A. (1951). “On Information and Sufficiency.” Annals of Mathematical Statistics, 22(1): 79–86.