0% found this document useful (0 votes)
2 views

UNIT-5 Entropy Encoding

The document discusses various data compression techniques, focusing on entropy encoding, repetitive character encoding, and statistical encoding methods like Huffman and arithmetic coding. It highlights applications in text processing, error detection, and multimedia formats, emphasizing the advantages and disadvantages of each method. Additionally, it covers source encoding and vector quantization, detailing their roles in reducing redundancy and optimizing data representation.

Uploaded by

kartik101203
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

UNIT-5 Entropy Encoding

The document discusses various data compression techniques, focusing on entropy encoding, repetitive character encoding, and statistical encoding methods like Huffman and arithmetic coding. It highlights applications in text processing, error detection, and multimedia formats, emphasizing the advantages and disadvantages of each method. Additionally, it covers source encoding and vector quantization, detailing their roles in reducing redundancy and optimizing data representation.

Uploaded by

kartik101203
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

UNIT-5

Entropy encoding

An entropy encoding is a coding scheme that involves assigning


codes to symbols so as to match code lengths with the probabilities
of the symbols. Typically, entropy encoders are used to compress
data by replacing symbols represented by equal-length codes with
symbols represented by codes proportional to
the negative logarithm of the probability. Therefore, the most
common symbols use the shortest codes.

Repetitive character encoding


Repetitive character encoding is a technique used in computer science and data compression
where sequences of the same character are encoded to save space or simplify representation.
It is commonly associated with Run-Length Encoding (RLE), but the concept applies in
other areas as well.
Run-Length Encoding (RLE)
RLE compresses data by replacing consecutive repeated characters with a single character
followed by the count of repetitions. For example:
● Input: AAAABBBCCDAA
● Encoded Output: 4A3B2C1D2A
This approach is most effective when the data contains many repeated characters.

Other Applications of Repetitive Character Encoding


1. Text Processing:
o When analyzing or searching for patterns in text, repetitive encoding helps
standardize the representation of sequences.
o Example: Replace sequences of whitespace in text with a single space.
2. Error Detection and Correction:
o Repeating characters can add redundancy to help detect and correct
transmission errors.
o For instance, in telecommunication, the data stream 111000 could be
transmitted as 111111000000 to make errors easier to identify.
3. Bioinformatics:
o DNA sequences often contain repetitive patterns (e.g., AAAA or GGGG).
Encoding these sequences can help analyze genetic data efficiently.
Pros and Cons
Pros:
● Reduces storage requirements for repetitive data.
● Simplifies pattern recognition in certain datasets.
Cons:
● Ineffective for non-repetitive data (might even increase size).
● Can be computationally expensive for encoding/decoding large datasets.

Blank/ zero encoding


Blank/Zero Encoding refers to methods of representing sequences of blank (e.g.,
spaces or null values) or zero-valued data in a compressed or efficient format. These
methods are commonly used in data compression, storage optimization, or
transmission efficiency. Here's an overview of the concept and its applications:

Key Concepts of Blank/Zero Encoding


1. Run-Length Encoding (RLE):
o Similar to repetitive character encoding, RLE is also used for blank/zero
encoding.
o Example:
▪ Input: 00000001000000
▪ Encoded: (6,0)(1,1)(6,0) (representing six zeros, one one, and six
zeros).
2. Sparse Matrix Representation:
o Blank/zero encoding is often used in sparse matrices, where most of the
elements are zeros.
o Instead of storing all elements, only the non-zero elements and their positions
are recorded.
o Example:
▪ Matrix:
Copy code
003
040
000
▪ Encoded as: [(0,2,3), (1,1,4)] (row, column, value).
3. Bitmaps and Flags:
o Use a binary representation to encode blank or zero data.
o Example:
▪ Data: 0 0 0 1 0 0 0
▪ Encoded bitmap: 0001000.
4. Huffman Coding:
o Assign shorter codes to blanks/zeros if they appear frequently.
o Example:
▪ If 0 is the most common value in data, assign it a short Huffman code
(e.g., 1).

Applications of Blank/Zero Encoding


1. Data Compression:
o Especially effective for datasets with a high frequency of zeros or blank spaces
(e.g., image compression, audio silences).
2. Transmission Efficiency:
o Reduces bandwidth requirements by omitting redundant zero data in
communication protocols.
3. Database Storage:
o Helps optimize storage for sparse tables or fields with many null/zero entries.
4. Machine Learning:
o In one-hot encoding or sparse feature representations, blank/zero encoding can
optimize memory usage.

Advantages
● Reduces memory and storage usage.
● Improves data transmission speed for repetitive or sparse datasets.
Disadvantages
● Encoding and decoding can add computational overhead.
● Not effective for dense datasets without a high prevalence of blanks/zeros.
Statistical Encoding is a data compression technique that leverages the statistical
properties of data—such as the frequency of occurrence of symbols—to achieve
efficient encoding. The fundamental idea is to assign shorter codes to more frequently
occurring symbols and longer codes to less frequent ones, reducing the overall storage
or transmission cost.

Key Types of Statistical Encoding


1. Huffman Coding:
o Constructs a binary tree where frequently occurring symbols are assigned
shorter binary codes.
o Example:
▪ Frequencies: A: 45, B: 13, C: 12, D: 16, E: 9, F: 5
▪ Encoded Output: A: 0, B: 101, C: 100, D: 111, E: 1101, F: 1100.
2. Arithmetic Coding:
o Encodes the entire message as a single fractional value in a range [0,1), based
on symbol probabilities.
o Example:
▪ Message: ABBA
▪ Probabilities: A=0.6, B=0.4
▪ Encoded value: A fractional number (e.g., 0.241) representing the
entire message.
3. Shannon-Fano Coding:
o Symbols are ordered by frequency, then hierarchically divided into groups,
and binary codes are assigned.
o Example:
▪ Frequencies: A: 8, B: 3, C: 1
▪ Encoded Output: A: 0, B: 10, C: 11.
4. Entropy Coding:
o Based on the concept of entropy, which measures the amount of information
or uncertainty in data.
o Typically used in combination with other statistical methods, e.g., Huffman or
Arithmetic coding.

Applications of Statistical Encoding


1. File Compression:
o Algorithms like ZIP and GZIP use statistical encoding methods.
2. Multimedia Data:
o Formats like JPEG, MP3, and MPEG use Huffman coding or similar
techniques for image and audio compression.
3. Network Protocols:
o Statistical encoding is used to minimize bandwidth for transmitting data with
predictable patterns.
4. Natural Language Processing (NLP):
o Encoding words based on their frequency in text corpora.

Advantages
● Optimized for datasets with skewed distributions, reducing average code length.
● Can achieve near-optimal compression ratios for well-understood distributions.
Disadvantages
● Ineffective for uniformly distributed data (no significant frequency difference).
● Requires preprocessing to calculate symbol frequencies, which can be
computationally expensive.

Source Encoding
Source Encoding, also known as data compression or source coding, is the process
of converting data from its original format into a compressed representation to reduce
redundancy and optimize storage or transmission. It is widely used in
telecommunications, multimedia, and data processing.

Key Concepts in Source Encoding


1. Redundancy Reduction:
o Removes unnecessary or repetitive information from data.
o Example: Compressing a text file by removing extra spaces or repeated
characters.
2. Entropy:
o Based on Shannon's Information Theory, entropy measures the minimum
average number of bits needed to encode symbols based on their probabilities.
o Example:
▪ Symbol probabilities: A=0.5, B=0.25, C=0.25.
▪ Minimum average bits per symbol ≈ 1.5 bits.
3. Lossless vs. Lossy Compression:
o Lossless Compression: No information is lost; the original data can be
perfectly reconstructed (e.g., ZIP, PNG).
o Lossy Compression: Irrelevant or less noticeable data is discarded to achieve
higher compression ratios (e.g., JPEG, MP3).

Methods of Source Encoding


1. Statistical Encoding:
o Encodes data based on symbol frequencies.
o Examples:
▪ Huffman Coding: Assigns shorter codes to frequent symbols.
▪ Arithmetic Coding: Encodes entire data as a fractional number based
on probabilities.
2. Dictionary-Based Encoding:
o Builds a dictionary of patterns (substrings) and replaces patterns with
references.
o Examples:
▪ Lempel-Ziv (LZ) Coding: Replaces repeated sequences with pointers.
▪ LZW Compression: A variant of LZ78 used in formats like GIF.
3. Run-Length Encoding (RLE):
o Replaces consecutive identical symbols with a single symbol and a count.
o Example: AAAABBB → 4A3B.
4. Transform-Based Compression:
o Converts data to a different domain where it is easier to compress.
o Examples:
▪ Discrete Cosine Transform (DCT): Used in JPEG.
▪ Fourier Transform: Used in audio compression.

Applications of Source Encoding


1. File Compression:
o Tools like ZIP, RAR, and 7z use source encoding to reduce file sizes.
2. Multimedia Formats:
o Used in MP3, JPEG, PNG, and MPEG for audio, image, and video
compression.
3. Data Transmission:
o Reduces bandwidth requirements in telecommunications (e.g., cellular data,
satellite communication).

Advantages
● Saves storage space.
● Reduces transmission time and bandwidth costs.
● Optimizes performance in systems with limited resources.
Disadvantages
● Encoding/decoding can be computationally expensive.
● Lossy methods may degrade data quality.

What is Vector Quantization?


Vector Quantization (VQ) is a technique used in signal processing, data compression,
and pattern recognition that involves quantizing continuous or discrete data into a
finite set of representative vectors, known as codebook vectors or centroids. The goal
of Vector Quantization is to minimize the distortion between the input data and the
codebook vectors, thereby achieving a compact representation of the data while
preserving as much information as possible.

What does Vector Quantization do?

Vector Quantization performs the following tasks:


● Clustering: Vector Quantization groups similar data points together based on a
similarity metric, such as Euclidean distance or cosine similarity, in order to create
clusters of similar data.
● Codebook generation: Vector Quantization creates a codebook, which is a set of
representative vectors (centroids) for each cluster. The codebook serves as a
compressed representation of the original data.
● Quantization: Vector Quantization replaces each data point with the index of the
closest codebook vector, effectively quantizing the data and reducing its size.
Some benefits of using Vector Quantization
Vector Quantization offers several benefits for data analysis and compression tasks:
● Data compression: Vector Quantization can achieve significant data compression with
minimal loss of information, making it suitable for applications like image and audio
compression.
● Noise reduction: Vector Quantization can help reduce noise in the data by replacing
individual data points with representative codebook vectors, leading to smoother and
more robust representations.
● Pattern recognition: Vector Quantization can be used to identify patterns or structures
in the data, which can be useful for tasks like classification, clustering, and feature
extraction.

You might also like