0% found this document useful (0 votes)

0 views

algorithm

The document describes the deflation algorithm used by zip and gzip, which is based on Lempel-Ziv 1977 and utilizes a hash table to find duplicated strings in input data. It also outlines the gzip file format standardized in RFC 1952, detailing its structure, compression methods, and optional fields. The gzip format is designed for efficient single-pass compression without needing prior knowledge of the uncompressed input size.

Uploaded by

en.artesgraficas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

0 views

algorithm

Uploaded by

en.artesgraficas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 3

1.

Algorithm

The deflation algorithm used by zip and gzip is a variation of

Lempel-Ziv 1977 [LZ77]. It finds duplicated strings in
the input data. The second occurrence of a string is replaced by a
pointer to the previous string, in the form of a pair (distance,
length). Distances are limited to 32K bytes, and lengths are limited
to 258 bytes. When a string does not occur anywhere in the previous
32K bytes, it is emitted as a sequence of literal bytes. (In this
description, 'string' must be taken as an arbitrary sequence of bytes,
and is not restricted to printable characters.)

Literals or match lengths are compressed with one Huffman tree, and
match distances are compressed with another tree. The trees are stored
in a compact form at the start of each block. The blocks can have any
size (except that the compressed data for one block must fit in
available memory). A block is terminated when zip determines that it
would be useful to start another block with fresh trees. (This is
somewhat similar to compress.)

Duplicated strings are found using a hash table. All input strings of
length 3 are inserted in the hash table. A hash index is computed for
the next 3 bytes. If the hash chain for this index is not empty, all
strings in the chain are compared with the current input string, and
the longest match is selected.

The hash chains are searched starting with the most recent strings, to
favor small distances and thus take advantage of the Huffman encoding.
The hash chains are singly linked. There are no deletions from the
hash chains, the algorithm simply discards matches that are too old.

To avoid a worst-case situation, very long hash chains are arbitrarily

truncated at a certain length, determined by a runtime option (zip -1
to -9). So zip does not always find the longest possible match but
generally finds a match which is long enough.

zip also defers the selection of matches with a lazy evaluation

mechanism. After a match of length N has been found, zip searches for a
longer match at the next input byte. If a longer match is found, the
previous match is truncated to a length of one (thus producing a single
literal byte) and the longer match is emitted afterwards. Otherwise,
the original match is kept, and the next match search is attempted only
N steps later.

The lazy match evaluation is also subject to a runtime parameter. If

the current match is long enough, zip reduces the search for a longer
match, thus speeding up the whole process. If compression ratio is more
important than speed, zip attempts a complete second search even if
the first match is already long enough.

The lazy match evaluation is not performed for the fastest compression
modes (speed options -1 to -3). For these fast modes, new strings
are inserted in the hash table only when no match was found, or
when the match is not too long. This degrades the compression ratio
but saves time since there are both fewer insertions and fewer searches.

2. gzip file format

The gzip file format was standardized in Internet RFC 1952 [RFC1952].
This section briefly describes the format and comments on some
implementation details.

The pkzip format imposes a lot of overhead in various headers, which

are useful for an archiver but not necessary when only one file is
compressed. gzip uses a much simpler structure. Numbers are in little
endian format, and bit 0 is the least significant bit.
A gzip file is a sequence of compressed members. Each member has the
following structure:

2 bytes magic header 0x1f, 0x8b (\037 \213)

1 byte compression method (0..7 reserved, 8 = deflate)
1 byte flags
bit 0 set: file probably ascii text
bit 1 set: header CRC-16 present
bit 2 set: extra field present
bit 3 set: original file name present
bit 4 set: file comment present
bit 5,6,7: reserved
4 bytes file modification time in Unix format
1 byte extra flags (depend on compression method)
1 byte operating system on which compression took place

2 bytes optional part number (second part=1)

2 bytes optional extra field length
? bytes optional extra field
? bytes optional original file name, zero terminated
? bytes optional file comment, zero terminated
2 bytes optional 16-bit header CRC
? bytes compressed data
4 bytes crc32
4 bytes uncompressed input size modulo 2^32

The format was designed to allow single pass compression without any
backwards seek, and without a priori knowledge of the uncompressed
input size or the available size on the output media. If input does
not come from a regular disk file, the file modification time is set
to the time at which compression started.

The timestamp is useful mainly when one gzip file is transferred over
a network. In this case it would not help to keep ownership
attributes. In the local case, the ownership attributes are preserved
by gzip when compressing/decompressing the file. A timestamp of zero
is ignored.

Bit 0 in the flags is only an optional indication, which can be set by

a small lookahead in the input data. In case of doubt, the flag is
cleared indicating binary data. For systems which have different
file formats for ascii text and binary data, the decompressor can
use the flag to choose the appropriate format.

The extra field, if present, must consist of one or more subfields,

each with the following format:

subfield id : 2 bytes
subfield size : 2 bytes (little-endian format)
subfield data
The subfield id can consist of two letters with some mnemonic value.
Please send any such id to <[email protected]>. Ids with a zero second
byte are reserved for future use. The following ids are defined:

Ap (0x41, 0x70) : Apollo file type information

The subfield size is the size of the subfield data and does not
include the id and the size itself. The field 'extra field length' is
the total size of the extra field, including subfield ids and sizes.

It must be possible to detect the end of the compressed data with any
compression format, regardless of the actual size of the compressed
data. If the compressed data cannot fit in one file (in particular for
diskettes), each part starts with a header as described above, but
only the last part has the crc32 and uncompressed size. A decompressor
may prompt for additional data for multi-part compressed files. It is
desirable but not mandatory that multiple parts be extractable
independently so that partial data can be recovered if one of the
parts is damaged. This is possible only if no compression state is
kept from one part to the other. The compression-type dependent flags
can indicate this.

If the file being compressed is on a file system with case insensitive

names, the original name field must be forced to lower case. There is
no original file name if the data was compressed from standard input.

Compression is always performed, even if the compressed file is

slightly larger than the original. The worst case expansion is
a few bytes for the gzip file header, plus 5 bytes every 32K block,
or an expansion ratio of 0.015% for large files. Note that the actual
number of used disk blocks almost never increases.

Jean-loup Gailly
[email protected]

References:

[LZ77] Ziv J., Lempel A., "A Universal Algorithm for Sequential Data
Compression", IEEE Transactions on Information Theory, Vol. 23, No. 3,
May 1977, pp. 337-343.

[RFC1952] Deutsch P., "GZIP file format specification version 4.3",

Internet RFC 1952, May 1996, <https://www.ietf.org/rfc/rfc1952.txt>.

APPNOTE.TXT documentation file in PKZIP 1.93a (October 1991). This

version no longer seems to be available online; the latest version is
in <https://www.pkware.com/documents/casestudies/APPNOTE.TXT>.

Top FAANG Interview Questions From LeetCode
No ratings yet
Top FAANG Interview Questions From LeetCode
15 pages
Woot19-Paper Fifield 0
No ratings yet
Woot19-Paper Fifield 0
11 pages
Gzip 114
No ratings yet
Gzip 114
7 pages
XZ File Format
No ratings yet
XZ File Format
20 pages
ZLib
No ratings yet
ZLib
3 pages
Tech Note
No ratings yet
Tech Note
5 pages
Tech Note
No ratings yet
Tech Note
5 pages
Tech Note
No ratings yet
Tech Note
7 pages
Gzip Is A Software Application Used For File Compression. Gzip Is Short For Gnu Zip The Program
No ratings yet
Gzip Is A Software Application Used For File Compression. Gzip Is Short For Gnu Zip The Program
37 pages
XZ File Format
No ratings yet
XZ File Format
20 pages
gzip
No ratings yet
gzip
8 pages
Inbound 5604542196618247233
No ratings yet
Inbound 5604542196618247233
2 pages
Linux Command On Archive & Compression
No ratings yet
Linux Command On Archive & Compression
11 pages
RML
No ratings yet
RML
13 pages
Lossless Data Compression Algorithm Abraham Lempel Jacob Ziv Terry Welch LZ78
No ratings yet
Lossless Data Compression Algorithm Abraham Lempel Jacob Ziv Terry Welch LZ78
9 pages
Bzip2 Format
No ratings yet
Bzip2 Format
28 pages
Windll
No ratings yet
Windll
3 pages
documentation-006-013
No ratings yet
documentation-006-013
8 pages
Misc C
No ratings yet
Misc C
3 pages
Zlib Man
No ratings yet
Zlib Man
2 pages
CS 300 Data Structures: Sabancı University Faculty of Engineering and Natural Sciences
No ratings yet
CS 300 Data Structures: Sabancı University Faculty of Engineering and Natural Sciences
6 pages
Introduction to Network Security - 2015 - Wang - Data Compression Using ZIP
No ratings yet
Introduction to Network Security - 2015 - Wang - Data Compression Using ZIP
2 pages
7-Zip Manual - P (Set Password) Switch - Documentation
No ratings yet
7-Zip Manual - P (Set Password) Switch - Documentation
107 pages
Lzma File Format
No ratings yet
Lzma File Format
3 pages
CS 11 - Machine Problem 2 PDF
No ratings yet
CS 11 - Machine Problem 2 PDF
3 pages
Project Guide Ms. Kavita Saxena
No ratings yet
Project Guide Ms. Kavita Saxena
14 pages
Name Synopsis Description
No ratings yet
Name Synopsis Description
2 pages
Chapter 6 Organizing Files For Performance Not Complete
No ratings yet
Chapter 6 Organizing Files For Performance Not Complete
65 pages
Static Int Const Unsigned Char Const Unsigned Char Int Unsigned Char
No ratings yet
Static Int Const Unsigned Char Const Unsigned Char Int Unsigned Char
3 pages
Lempel-Ziv-Welch (LZW) - Is A Universal Lossless Data Compression Algorithm Created by Abraham
No ratings yet
Lempel-Ziv-Welch (LZW) - Is A Universal Lossless Data Compression Algorithm Created by Abraham
5 pages
Run-Length Encoding
No ratings yet
Run-Length Encoding
3 pages
GZIP File Format Specification Version 4.3: Status of This Memo
No ratings yet
GZIP File Format Specification Version 4.3: Status of This Memo
12 pages
GZIP File Format Specification Version 4.3: Status of This Memo
No ratings yet
GZIP File Format Specification Version 4.3: Status of This Memo
12 pages
LZW (Lempel Ziv Welch) : 60.1 Brief History
No ratings yet
LZW (Lempel Ziv Welch) : 60.1 Brief History
4 pages
Lzma File Format
No ratings yet
Lzma File Format
3 pages
Improvised GZIP Published Eai.1!10!2019.160599
No ratings yet
Improvised GZIP Published Eai.1!10!2019.160599
8 pages
Archive Utilities: Learning Objectives
No ratings yet
Archive Utilities: Learning Objectives
20 pages
RecursiveDataCompressionMethod
No ratings yet
RecursiveDataCompressionMethod
26 pages
LZW Compression Algorithm
No ratings yet
LZW Compression Algorithm
4 pages
DEFLATE Compressed Data Format Specification Version 1.3: Status of This Memo
No ratings yet
DEFLATE Compressed Data Format Specification Version 1.3: Status of This Memo
15 pages
Class Notes CS 3137 1 LZW Encoding
No ratings yet
Class Notes CS 3137 1 LZW Encoding
5 pages
S_Run length encoding
No ratings yet
S_Run length encoding
14 pages
Lempel Ziv
No ratings yet
Lempel Ziv
22 pages
Change Log
No ratings yet
Change Log
28 pages
File System Basics File Compression Archiving and Backup
No ratings yet
File System Basics File Compression Archiving and Backup
9 pages
Framing Format (1)
No ratings yet
Framing Format (1)
3 pages
Why Needed?: Without Compression, These Applications Would Not Be Feasible
No ratings yet
Why Needed?: Without Compression, These Applications Would Not Be Feasible
11 pages
Read Me
No ratings yet
Read Me
4 pages
Spectra, Signals Report
No ratings yet
Spectra, Signals Report
8 pages
Lemp El Ziv Compression
No ratings yet
Lemp El Ziv Compression
6 pages
Gip
No ratings yet
Gip
2 pages
Huffman Zipper
No ratings yet
Huffman Zipper
11 pages
OS HOBBY (1)_merged (1)
No ratings yet
OS HOBBY (1)_merged (1)
25 pages
Chapter 4 Lossless Compression Algorithims
No ratings yet
Chapter 4 Lossless Compression Algorithims
30 pages
Unit 1 Data Compression
No ratings yet
Unit 1 Data Compression
30 pages
LZW Fundamentals: Lempel Ziv 1977 1978 Terry Welch's 1978 Algorithm 1984
No ratings yet
LZW Fundamentals: Lempel Ziv 1977 1978 Terry Welch's 1978 Algorithm 1984
9 pages
Lzma
No ratings yet
Lzma
11 pages
File Archiver
No ratings yet
File Archiver
1 page
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
Nintendo DS Architecture: Architecture of Consoles: A Practical Analysis, #14
From Everand
Nintendo DS Architecture: Architecture of Consoles: A Practical Analysis, #14
Rodrigo Copetti
No ratings yet
Project Gutenberg "Best Of" CD August 2003
From Everand
Project Gutenberg "Best Of" CD August 2003
Project Gutenberg
No ratings yet
Top 50+ Java Collections Imp Interview Questions (2024)
No ratings yet
Top 50+ Java Collections Imp Interview Questions (2024)
54 pages
Cd3281 Dsa Question Bank
No ratings yet
Cd3281 Dsa Question Bank
81 pages
Plagiarism Detection Research
No ratings yet
Plagiarism Detection Research
23 pages
Cop3502 Final Study Guide 3 - 1
No ratings yet
Cop3502 Final Study Guide 3 - 1
28 pages
Reversing On Windows 2016
No ratings yet
Reversing On Windows 2016
78 pages
Calculate Client Security Hash - 2021.10 Complete Exercise Walkthrough
No ratings yet
Calculate Client Security Hash - 2021.10 Complete Exercise Walkthrough
12 pages
DataFusion Query Engine SIGMOD 2024-FINAL
No ratings yet
DataFusion Query Engine SIGMOD 2024-FINAL
13 pages
Kangaroo Methods For Solving The Interval Discrete Logarithm Problem
No ratings yet
Kangaroo Methods For Solving The Interval Discrete Logarithm Problem
54 pages
A Framework For Analyzing and Improving Content-Based Chunking Algorithms - 2005
No ratings yet
A Framework For Analyzing and Improving Content-Based Chunking Algorithms - 2005
11 pages
Unit 15
No ratings yet
Unit 15
6 pages
MMD 03
No ratings yet
MMD 03
53 pages
UCLA CS32 Project 4 Spec (Winter 2014)
No ratings yet
UCLA CS32 Project 4 Spec (Winter 2014)
33 pages
UG-comp-SY-24-25
No ratings yet
UG-comp-SY-24-25
93 pages
ADD Find Unsorted Array Sorted Array Linked List
No ratings yet
ADD Find Unsorted Array Sorted Array Linked List
27 pages
Manual Typo3 Indexed Search
100% (1)
Manual Typo3 Indexed Search
17 pages
Done DS GTU Study Material Presentations Unit-4 13032021035653AM
No ratings yet
Done DS GTU Study Material Presentations Unit-4 13032021035653AM
24 pages
Dsa Hashingppt
No ratings yet
Dsa Hashingppt
8 pages
Static and Dynamic Hashing.docx
No ratings yet
Static and Dynamic Hashing.docx
3 pages
Chapter - 2: Database Model Key-Value Data Store Document Databases Column Databases Graph Databases
No ratings yet
Chapter - 2: Database Model Key-Value Data Store Document Databases Column Databases Graph Databases
61 pages
BERLIN VERSION Beacfbd - 2022-10-24
No ratings yet
BERLIN VERSION Beacfbd - 2022-10-24
41 pages
UNIT-3 Part-A:Semantic Analysis 1. Intermediate Code Forms
No ratings yet
UNIT-3 Part-A:Semantic Analysis 1. Intermediate Code Forms
26 pages
C# Unit 2
No ratings yet
C# Unit 2
81 pages
Perl Cheat Sheet
100% (1)
Perl Cheat Sheet
2 pages
Chapter 7 - Run Time Environment
No ratings yet
Chapter 7 - Run Time Environment
12 pages
Search Courses: My Courses AWS Certi Ed Solutions Architect Associate Simple Storage Service (S3) - Quiz
No ratings yet
Search Courses: My Courses AWS Certi Ed Solutions Architect Associate Simple Storage Service (S3) - Quiz
55 pages
PPT-203105251-3
No ratings yet
PPT-203105251-3
35 pages
06_ HashMap & HashSet and how do they internally work_ What is a hashing function_ _ 800+ Big Data & Java Interview FAQs
No ratings yet
06_ HashMap & HashSet and how do they internally work_ What is a hashing function_ _ 800+ Big Data & Java Interview FAQs
7 pages
Fundamentals of C Lab Record New MANUAL2
No ratings yet
Fundamentals of C Lab Record New MANUAL2
76 pages
APDCL JM IT Paper 1 Series C JobAssam - in
No ratings yet
APDCL JM IT Paper 1 Series C JobAssam - in
15 pages

algorithm

Uploaded by

algorithm

Uploaded by

1.

The deflation algorithm used by zip and gzip is a variation of

To avoid a worst-case situation, very long hash chains are arbitrarily

zip also defers the selection of matches with a lazy evaluation

The lazy match evaluation is also subject to a runtime parameter. If

2. gzip file format

The pkzip format imposes a lot of overhead in various headers, which

2 bytes magic header 0x1f, 0x8b (\037 \213)

2 bytes optional part number (second part=1)

Bit 0 in the flags is only an optional indication, which can be set by

The extra field, if present, must consist of one or more subfields,

Ap (0x41, 0x70) : Apollo file type information

If the file being compressed is on a file system with case insensitive

Compression is always performed, even if the compressed file is

[RFC1952] Deutsch P., "GZIP file format specification version 4.3",

APPNOTE.TXT documentation file in PKZIP 1.93a (October 1991). This

You might also like