< Scratch formats

ZHLZ format info

ZHLZ (/ˈzoʊlˌziː/, for "Z-Header Lempel-Ziv") is a simple text-based lossless compression format, based on the Lempel-Ziv family of algorithms. It utilizes Z-Header.

Example project: Z-Header + ZHLZ

Format description

Version: 1.0

ZHLZ compression is done by replacing sections of repeated data with copy operations that copy earlier data. Each copy operation is represented using a length and a distance, in that order. The distance represents a distance back in the data to go to in order to start copying (with a distance of 1 referring to the most recent character), and the length represents the number of characters to copy (starting at the position indicated by the distance and moving forward in the data from there). During decompression, copying a range of data should be done one character at a time, so that the length can be greater than the distance (which works because the number of characters available to be copied will grow as the copy operation is carried out).

This description will use some example compressed data to demonstrate elements of the format. Certain parts of the description, underlined and colored green, can be hovered over or clicked to highlight relevant parts of the example data.

The example data before compression is:

Give a man a fire and he's warm for a day, but set fire to him and he's warm for the rest of his life.

(Terry Pratchett, Jingo)

And after compression:

zhlz,,,09,11Give a man a fire and he's warm for a day,, but set,0037to him,1344the rest of his life.

The compressed data begins with a Z-Header, which should provide a list of 3 or more characters. The recommended format code for the header is zhlz. The first character provided by the header is used as an insertion marker, while the rest of the characters are used as digits (with the first of those characters representing the 0 digit, etc) to represent any numbers used in the compression syntax. The base used to represent the numbers is equal to the number of digits provided.

In this example, the characters provided by the header are ,0123456789. The first character, ,, is used as the insertion marker, and the rest of the characters, numerals 0 to 9 (of which there are 10), are used as digits to represent numbers in base 10.

Following the header are two single-digit numbers, the length length and the distance length. For both of these values, the actual value used is 1 greater than the value written. Here, the length length is written as 1, meaning the actual length length is 2. The distance length in this case has the same value. These values represent the fixed lengths of the length and distance numbers used in copy operations.

In the body of the data, characters by default represent themselves, and during decompression they're just copied to output as-is. The one exception is the insertion marker, which is used as a prefix to indicate copy operations. In order to represent an insertion marker character that appears in the data, an escape sequence must be used, consisting of an insertion marker prefixed with another insertion marker. In this example, , appears in the data, and, since , is the insertion marker, it's represented as a sequence of two , in the compressed data.

A copy operation consists of the insertion marker, followed by the length and the distance of the operation. The actual distance used is 1 greater than the distance as written, and the actual length is (length as written) + n, where n is the minimum length that will actually achieve compression, i.e. (length length) + (distance length) + 2. In this case, n = 2 + 2 + 2 = 6.

The first copy operation in this example has a length written as 00 and a distance written as 37 (interpreted as decimal numbers 0 and 37 using the digits provided by the header), meaning the actual length is 0 + n = 6, and the actual distance is 37 + 1 = 38. Going back 38 characters in the output data preceding the copy operation leads to the space before the first occurrence of the word "fire". The length of 6 indicates that the section of data to be copied is the word "fire" and the two spaces on either side of it. This copy operation produces the word "fire" and surrounding spaces in the phrase "set fire to him".

The second copy operation has length and distance written as 13 and 44, meaning 19 and 45 respectively. This copies the fragment "and he's warm for" along with the surrounding spaces.