< Scratch formats

Z-Header format info

Z-Header is a text-based header format, intended to be used in other data formats to allow implementation of those formats within arbitrary restricted character sets.

Example project: Z-Header + ZHLZ

A tool for rendering z-headers with fancy syntax highlighting can be found at ./formatter.

Format description

Version: 1.0

Fundamentally, a z-header just represents a sequence of characters interspersed with non-negative integers. This information may then be utilized in the encoding/decoding of the host format to which the z-header is attached. The characters provided by the header are intended to represent a set of characters to be used in the syntax of the host format, and the numbers are primarily intended to be used to delineate different sections of the character sequence; however, no prescription is made for how the information provided by z-headers should be used, and depending on the host format that information may be used in different ways.

For an example of how Z-Header might be used in a host format, see ZHLZ.

One of the fundamental concepts used by Z-Header is the idea of syntax characters, which are characters used in the syntax of the header itself. There are three syntax characters that are actually used in this version of the format. Due to the nature of the format as something designed to work with arbitrary character sets, the syntax characters have no predifined values. Throughout most examples in this document, however, the first syntax character will be represented as / forward slash, the second syntax character as : colon, and the third syntax character as # hash symbol.

As the format is fairly complex and can't be represented well using simple examples, I will begin with a comprehensive overview of the format before providing examples of complete headers showcasing specific components.

Overview

Some general notes:

In this overview, /, :, and # will be used interchangeably with "the first syntax character", "the second syntax character", and "the third syntax character", respectively. It is important to note that this is just for the sake of readability, and in practice the syntax characters may have other values.
Z-headers as defined in this document are case-insensitive; in reading the format or choosing information to be encoded in a header, "q" for example should be considered identical to "Q".
Since z-headers are meant to be components of other formats, certain z-header properties may be dependent on the host format (i.e., a general-purpose Z-Header decoder may need to be provided with some information about the format the header is attached to, and certain other decoders may be designed with a specific host format in mind).

Format code

The header begins with a 4-character code providing a hint as to what host format the header is attached to. In general, this code is not intended to reliably indicate the format; rather, it is primarily intended to be, when possible, a way for human readers to distinguish different formats from each other.

One caveat is that, depending on the host format, the format code may have a specific required value, or may even be omitted or replaced with other information; in any of these cases, however, it should be made clear in the host format's own documentation when and how these atypical properties are used.

Parameters

Following the format code is a list of syntax characters. This list terminates as soon as a duplicate character appears. There must be at least one syntax character, and up to three syntax characters may be required for certain Z-Header features. More syntax characters may appear, but are not actually used in the header syntax. All of the provided syntax characters will appear at the beginning of the sequence of characters represented by the header.

Example: /: indicates / as the first syntax character and : as the second syntax character (but does not specify a third syntax character). The header's output will begin with the characters / and :. (Also, to be clear, the header could just as easily list ab instead to have the syntax characters be a and b!)

Directly following the list of syntax characters, an indication of what character set the header should use may optionally appear. A character set in this sense does not impose any restrictions on what characters can appear; rather, a character set allows ranges of characters to be represented efficiently using their endpoints in the character set. If present, the character set indicator consists of a # (third syntax character) followed by a predefined prefix code written using syntax characters 1-3 (see character sets below).

If a character set indicator is not present, the host format may provide a default character set. Otherwise, if neither the header nor the host format provide a character set, character set 1 (see again character sets) is used.

After the syntax character list and, if present, the character set indicator, there may optionally appear a list of syntax characters to exclude from output. If present, this list is indicated by a : (second syntax character), which is followed by a list of syntax characters, in the same order that they were originally specified, to be excluded from automatically appearing in the character sequence represented by the header. For example, :/# causes the first and third syntax characters to be excluded from automatic output, meaning only : will automatically appear.

Explicit output

The rest of the header is prefixed by a / (first syntax character), and constitutes the main description of information represented by the header. This information consists of a sequence of one or more partitions, separated by partition separators.

A partition separator consists of a non-negative integer enclosed by # (the third syntax character) on either side. If the integer is zero, it's represented as the empty string (i.e., the partition separator is just ## with nothing inbetween). Otherwise, the number is represented in binary without leading zeros, using syntax characters 1-2 as digits 0-1, respectively. For example, #::/# represents a separator with the number 6, since ::/ represents the binary number 110, i.e. decimal 6.

In the header's output, the numbers will, conceptually at least, appear between the characters represented by the adjacent partitions. This is intended to allow the numbers to deliniate the partitions and provide information about them, but they may be used for other purposes.

A partition consists of a sequence of one or more sub-sequences, separated by the syntax characters #/.

A sub-sequence consists of a char-sequence followed by a range-sequence, separated by the syntax character /.

A char-sequence is just a list of individual characters to be added to the output. For example, the char-sequence abc adds a, b, and c, in that order, to the list of characters represented by the header.

A range-sequence (primarily) describes ranges of characters from the header's character set to be added to the output. Importantly, these ranges automatically exclude any syntax characters that are already included automatically in the output. Elements of a range-sequence are as follows:

A character within the character set, followed by another character within the character set, represents the range of characters in the character set that begins with the first character and ends with the second character. For example, using character set 1, cf outputs characters c, d, e, and f. If the second character appears before the first in the character set, the range is still read from left to right, but wraps back to the beginning of the character set once it reaches the character set's end. For instance, |" represents the sequence | } ~ ! " – the range begins at |, continues to the end of the character set at ~, wraps back around to the beginning of the character set at !, and continues to the end of the range at ".
A character within the character set that is not followed by a character within the character set, either because it's followed by a character outside the character set or because it's at the end of the range-sequence, represents a range starting with the given character and ending at the end of the character set. For example, y at the end of a range-sequence represents the sequence y z { | } ~.
A character outside of the character set represents that individual character.

Additionally, anywhere that a literal character can appear, an escape sequence may appear instead. An escape sequence consists of a : (second syntax character) followed by another syntax character, and represents that latter syntax character. If an escape sequence appears in a range, the syntax character is included in output. For instance, although the range .1 represents . 0 1 (excluding /, which would appear before 0), the range :/1 represents / 0 1, since the escape sequence :/ is used to explicitly include / as the beginning of the range.

Finally, a special case: if the "explicit output" part of the header produces no output (neither characters nor numbers), the entire character set, excluding syntax characters that have already been output, is added to the output list.

End of header

The header is terminated by a / (first syntax character), after which follows the data of the host format.

Example headers

The sequence of data represented by each example header will be listed below the header itself; the items will be space-separated, and the numbers will appear enclosed in pipes and bolded (while the characters retain default styling).

It is assumed in these examples that the host format does not provide a default character set, and therefore predefined character set 1 is used as a default.

Parts of these descriptions are colored and underlined in green. You can hover over or click them to highlight relevant parts of the example headers.

A minimal header

One of the simplest headers is this:

expl////

Output: / ! " # $ % & ' ( ) * + , - . 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~

The header begins with the format code "expl" (an abbreviation of "example"). The next character following that defines the first syntax character as /, which will now appear at the beginning of the output. As the next character after that has already appeared in the list of syntax characters, it indicates the end of the list, meaning that / is the only syntax character. And, as the character in question is equal to the first syntax character, it marks the beginning of the "explicit output" part of the header.

The explicit output consists of a single partition, with a single sub-sequence, with an empty char-sequence and empty range-sequence. Note that, although the char- and range-sequences are empty, the syntax character separating them still must appear. Because no output is actually specified, the entire character set will be added to output, excluding / since it's a syntax character.

Finally, the header is closed with the first syntax character.

Basic output

expl//a1¢x/lpz§79/

Output: / a 1 ¢ x l m n o p z { | } ~ § 7 8 9

This header starts out in the same way as the previous example—it has the format code "expl", / as the first syntax character, and then the marker for the beginning of the explicit output. The explicit output is also similar in that it has a single partition with a single sub-sequence, but this one has some actual content. The sub-sequence is separated into two parts by the first syntax character.

The first part, i.e. the char-sequence, consists simply of individual characters that get added to the output one-by-one. Note that a character outside the character set can appear here, and since these are just individual characters, it behaves exactly the same as the other characters.

The second part is the range-sequence. It begins with the characters l and p, which are both in the character set, meaning that they represent a range from l to p taken from the character set.

The next character, z, is within the character set, but the character following it is not; therefore, the z represents a range from z to the end of the character set. Since the § is outside the character set, it represents an individual character; the following two characters then represent another range.

The header is again closed by the first syntax character.

The second syntax character

expl/:://ab:/c/.;/

Output: : a b / c . / 0 1 2 3 4 5 6 7 8 9 ;

Following the format code, this header has two syntax characters before a duplicate appears; the first syntax character is / as before, and the second syntax character is :.

The character following the syntax character list is equal to the second syntax character, signaling a list of syntax characters to exclude from automatic output. The only character listed is /, followed by another / indicating the beginning of the explicit output. Since / has been excluded from automatic output, only : will appear automatically at the beginning of the output.

The explicit output consists of a single sub-sequence. The char-sequence of this sub-sequence lists four characters, one of which is an escaped /. Note that it does need to be escaped, even though it's been excluded from automatic output, since it still functions as a syntax character.

The range-sequence has a single range from . to ;. In the character set, this range covers both of the syntax characters, but only / actually appears in the output range, since it hasn't already been automatically output.

The third syntax character

expl/:##//asd/wz#/7q~/46#:/:#321/hk/

Output: / : # a s d w x y z 7 q ~ 4 5 6 |5| 3 2 1 h i j k

This header adds a third syntax character, #, in the list of syntax characters. Following that list is an instance of the third syntax character, marking a character set indicator. The text after that matches the predefined code /, indicating character set 1. In this case, it's not especially useful to explicitly state the character set, but it could be useful if the host format has its own default character set and you want to use character set 1 instead.

After the character set indicator is the beginning of the explicit output. The explicit output consists of two partitions. The partition separator between the two partitions contains the number :/:, i.e. binary 101 or decimal 5.

The first partition contains two sub-sequences, separated by #/. Having multiple sub-sequences like this can be useful for outputting individual characters after outputting ranges. Both of these sub-sequences have individual characters and ranges.

The second partition consists of only one sub-sequence, which also has individual characters and a range.

Syntax character values

Throughout this document, I've used /, :, and # as the first three syntax characters. However, I think it's important to emphasize that the syntax characters don't have to have those values, so here's an example where they don't:

expl0xy0+-0#&y0?!.00

Output: 0 x y + - # $ % & ? ! .

This header begins with the format code "expl" as with the other examples, but then the syntax characters listed are 0, x, and y for the first, second, and third syntax characters, respectively. The syntax character list is followed by an instance of the first syntax character, 0, marking the beginning of the explicit output.

The explicit output consists of a single partition with two sub-sequences, separated by y0 (i.e. the third syntax character followed by the first). The first sub-sequence contains both individual characters and a range, separated by 0, while the second sub-sequence contains only individual characters.

The header is closed by 0, the first syntax character.

Character sets

Here is a list of defined character sets, with prefix codes represented here using syntax characters /, :, and #, and numeric IDs derived by converting the prefix codes from bijective base 3 (with syntax characters 1-3 as digits 1-3 respectively) to decimal.

Currently, there is only one character set defined here:

/ (dec 1) – Defined as the collection of all non-control, non-whitespace, non-uppercase characters from the Basic Latin Unicode block, ordered by code point; i.e.: !"#$%&'()*+,-./0123456789:;<=>?@[\]^_`abcdefghijklmnopqrstuvwxyz{|}~