On Encodings

Encodings, In Detail
Compact Integer Encodings
Data Model Encodings
1. path
Capabilities

Status: Proposal (as of 17.01.2024)

Those encodings referenced from the Meadowcap specification have status Candidate.

alj: TODO: improve the TOC rendering (styling, top, Home?)alj: TODO: fix preview iframe heightA perhaps curious feature of the Willow data model is that its specification does not talk about encodings. We11Let’s be honest: Aljoscha strongly believe that specifications should concern themselves with purely logical data types as long as possible, treating encodings as a minor and ultimately interchangeable detail. When specifications define concepts in terms of their encodings, results usually end up miserably underspecified (see JSON) or full of incidental complexity (see XML).

Nevertheless, protocols that deal with persistent storage and network transmissions eventually have to serialise data. In this document, we give both some generic definitions around arbitrary encodings, and some specific encodings that recur throughout the Willow family of specifications.

Encodings, In Detail

The Willow protocols are generic over user-supplied parameters. Invariably, some values of the user-supplied types need to be encoded, so there also have to be user-supplied definitions of how to encode things. Hence, we need to precisely specify some properties and terminology around encodings.

Let $S$ be a set of values we want to encode. Then an encoding relationIn mathy terms: an encoding relation is a left-unique, left-total binary relation on $S \times {0, 1, \dots, 255}^{*}$ which defines a prefix code. assigns to each value in $S$ at least one (but possibly many) finite bytestrings, called the codes of that value. The assignment must satisfy the following properties:

each bytestring is the code of at most one value in $S$ , and
no code is a prefix of any other code.

An encoding relation in which every value has exactly one corresponding codeThe one-to-one mapping between values and codes lets us deterministically hash or sign values. is called an encoding function.

Quite often, we define an encoding relation for some set of values, and then specify a subset of its codes to obtain an encoding function. We call such a specialisation to a function a canonic subset or encoding.

A nice technique for achieving small codes is to not encode a value itself, but to encode merely how a value differs from some reference value. This requires both the party doing the encoding and the party doing the decoding to know that reference value, of course. We formally capture this notion in the concepts of relative encoding relationsSince the definitions require rather careful parsing, they are followed by an example. and relative encoding functions:

alj: TODO: fix font selection in Math post punctuationLet $S$ be a set of values we want to encode relative to some set of values $R$ . For each $s$ in $S,$ let $R_{s} \subseteq R$ denote the set of values relative to which $s$ can be encoded. Then a relative encoding relation assigns to each pair of an $s$ in $S$ and an $r$ in $R_{s}$ at least one (but possibly many) finite bytestrings, called the $r$ -relative codes of $s$ . For every fixed $r$ in $R,$ the assignment must satisfy the following properties:

each bytestring is the $r$ -relative code of at most one value in $S$ , and
no $r$ -relative code is a prefix of any other $r$ -relative code.

If there is exactly one $r$ -relative code for every $s$ in $S$ and every $r$ in $R_{s},$ we call the relative encoding relation a relative encoding function.

A toy example: we can encode unsigned bytes (natural numbers between $0$ and $255$ inclusive) relative to other unsigned bytes by encoding the difference between them as a bytestring of length one (the difference as a single unsigned byte). For simplicity, we allow this only when the difference is non-negative. So the $7$ -relative code of $12$ would be the byte 0x05 (because $12 - 7 = 5$ ). It would not be possible to encode $3$ relative to $7,$ however, since $3 - 7$ is a negative number.

In this example, both $S$ and $R$ are the set of unsigned bytes. For any unsigned byte $s,$ $R_{s}$ is the set of unsigned bytes greater than or equal to $s .$ For any fixed unsigned byte $r,$ all $r$ -relative code are unique (which also implies prefix-freeness, since all codes have the same length), so we have indeed defined a valid relative encoding relation. Note that codes might be duplicated across different $r$ — the $2$ -relative code of $3$ and the $123$ -relative code of $124$ are both 0x01, for example — but this is perfectly ok.

Compact Integer Encodings

In various places, we need to encode U64 values. When we expect small values to be significantly more common than large values, we can save space by using a variable-length encoding that encodes smaller numbers in fewer bytes than larger numbers. In this section, we define some building blocks for doing just that.

The basic idea is to use a tag — a small bitstring — to encode how many bytes will be used for encoding the actual number. We allow for actual encodings of either one, two, four, or eight bytes. To indicate those four options, we need tags of at least two bits. We can also allot more bits to a tag. When we do, the four greatest numbers that can be represented in the tag indicate the four possible encoding widths in bytes. Using any smaller number as the tag indicates that the tag itself is the U64.

Some examples before we give a formal definition: when using a four-bit tag, there are five different ways of encoding the number $7 :$

the tag can be $15$ ( $1111$ in binary), indicating an encoding in eight additional bytes, or
the tag can be $14$ ( $1110$ in binary), indicating an encoding in four additional bytes, or
the tag can be $13$ ( $1101$ in binary), indicating an encoding in two additional bytes, or
the tag can be $12$ ( $1100$ in binary), indicating an encoding in one additional byte, or
the tag can be $7$ ( $0111$ in binary), indicating an encoding in zero additional bytes.

When using a two-bit tag for the number $300,$ , there are only three options:

the tag can be $3$ ( $11$ in binary), indicating an encoding in eight additional bytes, or
the tag can be $2$ ( $10$ in binary), indicating an encoding in four additional bytes, or
the tag can be $1$ ( $01$ in binary), indicating an encoding in two additional bytes.

Now for the formal definitions:

alj: Fix colors and fonts in defs, refs, and math mode thereof.Let n be a U64, and let tag_width be a natural number between two and eight inclusive. Then the natural number t is a (valid) tag for n of width tag_width in all of the following cases:

$t = 2^{t a g_w i d t h} - 1$ , or
$n < 25 6^{4}$ and $t = 2^{t a g_w i d t h} - 2,$ or
$n < 25 6^{2}$ and $t = 2^{t a g_w i d t h} - 3,$ or
$n < 256$ and $t = 2^{t a g_w i d t h} - 4,$ or
$n = t$ and $t \leq 2^{t a g_w i d t h} - 5 .$

Note that there might be multiple tags to choose from for any given combination of a U64 and a width, but once the tag is selected, the corresponding compact U64 encoding is unique.For any U64 n and any tag t for n of width tag_width, the corresponding compact U64 encoding is:

the unsigned little-endian 8-byte encoding of n if $t = 2^{t a g_w i d t h} - 1,$
the unsigned little-endian 4-byte encoding of n if $t = 2^{t a g_w i d t h} - 2,$
the unsigned little-endian 2-byte encoding of n if $t = 2^{t a g_w i d t h} - 3,$
the unsigned little-endian 1-byte encoding of n if $t = 2^{t a g_w i d t h} - 4,$ and
the empty string otherwise.

Examples: the minimal tag for $7$ of width four is $7,$ and the minimal tag for $300$ of width two is $1 .$ We call a tag minimal for some given U64 and width if the corresponding compact U64 encoding is shorter than that for any other valid tag for the same U64 and width. This notion coincides with being the numerically least tag for a given U64 and width.

Data Model Encodings

We now list some useful encodings for the types of the Willow data model.

Path Encoding

We define an encoding relation EncodePath for Path. The codes in EncodePath for any Path val are the bytestrings that are concatenations of the following form:

Bits

Big-Endian Bitfield

0 – 3

A tag of width

4

for the sum of the lengths of the Components of val.

4 – 7

A tag of width

4

for the number of Components of val.

The corresponding compact U64 encoding for bits 0 – 3.

The corresponding compact U64 encoding for bits 4 – 7.

For every Component comp of val but the final one, a concatenation of the following form:

Bits	Big-Endian Bitfield
0 – 7	A tag of width $8$ for the length of comp.
The corresponding compact U64 encoding for bits 0 – 7.
The raw bytes of comp.

The length of the final Component can be reconstructed from the total length and the lengths of all prior Components.The raw bytes of the final Component of val, or the empty string if val has zero components.

We define the encoding function encode_path as the canonic subset of EncodePath obtained by using only minimal tags.

alj: Either make this visually more pleasing or remove it again.An example: encoding the Path blogideasfun with encode_path yields

0C 03 00 04 62 6C 6F 67 00 05 69 64 65 61 73 66 75 6E, because

0C is the minimal tag of width $4$ for the total path length ( $12$ ),
03 is the minimal tag of width $4$ for the number of Components,
the corresponding compact U64 encodings for both are the empty string, so they do not appear in the code,
00 04 is the minimal tag of width $8$ for the length of "blog",
the corresponding compact U64 encodings of the length of "blog" is the empty string, so it does not appear in the code,
62 6C 6F 67 is the ASCII encoding of "blog",
00 05 is the minimal tag of width $8$ for the length of "ideas",
the corresponding compact U64 encodings of the length of "ideas" is the empty string, so it does not appear in the code,
69 64 65 61 73 is the ASCII encoding of "ideas",
the length of "fun" is not encoded (it can be reconstructed as $12 - (4 + 5) = 3$ ), and
66 75 6E is the ASCII encoding of "fun".

We define a relative encoding relation EncodePathRelativePath for any Path relative to any Path. Let val be any Path, and let rel be any Path.

Let prefix_count be a natural number such that the first prefix_count Components of val are equal to the first prefix_count Components of rel.

Then the codes in EncodePathRelativePath for val relative to rel are the bytestrings that are concatenations of the following form:

Bits	Big-Endian Bitfield
0 – 7	A tag of width $8$ for prefix_count.
The corresponding compact U64 encoding for bits 0 – 7.
Any code in EncodePath for the path obtained from val by removing the first prefix_count Components.

We define the relative encoding function path_rel_path as the canonic subset of EncodePathRelativePath obtained by

choosing the greatest possible prefix_count,
using only minimal tags, and
using only the canonic subsets of all sub-encodings.

Capabilities

Encodings for Meadowcap and McSubspaceCapabilities.alj: fix Meadowcap link color

alj: TODO mc subspace cap absolute

alj: TODO mc cap absolute

We define a relative encoding relation EncodeMcSubspaceCapabilityRelativePrivateInterest for any McSubspaceCapability relative to any PersonalPrivateInterest with rel.private_interest.subspace_id == any and rel.private_interest.namespace_id == val.namespace_key. The codes in EncodeMcSubspaceCapabilityRelativePrivateInterest for any McSubspaceCapability val relative to any fitting PersonalPrivateInterest rel are the bytestrings that are concatenations of the following form:

Bits

Big-Endian Bitfield

0 – 7

A tag of width

8

for the number of pairs in val.delegations.

The corresponding compact U64 encoding for bits 0 – 7.

Any code in encode_namespace_sig for val.initial_authorisation.

If the number of pairs in val.delegations is nonzero: any code in encode_user_pk for val.user_key,
otherwise: the empty string.

For every pair (pk, sig) of val.delegations but the final one, a concatenation of the following form:alj: TODO fix styling of nested encoding without bitfields

Any code in encode_user_pk for pk.
Any code in encode_user_sig for sig.

If the number of pairs in val.delegations is nonzero: any code in encode_user_pk for the final UserSignature in val.delegations,
otherwise: the empty string.