======================
TESID design rationale
======================

TESID is made up of three parts:

 2. Encryption,
 3. String encoding, and
 1. A convention for reducing occupation of the ID space
    and attaching a discriminant for types or other such details.

(I’ve numbered these items in their encoding order,
but I’m going to discuss them in this altered order.
Things would get boring if list item numbers always had to be in order.)

See <algorithms.txt> for a basic overview of it all.
This file explains why things are the way they are.

2. 5peck: Speck, but with five-based word sizes and just one key
================================================================

Encryption is done with some new instances of the Speck cipher family,
which I have called 5peck (because it amused me).
5peck is pronounced “fpeck”, not to be confused with “ſpeck”.
(This also amuses me.)
In places where an identifier is required that cannot start with a numeral, I use fpeck.

There are two particular properties of 5peck ciphers:

 • All instances share an expanded key,
   taking as many bits as they require of each round key.
   This is a significant deviation from Speck for small word sizes,
   since there’s not much point in longer keys in usual lightweight block cipher situations,
   due to other weaknesses like birthday attacks.
   But such reasons are inapplicable to 5peck.

 • Word sizes are multiples of five,
   in order to match the base-32 alphabet that follows (32 is five bits).
   This is not normal because it’s more computationally expensive
   with no real benefit for normal cryptographic purposes,
   but there’s no reason not to do it when you have a reason like I do.

5peck ciphers don’t get names of their own (unlike Speck32/64, Speck128/128, &c.).
Wouldn’t want them getting self-important or anything.

In the reference implementation, all 5peck ciphers use a 64-bit machine word,
with appropriate rotation and wrapping addition and subtraction via masking,
but all that is implementation detail.

For key scheduling: a 128-bit key is expanded with n=64, m=2, T=30, α=9, β=2.
Encryption takes the n least significant bits of each round key.

5peck is defined for word sizes n={10, 15, 20, 25, 30, 35, 40, 45, 50}.
In all cases, T=30, α=9, β=2, and m is not applicable.

The notable omission from the sequence is n=5,
which would produce 1024 two-character TESIDs.
I omitted it because it’s not *that* useful,
I’d need to support a different value of at least α,
and I was concerned about its integrity,
given how extremely it deviates from the pattern I see in the Speck paper.
(To reach 128-bit security, I think it’d need at least a few more rounds,
T=30 yielding only 150 bits in its expanded key.)
So instead I start at n=10 and four-character TESIDs.

Justification for the single expanded key and T=30
--------------------------------------------------

Probably the most significant deviation from Speck is the 128-bit key used all round:
does it or can it lead to 128-bit security, even at n=10?
And how many rounds do we need?

These are the numbers of rounds for the defined Speck instances:

 • For n=16: T = 18 + m
 • For n=24: T = 19 + m
 • For n=32: T = 23 + m
 • For n=48: T = 26 + m
 • For n=64: T = 30 + m

I’m going to generalise and call this T = w + m,
where w is very approximately proportional to n + c for some constant c.
n=24 is a mild outlier from my predictions: I’d have expected w=20, not w=19.
And assuming that w is *linearly* correlated to n is hubris.

In the addition rather than multiplication of m,
it is clearly seen that security is not particularly strongly proportional to the number of round key bits;
Speck128/128 has 16 bits of round key per bit of security,
but Speck128/256 has only 8.5 bits of round key per bit of security,
and Speck32/64 has 5.5 bits of round key per bit of security.
You clearly can’t take this too far, as there’s an obvious hard lower bound of 1,
and imperfections in diffusion will mean that the real lower bound is higher.
(Skipping to the end state:
for n=10 I end up with 2.34 bits of round key per claimed bit of security,
which is not a huge amount of margin, but still sufficient, I think.
n=5 would have been 1.17, which I suspect would be insufficient.)

(I’m treating the number of bits of security as equivalent to the number of bits of key,
because that is the intent of the cipher.)

Well then, the number of rounds.
Or actually, the key expansion technique, that’s important too.

Originally, I expanded the key for each value of n,
by consuming one word of the key until it was all gone
(plus however many extra zero bits were needed at the end);
this led to m=ceil(128 / n).
I also tentatively chose T=30,
leading to values of w from 17 (for n=10) to 27 (for n=50).
This seemed fine for a while,
but I wasn’t enthusiastic about having to store 2160 bytes of expanded keys
(as I was using u64 for all n), and n=10’s m=13 was rather a lot.
Then I thought of just expanding it once, with n=64, m=2,
and during encryption taking the n bits of each round key.
The purpose of key expansion is to diffuse the key bits into the round keys,
so this shouldn’t reduce security in any way.
And it gets the whole key into the mix sooner,
so it should lead to *increased* security in the round keys for small values of n.
This effectively reduces the very high values of m for small values of n,
so that my w is higher for them.
This suggests that I could probably reduce the number of rounds for smaller n.
A nice simple solution here is T = 20 + n/5,
which stays higher than Speck, but by much less.
The most important one to consider here is n=10: this takes it down to T=22,
meaning 220 bits of round key.
This is heading too low for my liking.
With this approach to key scheduling,
I suspect you may need more rounds to fit this much security in.

(This is why I eliminated n=5: it’s hard to get such high security out of such a block size with this type of cipher; even n=10 may be pushing it.)

And really, performance is sufficiently excellent that it’s not worth fussing about reducing rounds by 25%.
My Rust benchmark of encrypting all n=10 values (2²⁰ of them, around a million)
shows it taking under half a nanosecond for each one (under 0.5ms for all) on my laptop,
and this will hold for all block sizes, since all are using u64 machine words with masking.
So I’ll just leave it a flat 30 rounds,
which I’m fairly confidently is sufficient for 128-bit security at n=10,
and significant overkill for n=15 to n=40 or so.
I’ve run the numbers in a few different ways and am as confident in the results as I could hope to be as a crypto layman.

Justification for α = 9, β = 2
------------------------------

Speck’s decision for their rotation values is explained in
<https://eprint.iacr.org/2017/560.pdf#page=8>:
they started with (7, 2) and it worked well,
but switched to (8, 3), which was still good enough,
for better x86 performance due to SIMD byte-shuffle,
except for Speck32 where (8, 3) was too low-quality on a 16-bit word
and they didn’t want to increase the number of rounds to compensate.
They acknowledge that there are other values like (9, 2) that are better in a vacuum,
but that the attacks they handle better aren’t security-limiting attacks,
and that performance is worse on some platforms.

In 5peck, word sizes are multiples of 5,
so the SIMD byte-shuffle argument fails,
as do the other performance arguments.
Intuition suggested to me that α − β = 5 might suffer slightly;
some basic cryptanalysis concurred and suggested (9, 2) works well.
(I use that for the n=64, m=2 key expansion too.)
In related news, it *looks* to me from some basic CryptoSMT analysis like multiples of five for n fare better than multiples of eight.
Not sure if I’m barking up the wrong tree of if that’s a real effect.
If you happen to be a cryptographer, I’d quite like to know more, for curiosity’s sake.

Other remarks
-------------

I do not know if my key reuse across ciphers weakens anything.
In principle, I don’t believe it should,
especially given that a particular value will only ever be encrypted with one value of n.

Remember that even if I can figure out how to nudge cryptographers’ tools along,
I am not a professional cryptographer and really don’t know what I’m doing.
But I do think that the cryptography covered here is sound and robust,
to approximately 128 bits of security,
which is frankly pretty extreme overkill for ID scrambling.
*Even if* I have gravely miscalculated on n=10,
I don’t think there’s any way that it could reduce it below even 80 bits,
which would still be significant overkill for the applicable threat model.

I’ve used quite a bit of logic and reasoning here,
which I imagine is a bad sign in cryptographic circles,
where it’s better to use concrete measurements.
I invite analysis by experts whose fancy is tickled.

If I’ve scared you in all of this, I would like to reassure you:
I’ve been handling the cryptography carefully just out of principle,
and in the real world it doesn’t actually matter,
because in the intended use case of this library,
literally no currently‐known or hypothesised attack will *ever* be viable.
Your IDs are safe.

3. Base-32 encoding
===================

I have chosen this alphabet:

   23456789abcdefghijkmnpqrstuvwxyz

That is, ASCII digits and lowercase letters, minus the following characters:

 • 0, as visually confusable with O and audially confusable with o
   (since zero is extremely commonly pronounced “oh”, just like o);

 • 1, as visually confusable with l and I;

 • l, as visually confusable with 1 and I; and

 • o, as audially confusable with 0.

For some purposes it might be nice to remove more:

 • i, as confusable in uppercase with l and 1;

 • 5 and s, as mutually confusable in handwriting and uppercase;

 • 2 and z, as mutually confusable in handwriting and uppercase; and

 • v and u, as mutually confusable in some fonts and handwriting.

But with bit-level block sizes
(and without messing around with more esoteric format-preserving encryption, which I did try),
a power-of-two base yields the best results for performance, code size and compactness of representation,
and I didn’t want to go down to 16.
So base-32 it is, meaning five bits per character.

An interesting consequence of the technical choices is that TESIDs are an even number of characters.
This means that you could even provide an *optional* check digit, if it took your fancy:
even-length string means no check digit, odd-length means check digit.
If you want a check digit, I suggest the Damm algorithm with an order-32 table.
At this time, I offer no assistance in this matter.

When thinking of ID lengths: every five bits is one character,
or more practically (because of the block sizes used) every ten bits two characters.
(Speck can only support even block sizes,
so I couldn’t support odd-length strings without compromising my design a bit,
or finding another cipher as good that could support odd block sizes,
which I failed to do, though I did toy with Hasty Pudding for a while.)

One final obvious question: why base-32, why not a different base?

 • To begin with, only ASCII is considered. Non-ASCII would be troublesome.

 • If convenient, a power of two is good, for its bit alignment,
   leading to a consistent pattern of block sizes and lengths,
   easier reasoning, and slightly improved efficiency of execution and length.

 • Base-64 is eliminated because it’s not sufficiently human-friendly:

   (a) it requires two non-alphanumeric characters in its alphabet,
       which messes up wordwise selection as, of ASCII,
       underscore is the only nonalphanumeric character that’s “word”.

   (b) confusable characters is a problem,
       and mixed case an inconvenience.

   In its favour, exactly six bits per character *is* nice,
   and its evenness means you could fully efficiently support all string lengths,
   odd as well as even (with Speck n=3×length).

 • Base-62 (alphanumeric) solves the wordwise selection problem,
   but doesn’t solve confusables or mixed case,
   and loses the neatness of bit alignment.

 • In early TESID design I used Base58 (alphanumeric minus 0, I, O, l),
   and was considering going down to base-57 (removing 1)
   which still yielded 6-/11-character strings for the 32- and 64-bit block sizes I used at the time.
   But a couple of people convinced me that case-sensitivity was undesirable,
   and once I got into custom block sizes,
   the difference in base doesn’t make such a difference in length.

 • Base-32 is pretty much as large as you can go while removing all usual confusables.

 • Going down perhaps as far as base-25 to eliminate more diverse confusables
   (mentioned above) was tempting, but I like the neatness of 32.

 • Going down to base-16 (whether hexadecimal or something else)
   *would* solve all the problems identified,
   including even minimising concerns about undesirable patterns in strings,
   but I’m not fond of the longer IDs it leads to:
   a 40-bit value is 10 characters,
   compared to base-32’s 8.
   (And with a little sparsity or type discrimination,
   a 40-bit value isn’t unreasonable.)
   Also there’s something I can’t quite place my finger on where hexadecimal just feels… *programmery*.
   The base is too small to feel right in some way I can’t properly express.
   A larger base somehow ends up feeling more human.
   (Don’t take from this that programmers are not human.
   Unless it amuses you to do so, in which case I won’t try to stop you.)

 • Decimal is unsuitable as it tempts numeric treatment,
   but leading zeroes are significant due to the cipher range overlaps.
   Also it’s just misleading for people, because if they see numbers,
   they kinda expect a sequence.

If designing with a different alphabet, different block sizes would make sense.
If encoding to base-*b* and padding to length *l*,
the largest block size that can be used is ceil(log₂(b**l - b**(l - 1)))
(this formula includes rezeroing the block range,
which is useful for some values of b and l).
For l={1, 2, 3, …}, block sizes up to 64 bits:
b=16 yields block sizes 0, 4,  8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64;
b=20 yields block sizes 0, 5,  9, 13, 18, 22, 26, 31, 35, 39, 44, 48, 52, 57, 61, 65;  
b=25 yields block sizes 0, 5, 10, 14, 19, 24, 28, 33, 38, 42, 47, 52, 56, 61, 65;
b=32 yields block sizes 0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65;
b=58 yields block sizes 0, 6, 12, 18, 24, 30, 36, 41, 47, 53, 59, 65;
b=64 yields block sizes 0, 6, 12, 18, 24, 30, 36, 42, 48, 54, 60, 66.
But with Speck, block sizes must be multiples of two,
so any odd-numbered block sizes must be reduced by one.
(That is, a five-character string in base-32 could only go up to 2²⁴, not 2²⁵.)

To my mind, the interesting variants of TESID here are:

 • Supporting odd-length strings,
   which for base-32 means Speck n={2, 5, 7, 10, 12, 15, 17, 20, 22, …}.
   Presuming you still started at four characters (n=10),
   I think the values α=9 and β=2 would still be sound.

 • Going down to base-16, with Speck n={2, 4, 6, 8, 10, 12, …, 64},
   officially supporting the full u128 range of up to 32 characters,
   though probably still starting at five characters (n=10).
   If you wanted the first 65536 IDs to be less than five characters,
   I’d suggest officially dropping below 128-bit security.
   (Aside: n=2 is fundamentally incompatible with 128-bit security,
   as there are too few possible permutations of a block: 2⁴! < 2¹²⁸.)
   The fact that base-16 aligns with bytes is quite nice.
   In the Rust implementation, since encode_long would no longer be fallible,
   I think I’d drop the u64/u128 divide and go u128 throughout.

Supporting odd-length strings does disclose more epochs,
but it doesn’t matter, especially not when you can tweak all that with⸺

1. Sparsity and discrimination
==============================

This is a simple extra step that can be added before encryption,
in order to make valid IDs harder to guess,
and to avoid overlap between IDs of different types.

   id → id × sparsity + discriminant

There are two purposes for sparsity:

1. Reducing the occupation of the ID space.
   If you don’t do this, then by the time you have a million or so IDs assigned (all the ones up to 2²⁰),
   *every* four-character string made up of base-32 characters will correspond to a valid ID.
   This can be undesirable, for example if you would like to use IDs as authentication,
   or if you just plain don’t want valid IDs to be guessable.

   (“Using IDs as authentication”: that is,
   having the ID is sufficient to access and/or mutate an entity.
   This means things like magic login links,
   and unlisted videos on a site like YouTube.)

   Given a 64-bit ID and 64-bit discriminant, the largest value that can safely be used for sparsity is 2³⁶ − 1.
   I suggest using powers of two (assigning a certain number of bits), but you don’t have to.
   If you allot 36 bits of sparsity (the most that can safely be done with 64-bit IDs,
   as 36 + 64 = 100 which is the largest block size supported by the 5peck ciphers,
   though in practice nothing will fill even 50 bits of that 64-bit ID,
   so with care you can go higher than 36 if you want),
   then only one in every 68 billion possible IDs will be valid,
   and it’s very probably reasonable to use IDs as authentication.
   With 36 bits of sparseness, you’d get 16 8-character IDs,
   then 16384 10-character IDs, then 16 million 12-character IDs, &c.

2. Avoiding ID reuse across types or tenants, by adding a discriminant.
   Without this, a given ID number will produce the same TESID,
   which could lead to type confusion,
   allow discovery of valid IDs of different types
   (or on different tenants, or whatever),
   or disclose details about types’ sequences.
   With this applied correctly and consistently,
   TESIDs will effectively represent the encryption of the (type, id) tuple,
   instead of just the ID number.
   Because of this, a decoder can also determine which type an ID pertains to.

If you are going to use TESID for more than one type,
I recommend adding bits for a type discriminant.

As for how much sparsity to add, decide that yourself.
Every bit you add increases the amortised ID length by 0.2 characters.
(Or, every ten bits you add increases the ID length by exactly 2 characters.)

Here’s a reference table for some possible values,
choosing powers of two (though you’re not limited to it):

┌─────────────┬────────┬───────┬──────────┬────────────┬─────────────┬─────────────┬───┐
│   Sparsity  │ (Bits) │ Extra │ № 4-char │  № 6-char  │  № 8-char   │  № 10-char  │ … │
│             │        │ chars │   IDs    │    IDs     │    IDs      │     IDs     │ … │
┝━━━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━┿━━━━━━━━━━┿━━━━━━━━━━━━┿━━━━━━━━━━━━━┿━━━━━━━━━━━━━┿━━━┥
│           0 │    0   │  0    │   1024   │ ~1 million │  ~1 billion │ ~1 trillion │ … │
│          32 │    5   │  1    │     32   │   32736    │ ~32 million │ ~32 billion │ … │
│         256 │    8   │  1.6  │      4   │    4092    │  ~4 million │  ~4 billion │ … │
│        1024 │   10   │  2    │      1   │    1023    │  ~1 million │  ~1 billion │ … │
│       65536 │   16   │  3.2  │      0   │      16    │    16368    │ ~16 million │ … │
│  4294967296 │   32   │  6.4  │      0   │       0    │      256    │    261888   │ … │
│ 68719476735 │ 35.99… │  7.2  │      0   │       0    │       16    │     16368   │ … │
└─────────────┴────────┴───────┴──────────┴────────────┴─────────────┴─────────────┴───┘

Values above 8 are not particularly likely to be useful for type discrimination,
but could be for other forms of discrimination (e.g. shared tenantry),
or for pure sparseness (especially for authenticatory IDs).

The discriminant should generally be smaller than sparsity;
otherwise the ID reuse problem will occur,
and type-unaware decoding will no longer be able to retrieve the discriminant in full.
But this is not checked, partly for reasons of efficiency,
and partly because an overly-large discriminant can be used legitimately,
as a simple offset to lengthen IDs (though sparsity and skipping ID 0 is probably generally a better tool).

I’ve introduced the concept of typed TESID coders to the libraries.
TESID offers no assistance in maintaining discriminant values for your types;
that you must maintain yourself.

It’s possible to achieve these effects in the database,
by adjusting the start (discriminant) and step (sparseness) of the sequence.
TESID provides this functionality anyway for three reasons:

(a) Most databases don’t provide the tools to readily adjust sequence stride;

(b) Your database is probably limited to 63 bits,
    but TESID performs these operations in 100-bit space,
    allowing sparseness and discrimination even on implausibly large ID values; and

(c) You may be applying TESID to an existing sequence.

—⁂—

I identify three main parameter value groups:

• Big sparsity, no or small discriminant: longer, much less guessable IDs.
  Go far enough and this can be reasonable for authenticated:
  at 2³⁰, you’re adding six characters to the string length,
  and only one in a billion IDs will be valid.
  And you can go further still if you really want.

• Big discriminant, no sparsity: just longer IDs, but fully guessable once you have enough.
  If you go big enough, this can give the perfect illusion of a fixed-length string:
  with a u64 ID, you can safely use 2⁶⁰ as a discriminant and get alwayss-14-character TESIDs;
  these will *in practice* be quite suitable for use as authentication.
  But you have to be careful in choosing the big discriminant to use,
  since the strings *are* still variable-width,
  and full subranges are perfectly guessable.
  All up, I don’t generally recommend this form.

• Small sparsity, type discriminant: slightly longer and slightly less guessable IDs.
  This is a good choice for most purposes.

(When I say “small sparsity”, I suggest values like 32, 256 or 1024.
You don’t *have* to line up with bits, but I mildly prefer to.)

—⁂—

Sequence ideas:

• Auto-incrementing database ID. This is the obvious and most likely use.

• Time. If you used a unix timestamp in seconds, you will get 8-character IDs until the year 36812 which is hopefully enough for you. (Times before 2004-01-11T00:37:04 will be six characters, and from 1970-01-01T00:00:00 to 1970-01-13T13:16:15 four.) If you use YYYYMMDDHHMMSS, you’ll get 10-character IDs; subtract 2000 from the year and go with YYMMDDHHMMSS, and you’re back to 8 characters until the end of 2219.

  Practical example that I am thinking of using: private but sharable (that is, the URL is the only authentication) pages on my website where I add a random token to the slug using the publication date.