5 TESID is made up of three parts:
8 3. String encoding, and
9 1. A convention for reducing occupation of the ID space
10 and attaching a discriminant for types or other such details.
12 (I’ve numbered these items in their encoding order,
13 but I’m going to discuss them in this altered order.
14 Things would get boring if list item numbers always had to be in order.)
16 See <algorithms.txt> for a basic overview of it all.
17 This file explains why things are the way they are.
19 2. 5peck: Speck, but with five-based word sizes and just one key
20 ================================================================
22 Encryption is done with some new instances of the Speck cipher family,
23 which I have called 5peck (because it amused me).
24 5peck is pronounced “fpeck”, not to be confused with “ſpeck”.
25 (This also amuses me.)
26 In places where an identifier is required that cannot start with a numeral, I use fpeck.
28 There are two particular properties of 5peck ciphers:
30 • All instances share an expanded key,
31 taking as many bits as they require of each round key.
32 This is a significant deviation from Speck for small word sizes,
33 since there’s not much point in longer keys in usual lightweight block cipher situations,
34 due to other weaknesses like birthday attacks.
35 But such reasons are inapplicable to 5peck.
37 • Word sizes are multiples of five,
38 in order to match the base-32 alphabet that follows (32 is five bits).
39 This is not normal because it’s more computationally expensive
40 with no real benefit for normal cryptographic purposes,
41 but there’s no reason not to do it when you have a reason like I do.
43 5peck ciphers don’t get names of their own (unlike Speck32/64, Speck128/128, &c.).
44 Wouldn’t want them getting self-important or anything.
46 In the reference implementation, all 5peck ciphers use a 64-bit machine word,
47 with appropriate rotation and wrapping addition and subtraction via masking,
48 but all that is implementation detail.
50 For key scheduling: a 128-bit key is expanded with n=64, m=2, T=30, α=9, β=2.
51 Encryption takes the n least significant bits of each round key.
53 5peck is defined for word sizes n={10, 15, 20, 25, 30, 35, 40, 45, 50}.
54 In all cases, T=30, α=9, β=2, and m is not applicable.
56 The notable omission from the sequence is n=5,
57 which would produce 1024 two-character TESIDs.
58 I omitted it because it’s not *that* useful,
59 I’d need to support a different value of at least α,
60 and I was concerned about its integrity,
61 given how extremely it deviates from the pattern I see in the Speck paper.
62 (To reach 128-bit security, I think it’d need at least a few more rounds,
63 T=30 yielding only 150 bits in its expanded key.)
64 So instead I start at n=10 and four-character TESIDs.
66 Justification for the single expanded key and T=30
67 --------------------------------------------------
69 Probably the most significant deviation from Speck is the 128-bit key used all round:
70 does it or can it lead to 128-bit security, even at n=10?
71 And how many rounds do we need?
73 These are the numbers of rounds for the defined Speck instances:
75 • For n=16: T = 18 + m
76 • For n=24: T = 19 + m
77 • For n=32: T = 23 + m
78 • For n=48: T = 26 + m
79 • For n=64: T = 30 + m
81 I’m going to generalise and call this T = w + m,
82 where w is very approximately proportional to n + c for some constant c.
83 n=24 is a mild outlier from my predictions: I’d have expected w=20, not w=19.
84 And assuming that w is *linearly* correlated to n is hubris.
86 In the addition rather than multiplication of m,
87 it is clearly seen that security is not particularly strongly proportional to the number of round key bits;
88 Speck128/128 has 16 bits of round key per bit of security,
89 but Speck128/256 has only 8.5 bits of round key per bit of security,
90 and Speck32/64 has 5.5 bits of round key per bit of security.
91 You clearly can’t take this too far, as there’s an obvious hard lower bound of 1,
92 and imperfections in diffusion will mean that the real lower bound is higher.
93 (Skipping to the end state:
94 for n=10 I end up with 2.34 bits of round key per claimed bit of security,
95 which is not a huge amount of margin, but still sufficient, I think.
96 n=5 would have been 1.17, which I suspect would be insufficient.)
98 (I’m treating the number of bits of security as equivalent to the number of bits of key,
99 because that is the intent of the cipher.)
101 Well then, the number of rounds.
102 Or actually, the key expansion technique, that’s important too.
104 Originally, I expanded the key for each value of n,
105 by consuming one word of the key until it was all gone
106 (plus however many extra zero bits were needed at the end);
107 this led to m=ceil(128 / n).
108 I also tentatively chose T=30,
109 leading to values of w from 17 (for n=10) to 27 (for n=50).
110 This seemed fine for a while,
111 but I wasn’t enthusiastic about having to store 2160 bytes of expanded keys
112 (as I was using u64 for all n), and n=10’s m=13 was rather a lot.
113 Then I thought of just expanding it once, with n=64, m=2,
114 and during encryption taking the n bits of each round key.
115 The purpose of key expansion is to diffuse the key bits into the round keys,
116 so this shouldn’t reduce security in any way.
117 And it gets the whole key into the mix sooner,
118 so it should lead to *increased* security in the round keys for small values of n.
119 This effectively reduces the very high values of m for small values of n,
120 so that my w is higher for them.
121 This suggests that I could probably reduce the number of rounds for smaller n.
122 A nice simple solution here is T = 20 + n/5,
123 which stays higher than Speck, but by much less.
124 The most important one to consider here is n=10: this takes it down to T=22,
125 meaning 220 bits of round key.
126 This is heading too low for my liking.
127 With this approach to key scheduling,
128 I suspect you may need more rounds to fit this much security in.
130 (This is why I eliminated n=5: it’s hard to get such high security out of such a block size with this type of cipher; even n=10 may be pushing it.)
132 And really, performance is sufficiently excellent that it’s not worth fussing about reducing rounds by 25%.
133 My Rust benchmark of encrypting all n=10 values (2²⁰ of them, around a million)
134 shows it taking under half a nanosecond for each one (under 0.5ms for all) on my laptop,
135 and this will hold for all block sizes, since all are using u64 machine words with masking.
136 So I’ll just leave it a flat 30 rounds,
137 which I’m fairly confidently is sufficient for 128-bit security at n=10,
138 and significant overkill for n=15 to n=40 or so.
139 I’ve run the numbers in a few different ways and am as confident in the results as I could hope to be as a crypto layman.
141 Justification for α = 9, β = 2
142 ------------------------------
144 Speck’s decision for their rotation values is explained in
145 <https://eprint.iacr.org/2017/560.pdf#page=8>:
146 they started with (7, 2) and it worked well,
147 but switched to (8, 3), which was still good enough,
148 for better x86 performance due to SIMD byte-shuffle,
149 except for Speck32 where (8, 3) was too low-quality on a 16-bit word
150 and they didn’t want to increase the number of rounds to compensate.
151 They acknowledge that there are other values like (9, 2) that are better in a vacuum,
152 but that the attacks they handle better aren’t security-limiting attacks,
153 and that performance is worse on some platforms.
155 In 5peck, word sizes are multiples of 5,
156 so the SIMD byte-shuffle argument fails,
157 as do the other performance arguments.
158 Intuition suggested to me that α − β = 5 might suffer slightly;
159 some basic cryptanalysis concurred and suggested (9, 2) works well.
160 (I use that for the n=64, m=2 key expansion too.)
161 In related news, it *looks* to me from some basic CryptoSMT analysis like multiples of five for n fare better than multiples of eight.
162 Not sure if I’m barking up the wrong tree of if that’s a real effect.
163 If you happen to be a cryptographer, I’d quite like to know more, for curiosity’s sake.
168 I do not know if my key reuse across ciphers weakens anything.
169 In principle, I don’t believe it should,
170 especially given that a particular value will only ever be encrypted with one value of n.
172 Remember that even if I can figure out how to nudge cryptographers’ tools along,
173 I am not a professional cryptographer and really don’t know what I’m doing.
174 But I do think that the cryptography covered here is sound and robust,
175 to approximately 128 bits of security,
176 which is frankly pretty extreme overkill for ID scrambling.
177 *Even if* I have gravely miscalculated on n=10,
178 I don’t think there’s any way that it could reduce it below even 80 bits,
179 which would still be significant overkill for the applicable threat model.
181 I’ve used quite a bit of logic and reasoning here,
182 which I imagine is a bad sign in cryptographic circles,
183 where it’s better to use concrete measurements.
184 I invite analysis by experts whose fancy is tickled.
186 If I’ve scared you in all of this, I would like to reassure you:
187 I’ve been handling the cryptography carefully just out of principle,
188 and in the real world it doesn’t actually matter,
189 because in the intended use case of this library,
190 literally no currently‐known or hypothesised attack will *ever* be viable.
196 I have chosen this alphabet:
198 23456789abcdefghijkmnpqrstuvwxyz
200 That is, ASCII digits and lowercase letters, minus the following characters:
202 • 0, as visually confusable with O and audially confusable with o
203 (since zero is extremely commonly pronounced “oh”, just like o);
205 • 1, as visually confusable with l and I;
207 • l, as visually confusable with 1 and I; and
209 • o, as audially confusable with 0.
211 For some purposes it might be nice to remove more:
213 • i, as confusable in uppercase with l and 1;
215 • 5 and s, as mutually confusable in handwriting and uppercase;
217 • 2 and z, as mutually confusable in handwriting and uppercase; and
219 • v and u, as mutually confusable in some fonts and handwriting.
221 But with bit-level block sizes
222 (and without messing around with more esoteric format-preserving encryption, which I did try),
223 a power-of-two base yields the best results for performance, code size and compactness of representation,
224 and I didn’t want to go down to 16.
225 So base-32 it is, meaning five bits per character.
227 An interesting consequence of the technical choices is that TESIDs are an even number of characters.
228 This means that you could even provide an *optional* check digit, if it took your fancy:
229 even-length string means no check digit, odd-length means check digit.
230 If you want a check digit, I suggest the Damm algorithm with an order-32 table.
231 At this time, I offer no assistance in this matter.
233 When thinking of ID lengths: every five bits is one character,
234 or more practically (because of the block sizes used) every ten bits two characters.
235 (Speck can only support even block sizes,
236 so I couldn’t support odd-length strings without compromising my design a bit,
237 or finding another cipher as good that could support odd block sizes,
238 which I failed to do, though I did toy with Hasty Pudding for a while.)
240 One final obvious question: why base-32, why not a different base?
242 • To begin with, only ASCII is considered. Non-ASCII would be troublesome.
244 • If convenient, a power of two is good, for its bit alignment,
245 leading to a consistent pattern of block sizes and lengths,
246 easier reasoning, and slightly improved efficiency of execution and length.
248 • Base-64 is eliminated because it’s not sufficiently human-friendly:
250 (a) it requires two non-alphanumeric characters in its alphabet,
251 which messes up wordwise selection as, of ASCII,
252 underscore is the only nonalphanumeric character that’s “word”.
254 (b) confusable characters is a problem,
255 and mixed case an inconvenience.
257 In its favour, exactly six bits per character *is* nice,
258 and its evenness means you could fully efficiently support all string lengths,
259 odd as well as even (with Speck n=3×length).
261 • Base-62 (alphanumeric) solves the wordwise selection problem,
262 but doesn’t solve confusables or mixed case,
263 and loses the neatness of bit alignment.
265 • In early TESID design I used Base58 (alphanumeric minus 0, I, O, l),
266 and was considering going down to base-57 (removing 1)
267 which still yielded 6-/11-character strings for the 32- and 64-bit block sizes I used at the time.
268 But a couple of people convinced me that case-sensitivity was undesirable,
269 and once I got into custom block sizes,
270 the difference in base doesn’t make such a difference in length.
272 • Base-32 is pretty much as large as you can go while removing all usual confusables.
274 • Going down perhaps as far as base-25 to eliminate more diverse confusables
275 (mentioned above) was tempting, but I like the neatness of 32.
277 • Going down to base-16 (whether hexadecimal or something else)
278 *would* solve all the problems identified,
279 including even minimising concerns about undesirable patterns in strings,
280 but I’m not fond of the longer IDs it leads to:
281 a 40-bit value is 10 characters,
282 compared to base-32’s 8.
283 (And with a little sparsity or type discrimination,
284 a 40-bit value isn’t unreasonable.)
285 Also there’s something I can’t quite place my finger on where hexadecimal just feels… *programmery*.
286 The base is too small to feel right in some way I can’t properly express.
287 A larger base somehow ends up feeling more human.
288 (Don’t take from this that programmers are not human.
289 Unless it amuses you to do so, in which case I won’t try to stop you.)
291 • Decimal is unsuitable as it tempts numeric treatment,
292 but leading zeroes are significant due to the cipher range overlaps.
293 Also it’s just misleading for people, because if they see numbers,
294 they kinda expect a sequence.
296 If designing with a different alphabet, different block sizes would make sense.
297 If encoding to base-*b* and padding to length *l*,
298 the largest block size that can be used is ceil(log₂(b**l - b**(l - 1)))
299 (this formula includes rezeroing the block range,
300 which is useful for some values of b and l).
301 For l={1, 2, 3, …}, block sizes up to 64 bits:
302 b=16 yields block sizes 0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64;
303 b=20 yields block sizes 0, 5, 9, 13, 18, 22, 26, 31, 35, 39, 44, 48, 52, 57, 61, 65;
304 b=25 yields block sizes 0, 5, 10, 14, 19, 24, 28, 33, 38, 42, 47, 52, 56, 61, 65;
305 b=32 yields block sizes 0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65;
306 b=58 yields block sizes 0, 6, 12, 18, 24, 30, 36, 41, 47, 53, 59, 65;
307 b=64 yields block sizes 0, 6, 12, 18, 24, 30, 36, 42, 48, 54, 60, 66.
308 But with Speck, block sizes must be multiples of two,
309 so any odd-numbered block sizes must be reduced by one.
310 (That is, a five-character string in base-32 could only go up to 2²⁴, not 2²⁵.)
312 To my mind, the interesting variants of TESID here are:
314 • Supporting odd-length strings,
315 which for base-32 means Speck n={2, 5, 7, 10, 12, 15, 17, 20, 22, …}.
316 Presuming you still started at four characters (n=10),
317 I think the values α=9 and β=2 would still be sound.
319 • Going down to base-16, with Speck n={2, 4, 6, 8, 10, 12, …, 64},
320 officially supporting the full u128 range of up to 32 characters,
321 though probably still starting at five characters (n=10).
322 If you wanted the first 65536 IDs to be less than five characters,
323 I’d suggest officially dropping below 128-bit security.
324 (Aside: n=2 is fundamentally incompatible with 128-bit security,
325 as there are too few possible permutations of a block: 2⁴! < 2¹²⁸.)
326 The fact that base-16 aligns with bytes is quite nice.
327 In the Rust implementation, since encode_long would no longer be fallible,
328 I think I’d drop the u64/u128 divide and go u128 throughout.
330 Supporting odd-length strings does disclose more epochs,
331 but it doesn’t matter, especially not when you can tweak all that with⸺
333 1. Sparsity and discrimination
334 ==============================
336 This is a simple extra step that can be added before encryption,
337 in order to make valid IDs harder to guess,
338 and to avoid overlap between IDs of different types.
340 id → id × sparsity + discriminant
342 There are two purposes for sparsity:
344 1. Reducing the occupation of the ID space.
345 If you don’t do this, then by the time you have a million or so IDs assigned (all the ones up to 2²⁰),
346 *every* four-character string made up of base-32 characters will correspond to a valid ID.
347 This can be undesirable, for example if you would like to use IDs as authentication,
348 or if you just plain don’t want valid IDs to be guessable.
350 (“Using IDs as authentication”: that is,
351 having the ID is sufficient to access and/or mutate an entity.
352 This means things like magic login links,
353 and unlisted videos on a site like YouTube.)
355 Given a 64-bit ID and 64-bit discriminant, the largest value that can safely be used for sparsity is 2³⁶ − 1.
356 I suggest using powers of two (assigning a certain number of bits), but you don’t have to.
357 If you allot 36 bits of sparsity (the most that can safely be done with 64-bit IDs,
358 as 36 + 64 = 100 which is the largest block size supported by the 5peck ciphers,
359 though in practice nothing will fill even 50 bits of that 64-bit ID,
360 so with care you can go higher than 36 if you want),
361 then only one in every 68 billion possible IDs will be valid,
362 and it’s very probably reasonable to use IDs as authentication.
363 With 36 bits of sparseness, you’d get 16 8-character IDs,
364 then 16384 10-character IDs, then 16 million 12-character IDs, &c.
366 2. Avoiding ID reuse across types or tenants, by adding a discriminant.
367 Without this, a given ID number will produce the same TESID,
368 which could lead to type confusion,
369 allow discovery of valid IDs of different types
370 (or on different tenants, or whatever),
371 or disclose details about types’ sequences.
372 With this applied correctly and consistently,
373 TESIDs will effectively represent the encryption of the (type, id) tuple,
374 instead of just the ID number.
375 Because of this, a decoder can also determine which type an ID pertains to.
377 If you are going to use TESID for more than one type,
378 I recommend adding bits for a type discriminant.
380 As for how much sparsity to add, decide that yourself.
381 Every bit you add increases the amortised ID length by 0.2 characters.
382 (Or, every ten bits you add increases the ID length by exactly 2 characters.)
384 Here’s a reference table for some possible values,
385 choosing powers of two (though you’re not limited to it):
387 ┌─────────────┬────────┬───────┬──────────┬────────────┬─────────────┬─────────────┬───┐
388 │ Sparsity │ (Bits) │ Extra │ № 4-char │ № 6-char │ № 8-char │ № 10-char │ … │
389 │ │ │ chars │ IDs │ IDs │ IDs │ IDs │ … │
390 ┝━━━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━┿━━━━━━━━━━┿━━━━━━━━━━━━┿━━━━━━━━━━━━━┿━━━━━━━━━━━━━┿━━━┥
391 │ 0 │ 0 │ 0 │ 1024 │ ~1 million │ ~1 billion │ ~1 trillion │ … │
392 │ 32 │ 5 │ 1 │ 32 │ 32736 │ ~32 million │ ~32 billion │ … │
393 │ 256 │ 8 │ 1.6 │ 4 │ 4092 │ ~4 million │ ~4 billion │ … │
394 │ 1024 │ 10 │ 2 │ 1 │ 1023 │ ~1 million │ ~1 billion │ … │
395 │ 65536 │ 16 │ 3.2 │ 0 │ 16 │ 16368 │ ~16 million │ … │
396 │ 4294967296 │ 32 │ 6.4 │ 0 │ 0 │ 256 │ 261888 │ … │
397 │ 68719476735 │ 35.99… │ 7.2 │ 0 │ 0 │ 16 │ 16368 │ … │
398 └─────────────┴────────┴───────┴──────────┴────────────┴─────────────┴─────────────┴───┘
400 Values above 8 are not particularly likely to be useful for type discrimination,
401 but could be for other forms of discrimination (e.g. shared tenantry),
402 or for pure sparseness (especially for authenticatory IDs).
404 The discriminant should generally be smaller than sparsity;
405 otherwise the ID reuse problem will occur,
406 and type-unaware decoding will no longer be able to retrieve the discriminant in full.
407 But this is not checked, partly for reasons of efficiency,
408 and partly because an overly-large discriminant can be used legitimately,
409 as a simple offset to lengthen IDs (though sparsity and skipping ID 0 is probably generally a better tool).
411 I’ve introduced the concept of typed TESID coders to the libraries.
412 TESID offers no assistance in maintaining discriminant values for your types;
413 that you must maintain yourself.
415 It’s possible to achieve these effects in the database,
416 by adjusting the start (discriminant) and step (sparseness) of the sequence.
417 TESID provides this functionality anyway for three reasons:
419 (a) Most databases don’t provide the tools to readily adjust sequence stride;
421 (b) Your database is probably limited to 63 bits,
422 but TESID performs these operations in 100-bit space,
423 allowing sparseness and discrimination even on implausibly large ID values; and
425 (c) You may be applying TESID to an existing sequence.
429 I identify three main parameter value groups:
431 • Big sparsity, no or small discriminant: longer, much less guessable IDs.
432 Go far enough and this can be reasonable for authenticated:
433 at 2³⁰, you’re adding six characters to the string length,
434 and only one in a billion IDs will be valid.
435 And you can go further still if you really want.
437 • Big discriminant, no sparsity: just longer IDs, but fully guessable once you have enough.
438 If you go big enough, this can give the perfect illusion of a fixed-length string:
439 with a u64 ID, you can safely use 2⁶⁰ as a discriminant and get alwayss-14-character TESIDs;
440 these will *in practice* be quite suitable for use as authentication.
441 But you have to be careful in choosing the big discriminant to use,
442 since the strings *are* still variable-width,
443 and full subranges are perfectly guessable.
444 All up, I don’t generally recommend this form.
446 • Small sparsity, type discriminant: slightly longer and slightly less guessable IDs.
447 This is a good choice for most purposes.
449 (When I say “small sparsity”, I suggest values like 32, 256 or 1024.
450 You don’t *have* to line up with bits, but I mildly prefer to.)
456 • Auto-incrementing database ID. This is the obvious and most likely use.
458 • Time. If you used a unix timestamp in seconds, you will get 8-character IDs until the year 36812 which is hopefully enough for you. (Times before 2004-01-11T00:37:04 will be six characters, and from 1970-01-01T00:00:00 to 1970-01-13T13:16:15 four.) If you use YYYYMMDDHHMMSS, you’ll get 10-character IDs; subtract 2000 from the year and go with YYMMDDHHMMSS, and you’re back to 8 characters until the end of 2219.
460 Practical example that I am thinking of using: private but sharable (that is, the URL is the only authentication) pages on my website where I add a random token to the slug using the publication date.